ACM Transactions on

the Web (TWEB)

A Novel Evidence-Based Bayesian Similarity Measure for Recommender Systems

User-based collaborative filtering, a widely used nearest neighbour-based recommendation technique, predicts an item’s rating by aggregating its... (more)

A Buyer-Friendly and Mediated Watermarking Protocol for Web Context

Watermarking protocols are used in conjunction with digital watermarking techniques to protect digital copyright on the Internet. They define the... (more)


We present Q2P, a system that discovers query templates from search engines via their query autocompletion services. Q2P is distinct from the existing works in that it does not rely on query logs of search engines that are typically not readily available. Q2P is also unique in that it uses a trie to economically store queries sampled from a search... (more)

Activity Dynamics in Collaboration Networks

Many online collaboration networks struggle to gain user activity and become self-sustaining due to the ramp-up problem or dwindling activity within the system. Prominent examples include online encyclopedias such as (Semantic) MediaWikis, Question and Answering portals such as StackOverflow, and many others. Only a small fraction of these systems... (more)

Probabilistic QoS Aggregations for Service Composition

In this article, we propose a comprehensive approach for Quality of Service (QoS) calculation in service composition. Differing from the existing work... (more)

Search and Breast Cancer

We seek to understand the evolving needs of people who are faced with a life-changing medical diagnosis based on analyses of queries extracted from an anonymized search query log. Focusing on breast cancer, we manually tag a set of Web searchers as showing patterns of search behavior consistent with someone grappling with the screening, diagnosis,... (more)

What Users Actually Do in a Social Tagging System

Social tagging systems have established themselves as an important part in today’s Web and have attracted the interest of our research community... (more)


About TWEB

The journal Transactions on the Web (TWEB) publishes refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies.

The scope of TWEB is described on the Call for Papers page. Authors are invited to submit original research papers for consideration by following the directions on the Author Guidelines page.

Forthcoming Articles
A Large-Scale Evaluation of U.S. Financial Institutions' Standardized Privacy Notices

Financial institutions in the United States are required by the Gramm-Leach-Bliley Act to provide annual privacy notices. In 2009, eight federal agencies jointly released a model privacy form for these disclosures. While the use of this model privacy form is not required, it has been widely adopted. We automatically evaluated 6,191 U.S. financial institutions' privacy notices posted on the World Wide Web. We found large variance in stated practices, even among institutions of the same type. While thousands of financial institutions share personal information without providing the opportunity for consumers to opt out, some institutions' practices are more privacy-protective. Regression analyses show that large institutions and those headquartered in the Northeastern region share consumers' personal information at higher rates than all other institutions. Furthermore, our analysis helped us uncover institutions that do not let consumers limit data sharing when legally required to do so, as well as institutions making self-contradictory statements. We discuss implications for privacy in the financial industry, issues with the design and use of the model privacy form on the World Wide Web, and future directions for standardized privacy notice.

Prediction and Predictability for Search Query Acceleration

A commercial web search engine shards its index among many servers, and therefore the response time of a search query is dominated by the slowest server that processes the query. Prior approaches target improving responsiveness by reducing the tail latency of an individual search server. They predict query execution time, and if a query is predicted to be long-running, it runs in parallel, otherwise it runs sequentially. These approaches are, however, not accurate enough for reducing a high tail latency when responses are aggregated from many servers because this requires each server to reduce a substantially higher tail latency (e.g., the 99.99th-percentile), which we call extreme tail latency. Our extensive evaluation results show that, for both scenarios, the proposed framework is effective in reducing the extreme tail latency compared to a start-of-the-art predictor because of its higher recall, and it improves server throughput by more than 70% because of its improved precision.

PEACE-ful Web Event Extraction and Processing as Bi-Temporal Mutable Events

The web is the largest bulletin board of the world. Events of all types, from flight arrivals to business meetings, are announced on this board. Tracking and reacting to such event announcements, however, is a tedious manual task, only slightly alleviated by email or similar notifications. Announcements are published with human readers in mind, and updates or delayed announcements are frequent. These characteristics have hampered attempts at automatic tracking. PEACE provides the first integrated framework for event processing on top of web event ads, consisting of event extraction, complex event processing, and action execution in response to these events. Given a schema of the events to be tracked, the framework populates this schema by extracting events from announcement sources. This extraction is performed by little programs called wrappers which produce the events including updates and retractions. PEACE then queries these events to detect complex events, often combining announcements from multiple sources. To deal with updates and delayed announcements, PEACEs schemas are bitemporal, as to distinguish between occurrence and detection time. This allows complex event specifications to track updates and to react upon differences in occurrence and detection time. In case of new, changing, or deleted events, PEACE allows to execute actions, such as tweeting or sending out email notifications. Actions are typically specified as web interactions, e.g., to fill and submit a form with attributes of the triggering event. Our evaluation shows that PEACEs processing is dominated by the time needed for accessing the web to extract events and perform actions, allotting to 97.4%. Thus, PEACE requires only 2.6% overhead, and therefore, the complex event processor scales well even with moderate resources. We further show that simple and reasonable restrictions on complex event specifications and the timing of constituent events suffice to guarantee that PEACE only requires a constant buffer to process arbitrarily many event announcements.

Periodicity in User Engagement with a Search Engine and its Application to Online Controlled Experiments

Nowadays, billions of people use the Web in connection with their daily needs. A significant part of the needs are constituted by search tasks that are usually addressed by search engines. Thus, daily search needs result in regular user engagement with a search engine. User engagement with web services was studied in various aspects, but there appear to be no studies of its regularity and periodicity. In this paper, we study periodicity of the user engagement with a popular search engine through applying spectrum analysis to temporal sequences of different engagement metrics. We found periodicity patterns of user engagement and revealed classes of users whose periodicity patterns do not change over a long period of time. In addition, we used the spectrum series as key metrics to evaluate search quality. We found that the novel periodicity metrics outperform the state-of-the-art quality metrics both in terms of significance level (p-value) and sensitivity to different search engine changes.

A Comprehensive Survey and Classification of Approaches for Community Question Answering

Community question answering (CQA) systems, such as Yahoo! Answers, Stack Overflow or Quora, belong to a prominent group of successful and popular Web 2.0 applications, which are used every day by millions of users to post complex, subjective or context-dependent questions. In order to answer them effectively, CQA systems should optimally harness collective intelligence of the whole online community, what will be impossible without appropriate collaboration support provided by information technologies. Therefore, CQA became an interesting and promising subject of research in computer science and now we can gather the results of ten years long research. Nevertheless in spite of the increasing number of publications emerging each year, so far the research on CQA systems has missed a comprehensive state-of-the-art survey. We attempt to fill this gap by a review of 265 articles published between 2005 and 2014, which were selected from major conferences and journals. According to this evaluation, at first we propose a framework that defines descriptive attributes of CQA approaches. Secondly, we introduce a classification of all approaches with respect to problems they are aimed to solve. The classification is consequently employed in a review of a significant number of representative approaches, which are described by means of attributes from the descriptive framework. As a part of the survey, we also depict the current trends as well as highlight the areas that require further attention from the research community.

Scalable and Efficient Web Search Result Diversification

It has been shown that top-k retrieval quality can be considerably improved by taking not only relevance, but also diversity into account. However, currently proposed diversification approaches have not put much attention on practical usability in large-scale settings, such as modern Web search systems. In this work, we make two contributions towards this goal. First, we propose a combination of optimizations and heuristics for an implicit diversification algorithm based on the desirable facility placement principle, and present two algorithms that achieve linear complexity without compromising the retrieval effectiveness. Instead of an exhaustive comparison of documents, these algorithms first perform a clustering phase, and then exploit its outcome to compose the diverse result set. Second, we describe and analyse two variants for distributed diversification in a computing cluster, for large-scale IR where the document collection is too large to keep in one node. Extensive evaluations on standard TREC framework demonstrate a competitive retrieval quality of the proposed optimizations to the baseline algorithm while reducing the processing time by more than 80%; and shed light on the efficiency and effectiveness trade-offs of diversification when applied on top of a distributed architecture.


