ACM Transactions on

the Web (TWEB)

Latest Articles

Evaluating Quality in Use of Corporate Web Sites: An Empirical Investigation

In our prior work, we presented a novel approach to the evaluation of quality in use of corporate web sites based on an original quality model (QM-U) and a related methodology (EQ-EVAL). This article focuses on two research questions. The first one aims at investigating whether expected quality obtained through the application of EQ-EVAL... (more)

Localness of Location-based Knowledge Sharing: A Study of Naver KiN “Here”

In location-based social Q8A services, people ask a question with a high expectation that local residents who have local knowledge will answer the question. However, little is known about the locality of user activities in location-based social Q8A services. This study aims to deepen our understanding of location-based knowledge sharing... (more)

Extracting and Summarizing Situational Information from the Twitter Social Media during Disasters

Microblogging sites like Twitter have become important sources of real-time information during... (more)

Completeness Management for RDF Data Sources

The Semantic Web is commonly interpreted under the open-world assumption, meaning that information available (e.g., in a data source) captures only a subset of the reality. Therefore, there is no certainty about whether the available information provides a complete representation of the reality. The broad aim of this article is to contribute a... (more)

Optimizing Whole-Page Presentation for Web Search

Modern search engines aggregate results from different verticals: webpages, news, images, video, shopping, knowledge cards, local maps, and so on.... (more)

Faster Base64 Encoding and Decoding Using AVX2 Instructions

Web developers use base64 formats to include images, fonts, sounds, and other resources directly inside HTML, JavaScript, JSON, and XML files. We... (more)


About TWEB

The journal Transactions on the Web (TWEB) publishes refereed articles reporting the results of research on Web content, applications, use, and related enabling technologies.

The scope of TWEB is described on the Call for Papers page. Authors are invited to submit original research papers for consideration by following the directions on the Author Guidelines page.

read more
Forthcoming Articles
Understanding Cross-site Linking in Online Social Networks

Online social networks (OSNs) have become a commodity in people's daily life. Given the diverse focuses of different OSN services, a user often has accounts on multiple sites. In this paper, we study the emerging "cross-site linking" function, which is supported by a number of mainstream OSN services. To gain a deep and systematic understanding of this function, we first conduct a data-driven analysis by using crawled profiles and social connections of all 60+ million Foursquare users. Our analysis has shown that the cross-site linking function is adopted by 57.10% of all Foursquare users, and the users who have enabled this function are more active than other users. We have also found that users who are more concerned with online privacy have a lower probability to enable the cross-site linking function. By further exploring the cross-site links between Foursquare and leading OSN sites, we formalize the cross-site information aggregation problem. Using the massive data collected from Foursquare, Facebook and Twitter, we demonstrate the usefulness and challenges of cross-site information aggregation. In addition to measurements, we also carry out a survey to let users provide their detailed opinions about cross-site linking. The survey reveals why people choose to or not to enable cross-site linking, and investigates the motivation and concerns of enabling this function.

Exploiting usage to predict instantaneous app popularity: Trend filters and retention rates

The popularity of mobile apps is traditionally measured by metrics such as the number of downloads, installations, or user ratings. A problem with these measures is that they reflect usage only indirectly. We propose to exploit actual app usage statistics. Indeed, retention rates, i.e., the number of days users continue to interact with an installed app have been suggested to predict successful app lifecycles. We conduct the first independent and large-scale study of retention rates and usage trends on a database of app-usage data from a community of 339,842 users and more than 213,667 apps. Our analysis shows that, on average, applications lose 65% of their users in the first week, while very popular applications (top 100) lose only 35%. It also reveals, however, that many applications have more complex usage behavior patterns due to seasonality, marketing, or other factors. To capture such effects, we develop a novel app-usage trend measure which provides instantaneous information about the popularity of an application. Our analysis shows that roughly 40% of all apps never gain more than a handful of users (Marginal apps). Less than 0.4% of the remaining 60% are constantly popular (Dominant apps), 1% have a quick drain of usage after an initial steep rise (Expired apps), and 7% continuously rise in popularity (Hot apps). From these, we can distinguish, for instance, trendsetters from copycat apps. We conclude by demonstrating that usage behavior trend information can be used to develop better mobile app recommendations.

Analyzing Privacy Policies at Scale: From Crowdsourcing to Automated Annotations

Website privacy policies are often long and difficult to understand. While research shows that Internet users care about their privacy, they do not have the time to understand the policies of every website they visit, and most users hardly ever read privacy policies. Some recent efforts have aimed to use a combination of crowdsourcing, machine learning, and natural language processing to interpret privacy policies at scale, thus producing annotations for use in interfaces that inform Internet users of salient policy details. However, little attention has been devoted to studying the accuracy of crowdsourced privacy policy annotations, how crowdworker productivity can be enhanced for such a task, and the levels of granularity that are feasible for automatic analysis of privacy policies. In this paper we present a trajectory of work addressing each of these topics. We include analyses of crowdworker performance, evaluation of a method to make a privacy-policy oriented task easier for crowdworkers, a coarse-grained approach to labeling segments of policy text with descriptive themes, and a fine-grained approach to identifying user choices described in policy text. Together, results from these efforts show the effectiveness of using automated and semi-automated methods for extracting from privacy policies the details that are salient to Internet users' interests.

Unsupervised Domain Ranking in Large-Scale Web Crawls

With the proliferation of web spam and infinite auto-generated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls -- a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.

new phone, who dis? Modeling Millennials' Backup Behavior

Given the ever-rising frequency of malware attacks and other problems leading people to lose their files, backups are an important proactive protective behavior in which users can engage. Backing up files can prevent emotional and financial losses and improve overall user experience. Yet, we find that less than half of young adults perform mobile or computer backups at least every few months. To understand why, we model the factors that drive mobile and computer backup behavior, and changes in that behavior over time, using data from a panel survey of 384 diverse young adults. We develop a set of models that explain 37% and 38% of the variance in reported mobile and computer backup behaviors, respectively. These models show consistent relationships between Internet skills and backup frequency on both mobile and computer devices. We find that this relationship holds longitudinally: increases in Internet skills lead to increased frequency of computer backups. This paper provides a foundation for understanding what drives young adult's backup behavior. It concludes with recommendations for motivating people to back up and for future work modeling similar user behaviors.

Exploring and Analysing the African Web Ecosystem

It is well known that Africas Internet infrastructure is progressing at a rapid pace. A flurry of recent research has quantified this, highlighting the expansion of its underlying connectivity network. However, improving the infrastructure is not useful without appropriately provisioned services to exploit it. This paper measures the availability and utilisation of web infrastructure in Africa. Whereas others have explored web infrastructure in developed regions, we shed light on practices in developing regions. To achieve this, we apply a comprehensive measurement methodology to collect data from a variety of sources. We first focus on Google to reveal that its content infrastructure in Africa is, indeed, expanding. We, however, find that much of its web content is still served from the US and Europe, despite being the most popular website in many African countries. We repeat the same analysis across a number of other regionally popular websites to find that even national African websites prefer to host their content abroad. To explore the reasons for this, we evaluate some of the major bottlenecks facing content delivery networks (CDNs) in Africa. Amongst other things, we find a lack of peering between the networks hosting our probes, preventing the sharing of CDN servers, as well as poorly configured DNS resolvers. We conclude the work with a number of suggestions for alleviating the issues observed.

You, the Web and Your Device: Longitudinal Characterization of Browsing Habits

Understanding how people interact with the web is key for a variety of applications, e.g., from the design of effective web pages to the definition of successful online marketing campaigns. User browsing behavior has been traditionally represented and studied by means of clickstreams, i.e., graphs whose vertices are pages, and edges are the paths followed by users. Obtaining large and representative data to extract clickstreams is however challenging. The evolution of the web questions whether user behavior is changing and, by consequence, whether properties of clickstreams are changing. This paper presents a longitudinal study of clickstreams in the last 3 years. We capture an anonymized dataset of HTTP traces in a large ISP, where thousands of households are connected. We first propose a methodology to identify actual URLs requested by users from the massive set of requests automatically fired by browsers when rendering pages. Then, we characterize web usage patterns and clickstreams, taking into account both the temporal evolution and the impact of the device used to explore the web. Our analyses uncover and quantify interesting patterns, such as the increasing importance of social networks for content promotion in smartphones, the rather limited number of pages (on average less than 5 pages per day per active smartphone) visited by users while at home, or the impact of browser, with Internet Explorer users that visit half the content compared to Firefox or Chrome users. Finally, we contribute our dataset of anonymized clickstreams to the community to foster new studies.

A rule-based transducer for querying incompletely aligned datasets

A growing number of Linked Open Data sources (from diverse provenance and about different domains) are made available which can be freely browsed and searched to find and extract useful information. However, access to them is difficult for the users due to different aspects. This paper is mainly concerned with the heterogeneity aspect. It is quite common for datasets to describe the same or overlapped domains but using different vocabularies. This paper presents a transducer that transforms a SPARQL query, suitably expressed in terms of the vocabularies used in a source Dataset, into another SPARQL query, suitably expressed for a target Dataset supported by different vocabularies. The transducer obtains an acceptable transformation of the original query in order to increase the opportunities of getting answers even in case of adverse situations (such as those when no direct translation of terms seems possible). Perhaps it does not always preserve the semantics of the query, although it does not refuse to obtain an equivalent translation if it is at hand. Transformation across datasets is achieved through the management of a wide range of transformation rules. The feasibility of our proposal has been validated with a prototype implementation that processes queries that appear in well known benchmarks and SPARQL endpoint logs. Results of the experiments show that the system is quite effective achieving adequate transformations.

Mining Abstract XML Data-Types

Schema matching is an indistinguishable part of various data engineering domains. The currently dominant standard of specifying schemas on the Web is the XML language. Since XML offers advanced modeling capabilities, schema grammars usually represent complex data models. Thus, the usage of schema matching approaches, which identify semantically similar complex XML data models, is vital in data engineering domains. However, the top-rated state-of-the-art matching approaches do not focus on the identification of semantically similar complex XML data models, but they simply match schema elements or combinations of schema elements. To fill in this gap in the literature, we represent schemas in a complete way (i.e. without discarding schema elements and relations) in order to be able to capture complex data models and we propose an automated approach that realizes the matching of complex data models by performing the following steps: (i) mining structural design patterns, usually encountered inside such data models, and (ii) matching the semantically similar ones. Since the traditional tree pattern mining and matching technique is computationally demanding and in some cases inefficient, our approach extends this technique with a pruning, an indexing, and a greedy technique for mining and matching patterns efficiently. The usage of the proposed efficiency techniques reduces the numbers of the mined patterns and the produced matchings. In particular, based on these techniques, our approach does not enumerate all possible patterns and matchings, but it enumerates only the best possible ones. To decide whether a pattern or a matching is better than another, these techniques calculate the confidences of patterns and matchings, respectively. The pattern and the matching confidences are calculated by using two suites of newly proposed metrics. We evaluate our approach in terms of its effectiveness (i.e. its capability in identifying semantically similar structural design patterns between different schemas) and its (time and space) efficiency. In particular, we evaluate the impact of the proposed pruning, indexing, and greedy techniques and of the suites of the (pattern and matching) confidence metrics on the efficiency and the effectiveness of our approach. We also evaluate the effectiveness variability of our approach (i.e. whether its effectiveness remains high in different schemas pairs). We use in our evaluation the schemas of the matching benchmark~\emph{XBenchMatch}, which has already been used for the joint evaluation of top-rated state-of-the-art matching approaches. Overall, the results of our evaluation show that the usage of structural design patterns helps our approach in matching complex XML data models effectively. Additionally, the proposed pruning, indexing, and greedy techniques and the suites of the (pattern and matching) confidence metrics succeed in keeping the effectiveness and the efficiency of our approach steadily high in various cases of schemas.

Imaginary People Representing Real Numbers: Generating Personas from Online Social Media Data

We develop a methodology to automate creating imaginary people, referred to as personas, by processing complex behavioral and demographic data of social media audiences. From a popular social media account containing more than 30 million interactions by viewers from 198 countries engaging with more than 4,200 online videos produced by a global media corporation, we demonstrate that our methodology has several novel accomplishments, including: (a) identifying distinct user behavioral segments based on the user content consumption patterns; (b) identifying impactful demographics groupings; and (c) creating rich persona descriptions by automatically adding pertinent attributes, such as names, photos, and personal characteristics. We validate our approach by implementing the methodology into an actual working system; we then evaluate it via quantitative methods by examining the accuracy of predicting content preference of personas, the stability of the personas over time, and the generalizability of the method via applying to two other datasets. Research findings show the approach can develop rich personas representing the behavior and demographics of real audiences using privacy preserving aggregated online social media data from major online platforms. Results have implications for media companies and other organizations distributing content via online platforms.

Test-Based Security Certification of Composite Services

The diffusion of service-based and cloud-based systems has brought to a scenario where software is often made available as services, offered as commodities over corporate networks or the global net. This scenario supports the definition of business processes as composite services, which are implemented via runtime composition of offerings provided by different suppliers. Fast and accurate evaluation of services' security properties becomes then a fundamental requirement. In this paper, we show how the verification of security properties of composite services can be handled by test-based security certification. Our approach builds on existing security certification schemes for monolithic services and extends them towards service compositions. It virtually certifies composite services, starting from certificates awarded to the component services. We describe three heuristic algorithms for generating runtime test-based evidence of the composite service holding the properties. These algorithms are compared with the corresponding exhaustive algorithm to evaluate their quality and performance. We also evaluate the proposed approach in a real-world industrial scenario, which considers ENGpay online payment system of Engineering Ingegneria Informatica S.p.A. The proposed industrial evaluation presents the utility and generality of the proposed approach by showing how certification results can be used as a basis to establish compliance to Payment Card Industry Data Security Standard (PCI DSS).

Top-k User-Defined Vertex Scoring Queries in Edge-Labeled Graph Databases

We consider identifying highly ranked vertices in large graph databases such as social networks or the Semantic Web where there are edge labels. There are many applications where users express scoring queries against such databases that involve two elements: (i) a set of patterns describing relationships that a vertex of interest to the user must satisfy, and (ii) a scoring mechanism in which the user may use properties of the vertex in order to assign a score to that vertex. We define the concept of a partial pattern map query (partial PM-query) which intuitively allows us to prune, and show that finding an optimal partial PM-query is NP-hard. We then propose two algorithms, PScore_LP and PScore_NWST, to find the answer to a scoring (top-k) query. In PScore_LP, the optimal partial PM-query is found using a list-oriented pruning method. PScore_NWST leverages Node-Weighted Steiner Trees to quickly compute slightly sub-optimal solutions. We conduct detailed experiments comparing our algorithms with (i) an algorithm (PScore_Base) that computes all answers to the query, evaluates them according to the scoring method, and chooses the top-k, and (ii) two Semantic Web query processing systems (Jena and GraphDB). Our algorithms show better performance than PScore_Base and the Semantic Web query processing systems  moreover, PScore_NWST outperforms PScore_LP on large queries and on queries with a tree structure.

All ACM Journals | See Full Journal Index

Search TWEB
enter search term and/or author name