Online social networks (OSNs) have become a commodity in people's daily life. Given the diverse focuses of different OSN services, a user often has accounts on multiple sites. In this paper, we study the emerging "cross-site linking" function, which is supported by a number of mainstream OSN services. To gain a deep and systematic understanding of this function, we first conduct a data-driven analysis by using crawled profiles and social connections of all 60+ million Foursquare users. Our analysis has shown that the cross-site linking function is adopted by 57.10% of all Foursquare users, and the users who have enabled this function are more active than other users. We have also found that users who are more concerned with online privacy have a lower probability to enable the cross-site linking function. By further exploring the cross-site links between Foursquare and leading OSN sites, we formalize the cross-site information aggregation problem. Using the massive data collected from Foursquare, Facebook and Twitter, we demonstrate the usefulness and challenges of cross-site information aggregation. In addition to measurements, we also carry out a survey to let users provide their detailed opinions about cross-site linking. The survey reveals why people choose to or not to enable cross-site linking, and investigates the motivation and concerns of enabling this function.
The popularity of mobile apps is traditionally measured by metrics such as the number of downloads, installations, or user ratings. A problem with these measures is that they reflect usage only indirectly. We propose to exploit actual app usage statistics. Indeed, retention rates, i.e., the number of days users continue to interact with an installed app have been suggested to predict successful app lifecycles. We conduct the first independent and large-scale study of retention rates and usage trends on a database of app-usage data from a community of 339,842 users and more than 213,667 apps. Our analysis shows that, on average, applications lose 65% of their users in the first week, while very popular applications (top 100) lose only 35%. It also reveals, however, that many applications have more complex usage behavior patterns due to seasonality, marketing, or other factors. To capture such effects, we develop a novel app-usage trend measure which provides instantaneous information about the popularity of an application. Our analysis shows that roughly 40% of all apps never gain more than a handful of users (Marginal apps). Less than 0.4% of the remaining 60% are constantly popular (Dominant apps), 1% have a quick drain of usage after an initial steep rise (Expired apps), and 7% continuously rise in popularity (Hot apps). From these, we can distinguish, for instance, trendsetters from copycat apps. We conclude by demonstrating that usage behavior trend information can be used to develop better mobile app recommendations.
With the proliferation of web spam and infinite auto-generated web content, large-scale web crawlers require low-complexity ranking methods to effectively budget their limited resources and allocate bandwidth to reputable sites. To shed light on Internet-wide spam avoidance, we study topology-based ranking algorithms on domain-level graphs from the two largest academic crawls -- a 6.3B-page IRLbot dataset and a 1B-page ClueWeb09 exploration. We first propose a new methodology for comparing the various rankings and then show that in-degree BFS-based techniques decisively outperform classic PageRank-style methods, including TrustRank. However, since BFS requires several orders of magnitude higher overhead and is generally infeasible for real-time use, we propose a fast, accurate, and scalable estimation method called TSE that can achieve much better crawl prioritization in practice. It is especially beneficial in applications with limited hardware resources.
In a previous paper we presented a novel approach to the evaluation of quality in use of corporate web sites based on an original quality model (QM-U) and a related methodology to put it into practice (EQ-EVAL). This paper focuses on two research questions. The first one aims to investigate whether expected quality obtained through the application of EQ-EVAL methodology by employing a small panel of evaluators is a good approximation of actual quality obtained through experimentation with real users. In order to answer this research question, a comparative study has been carried out involving five evaluators and fifty real users. The second research question aims to demonstrate that the adoption of the EQ-EVAL methodology can provide useful information for web site improvement. Three original indicators, namely coherence, coverage and ranking have been defined in order to answer this second question, and an additional study comparing the assessments of two panels of five and ten evaluators respectively has been carried out. The results obtained in both comparative studies are largely positive and provide a rational support for the adoption of the EQ-EVAL methodology.
Given the ever-rising frequency of malware attacks and other problems leading people to lose their files, backups are an important proactive protective behavior in which users can engage. Backing up files can prevent emotional and financial losses and improve overall user experience. Yet, we find that less than half of young adults perform mobile or computer backups at least every few months. To understand why, we model the factors that drive mobile and computer backup behavior, and changes in that behavior over time, using data from a panel survey of 384 diverse young adults. We develop a set of models that explain 37% and 38% of the variance in reported mobile and computer backup behaviors, respectively. These models show consistent relationships between Internet skills and backup frequency on both mobile and computer devices. We find that this relationship holds longitudinally: increases in Internet skills lead to increased frequency of computer backups. This paper provides a foundation for understanding what drives young adult's backup behavior. It concludes with recommendations for motivating people to back up and for future work modeling similar user behaviors.
Modern search engines aggregate results from different verticals: webpages, news, images, video, shopping, knowledge cards, local maps, etc. Unlike ``ten blue links'', these search results are heterogeneous in nature and not even arranged in a list on the page. This revolution directly challenges the conventional ``ranked list'' formulation in ad hoc search. Therefore, finding proper presentation for a gallery of heterogeneous results is critical for modern search engines. We propose a novel framework that learns the optimal page presentation to render heterogeneous results onto search result page (SERP). Page presentation is broadly defined as the strategy to present a set of items on SERP, much more expressive than a ranked list. It can specify item positions, image sizes, text fonts, and any other styles as long as variations are within business and design constraints. The learned presentation is content-aware, i.e. tailored to specific queries and returned results. Simulation experiments show that the framework automatically learns eye-catchy presentations for relevant results. Experiments on real data show that simple instantiations of the framework already outperform leading algorithm in federated search result presentation. It means the framework can learn its own result presentation strategy purely from data, without even knowing the ``probability ranking principle''.
In location-based social Q&A, the questions related to a local community (e.g., local services and places) are typically answered by local residents (i.e., people who have the local knowledge). This study aims to deepen our understanding of location-based knowledge sharing through investigating general users behavioral characteristics, the topical and typological patterns related to the geographic characteristics, geographic locality of user activities, and motivations of local knowledge sharing. To this end, we analyzed a 12-month period Q&A dataset from Naver KiN Here and a supplementary survey dataset from 285 mobile users. Our results revealed several unique characteristics of location-based social Q&A. When compared with conventional social Q&A sites, Naver KiN Here had distinctive users behavior patterns and different topical/typological patterns. In addition, Naver KiN Here exhibited a strong spatial locality where the answers mostly had 1-3 spatial clusters of contributions, and a typical cluster spanned a few neighboring districts. We also uncovered unique motivators, e.g., ownership of local knowledge and a sense of local community. The findings reported in the paper have significant implications for the design of Q&A systems, especially location-based social Q&A systems.
It is well known that Africas Internet infrastructure is progressing at a rapid pace. A flurry of recent research has quantified this, highlighting the expansion of its underlying connectivity network. However, improving the infrastructure is not useful without appropriately provisioned services to exploit it. This paper measures the availability and utilisation of web infrastructure in Africa. Whereas others have explored web infrastructure in developed regions, we shed light on practices in developing regions. To achieve this, we apply a comprehensive measurement methodology to collect data from a variety of sources. We first focus on Google to reveal that its content infrastructure in Africa is, indeed, expanding. We, however, find that much of its web content is still served from the US and Europe, despite being the most popular website in many African countries. We repeat the same analysis across a number of other regionally popular websites to find that even national African websites prefer to host their content abroad. To explore the reasons for this, we evaluate some of the major bottlenecks facing content delivery networks (CDNs) in Africa. Amongst other things, we find a lack of peering between the networks hosting our probes, preventing the sharing of CDN servers, as well as poorly configured DNS resolvers. We conclude the work with a number of suggestions for alleviating the issues observed.
Understanding how people interact with the web is key for a variety of applications, e.g., from the design of effective web pages to the definition of successful online marketing campaigns. User browsing behavior has been traditionally represented and studied by means of clickstreams, i.e., graphs whose vertices are pages, and edges are the paths followed by users. Obtaining large and representative data to extract clickstreams is however challenging. The evolution of the web questions whether user behavior is changing and, by consequence, whether properties of clickstreams are changing. This paper presents a longitudinal study of clickstreams in the last 3 years. We capture an anonymized dataset of HTTP traces in a large ISP, where thousands of households are connected. We first propose a methodology to identify actual URLs requested by users from the massive set of requests automatically fired by browsers when rendering pages. Then, we characterize web usage patterns and clickstreams, taking into account both the temporal evolution and the impact of the device used to explore the web. Our analyses uncover and quantify interesting patterns, such as the increasing importance of social networks for content promotion in smartphones, the rather limited number of pages (on average less than 5 pages per day per active smartphone) visited by users while at home, or the impact of browser, with Internet Explorer users that visit half the content compared to Firefox or Chrome users. Finally, we contribute our dataset of anonymized clickstreams to the community to foster new studies.
Microblogging sites like Twitter have become important sources of real-time information during disaster events. A large amount of valuable situational information is posted in these sites during disasters; however, the information is dispersed among hundreds of thousands of tweets containing sentiments and opinion of the masses. To effectively utilize microblogging sites during disaster events, it is necessary to not only extract the situational information from the large amounts of sentiment and opinion, but also to summarize the large amounts of situational information posted in real-time. During disasters in countries like India, a sizeable number of tweets are posted in local resource-poor languages besides the normal English-language tweets. For instance, in the Indian subcontinent, a large number of tweets are posted in Hindi / Devanagari (the national language of India), and some of the information contained in such non-English tweets are not available (or available at a later point of time) through English tweets. In this work, we develop a novel classification-summarization framework which handles tweets in both English and Hindi -- we first extract tweets containing situational information, and then summarize this information. Our proposed methodology is developed based on the understanding of how several concepts evolve in Twitter during disaster. This understanding helps us achieve superior performance compared to the state-of-the-art tweet classifiers and summarization approaches on English tweets. Additionally, to our knowledge, this is the first attempt to extract situational information from non-English tweets.
The Semantic Web is commonly interpreted under the open-world assumption. Under this setting, available information only captures a subset of the reality, thus hindering certainty as to whether the reality is fully described (e.g., in the answer to a query). While there are several aspects of the reality where one can observe complete information, there is currently no way to assert meta-information about completeness in a machine-readable form. The aim of this paper is to fill this gap and to contribute a (formal) study of how to describe the completeness of parts of the Semantic Web, and how to leverage this novel information for query answering. One immediate benefit is that now query answers can be complemented with information about their completeness. More specifically, we introduce a theoretical framework allowing to augment RDF data sources with statements, also expressed in RDF, about their completeness. We then study the impact of completeness statements on the complexity of query answering by considering different fragments of the SPARQL language, including the RDFS entailment regime, and the federated scenario. We implement an efficient method for reasoning about query completeness and provide an experimental evaluation in the presence of large sets of completeness statements.
A growing number of Linked Open Data sources (from diverse provenance and about different domains) are made available which can be freely browsed and searched to find and extract useful information. However, access to them is difficult for the users due to different aspects. This paper is mainly concerned with the heterogeneity aspect. It is quite common for datasets to describe the same or overlapped domains but using different vocabularies. This paper presents a transducer that transforms a SPARQL query, suitably expressed in terms of the vocabularies used in a source Dataset, into another SPARQL query, suitably expressed for a target Dataset supported by different vocabularies. The transducer obtains an acceptable transformation of the original query in order to increase the opportunities of getting answers even in case of adverse situations (such as those when no direct translation of terms seems possible). Perhaps it does not always preserve the semantics of the query, although it does not refuse to obtain an equivalent translation if it is at hand. Transformation across datasets is achieved through the management of a wide range of transformation rules. The feasibility of our proposal has been validated with a prototype implementation that processes queries that appear in well known benchmarks and SPARQL endpoint logs. Results of the experiments show that the system is quite effective achieving adequate transformations.
This paper addresses web interfaces for High Performance Computing (HPC) simulation software. First, it presents a brief history, starting in the 90s with Java applets, of web interfaces used for accessing and making best possible use of remote HPC resources. Then this article reviews the present state of such HPC web-based portals. We identify and discuss the key features and constraints that characterize HPC portals. The design and development of Bull extreme factory Computing Studio v3 (XCS3) is chosen as a common thread for showing how these features can all be implemented in one software: multi-tenancy, multi-scheduler compatibility, HPC application template framework, complete control through an HTTP RESTful API, customizable user interface with Responsive Web Design, remote visualization, Role Base Access Control, and access through the Authentication, Authorization, and Accounting proven security framework. The paper concludes with the benefits of using such an HPC portal for both end-users and IT administrators.
We consider identifying highly ranked vertices in large graph databases such as social networks or the Semantic Web where there are edge labels. There are many applications where users express scoring queries against such databases that involve two elements: (i) a set of patterns describing relationships that a vertex of interest to the user must satisfy, and (ii) a scoring mechanism in which the user may use properties of the vertex in order to assign a score to that vertex. We define the concept of a partial pattern map query (partial PM-query) which intuitively allows us to prune, and show that finding an optimal partial PM-query is NP-hard. We then propose two algorithms, PScore_LP and PScore_NWST, to find the answer to a scoring (top-k) query. In PScore_LP, the optimal partial PM-query is found using a list-oriented pruning method. PScore_NWST leverages Node-Weighted Steiner Trees to quickly compute slightly sub-optimal solutions. We conduct detailed experiments comparing our algorithms with (i) an algorithm (PScore_Base) that computes all answers to the query, evaluates them according to the scoring method, and chooses the top-k, and (ii) two Semantic Web query processing systems (Jena and GraphDB). Our algorithms show better performance than PScore_Base and the Semantic Web query processing systems moreover, PScore_NWST outperforms PScore_LP on large queries and on queries with a tree structure.