Although web crawlers have been around for twenty years by now, there is virtually no freely available, open-source crawling software that guarantees high throughput, overcomes the limits of single-machine systems and at the same time scales linearly with the amount of resources available. This paper aims at filling this gap, through the description of BUbiNG, our next-generation web crawler built upon the authors' experience with UbiCrawler and on the last ten years of research on the topic. BUbiNG is an open-source Java fully distributed crawler; a single BUbiNG agent, using sizeable hardware, can crawl several thousands pages per second respecting strict politeness constraints, both host- and IP-based. Unlike existing open-source distributed crawlers that rely on batch techniques (like MapReduce), BUbiNG job distribution is based on modern high-speed protocols so to achieve very high throughput.
Online social networks (OSN) have today reached a capillary diffusion and people often subscribe to several OSNs. This phenomenon leads to online social internetworking (OSI) scenarios where users who subscribe to multiple OSNs are termed as bridges. Unfortunately, several important features make the study of information propagation in an OSI scenario a difficult task, e.g., correlations in both the structural characteristics of OSNs and the bridge interconnections, heterogeneity and size of OSNs, activity factors, cross-posting propensity, etc. In this paper we propose a directed random graph-based model that is amenable to efficient numerical solution to analyze the phenomenon of information propagation in an OSI scenario; in the model development we take into account heterogeneity and correlations introduced by both topological (correlations among nodes degrees and among bridge distributions) and user-related factors (activity index, cross-posting propensity). We first validate the model predictions against simulations on snapshots of interconnected OSNs in a reference scenario. Subsequently, we exploit the model to show the impact on the information propagation of several characteristics of the reference scenario, i.e., size and complexity of the OSI scenario, degree distribution and overall number of bridges, growth and decline of OSNs in time, and time-varying cross-posting users propensity.
Content Security Policy (CSP) is a recent W3C standard introduced to prevent and mitigate the impact of content injection vulnerabilities on websites. In this paper we introduce a formal semantics for the latest stable version of the standard, CSP Level 2. We then perform a systematic, large-scale analysis of the effectiveness of the current CSP deployment, using the formal semantics to substantiate our methodology and to assess the impact of the detected issues. We focus on four key aspects that affect the effectiveness of CSP: browser support, website adoption, correct configuration and constant maintenance. Our analysis shows that browser support for CSP is largely satisfactory, with the exception of few notable issues, but unfortunately there are several shortcomings relative to the other three aspects. CSP appears to have a rather limited deployment as yet and, more crucially, existing policies exhibit a number of weaknesses and misconfiguration errors. Moreover, content security policies are not regularly updated to ban insecure practices and remove unintended security violations. We argue that many of these problems can be fixed by better exploiting the monitoring facilities of CSP, while other issues deserve additional research, being more rooted into the CSP design.
The exponential growth in smartphone adoption is contributing to the availability of vast amounts of human behavioral data. This data enables the development of increasingly accurate data-driven user models that enable the delivery of personalized services which are often free in exchange for the use of its customers' data. Although such usage conventions have raised many privacy concerns, the increasing value of personal data is motivating diverse entities to aggressively collect and exploit the data. In this paper, we propose the concept of constrained user modeling, focusing on the possibility of non-explicit uses of personal data. The concept is demonstrated with mobile online activity data, collected in-the-wild from 61 mobile phone users for a minimum of 30 days. We speculate on realistic scenarios of constrained user modeling and evaluate the feasibility of them. Our scenarios attempt to model heterogeneous user traits and interests, including personality, boredom proneness, demographics, and shopping interests. Based on our modeling results, we discuss various implications to personalization, privacy, and personal data rights.
A Knowledge graph is a graph with entities of different types as nodes and various relations among them as edges. The constructions of knowledge graphs in the past decades facilitate many applications, such as link prediction, web search analysis, question answering, etc. Knowledge graph embedding aims to represent entities and relations in a large-scale knowledge graph as elements in a continuous vector space. Existing methods, e.g., TransE, TransH and TransR, learn the embedding representation by defining a global margin-based loss function over the data. However, the optimal loss function is determined during experiments whose parameters are examined among a closed set of candidates. Moreover, embeddings over two knowledge graphs with different entities and relations share the same set of candidate loss functions, ignoring the locality of both graphs. This leads to the limited performance of embedding related applications. In this paper, a locally adaptive translation method for knowledge graph embedding, called TransA, is proposed to find the optimal loss function by adaptively determining its margin over different knowledge graphs. Then the convergence of TransA is verified from the aspect of its uniform stability. To make the embedding methods up-to-date when new vertices and edges are added into the knowledge graph, the incremental algorithm for TransA, called iTransA, is proposed by adaptively adjusting the optimal margin over time. Experiments on two benchmark data sets demonstrate the superiority of the proposed method, as compared to the-state-of-the-art ones.
The use of queries to find products and services that are located nearby is increasing rapidly due mainly to the ubiquity of internet access and location services provided by smartphone devices. Local search engines help users by matching queries with a predefined geographical connotation (local queries) against a database of local business listings. Local search differs from traditional Web search because, to correctly capture users click behavior, the estimation of relevance between query and candidate results must be integrated with geographical signals, such as distance. The intuition is that users prefer businesses that are physically closer to them or in a convenient area (e.g. close to their home). However, this notion of closeness depends upon other factors, like the business category, the quality of the service provided, the density of businesses in the area of interest, the hour of the day or even the day of the week. In this work we perform an extensive analysis of online users interactions with a local search engine, investigating their intent, temporal patterns, and highlighting relationships between distance-to-business and other factors, such as business reputation, Furthermore, we investigate the problem of estimating the click-through rate on local search (LCTR) by exploiting the combination of standard retrieval methods with a rich collection of geo, user and business-dependent features. We validate our approach on a large log collected from a real-world local search service. Our evaluation shows that the non-linear combination of business and user information, geo-local and textual relevance features leads to a significant improvements over existing alternative approaches based on a combination of relevance, distance and business reputation.