Archive

Posts Tagged ‘Algorithms’

Julia Stoyanovich and Gerome Miklau are going to give a talk at Télécom ParisTech on December 5th

November 15th, 2011
Comments Off

Webdam is very happy to welcome you at Télécom ParisTech on December 5th to the talk organized by Pierre Senellart.

This will take place in “Télécom ParisTech” 46, rue Barrault – 75013 Paris in room C017 in the basement.

Planning:

Gerome Miklau talk abstract

Using Inference to Improve the Accuracy of Differentially-Private Output

Differential privacy is a rigorous privacy standard that protects against powerful adversaries, offers precise accuracy guarantees, and has been successfully applied to a range of data analysis tasks. When differential privacy is satisfied, participants in a dataset enjoy the compelling assurance that information released about the dataset is virtually indistinguishable whether or not their personal data is included.

Differential privacy is achieved by introducing randomness into query answers, and a major goal of research in this area is to devise methods that offer the best accuracy for a fixed level of privacy. The original algorithm for achieving differential privacy, commonly called the Laplace mechanism, returns the true answer after the addition of random noise drawn from a Laplace distribution. If an analyst requires only the answer to a single query about the database, then a version of the Laplace mechanism is known to offer optimal accuracy. But the Laplace mechanism can be severely suboptimal when a set of correlated queries are submitted, and despite much recent work, optimal strategies for answering a collection of correlated queries are not known.

After reviewing the basic principles of differential privacy, I will describe two examples of how query constraints and statistical inference can be used to construct more accurate differentially-private algorithms, with no privacy penalty. The first example comes from our recent work investigating the properties of a social network that can be studied without threatening the privacy of individuals and their connections. I will show that the degree distribution of a network can be estimated privately and accurately by asking a special query for which constraints are known to hold, and then exploiting the constraints to infer a more accurate final result. The second example comes from the analysis of more typical tabular data (such as census or medical data). When answering a set of predicate counting queries, I will show that correlations amongst the queries can be exploited to significantly reduce error introduced by the privacy mechanism.

Julias Stoyanovich talk abstract

Ranked Exploration of Large Structured Datasets

In online applications such as Yahoo! Personals and Trulia.com, users define structured profiles in order to find potentially interesting matches. Typically, profiles are evaluated against large datasets and produce thousands of ranked matches. Highly ranked results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters, identified by a combination of attributes that correlate with rank, and that allow for richer exploration of the result set.

In the first part of this talk I will propose a novel data exploration paradigm, termed rank-aware interval-based clustering. I will formally define the problem and, to solve it, will propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and I will present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. I will present results of a large-scale user study that validates the effectiveness of this approach. I will also demonstrate scalability with an extensive performance evaluation on datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.

In the second part of this talk I will describe on-going work on data exploration for datasets in which multiple alternative rankings are defined over the items, and where each ranking orders only a subset of the items. Such datasets arise naturally in a variety of application domains, including social (e.g., restaurant and movie rating sites) and biological (e.g., analysis of genetic data). In these datasets there is often a need to aggregate multiple rankings, computing, e.g., a single ranked list of differentially expressed genes across a variety of experimental conditions, or of restaurants that are well-liked by one’s friends. I will argue that blindly aggregating multiple rankings into a single list may lead to an uninformative result, because it may not fully leverage opinions of different, possibly disagreeing, groups of judges. I will describe a framework that robustly identifies ranked agreement, i.e., it finds groups of judges whose rankings can be meaningfully aggregated. Finally, I will show how structured attributes of items and of judges can be used to guide the process of identifying ranked agreement, and to describe the resulting consensus rankings to a user.

Bio:
Julia Stoyanovich is a Visiting Scholar at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst. After receiving her B.S. Julia went on to work for two start-ups and one real company in New York City, where she interacted with, and was puzzled by, a variety of massive datasets. Julia’s research focuses on modeling and exploring large datasets in presence of rich semantic and statistical structure. She has recently worked on personalized search and ranking in social content sites, rank-aware clustering in large structured datasets that focus on dating and restaurant reviews, data exploration in repositories of biological objects as diverse as scientific publications, functional genomics experiments and scientific workflows, and representation and inference in large datasets with missing values.

Events , , , ,

PARIS: Probabilistic Alignment of Relations, Instances, and Schema (website)

November 2nd, 2011
Comments Off

A website for PARIS

One of the main challenges that the Semantic Web faces is the integration of a growing number of independently designed ontologies. In this work, we present paris, an approach for the automatic alignment of ontologies. paris aligns not only instances, but also relations and classes. Alignments at the instance level cross-fertilize with alignments at the schema level. Thereby, our system provides a truly holistic solution to the problem of ontology alignment. The heart of the approach is probabilistic, i.e., we measure degrees of matchings based on probability estimates. This allows paris to run without any parameter tuning. We demonstrate the efficiency of the algorithm and its precision through extensive experiments. In particular, we obtain a precision of around 90 % in experiments with some of the world’s largest ontologies.

News , ,

Introduction to Social Networks on Web

December 11th, 2008
Comments Off

Report on the presentation of Pierre Senellart, December 11, 2008.
See slides for more details.
Warning : this report outlines the understanding of the post author (Alban Galland) and nothing more.

Typology

Definition : a social content web site is a web site with users, content and implicit or explicit links between users.

This definition, rather large, cover as much the sites of blogs and of multimedia content as explicitly social networks (SN) based sites. The social content web sites are users based or content based. The users based site may be pure SN (professional as LinkedIn, friendship as MySpace or mixed as FaceBook), blog communities (SkyRock) or dating-sites (Meetic). The content based sites are sites where users could share or annotate content and meet through common interests. they could be catalogs of content (from Music as LastFm to bookmarks as delicious), content-sharing sites (pictures as flickr, videos as YouTube), content-producing site (wikipedia, forums, Yahoo! Answer…) or web-shop (ebay or Amazon).

Models

The natural model is a graph, directed or undirected, which could be multipartite (users, content, tags …). The links between users could be explicit (bridging links, declaration) or implicit (bonding links, through content).

The SN graphs are characterized by

  • sparse graph
  • small distances (small world graph, 6 degrees of separation theory)
  • high transitivity (clustering : two nodes close from a third one are likely to be close themselves)
  • degree distribution follows a power-law

SN are not randoms graph (which could be only sparse with small distance) nor random modification of a regular grid (which could be only sparse, with small distance and high transitivity). They are closer from free-scale graph, build by adding nodes one by one and linking each new node in order to preserve the property described above.

Algorithms

  • PageRank : this algorithm is used to rank mode in a graph according to their importance in the graph. It is not helpful on undirected graph since it converges to the degree of the node, but variants exists.
  • Search of communities : extract communities from the graph could be done using minimum cut/maximum flow algorithms or Markov clustering algorithms (MCL, removing betweeness edges)
  • Improve Information Retrieval : the tags could be used to improve semantic search. recommendation is also a topic of interest , using Collaborative filtering (user-based) or item based recommendation.Finally IR could be biased with distance on the SN graph

Conclusion

  • SN is larger than FaceBook!
  • There is some natural models and some natural research on IR, trust …

News , , ,