Archive

Posts Tagged ‘Meetings’

Julia Stoyanovich and Gerome Miklau are going to give a talk at Télécom ParisTech on December 5th

November 15th, 2011
Comments Off

Webdam is very happy to welcome you at Télécom ParisTech on December 5th to the talk organized by Pierre Senellart.

This will take place in “Télécom ParisTech” 46, rue Barrault – 75013 Paris in room C017 in the basement.

Planning:

Gerome Miklau talk abstract

Using Inference to Improve the Accuracy of Differentially-Private Output

Differential privacy is a rigorous privacy standard that protects against powerful adversaries, offers precise accuracy guarantees, and has been successfully applied to a range of data analysis tasks. When differential privacy is satisfied, participants in a dataset enjoy the compelling assurance that information released about the dataset is virtually indistinguishable whether or not their personal data is included.

Differential privacy is achieved by introducing randomness into query answers, and a major goal of research in this area is to devise methods that offer the best accuracy for a fixed level of privacy. The original algorithm for achieving differential privacy, commonly called the Laplace mechanism, returns the true answer after the addition of random noise drawn from a Laplace distribution. If an analyst requires only the answer to a single query about the database, then a version of the Laplace mechanism is known to offer optimal accuracy. But the Laplace mechanism can be severely suboptimal when a set of correlated queries are submitted, and despite much recent work, optimal strategies for answering a collection of correlated queries are not known.

After reviewing the basic principles of differential privacy, I will describe two examples of how query constraints and statistical inference can be used to construct more accurate differentially-private algorithms, with no privacy penalty. The first example comes from our recent work investigating the properties of a social network that can be studied without threatening the privacy of individuals and their connections. I will show that the degree distribution of a network can be estimated privately and accurately by asking a special query for which constraints are known to hold, and then exploiting the constraints to infer a more accurate final result. The second example comes from the analysis of more typical tabular data (such as census or medical data). When answering a set of predicate counting queries, I will show that correlations amongst the queries can be exploited to significantly reduce error introduced by the privacy mechanism.

Julias Stoyanovich talk abstract

Ranked Exploration of Large Structured Datasets

In online applications such as Yahoo! Personals and Trulia.com, users define structured profiles in order to find potentially interesting matches. Typically, profiles are evaluated against large datasets and produce thousands of ranked matches. Highly ranked results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters, identified by a combination of attributes that correlate with rank, and that allow for richer exploration of the result set.

In the first part of this talk I will propose a novel data exploration paradigm, termed rank-aware interval-based clustering. I will formally define the problem and, to solve it, will propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and I will present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. I will present results of a large-scale user study that validates the effectiveness of this approach. I will also demonstrate scalability with an extensive performance evaluation on datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.

In the second part of this talk I will describe on-going work on data exploration for datasets in which multiple alternative rankings are defined over the items, and where each ranking orders only a subset of the items. Such datasets arise naturally in a variety of application domains, including social (e.g., restaurant and movie rating sites) and biological (e.g., analysis of genetic data). In these datasets there is often a need to aggregate multiple rankings, computing, e.g., a single ranked list of differentially expressed genes across a variety of experimental conditions, or of restaurants that are well-liked by one’s friends. I will argue that blindly aggregating multiple rankings into a single list may lead to an uninformative result, because it may not fully leverage opinions of different, possibly disagreeing, groups of judges. I will describe a framework that robustly identifies ranked agreement, i.e., it finds groups of judges whose rankings can be meaningfully aggregated. Finally, I will show how structured attributes of items and of judges can be used to guide the process of identifying ranked agreement, and to describe the resulting consensus rankings to a user.

Bio:
Julia Stoyanovich is a Visiting Scholar at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst. After receiving her B.S. Julia went on to work for two start-ups and one real company in New York City, where she interacted with, and was puzzled by, a variety of massive datasets. Julia’s research focuses on modeling and exploring large datasets in presence of rich semantic and statistical structure. She has recently worked on personalized search and ranking in social content sites, rank-aware clustering in large structured datasets that focus on dating and restaurant reviews, data exploration in repositories of biological objects as diverse as scientific publications, functional genomics experiments and scientific workflows, and representation and inference in large datasets with missing values.

Events , , , ,

Webdam meeting

March 14th, 2011
Comments Off

PeerSoN : a P2P social networks

March 10th, 2009
Comments Off

Report on the presentation of Sonja Buchegger, March 9th, 2009
See PeerSoN web site and slides for more details
Warning : this report outlines the understanding of the post author (Alban Galland) and nothing more.

Context

Ubiquitous computing is a model where devices and systems collaborate to solve tasks given by the user without him being conscious of it. This paradigm leads to problems of privacy, since you leave trace everywhere in a virtual world integrated to real world. All these data could be used for data-mining, from advertising to surveillance. This virtual world usually suffer for a lack of memory loss. These systems also tends to centralize the data of the users on a part of the system. The personal (private), public and commercial spheres collide in this context.

Social networks are another model where this privacy issue is risen, since users store very personal data on these systems. They are usually web 2.0 services which need Internet connexion. The main feature is to let users keep in touch with their friend in an ambient way.

Integrating ubiquitous computing and social networks in an ubiquitous P2P social network helping privacy is then specially challenging. One of the main reason to design such a system is that social networks naturally collide with real world and ubiquity is then specially desirable. It also solves most of the ownership question about data and avoid that systems dictate terms of use.

Distribution

Social networks and ubiquitous computing are naturally distributed. PeerSoN use a distributed storage of data. To solve online availability problem, it uses replication on friends, the keys parameters chosen given a trace of users characterizing their temporal and geographical distribution. To solve boot-strapping, it also use storage on random nodes. The peers communicate directly but they use a DHT for lookup. This DHT was build using openDHT and Planet Labs in a first version, but too many availability problems lead to a centralized emulation of HT (put/get/remove operations) on the current version. The peers are identified by the hash of a globally unique identifier (such as email address) . When connecting to the DHT, the user register his user id, his machine IP and his data.

Direct exchange

In order not to be dependent of a network connection (and to go further on the ubiquity), the design should take in account delay tolerant networks. It is useful to carry information from friend to friend. Asynchronous messaging is an example of such content. But it is not clear that distribution will work well this way. It is also useful for storage, since the system should use the storage available around.

Access control

There is a trade-off between privacy and search. The user defines what he want to be searchable. The system emulates a fine-grained access control with keys (whom can see which part of the profile). This method would also provide protection against storage provider. The key management emulates a standard public key infrastructure and key may be exchanged by direct contact.

Related work and issues

  • Distributed file management: usually, the assumptions are that data is stable and interests follow Zipf’s distribution. In SN context, data change a lot and distribution of interest is local
  • Anonymity: distribution in a DHT leaves less traces of the query
  • Media storage: the storage should be optimized using novelty.

On-going work

Response time testings using different assumptions on the network.

News , , ,

Anonymisation in social-based P2P networks

March 2nd, 2009
Comments Off

Report on the presentation of Fabrice Le Fessant, February 23th, 2009
See slides for more details.
Warning : this report outlines the understanding of the post author (Alban Galland) and nothing more.

Context

In a context of P2P file sharing networks, some malicious peer may try to keep a log of the queries issued on the network in order to build upload and download profiles of other peers. To avoid censorship in particular, one may want to design a network where non-trusted peers may contribute to the life of the network without being able to locate publisher neither querier. A social-based P2P network naturally fits this requirement : friends are not hidden but trusted and they can anonymise the exchanges.

Previous work

There is already some social based P2P networks, such as the turtle network. It is close to gnutella but based on social network, which means that connexions are chosen and trusted. The search is done by flooding, which is quiet expensive in bandwidth.

There is also some anti-censorship networks, such as freenet. It manages small encrypted documents. The search is done by depth-first search, oriented by a notion of distance between users. The data is accessed by replication on the back-path. Such a network could be easily limited to friends.

Gnunet is another example of anti-censorship networks. The search is done by a limited breadth-first search. It use a shortcut system to randomly modify the id on the queries for the anonymisation. There is also a credit system to avoid flooding. It has been shown that these two optimizations are indeed a weakness for the anonymisation.

Some clues about Orkut

Some simulations have been done based on a trace of Orkut. They raised interesting questions about the topology of the network.

  • What is the distribution of the nodes degrees?
  • What happen for the connectivity when removing nodes?

The answers of these questions deeply depend of how the crawl have been made.

Problem

  • How to manage big files?
  • How to specify the level of the attacker to have different theoretical guaranties?
  • How to restrict real network to sub-network?

Ideas

The load should be balanced between query time and publication time. Most of the P2P methods are based on the query, but one could also think of diffusion process when a resource is published (through subscription to feeds, replication or local index tables materialization). Both methods could be mixed. It is the case in structured networks such as DHT where a distributed index is materialized and queried.

Finally, the methods should be optimized depending of the file type and the file size.

News , , ,

Social Networks APIs

January 26th, 2009
Comments Off

Report on the presentation of Alban Galland, January 26th, 2009
See slides for more details.
Warning : this report outlines the understanding of the post author (Alban Galland) and nothing more.

Some existing APIs

There is different kind of APIs for Social Networks : micro-formats, ontologies, query interaction with social sites and application definition for social sites… These different kind of APIs give different level of details of how to represent a social graph and how to query it.

Mirco-formats

xfn (XHTML Friends networks) is a mirco-format which is use to tag the hyperlink with predefined friendship concepts. There are plenty of other micro-formats which are also linked to Social Networks. These micro-formats are easy to use and naturally distributed, but it is hard to have a global view of the network and to query it.

Ontologies

FOAF (Friend Of A Friend) is an ontology in RDF and OWL which describes a social network. It contains in particular a large number of way of identification. Because ontologies are hard to design, this one is still unstable. It is also hard to have a global view of the network and it could also lead to too much complexity in description. Nevertheless, some tools as Google Social Graph APIs add a layer to query both XFN and FOAF information on the web as a whole.

Interaction with social sites

Open Social is an example of APIs which allows application to interact with social sites. It is a package of three APIs (Javascript, REST and RPC). Using one of these APIs, a couple (viewer,application) can query a social site (container) about the social informations of a distinct user (owner). For example, the CoolApplication application could use my credential of logged user on Orkut to query it about some of my friends. The readable data are the profile of the users (corresponding to the Orkut profile) and their lists of friends. There are also some discovery capabilities of an unstructured table of data. This feature was probably designed for containers which are not owned by Google, but would like to use the API. That is, of course, the limit of the approach “one designs for himself and tries then to convince others”. The container can implement its own security policy to filter what is readable, but there is no way to specify such a policy and no clue about what should be the default policy. The application could also write two kinds of information : a log of activities and some persistent couples attribute-value. The FBJS, Rest-Like and Connect APIs of Facebook are designed with the same spirit.
These APIs allow some transfer of the data from social site to applications, which make the former more useful and the latter usable on more container. They are nevertheless designed in a centralized way (with a container) and privacy is largely under-specified.

Application specification

FBML is an example of application specification (or at least about rendering a user interface using embedded service calls to Facebook). The principle is relatively different of the interaction specification, since the evaluation of FBML is done by Facebook and the data are not transfered to the application server. FBML could access data using FQL or complex tags which are wrapping of FQL queries. FQL is an SQL-like language and the Facebook Social Network is described as table with indexability constraints which restrict the queries. This kind of APIs limits the expressiveness of the applications, but it protects more the privacy since it did not transfer any data to the applications.

Some clues about design of SN APIs

To summarize, a good API should be easy to use, distributed, easy to query as a whole and allow data transfer with privacy control.

Such APIs must rely on a model of social networks with :

  • People, which could be identified, authenticated and have a profile
  • Relationships, which could be of different type
  • Applications, which could be identified, authenticated, are described by a code and could write some part of the profile of a user.

They must also allow queries with right access. We believe that a good model of Social Network is a distributed knowledge base with right access. We are currently defining such a model.

News , , ,

Incentives for Users of Social Software

January 14th, 2009
Comments Off

Report on the presentation of Panayotis Antoniadis, January 14th, 2009
See slides for more details.
Warning : this report outlines the understanding of the post author (Alban Galland) and nothing more.

Context

The presentation was focused on how to understand and model users behavior in P2P systems. The design of incentive mechanisms must indeed take into account not only economics but also social behavior.

The social networks are directly connected to the notion of self organized communities. P2P systems are going more slowly social than centralized systems, because of legal reasons or because they are less open to participation (friend-to-friend networks control their access). Web-based communities are efficient but sometimes they already are using P2P for content distribution. Some may argue that this content distribution could be used to manage all the community and that the web-access layer is then useless. This layer also leads to privacy and censorship problems which encourage systems enabling independence. Other reason to use P2P systems may also come from applications using distribution features. Actually, benefit is still not totally obvious : web-based and P2P seem in fact complementary. For example, web-based systems could be used to meet people and P2P systems to interact with friends (in a private way).

Challenges

There are two kinds of challenges :

  • Technical issues : content distribution, information integrity, different privacy/security issues. In general, identity matters in social system and some data should not be shared
  • Incentives issues : participation, resource sharing, trust

The presentation is focused on the latter one.

Example of Wireless Neighborhood communities

Hybrid on-line communities are both physical and virtual communities. The notion is connected to P2P systems and ad-hoc networks. They are used to provide Internet for everybody avoiding hotspots, but it is also fun to have a private network and it is a means to organize something in a neighborhood. There are already localized communities (lifeAt, i-neighbors, peuplade.fr, Facebook neighborhoods, meetup…) These services usually allow users to exchange services and information with their neighborhood, but are web based (not physical). There are also grassroots communities of wireless networks (seattle wireless net, awmn). The idea is to bring both components together to create incentives.

About incentives

Economics vision : the users share resources through market or reciprocity (token, reputation…). The goal is to design markets such as when they reach equilibrium, users have the targeted behavior. Modeling the optimization process is possible knowing utility and cost. But there is a problem of information because utility and cost are usually unknown, even by the users themselves which are then hard to predict.

Social vision : Even without economic reasons, P2P systems are still working : people make efforts without direct economics incentives. Some reasons are spirit, value/cost ratio, self-efficacy, altruism or in general social incentives…

In general, incentives cover a wide range from extrinsic to intrinsic motivations : payments, reciprocity, long term benefit, popularity, status, self-image, sense of efficacy, community vision, interest, fun…

The economics vision could be difficult to apply on the resources layer. For example, FON let you share a bit of your wifi with other Foneros, based on global reciprocity. But there is a deep problem of symmetry of resources usage since fewer people will try to access wifi of people living in an isolated place. There is other models as the yellow chair project or the wifi-thank-you site where incentives are purely social. In general, incentives are not additive : the more you control (extrinsic incentives), the less people are self-motivated (intrinsic incentives).

An interesting idea is to use a cross-layer incentive : sharing low-level resources is rewarded at the level of the community and reciprocally the good members of the community get more low-level resources.

Application to social-software design

The key features of a social-software are

  • the general vision or promise
  • the community outcome
  • the personal image of the user
  • the local activity : the user must feel that somebody has seen her profil or interacts with her
  • what the user see about the rest of the world
  • the types of relations and interactions…

On-going work

After studying at the Computer Science Department of University of Crete and at the Department of Informatics of Athens University of Economics and Business, Panayotis is now a post-doc researcher at LIP6 Laboratory (University of Pierre and Marie Curie, Paris) working on the design of incentive mechanisms for network shared testbeds (like Planetlab) and virtual communities (on the Internet and wireless networks). Panayotis is collaborating with Benedicte Le Grand, Marcelo Dias De Amorim, Ileana Apostol and Tridib Banerjee. He is member of the wip project.

Updated 01/20/2009 thanks to P. Amtoniadis helpful comments

News , , ,

01/09/2009 : Webdam kickoff meeting

January 7th, 2009
Comments Off

When : Friday January 9th

Who : By invitation only

Where : Ecole Normale Supérieure, Cachan, LSV Library

Agenda :

  • 12:00-14:00 : lunch
  • 14:00-15:00 : presentation of Webdam (Serge)
  • 15:00-16:00 : brainstorming

Summary : slides of the presentation.

Events

Introduction to Social Networks on Web

December 11th, 2008
Comments Off

Report on the presentation of Pierre Senellart, December 11, 2008.
See slides for more details.
Warning : this report outlines the understanding of the post author (Alban Galland) and nothing more.

Typology

Definition : a social content web site is a web site with users, content and implicit or explicit links between users.

This definition, rather large, cover as much the sites of blogs and of multimedia content as explicitly social networks (SN) based sites. The social content web sites are users based or content based. The users based site may be pure SN (professional as LinkedIn, friendship as MySpace or mixed as FaceBook), blog communities (SkyRock) or dating-sites (Meetic). The content based sites are sites where users could share or annotate content and meet through common interests. they could be catalogs of content (from Music as LastFm to bookmarks as delicious), content-sharing sites (pictures as flickr, videos as YouTube), content-producing site (wikipedia, forums, Yahoo! Answer…) or web-shop (ebay or Amazon).

Models

The natural model is a graph, directed or undirected, which could be multipartite (users, content, tags …). The links between users could be explicit (bridging links, declaration) or implicit (bonding links, through content).

The SN graphs are characterized by

  • sparse graph
  • small distances (small world graph, 6 degrees of separation theory)
  • high transitivity (clustering : two nodes close from a third one are likely to be close themselves)
  • degree distribution follows a power-law

SN are not randoms graph (which could be only sparse with small distance) nor random modification of a regular grid (which could be only sparse, with small distance and high transitivity). They are closer from free-scale graph, build by adding nodes one by one and linking each new node in order to preserve the property described above.

Algorithms

  • PageRank : this algorithm is used to rank mode in a graph according to their importance in the graph. It is not helpful on undirected graph since it converges to the degree of the node, but variants exists.
  • Search of communities : extract communities from the graph could be done using minimum cut/maximum flow algorithms or Markov clustering algorithms (MCL, removing betweeness edges)
  • Improve Information Retrieval : the tags could be used to improve semantic search. recommendation is also a topic of interest , using Collaborative filtering (user-based) or item based recommendation.Finally IR could be biased with distance on the SN graph

Conclusion

  • SN is larger than FaceBook!
  • There is some natural models and some natural research on IR, trust …

News , , ,