Webdam Project

Yanlei Diao: Scalable, Low-Latency Data Analytics and its Applications

December 6th, 2012

Comments Off

When: Thursday, December 20th 2012, 14.00, room 445, PCRI

Abstract

An integral part of many data-intensive applications is the need to collect and analyze enormous data sets, such as click streams, search logs, and sensor streams to derive answers and insights with low latencies. Concurrently, new programming models and architectures have been developed for large-scale cluster computing, exemplified by recent MapReduce systems. However, these systems are designed for batch processing and require data set to be fully loaded into the cluster before running analytical queries, hence causing high delays of query answers.

In this talk, I present the design of a scalable, low-latency analytics platform, called Scalla, that fundamentally transforms the existing cluster computing paradigm into an incremental parallel processing paradigm, which provides the combined benefits of massive parallelism, incremental answers, and I/O efficiency. Our technical contributions include replacing an existing popular mechanism for partitioned parallelism with a purely hash-based mechanism and using dynamic frequency analysis to offer in-memory processing for most of the data. In this talk, I will also examine two application scenarios, click stream analysis, which has been used in our evaluation, and genomic data analysis, which is a new project that leverages Scalla for massive-scale genomic data processing and analysis.

Short bio

Yanlei Diao is an Associate Professor of Computer Science at the University of Massachusetts Amherst. Her research interests are in information architectures and data management systems, with a focus on large-scale data analysis, data streams, uncertain data management, and flash memory databases. She received her PhD in Computer Science from the University of California, Berkeley in 2005, her M.S. in Computer Science from the Hong Kong University of Science and Technology in 2000, and her B.S. in Computer Science from Fudan University in 1998.

Yanlei Diao was a recipient of the NSF Career Award and the IBM Scalable Innovation Faculty Award, and was a finalist of the Microsoft Research New Faculty Fellowship. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin. Her PhD dissertation “Query Processing for Large-Scale XML Message Brokering” won the 2006 ACM-SIGMOD Dissertation Award Honorable Mention. She is an associate editor of PVLDB 2013 and has served on the organizing committees of SIGMOD, CIDR, DMSN, the New Researcher Symposium, and the New England Database Summit. She has served on program committees of numerous international conferences and workshops.

Events, News

WebDam-MoDaS Workshop in Eilat 2012

November 6th, 2012

Comments Off

Information and the report about the Eilat workshop are available here.

You could also find the details on the website

Events, News

Seminar Yannis Papakonstantinou – Friday September 14th – ENS Cachan

August 30th, 2012

Comments Off

Speaker: Prof. Yannis Papakonstantinou, Computer Science and
Engineering, Univ of California at San Diego

When: Friday September 14th, 10:30

Where: ENS Cachan http://www.ens-cachan.fr/
amphithéâtre 121, Léonard de Vinci building

Title: Declarative, optimizable data-driven specifications of web &
mobile applications

Abstract:
Developers of web and mobile application development write too much
low level “plumbing” code to efficiently access, integrate and
coordinate application state that resides on multiple sub-systems of
the architecture, and is accessed using different languages: SQL at
the database server; HTML and Javascript at the browser, which in
HTML5 includes its own database state; Java or other programming
languages at the application server.

In the spirit of Active XML, the FORWARD project replaces such low level
code with declarative specifications. Its cornerstones are
(i) the unified application state virtual database, which enables
modeling and manipulating the entire application state in an extension
of SQL, named SQL++
(ii) specification of Ajax pages as essentially rendered views over
the unified application state.

We discuss problems solved in the last three years and the system
resulting from this activity. We then discuss a cluster of issues resulting
from both mobile agents and demanding Big Data visualizations and
propose a recently-initiated effort on an asynchronous SQL.

Consequently the following three problems are resolved by appropriate
reduction to data management problems, where prior database research
literature is leveraged and extended.

1. The partial change of Ajax pages, in response to application state
changes, is reduced to an incremental view maintenance problem. Id’s
that retain the provenance of the page data play an instrumental
efficiency role.

2. Efficient data access is reduced to semistructured query processing
over an integrated view that involves large database(s) and small main
memory-based sources. We connect with prior works in OQL.

3. The inherent location transparency of the specifications is
exploited in order to perform computation at the appropriate location
(browser vs server). More broadly, the talk discusses ongoing and
future work in utilizing the increased abilities of HTML5 clients
towards achieving low latency mobile web applications applications,
while location transparency of the specifications is retained.

Short Bio:
Yannis Papakonstantinou (http://db.ucsd.edu/people/yannis.htm) is a
Professor of Computer Science and Engineering at the University of
California, San Diego. His research is in the intersection of data
management technologies and the web, where he has published over
eighty research articles. He has given multiple tutorials and invited
talks, has served on journal editorial boards and has chaired and
participated in program committees for many international conferences
and workshops.

Yannis enjoys to commercialize his research and to inform his research
accordingly. He was the CEO and Chief Scientist of Enosys Software,
which built and commercialized an early XML-based Enterprise
Information Integration platform. Enosys Software was acquired in 2003
by BEA Systems. His lab’s FORWARD platform (for the rapid development
of data-driven Ajax applications) is now in use by many commercial
applications. He is involved in data analytics in the pharmaceutical
industry and is in the technical advisory board of Brightscope Inc.

Yannis holds a Diploma of Electrical Engineering from the National
Technical University of Athens, MS and Ph.D. in Computer Science from
Stanford University (1997) and an NSF CAREER award for his work on
data integration.

Events, News

WebDam-MoDaS Workshop in Eilat

July 30th, 2012

Comments Off

This meeting will be joint between Webdam (in its last year) and
MoDaS (inits first). The meeting will bring together members of the
two projects with the best world specialists in the topics.

http://www.cs.tau.ac.il/workshop/modas/

Meeting Topic:

We are being overwhelmed by the masses of information that are
available. Typically pieces of information are noisy: imprecise,
incomplete, inconsistent. This may be the case for global information
on the public Web as well as for private information in social networks
systems. We are concerned with combining all the techniques we can
to evaluate the quality of information and work to improve it. This
will typically involve both reasoning in an imprecise environment
(asstressed by Webdam) and relying on crowd participation (as
advocated by MoDaS). The workshop will bring together the two
approaches with an emphasis on the intersection of the two topics
but also considering their disjunction to bring the two groups up to
date with the two topics.
The workshop will serve both as an assessment for Webdam and
a brainstorming for MoDaS.

Program chairs: Tova Milo (Tel Aviv University), Serge Abiteboul
(INRIA, ENSCachan)

Events, News

Webdam participated at WWW conference in Lyon

June 1st, 2012

Comments Off

At this occasion Serge Abiteboul presented the Webdam project [slides]

Events

Julia Stoyanovich and Gerome Miklau are going to give a talk at Télécom ParisTech on December 5th

November 15th, 2011

Comments Off

Webdam is very happy to welcome you at Télécom ParisTech on December 5th to the talk organized by Pierre Senellart.

This will take place in “Télécom ParisTech” 46, rue Barrault – 75013 Paris in room C017 in the basement.

Planning:

14:00 Gerome Miklau
15:00 Julias Stoyanovich

Gerome Miklau talk abstract

Using Inference to Improve the Accuracy of Differentially-Private Output

Differential privacy is a rigorous privacy standard that protects against powerful adversaries, offers precise accuracy guarantees, and has been successfully applied to a range of data analysis tasks. When differential privacy is satisfied, participants in a dataset enjoy the compelling assurance that information released about the dataset is virtually indistinguishable whether or not their personal data is included.

Differential privacy is achieved by introducing randomness into query answers, and a major goal of research in this area is to devise methods that offer the best accuracy for a fixed level of privacy. The original algorithm for achieving differential privacy, commonly called the Laplace mechanism, returns the true answer after the addition of random noise drawn from a Laplace distribution. If an analyst requires only the answer to a single query about the database, then a version of the Laplace mechanism is known to offer optimal accuracy. But the Laplace mechanism can be severely suboptimal when a set of correlated queries are submitted, and despite much recent work, optimal strategies for answering a collection of correlated queries are not known.

After reviewing the basic principles of differential privacy, I will describe two examples of how query constraints and statistical inference can be used to construct more accurate differentially-private algorithms, with no privacy penalty. The first example comes from our recent work investigating the properties of a social network that can be studied without threatening the privacy of individuals and their connections. I will show that the degree distribution of a network can be estimated privately and accurately by asking a special query for which constraints are known to hold, and then exploiting the constraints to infer a more accurate final result. The second example comes from the analysis of more typical tabular data (such as census or medical data). When answering a set of predicate counting queries, I will show that correlations amongst the queries can be exploited to significantly reduce error introduced by the privacy mechanism.

Julias Stoyanovich talk abstract

Ranked Exploration of Large Structured Datasets

In online applications such as Yahoo! Personals and Trulia.com, users define structured profiles in order to find potentially interesting matches. Typically, profiles are evaluated against large datasets and produce thousands of ranked matches. Highly ranked results tend to be homogeneous, which hinders data exploration. For example, a dating website user who is looking for a partner between 20 and 40 years old, and who sorts the matches by income from higher to lower, will see a large number of matches in their late 30s who hold an MBA degree and work in the financial industry, before seeing any matches in different age groups and walks of life. An alternative to presenting results in a ranked list is to find clusters, identified by a combination of attributes that correlate with rank, and that allow for richer exploration of the result set.

In the first part of this talk I will propose a novel data exploration paradigm, termed rank-aware interval-based clustering. I will formally define the problem and, to solve it, will propose a novel measure of locality, together with a family of clustering quality measures appropriate for this application scenario. These ingredients may be used by a variety of clustering algorithms, and I will present BARAC, a particular subspace-clustering algorithm that enables rank-aware interval-based clustering in domains with heterogeneous attributes. I will present results of a large-scale user study that validates the effectiveness of this approach. I will also demonstrate scalability with an extensive performance evaluation on datasets from Yahoo! Personals, a leading online dating site, and on restaurant data from Yahoo! Local.

In the second part of this talk I will describe on-going work on data exploration for datasets in which multiple alternative rankings are defined over the items, and where each ranking orders only a subset of the items. Such datasets arise naturally in a variety of application domains, including social (e.g., restaurant and movie rating sites) and biological (e.g., analysis of genetic data). In these datasets there is often a need to aggregate multiple rankings, computing, e.g., a single ranked list of differentially expressed genes across a variety of experimental conditions, or of restaurants that are well-liked by one’s friends. I will argue that blindly aggregating multiple rankings into a single list may lead to an uninformative result, because it may not fully leverage opinions of different, possibly disagreeing, groups of judges. I will describe a framework that robustly identifies ranked agreement, i.e., it finds groups of judges whose rankings can be meaningfully aggregated. Finally, I will show how structured attributes of items and of judges can be used to guide the process of identifying ranked agreement, and to describe the resulting consensus rankings to a user.

Bio:
Julia Stoyanovich is a Visiting Scholar at the University of Pennsylvania. Julia holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics and Statistics from the University of Massachusetts at Amherst. After receiving her B.S. Julia went on to work for two start-ups and one real company in New York City, where she interacted with, and was puzzled by, a variety of massive datasets. Julia’s research focuses on modeling and exploring large datasets in presence of rich semantic and statistical structure. She has recently worked on personalized search and ranking in social content sites, rank-aware clustering in large structured datasets that focus on dating and restaurant reviews, data exploration in repositories of biological objects as diverse as scientific publications, functional genomics experiments and scientific workflows, and representation and inference in large datasets with missing values.

Events Algorithms, Anonymization, Meetings, Members, Visitors

Webdam meeting

March 14th, 2011

Comments Off

Webdam gathered its members at Telecom ParisTech in March 2011. The program and public version of the slides presented are available for consultation.

Events Meetings, Members, Workshop

Webdam at Dagstuhl (2)

June 21st, 2010

Comments Off

Webdam and the Fox European project will organize a Dagstuhl workshop on Foundations of Distributed Data Management, 17-21 October 2011.
Please, if you want to attend, block the dates and stay tuned.

Events, News Dissemination, Workshop

07/21/2009 : Webdam Workshop on Modal Logic

July 17th, 2009

Comments Off

When : Tuesday July 21th
Where : Ecole Normale Superieure, Cachan, LSV Library
Agenda :

Petrucio Viana (from Rio de Janeiro): algebras of binary relations and graph calculi
Gaelle Fontaine (from Amsterdam): a characterization of the continuous fragment of the mu-calculus and/or an easy completeness proof for the mu-calculus on finite trees.
Balder Ten Cate : on modal definability and universal Horn conditions.

Events Dissemination, Workshop

08/28/2009 : Following-VLDB workshop

February 13th, 2009

Comments Off

When : Friday August 28th (just after VLDB)

Who : Webdam members and Webdam advisory board

Where : Telecom ParisTech, Paris [46, rue Barrault - 75013 Paris]

Agenda :

Brainstorming session with all the members
Private first meeting of the advisory board

Summary :

What do we expect from foundations of Web data management?
What are the goals?
What should be the content?
What can be achieved in Webdam?

Full program : see Workshop program
Notes on the meeting

Events Dissemination, Workshop

Older Entries

Archive

Yanlei Diao: Scalable, Low-Latency Data Analytics and its Applications

Abstract

Short bio

WebDam-MoDaS Workshop in Eilat 2012

Seminar Yannis Papakonstantinou – Friday September 14th – ENS Cachan

WebDam-MoDaS Workshop in Eilat

Webdam participated at WWW conference in Lyon

Julia Stoyanovich and Gerome Miklau are going to give a talk at Télécom ParisTech on December 5th

Gerome Miklau talk abstract

Julias Stoyanovich talk abstract

Webdam meeting

Webdam at Dagstuhl (2)

07/21/2009 : Webdam Workshop on Modal Logic

08/28/2009 : Following-VLDB workshop

Main menu

Recent Posts

Categories

Archives

Meta