Ongoing research projects in DBWeb

Data managed by information systems are more and more complex, distributed, heterogeneous, dynamic, and of various forms. The DBWeb project focus on the fundamental issues raised in modern data and knowledge management systems, especially on the Web or in collaborative contexts oriented towards peer-to-peer networks. Our research interests cover both theoretical fundations of database management systems, practical solutions and applications, as well as cognitive aspects. Here are the main research projects we are involved in.

DEUS: Extraction and Querying of Complex Objects from the Structured Web

Abstract

We are witnessing today a tremendous growth of the so called structured Web, in which documents are no longer quasi-textual, but are data-centric, presenting structured content, complex objects, a Web in which information is no longer served “raw”. While current search platforms have mostly benefited from top research in information retrieval and distributed systems, the shift towards schematized data calls for more precise, richer querying of the Web and raises new challenges to which the data management community can provide answers. We believe that a key challenge for future Web interrogation is to leverage the structured part of the Web, for better understanding, extraction and access to data that would enable richer search interactions. Expanding beyond entity search applications, which focus on simple, atomic objects, and building on the idea that such simple, easily recognizable entities can be further organized into more complex relations or objects, often with spatio-temporal components, denoted structured object descriptions, in short SODs, we intend to study in this project the theoretical and practical challenges that are raised in querying for complex objects on the structured Web.

Participants in DBWeb

DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology

Abstract

The aim of this project, headed by Georg Gottlob from University of Oxford, is to provide the logical, algorithmic, and methodological, foundations for the knowledge-based extraction of structured data from web sites belonging to specific domains within various industries and areas of general interest. One core part of this will be a comprehensive multi-dimensional logical data model that will be used to represent at the same time the content of a large Website, its structure, inferred user-interaction patterns and all meta-information and knowledge (factual and rule-based) that is necessary to automatically perform the desired extraction tasks. We further aim at designing new methods and algorithms for integrating existing machine-learning based approaches and methods for document analysis and annotation into our knowledge based framework. The vision is that such new foundations will enable us to build powerful systems that autonomously explore websites of a given domain, understand their structure, and extract and output richly structured data in formats such as XML and RDF. The breakthrough in automatic data extraction we strive for would be most beneficial to two interrelated technologies that are the hottest next topics in web search: vertical search, that is, web search in specialized domains, and object search, that is, the search for web data objects rather than web pages.

Participants in DBWeb

Other institutions involved

Duration: 2010–2014

The DIADEM project is funded by the European Research Council.

DataRing: P2P Data Sharing for Online Communities

Abstract

The DataRing project addresses the problem of P2P data sharing for online communities, by offering a high-level network ring across distributed data source owners. Users may be in high numbers and interested in different kinds of collaboration and sharing their knowledge, ideas, experiences, etc. Data sources can be in high numbers, fairly autonomous, i.e. locally owned and controlled, and highly heterogeneous with different semantics and structures. What we need then is new decentralized data management techniques that scale up while addressing the autonomy, dynamic behavior and heterogeneity of both users and data sources.

Website of the DataRing project

Participants in DBWeb

Other institutions involved

Duration: 2009–2011

The DataRing project (2009–2011) is sponsored by the French national research agency ANR (Agence Nationale de la Recherche), within the programme Future Networks and Services (VERSO).

ISICIL: Trust Managment in Open Communities and Online Social Networks

Abstract

Our contribution in the ISICIL project concerns trust managment in open communities and online social networks. It is often necessary in data management applications to control the ways in which data is accessed, modified and transformed. When data is under centralized control, arbitrarily complex restriction scenarios can be actively enforced inside the boundaries of the owner. All this becomes much harder when data cannot be actively controlled and monitored, for instance when it is shared in a distributed and open context such as large social networks for information and knowledge sharing. The management of trust and privacy is becoming crucial in many applications, like the collaborative publishing of information (Wikipedia, open software communities, e-bay) or social networks applications. Many novel issues are raised in such contexts and one of our objectives is to study appropriate models and tools for trust and privacy management. More precisely, the ISICIL project must innovate on the following points: (1) Better understanding of the use of trust models and their limits in open communities. (2) Developing suitable models for privacy based on trust measures in open communities for information sharing and publishing. In particular, these models should allow data owners to preserve their anonymity and to control how private information is disseminated, accessed or modified.

Website of the ISICIL project

Participants in DBWeb

Duration: 2009–2010

The ISICIL project (2009–2011) is sponsored by the French national research agency ANR (Agence Nationale de la Recherche), within the programme Content and Interactions (CONTINT).

LpOD: Access Control and Update Restrictions in Distributed XML-based Documents

Abstract

Our contribution in the LpOD project concerns access control and update restrictions in shared ODF documents. We intend to study formal models for specifying access and update restritions on a document in exchange/workflow scenarios, as well as techniques for enforcing security policies in this context. This is a complex problem since decisions regarding access may depend on various factors that could overpass the scope and knowledge of individual entities in the workflow process. For instance, policies may depend on private aspects concerning the user (who is demanding access to data) or the data publisher. In general, settings in which access policies are distributed are more and more common and we believe they are highly relevant for workflow scenarios. Finding techniques and formal models for enforcing security on exchanged or published data is in itself a challenging problem. Moreover, reasoning about data integrity and authenticity is also challenging, and we intend to extend our work on integrity inference for XML-based data to the setting of OpenDocument (ODF) information management.

Website of the LpOD project

Participants in DBWeb

Duration: 2009–2010

The LpOD project (2009–2010) is sponsored by the French national research agency ANR (Agence Nationale de la Recherche), within the programme Content and Interactions (CONTINT).

MILC: Artificial intelligence and cognitive aspects

Abstract

This research on language and cognition (MILC sub-project) focuses on the quest for fundamental principles underlying the language faculty and the will to communicate. The main areas of interest are currently relevance and honest communication.

Website of the MILC project

Participants in DBWeb

PANIC: Pro-Activity of Audience and Digitization of Cultural Industries

Abstract

This research project, which involves economists and social scientists, deals with the evolution of the audience of cultural media (books, press, games, music, etc.) with the advent of digitization and the Internet. In this project, we are involved in data cleaning and data enrichment tasks.

Website of the PANIC project

Participants in DBWeb

Other institutions involved

Duration: 2010–2012

The PANIC project (2009–2011) is sponsored by the French national research agency ANR (Agence Nationale de la Recherche), within the programme Content and Interaction (CONTINT).

REWRITE: Answering relational/XML queries using views

Abstract

We study in this project the problem of querying data sources that accept only a limited set of queries, such as sources accessible by Web services which can implement very large (potentially infinite) families of queries. We first revisit a classical setting in which the application queries are conjunctive queries and the source accepts families of (possibly parameterized) conjunctive queries specified as the expansions of a (potentially recursive) Datalog program with parameters, under the assumption that sources satisfy integrity constraints. We then consider XML queries and views. The standard approach for optimization of XPath queries by rewriting using views techniques consists in navigating inside a view’s output, thus allowing the usage of only one view in the rewritten query. Algorithms for richer classes of XPath rewritings, using intersection or joins on node identifiers, have been proposed, but they either lack completeness guarantees, or require additional information about the data. We study restrictions under which an XPath can be rewritten in polynomial time using an intersection of views and effective algorithms that can work for any documents or type of identifiers. Moreover, we are interested in the complexity of the related problem of deciding if an XPath with intersection can be equivalently rewritten as one without intersection or union. Starting from our novel techniques for XML query answering using multiple views, we then study expressibility and support when a (potentially infinite) set of views is specified using the QSS (Query Set Specification) formalism.

Participants in DBWeb

Other institutions involved

Duration: 2008–2011

WebPlan: Query-Driven Data Aquisition from Web Based Data Source

Abstract

The functioning of entities as diverse as enterprises and government agencies depends on obtaining high-quality data. Increasingly these entities depend on external sources for their operational data: critical data is obtained dynamically via web services, is extracted from web pages, or is purchased from third parties. These sources can differ radically in their completeness, accuracy, and availability. It is not possible for applications to index and explore data from each source in advance of querying: there are too many sources, they are too costly to access, and the data in them may be refreshed constantly.

How should data acquisition proceed in such situations?

In this project we will develop algorithms for answering queries in the presence of large numbers of web-based data sources, sources that may overlap substantially in their datasets but have different access restrictions and costs. Our approach will make use of schema information about the data an application is querying: data format, integrity constraints, and any prior knowledge of costs that may be available. The core of the project will be algorithms for answering a query by interactively exploring the sources, dynamically pruning out irrelevant or exhausted sources in the process.

Website of the WebPlan project

Participants in DBWeb

Other institutions involved

Duration: 2010–2013

WebPlan is sponsored by the UK Engineering and Physical Sciences Research Council (EPSRC).

Webdam: Foundations of Web data management

Abstract

The goal of the Webdam project, headed by Serge Abiteboul from INRIA Sacalay, is to develop a formal model for Web data management. This model will open new horizons for the development of the Web in a well-principled way, enhancing its functionality, performance, and reliability. Specifically, the goal is to develop a universally accepted formal framework for describing complex and flexible interacting Web applications featuring notably data exchange, sharing, integration, querying and updating. We also propose to develop formal foundations that will enable peers to concurrently reason about global data management activities, cooperate in solving specific tasks and support services with desired quality of service. Although the proposal addresses fundamental issues, its goal is to serve as the basis for future software development for Web data management.

Website of the Webdam project

Participants in DBWeb

Other institutions involved

Duration: 2009–2013

The Webdam project is funded by the European Research Council under the European Community’s Seventh Framework Programme (FP7/2007-2013) / ERC grant Webdam, agreement n° 226513.

For any question regarding this website, please contact dbweb@telecom-paristech.fr.