Ongoing research projects in DBWeb
Data managed by information systems are more and more complex,
distributed, heterogeneous, dynamic, and of various forms. The DBWeb project focus on
the fundamental issues raised in modern data and knowledge
management systems, especially on the Web or in collaborative
contexts oriented towards peer-to-peer networks. Our research interests
cover both theoretical fundations of database management systems,
practical solutions and applications, as well as cognitive aspects.
Here are the main research projects we are involved in.
DEUS: Extraction and Querying of Complex Objects from the Structured Web
Abstract
We are witnessing today a tremendous growth of the so called structured Web,
in which documents are no longer quasi-textual, but are data-centric, presenting structured content, complex objects,
a Web in which information is no longer served “raw”.
While current search platforms have mostly benefited from top research in information retrieval
and distributed systems, the shift towards schematized data calls for more precise, richer querying of the
Web and raises new challenges to which the data management community can provide answers.
We believe that a key challenge for future Web interrogation is to leverage the structured part of the Web, for better understanding, extraction and access to data that would enable richer search interactions.
Expanding beyond entity search applications, which focus on simple, atomic objects, and building on the idea
that such simple, easily recognizable entities can be further organized into more complex relations or objects,
often with spatio-temporal components, denoted structured object descriptions, in short SODs, we intend to study in this project the theoretical and practical challenges that are raised in querying for complex objects on the structured Web.
Participants in DBWeb
DIADEM: Domain-centric Intelligent Automated Data Extraction Methodology
Abstract
The aim of this project, headed by Georg Gottlob from
University of Oxford, is to provide the
logical, algorithmic, and methodological, foundations for the
knowledge-based extraction of structured data from web sites
belonging to specific domains within various industries and areas
of general interest. One core part of this will be a
comprehensive multi-dimensional logical data model that will be
used to represent at the same time the content of a large
Website, its structure, inferred user-interaction patterns and
all meta-information and knowledge (factual and rule-based) that
is necessary to automatically perform the desired extraction
tasks. We further aim at designing new methods and algorithms for
integrating existing machine-learning based approaches and
methods for document analysis and annotation into our knowledge
based framework. The vision is that such new foundations will enable
us to build powerful systems that autonomously explore websites
of a given domain, understand their structure, and extract and
output richly structured data in formats such as XML and RDF. The
breakthrough in automatic data extraction we strive for would be
most beneficial to two interrelated technologies that are the
hottest next topics in web search: vertical search, that is, web
search in specialized domains, and object search, that is, the
search for web data objects rather than web pages.
Participants in DBWeb
Other institutions involved
Duration: 2010–2014
The DIADEM project is funded by the European Research
Council.
DataRing: P2P Data Sharing for Online Communities
Abstract
The DataRing project addresses the problem of P2P data sharing for
online communities, by offering a high-level network ring across
distributed data source owners. Users may be in high numbers and
interested in different kinds of collaboration and sharing their
knowledge, ideas, experiences, etc. Data sources can be in high
numbers, fairly autonomous, i.e. locally owned and controlled, and
highly heterogeneous with different semantics and structures. What
we need then is new decentralized data management techniques that
scale up while addressing the autonomy, dynamic behavior and
heterogeneity of both users and data sources.
Website of the DataRing project
Participants in DBWeb
Other institutions involved
Duration: 2009–2011
The DataRing project (2009–2011) is sponsored by the French
national research agency ANR (Agence Nationale de la Recherche),
within the programme Future Networks and Services
(VERSO).
ISICIL: Trust Managment in Open Communities and Online Social Networks
Abstract
Our contribution in the ISICIL project concerns trust managment in open communities and online social networks. It is often necessary in data management
applications to control the ways in which data is accessed, modified and transformed. When data is under
centralized control, arbitrarily complex restriction scenarios can be actively enforced inside the boundaries of
the owner. All this becomes much harder when data cannot be actively controlled and monitored, for instance
when it is shared in a distributed and open context such as large social networks for information and knowledge
sharing. The management of trust and privacy is becoming crucial in many applications, like the collaborative
publishing of information (Wikipedia, open software communities, e-bay) or social networks applications.
Many novel issues are raised in such contexts and one of our objectives is to study appropriate models and tools
for trust and privacy management. More precisely, the ISICIL project must innovate on the following points: (1) Better
understanding of the use of trust models and their limits in open communities. (2) Developing suitable models for privacy based on trust measures in open communities for
information sharing and publishing. In particular, these models should allow data owners to preserve their
anonymity and to control how private information is disseminated, accessed or modified.
Website of the ISICIL project
Participants in DBWeb
Duration: 2009–2010
The ISICIL project (2009–2011) is sponsored by the French
national research agency ANR (Agence Nationale de la Recherche),
within the programme Content and Interactions (CONTINT).
LpOD: Access Control and Update Restrictions in Distributed XML-based Documents
Abstract
Our contribution in the LpOD project concerns access control and update restrictions in shared ODF documents. We intend to study formal models for specifying access and update
restritions on a document in exchange/workflow scenarios, as well as
techniques for enforcing security policies in this context. This is a
complex problem since decisions regarding access may depend on various
factors that could overpass the scope and knowledge of individual entities
in the workflow process. For instance, policies may depend on private
aspects concerning the user (who is demanding access to data) or the data
publisher. In general, settings in which access policies are distributed
are more and more common and we believe they are highly relevant for
workflow scenarios. Finding techniques and formal models for enforcing
security on exchanged or published data is in itself a challenging
problem. Moreover, reasoning about data integrity and authenticity is also
challenging, and we intend to extend our work on integrity inference for
XML-based data to the setting of OpenDocument (ODF) information management.
Website of the LpOD project
Participants in DBWeb
Duration: 2009–2010
The LpOD project (2009–2010) is sponsored by the French
national research agency ANR (Agence Nationale de la Recherche),
within the programme Content and Interactions
(CONTINT).
MILC: Artificial intelligence and cognitive aspects
Abstract
This research on language and cognition (MILC sub-project) focuses
on the quest for fundamental principles
underlying the language faculty and the will to communicate.
The main areas of interest are currently relevance and honest communication.
Website of the MILC project
Participants in DBWeb
PANIC: Pro-Activity of Audience and Digitization of Cultural Industries
Abstract
This research project, which involves economists and social scientists,
deals with the evolution of the audience of cultural media (books, press, games, music, etc.) with
the advent of digitization and the Internet. In this project, we are involved in data cleaning and data
enrichment tasks.
Website of the PANIC project
Participants in DBWeb
Other institutions involved
Duration: 2010–2012
The PANIC project (2009–2011) is sponsored by the French
national research agency ANR (Agence Nationale de la Recherche),
within the programme Content and Interaction
(CONTINT).
REWRITE: Answering relational/XML queries using views
Abstract
We study in this project the problem of querying data sources that accept only a
limited set of queries, such as sources accessible by Web services
which can implement very large (potentially infinite) families of
queries. We first revisit a classical setting in which the
application queries are conjunctive queries and the source accepts
families of (possibly parameterized) conjunctive
queries specified as the expansions of a
(potentially recursive) Datalog program with parameters, under the assumption that sources
satisfy integrity constraints.
We then consider XML queries and views. The standard approach for optimization of XPath queries by rewriting
using views techniques consists in navigating inside a view’s output, thus allowing
the usage of only one view in the rewritten query. Algorithms for richer classes of XPath rewritings,
using intersection
or joins on node identifiers, have been proposed, but they either lack
completeness guarantees, or require additional information about the
data. We study restrictions under which an XPath can
be rewritten in polynomial time using an intersection of views and effective algorithms that can work for
any documents or type of identifiers. Moreover, we are interested in the complexity
of the related problem of deciding if an XPath with intersection can
be equivalently rewritten as one without intersection or union.
Starting from our novel techniques for XML query answering using multiple views, we then study
expressibility and support when a (potentially infinite) set
of views is specified using the QSS (Query Set Specification) formalism.
Participants in DBWeb
Other institutions involved
Duration: 2008–2011
WebPlan: Query-Driven Data Aquisition from Web Based Data Source
Abstract
The functioning of entities as diverse as enterprises and
government agencies depends on obtaining high-quality data.
Increasingly these entities depend on external sources for their
operational data: critical data is obtained dynamically via web
services, is extracted from web pages, or is purchased from third
parties. These sources can differ radically in their completeness,
accuracy, and availability. It is not possible for applications to
index and explore data from each source in advance of querying:
there are too many sources, they are too costly to access, and the
data in them may be refreshed constantly.
How should data acquisition proceed in such situations?
In this project we will develop algorithms for answering queries in
the presence of large numbers of web-based data sources, sources
that may overlap substantially in their datasets but have different
access restrictions and costs. Our approach will make use of schema
information about the data an application is querying: data format,
integrity constraints, and any prior knowledge of costs that may be
available. The core of the project will be algorithms for answering
a query by interactively exploring the sources, dynamically pruning
out irrelevant or exhausted sources in the process.
Website of the WebPlan project
Participants in DBWeb
Other institutions involved
Duration: 2010–2013
WebPlan is sponsored by the UK Engineering and Physical
Sciences Research Council (EPSRC).
Webdam: Foundations of Web data management
Abstract
The goal of the Webdam project, headed by Serge Abiteboul
from INRIA Sacalay, is to develop a formal model
for Web data management. This model will open new horizons for the
development of the Web in a well-principled way, enhancing its
functionality, performance, and reliability. Specifically, the goal
is to develop a universally accepted formal framework for
describing complex and flexible interacting Web applications
featuring notably data exchange, sharing, integration, querying and
updating. We also propose to develop formal foundations that will
enable peers to concurrently reason about global data management
activities, cooperate in solving specific tasks and support
services with desired quality of service. Although the proposal
addresses fundamental issues, its goal is to serve as the basis for
future software development for Web data management.
Website of the Webdam project
Participants in DBWeb
Other institutions involved
Duration: 2009–2013
The Webdam project is funded by the European Research
Council under the European Community’s Seventh Framework Programme
(FP7/2007-2013) / ERC grant Webdam, agreement n° 226513.