SCHOOL

INGÉNIERIE, STI
MASTER

Informatics
TRACK

Data & Knowledge

Program

First semester, first period: 3 mandatory courses + 1 softskill course (3 x 5 = 15 ECTS + 2.5 ECTS)

First semester, second period: 6 out of the following choices (6 x 2.5 ECTS = 15 ECTS)

Plus a mandatory soft-skills course (2.5 ECTS)

Second semester: one of the following (each 25 ECTS)

Mandatory Courses

The following courses are mandatory and take place in the first period of the first semester. Each course is 42h and 5 ECTS.

Complex Data and Knowledge

Responsibles: Nicole Bidoit (Univ. Paris Sud), Brigitte Safar (U Paris Sud), Silviu Maniu (U Paris-Sud)

Content: Recent years have seen a massive increase in the amount of data, in particular on the Web. This course is to expose students to current technology and research issues in connection with web data and knowledge management. This includes concepts, methods, techniques and tools for the handling of heterogeneous, complex, incomplete, and semantic data. More concretely, this module covers the basics of semistructured data models such as XML standard and RSS, schemas such as DTD, XML Schema, query languages such as XPath, XQuery, XSLT and more advanced topics such as static analysis, XML views, XQuery evaluation, ... It also covers knowledge representation for the Semantic Web, starting with various representation formalisms and their reasoning mechanisms, and then focusing on Semantic Web standards such as RDF, SPARQL, and OWL. The course will also provide an understanding of methods for XML-RDF data management, investigating the brigde between two types of techniques.

Intended outcome: This course gives the student a broad and detailed understanding of XML database technology as well as the Semantic Web. Its goal is also that student be able to identify the latest related research topics.

Prerequisites:

Architectures for Massive Data Management

Responsibles: Ioana Manolescu (INRIA Saclay), Albert Bifet (Télécom ParisTech)

Content: This module will present concepts, architectures and algorithms for data storage, management, and analytics, at a very large scale, especially in distributed settings. The following topics will be covered:

A strong focus will be given to labs in this class, so that students can gather enough experience with different existing systems, and understand their respective advantages. The architecture of all distributed computing and storage systems will be discussed in detail during lectures.

Prerequisites: Databases, Algorithms & Data Structure, Java programming

Data Warehouses

Responsible: Benoit Groz (U Paris Sud)

Content: This module will cover architectures and technologies dedicated to transform raw data into valuable information for business processes. It first covers various aspects of data warehouses, where the raw data is harvested from various sources to obtain a global vision. More specifically, we will discuss data warehouse architectures, data modeling (conceptual, e.g., multi-dimensional modeling, and logical, e.g., star-schemas), query languages, and query optimization techiques used to efficiently execute analytical queries. The course then outlines different types of analytical applications, include OLAP-queries as well as simple data profiling and data mining techniques.

Prerequisites: Introduction to databases (L2-L3), implementation of databases and query optimization (M1)

First Semester, Second Period

Students have to choose 6 out of the following courses. Each course is 21h and 2.5 ECTS.

Stream Data Mining

Responsible: Albert Bifet

Content: Data streams are everywhere, from F1 racing over electricity networks to social media feeds. Data stream mining or Real-Time Analytics relies on and develops new incremental algorithms that process streams under strict resource limitations. This course focuses on, as well as extends the methods implemented in open source tools as MOA and Apache SAMOA. Students will learn to how select and apply an appropriate method for a given data stream problem; they will learn how to design and implement such algorithms; and they will learn how to evaluate and compare different solutions.

Knowledge Base Construction

Responsible: Fabian Suchanek (Télécom Paris Tech)

Content: This module will teach students the basics of semantic information extraction. It will cover the concepts, methods, and algorithms to extract factual information from text in order to construct a coherent knowledge base. This includes some NLP (Part-of-Speech tagging, Dependency Parsing, etc.), and the techniques and concepts of entity disambiguation, instance extraction, the extraction from semi-structured sources (Wrapper Induction, Wikipedia-based approaches), the extraction from unstructured sources (e.g., by Pattern-based approaches), and the extraction by Soft Reasoning (Markov Logic, MAX SAT, etc.). We will also cover the design of extraction approaches in general (Evaluation, Iteration, etc.), and the alignment of knowledge bases in the Linked Open Data framework.

Prerequisites:

Data Science for Big Data

This course has been canceled

Responsible: Michalis Vazirgiannis, Ecole Polytechnique

Content: To acquaint the students with algorithms, methods and techniques for the life cycle of a data science project i.e. the iterative and incremental approach to make sense of the data (structured, graph, text) around the following key components: Data engineering and Data analysis. This includes data pre-processing and cleaning, feature extraction and creation, supervised and non-supervised learning methods for potentially Big data

Syllabus:

Prerequisites: Data Bases, Algorithms, Probability/Statistics, Programming

Dynamic Content Management

Responsible: Nicoleta Preda (UVSQ)

Content: This module will examine the management of dynamic data, for a variety of distributed Web applications. The course includes an introduction to standard tools for developing Web applications (REST/SOAP Web Services, XML/JSON, XSLT, BPEL), followed by an exploration of the problems that come from the dynamic nature of the data returned by Web services: wrapper construction, on-the-fly entity resolution, query evaluation using services with limited access patterns, workflow selection, verification/provenance of workflows. We will also cover the dynamic integration into RDF knowledge bases (Linked Open Data) of the data exported by digital libraries using Web service APIs.

Prerequisites: Basics of the Web (HTTP, HTML, Web forms, XML), Basics of distributed and database systems.

Information Integration

Responsible: Nathalie Pernelle (Paris Sud), Fatiha Saïs (Paris Sud)

Content: Nowadays, the Web of documents has evolved into a Web of Data connecting distributed and structured data (e.g., RDF, RDFa, MicroFormat) across the Web. To benefit of all the Web of data richness, it is important to establish whether two pieces of data refer to the same real world entity. In this module, we first survey well-known data integration architectures. Then, we present the data linking problem by giving a classification of the main existing approaches: supervised/unsupervised, local/global, knowledge-based and single/multi-ontologies. After that, we introduce the data fusion issue encountered when data connected by an identity link has to be integrated, which arises the problem of conflicting values. The main approaches, techniques and knowledge used to solve all these issues are explored.

Intended outcome: This course gives the students an understanding of the difficulties encountered with regard to the design of an application when he has to decide that the “Musée des Arts Premier”, located near “Trocadero” and the “Musée du quai Branly”, located in “Paris’s 7th arrondissement”, refer to the same museum. It gives also an understanding of the criteria to choose a data linking approach in order to take into account characteristics related to the data and to the application. Finally, it introduces students to the data fusion issue, allowing to develop tools specifically adapted to the data and application domain.

Cognitive Modelling

Responsible: Jean-Louis Dessalles, Telecom ParisTech

Content: Knowledge is most often created by humans and intended for humans. This course addresses the fundamental issue of knowledge representation and processing from a cognitive perspective. The course will first present cognitive semantics models (‘language of thought’ models, cognitive linguistics, conceptual spaces) and address the corresponding issues (grounding problem, frame problem, consistency, holism, learnability,...). Various implementations will be proposed for students to get a good grasp on these models. The link with KR engineering techniques will be made. The course will also address the question of knowledge relevance. Most text fragments are motivated (e.g. deal with an issue) and most queries by end-users, e.g. in search engines, are motivated as well. The course will present theories and implemented models of relevance (newsworthiness, interest, argumentative relevance).

Prerequisites: First-order logic, Logic programming.

Distributed Data Mining and Machine Learning

Responsible: Mauro Sozio

Content: The course will present machine learning and data mining algorithms for massive data analysis. It will cover the main theoretical and practical aspects behind machine learning and data mining with emphasis on designing efficient parallel/distributed algorithms and their implementation in MapReduce(Hadoop).

Syllabus

Prerequisites: Data Bases, Algorithms, Probability/Statistics, Programming

Very Large Data and Knowledge in Bioinformatics

Responsible: Sarah Cohen-Boulakia (Paris Sud)

Content: The course will cover problems of very large data and knowledge in the domain of Bioinformatics. Topics include, but are not limited to: (1) methods and tools for scientific workflows, (2) storing and querying provenance in scientific workflows, (3) mining workflow databases, (4) the use of Semantic Web and Metadata in Bioinformatics.

Uncertain Data Management

Responsible: Silviu Maniu (U Paris-Sud)

Content: The objective of this class is to present models for the representation of uncertain data, as well as algorithms and tools to process this data while maintaining nformation about its uncertainty. Topics covered include:

Labs will feature practical uses of a probabilistic relational database engine, MayBMS/Sprout.

New trends in Data&Knowledge (“Module liberté”)

Responsible: Fabian M. Suchanek (Télécom ParisTech)

Content: The Data&Knowledge track acknowledges that new concepts and techniques will be developed over the coming years in the area of knowledge and data mangement. To ensure the timely coverage of these concepts, and also to welcome potential future lecturers into our track, we allow students to fill the credits of this module completely freely from the courses that are offered at UPSa. The condition is that the courses be thematically related to knowledge and data management. The organisers of the Data&Knowledge track will examine each proposed course upon request and decide whether to admit it as a possible choice for the students.

Mandatory Softskill Courses

The following two courses are mandatory in the second period of the first semester. Each course is 2.5 ECTS.

Communication in Research

Responsible: all lecturers of Data&Knowledge

Content: In this module, students will get the opportunity to practice their English speaking skills as well as various soft-skills such as presentation techniques, team work, discussion or debating techniques. After introductory classes to these topics, students will prepare presentations (not necessarily limited to slideshows) on scientific papers, with the goal of explaining the scientific contributions to non-computer scientists in an understandable, accurate, but entertaining way.

Prerequisites: Introduction to database systems (L2-L3), Implementation of database systems and optimization (M1),

Introduction to Research / Business

Responsible: N.N.

Content: This course corresponds to the classical French course “Formation à la recherche / à l'entreprise”. It introduces the basic concepts of research and business.