ProgramFirst semester, first period: 3 mandatory courses + 1 softskill course (3 x 5 = 15 ECTS + 2.5 ECTS)
- Complex Data and Knowledge (Nicole Bidoit, Brigitte Safar)
- Architectures for Massive Data Management (Ioana Manolescu, Albert Bifet)
- Data Warehousing (Benoit Groz)
- Softskills seminar (Fabian Suchanek and all others)
First semester, second period: 6 out of the following choices (6 x 2.5 ECTS = 15 ECTS)
- Knowledge Base Construction (Fabian Suchanek), opened to DataScale
Cognitive ModellingCANCELED (Jean-Louis Dessalles)
Information IntegrationCANCELED (Nathalie Pernelle and Fatiha Saïs)
- Uncertain Data Management (Silviu Maniu, Antoine Amarilli)
- Dynamic Content Management (Nicoleta Preda), opened to DataScale
- Distributed Data Mining and Machine Learning (Mauro Sozio)
Data Science for Big Data (Michalis Vazirgiannis; canceled)
- Very Large Data and Knowledge in Bioinformatics (Sarah Cohen-Boulakia)
- Stream Data Mining (Albert Bifet)
- New trends in Data&Knowledge (any course at Paris-Saclay University by approval, e.g., from here)
- Preparation for research / business (Emmanuel Waller)
Second semester: one of the following (each 25 ECTS)
- 6 month master thesis project
- 6 month internship in a company
Mandatory CoursesThe following courses are mandatory and take place in the first period of the first semester. Each course is 42h and 5 ECTS.
Complex Data and KnowledgeResponsibles: Nicole Bidoit (Univ. Paris Sud), Brigitte Safar (U Paris Sud), Silviu Maniu (U Paris-Sud)
Content: Recent years have seen a massive increase in the amount of data, in particular on the Web. This course is to expose students to current technology and research issues in connection with web data and knowledge management. This includes concepts, methods, techniques and tools for the handling of heterogeneous, complex, incomplete, and semantic data. More concretely, this module covers the basics of semistructured data models such as XML standard and RSS, schemas such as DTD, XML Schema, query languages such as XPath, XQuery, XSLT and more advanced topics such as static analysis, XML views, XQuery evaluation, ... It also covers knowledge representation for the Semantic Web, starting with various representation formalisms and their reasoning mechanisms, and then focusing on Semantic Web standards such as RDF, SPARQL, and OWL. The course will also provide an understanding of methods for XML-RDF data management, investigating the brigde between two types of techniques.
Intended outcome: This course gives the student a broad and detailed understanding of XML database technology as well as the Semantic Web. Its goal is also that student be able to identify the latest related research topics.
- Relational databases (model and query languages)
- Programming skills (Java)
- Logic for computer science
Architectures for Massive Data ManagementResponsibles: Ioana Manolescu (INRIA Saclay), Albert Bifet (Télécom ParisTech)
Content: This module will present concepts, architectures and algorithms for data storage, management, and analytics, at a very large scale, especially in distributed settings. The following topics will be covered:
- Introduction to distributed systems (cluster and P2P architectures, scalability and Amdahl’s law; consistency, availability, and the CAP theorem; ACID vs BASE)
- Massively distributed filesystems/ cloud-based distributed file systems (e.g., HDFS/GFS)
- Distributed computation models: MapReduce vs MPI
- Distributed NoSQL databases:
- Key-value stores: Search-tree based, e.g., DNS, Baton, BigTable/HBase; Hash-based (DHTs): consistent hashing, Chord, DynamoDB
- Graph databases, e.g., Neo4J, Pregel
- Distributed triple stores
- Document stores, e.g., MongoDB
- Cloud Storage and Computing
Prerequisites: Databases, Algorithms & Data Structure, Java programming
Data WarehousesResponsible: Benoit Groz (U Paris Sud)
Content: This module will cover architectures and technologies dedicated to transform raw data into valuable information for business processes. It first covers various aspects of data warehouses, where the raw data is harvested from various sources to obtain a global vision. More specifically, we will discuss data warehouse architectures, data modeling (conceptual, e.g., multi-dimensional modeling, and logical, e.g., star-schemas), query languages, and query optimization techiques used to efficiently execute analytical queries. The course then outlines different types of analytical applications, include OLAP-queries as well as simple data profiling and data mining techniques.
Prerequisites: Introduction to databases (L2-L3), implementation of databases and query optimization (M1)
First Semester, Second PeriodStudents have to choose 6 out of the following courses. Each course is 21h and 2.5 ECTS.
Stream Data MiningResponsible: Albert Bifet
Content: Data streams are everywhere, from F1 racing over electricity networks to social media feeds. Data stream mining or Real-Time Analytics relies on and develops new incremental algorithms that process streams under strict resource limitations. This course focuses on, as well as extends the methods implemented in open source tools as MOA and Apache SAMOA. Students will learn to how select and apply an appropriate method for a given data stream problem; they will learn how to design and implement such algorithms; and they will learn how to evaluate and compare different solutions.
Knowledge Base ConstructionResponsible: Fabian Suchanek (Télécom Paris Tech)
Content: This module will teach students the basics of semantic information extraction. It will cover the concepts, methods, and algorithms to extract factual information from text in order to construct a coherent knowledge base. This includes some NLP (Part-of-Speech tagging, Dependency Parsing, etc.), and the techniques and concepts of entity disambiguation, instance extraction, the extraction from semi-structured sources (Wrapper Induction, Wikipedia-based approaches), the extraction from unstructured sources (e.g., by Pattern-based approaches), and the extraction by Soft Reasoning (Markov Logic, MAX SAT, etc.). We will also cover the design of extraction approaches in general (Evaluation, Iteration, etc.), and the alignment of knowledge bases in the Linked Open Data framework.
- Propositional & First Order Logic
- Basics of the Web (HTTP, HTML, (Web forms), XML, ...)
- Basics of the Semantic Web (knowledge representation, RDF, OWL,...)
- Graph Theory
- Java programming
This course has been canceled
Data Science for Big Data
Responsible: Michalis Vazirgiannis, Ecole Polytechnique
Content: To acquaint the students with algorithms, methods and techniques for the life cycle of a data science project i.e. the iterative and incremental approach to make sense of the data (structured, graph, text) around the following key components: Data engineering and Data analysis. This includes data pre-processing and cleaning, feature extraction and creation, supervised and non-supervised learning methods for potentially Big data
- (Big) data pre processing
- Data cleaning, normalization
- feature selection & creation
- spectral decompositions
- dimensionality reduction
- Data analysis
- Descriptive (data quality)
- Exploratory (summary statistics, correlation, ANOVA)
- Inferential (theory of generalization, sampling, statistical testing)
- Predictive (supervised, unsupervised machine learning)
- sequence text, graph, mining,
- Case Studies (from data mining cups or Kaggle)
Prerequisites: Data Bases, Algorithms, Probability/Statistics, Programming
Dynamic Content ManagementResponsible: Nicoleta Preda (UVSQ)
Content: This module will examine the management of dynamic data, for a variety of distributed Web applications. The course includes an introduction to standard tools for developing Web applications (REST/SOAP Web Services, XML/JSON, XSLT, BPEL), followed by an exploration of the problems that come from the dynamic nature of the data returned by Web services: wrapper construction, on-the-fly entity resolution, query evaluation using services with limited access patterns, workflow selection, verification/provenance of workflows. We will also cover the dynamic integration into RDF knowledge bases (Linked Open Data) of the data exported by digital libraries using Web service APIs.
Prerequisites: Basics of the Web (HTTP, HTML, Web forms, XML), Basics of distributed and database systems.
Information IntegrationResponsible: Nathalie Pernelle (Paris Sud), Fatiha Saïs (Paris Sud)
Content: Nowadays, the Web of documents has evolved into a Web of Data connecting distributed and structured data (e.g., RDF, RDFa, MicroFormat) across the Web. To benefit of all the Web of data richness, it is important to establish whether two pieces of data refer to the same real world entity. In this module, we first survey well-known data integration architectures. Then, we present the data linking problem by giving a classification of the main existing approaches: supervised/unsupervised, local/global, knowledge-based and single/multi-ontologies. After that, we introduce the data fusion issue encountered when data connected by an identity link has to be integrated, which arises the problem of conflicting values. The main approaches, techniques and knowledge used to solve all these issues are explored.
Intended outcome: This course gives the students an understanding of the difficulties encountered with regard to the design of an application when he has to decide that the “Musée des Arts Premier”, located near “Trocadero” and the “Musée du quai Branly”, located in “Paris’s 7th arrondissement”, refer to the same museum. It gives also an understanding of the criteria to choose a data linking approach in order to take into account characteristics related to the data and to the application. Finally, it introduces students to the data fusion issue, allowing to develop tools specifically adapted to the data and application domain.
Cognitive ModellingResponsible: Jean-Louis Dessalles, Telecom ParisTech
Content: Knowledge is most often created by humans and intended for humans. This course addresses the fundamental issue of knowledge representation and processing from a cognitive perspective. The course will first present cognitive semantics models (‘language of thought’ models, cognitive linguistics, conceptual spaces) and address the corresponding issues (grounding problem, frame problem, consistency, holism, learnability,...). Various implementations will be proposed for students to get a good grasp on these models. The link with KR engineering techniques will be made. The course will also address the question of knowledge relevance. Most text fragments are motivated (e.g. deal with an issue) and most queries by end-users, e.g. in search engines, are motivated as well. The course will present theories and implemented models of relevance (newsworthiness, interest, argumentative relevance).
Prerequisites: First-order logic, Logic programming.
Distributed Data Mining and Machine LearningResponsible: Mauro Sozio
Content: The course will present machine learning and data mining algorithms for massive data analysis. It will cover the main theoretical and practical aspects behind machine learning and data mining with emphasis on designing efficient parallel/distributed algorithms and their implementation in MapReduce(Hadoop).
- clustering (k-means, spectral clustering)
- ranking (PageRank, personalized PageRank)
- recommendation systems
- frequent itemsets
- Expectation Maximization
- convex optimization
- Support Vector Machine
- streaming algorithms
Prerequisites: Data Bases, Algorithms, Probability/Statistics, Programming
Very Large Data and Knowledge in BioinformaticsResponsible: Sarah Cohen-Boulakia (Paris Sud)
Content: The course will cover problems of very large data and knowledge in the domain of Bioinformatics. Topics include, but are not limited to: (1) methods and tools for scientific workflows, (2) storing and querying provenance in scientific workflows, (3) mining workflow databases, (4) the use of Semantic Web and Metadata in Bioinformatics.
Uncertain Data ManagementResponsible: Silviu Maniu (U Paris-Sud)
Content: The objective of this class is to present models for the representation of uncertain data, as well as algorithms and tools to process this data while maintaining nformation about its uncertainty. Topics covered include:
- Sources of uncertain data
- Incomplete data models: SQL NULLs and Codd tables, c-tables, incomplete XML
- Certain answers, consistent answers, strong representation systems.
- Measuring uncertainty: probability distributions, possibility distributions, Dempster-Shafer theory
- Possible worlds semantics
- Probabilistic relations: tuple-independent, block-independent-disjoint tuples, probabilistic c-tables. Models and querying.
- Probabilistic XML: local depedencies and global dependencies. Models and querying.
- Updating probabilistic databases.
- Applications of probabilistic databases. Inferring probability distributions.
New trends in Data&Knowledge (“Module liberté”)Responsible: Fabian M. Suchanek (Télécom ParisTech)
Content: The Data&Knowledge track acknowledges that new concepts and techniques will be developed over the coming years in the area of knowledge and data mangement. To ensure the timely coverage of these concepts, and also to welcome potential future lecturers into our track, we allow students to fill the credits of this module completely freely from the courses that are offered at UPSa. The condition is that the courses be thematically related to knowledge and data management. The organisers of the Data&Knowledge track will examine each proposed course upon request and decide whether to admit it as a possible choice for the students.
Mandatory Softskill CoursesThe following two courses are mandatory in the second period of the first semester. Each course is 2.5 ECTS.
Communication in ResearchResponsible: all lecturers of Data&Knowledge
Content: In this module, students will get the opportunity to practice their English speaking skills as well as various soft-skills such as presentation techniques, team work, discussion or debating techniques. After introductory classes to these topics, students will prepare presentations (not necessarily limited to slideshows) on scientific papers, with the goal of explaining the scientific contributions to non-computer scientists in an understandable, accurate, but entertaining way.
Prerequisites: Introduction to database systems (L2-L3), Implementation of database systems and optimization (M1),
Introduction to Research / BusinessResponsible: N.N.
Content: This course corresponds to the classical French course “Formation à la recherche / à l'entreprise”. It introduces the basic concepts of research and business.