List of accepted tutorials:

3 hours

  • Computational fact-checking: a content management perspective

    by Sylvie Cazalens (INSA-Lyon, France), Julien Leblay (AIST Tokyo, Japan), Philippe Lamarre (INSA-Lyon, France), Ioana Manolescu (INRIA, France) and Xavier Tannier(Univ. Sorbonne, Paris, France).

    A particularly popular and active area of data journalism is concerned with fact-checking. The term was born in the journalist community and referred to the process of verifying and ensuring the accuracy of published media content; these last years, however, it has increasingly focused on the analysis of politics, economy, science, and news content shared in any form, but first and foremost on the Web.

    This tutorial outlines the current state of affairs in the area of digital (or computational) fact-checking in newsrooms, by journalists, NGO workers, scientists and IT companies. It shows which areas of digital content management research, in particular those relying on the Web, can be leveraged to help fact-checking, and gives a comprehensive survey of efforts in this area. Finally, the tutorial highlights ongoing trends, unsolved problems, and areas where we envision future scientific and practical advances.

  • Data Integration and Machine Learning: a Natural Synergy

    by Xin Luna Dong (Amazon, USA) and Theodoros Rekatsinas (Univ. Wisconsin-Madison, USA).

    As data volume and variety have increased, so have the ties between machine learning and data integration become stronger. For machine learning to be effective, one must utilize data from the greatest possible variety of sources; and this is why data integration plays a key role. At the same time machine learning is driving automation in data integration, resulting in overall reduction of integration costs and improved accuracy. This tutorial focuses on three aspects of the synergistic relationship between data integration and machine learning: (1) we survey how state-of-the-art data integration solutions rely on machine learning-based approaches for accurate results and effective human-in-the-loop pipelines, (2) we review how end-to-end machine learning applications rely on data integration to identify accurate, clean, and relevant data for their analytics exercises, and (3) we discuss open research challenges and opportunities that span across data integration and machine learning.

  • Database and Distributed Computing Fundamentals for Scalable, Fault-tolerant and Consistent Maintenance of Blockchains,

    by Amr El Abbadi, Divy Agrawal, Sujaya A. Maiyya and Victor Zakhary (UC Santa Barbara, USA).

    Bitcoin is a successful and interesting example of a global scale peer-to-peer cryptocurrency that integrates many techniques and protocols from cryptography, distributed systems, and databases. The main underlying data structure is blockchain, a scalable fully replicated structure that is shared among all participants and guarantees a consistent view of all user transactions by all participants in the cryptocurrency system. In this tutorial, we discuss the basic protocols used in blockchain, and elaborate on its main advantages and limitations. To overcome these limitations, we provide the necessary distributed systems background in managing large scale fully replicated ledgers, using Byzantine Agreement protocols to solve the consensus problem. Finally, we expound on some of the most recent proposals to design scalable and efficient blockchains. The focus of the tutorial is on the distributed systems and database technical aspects of the recent innovations in blockchains.

1.5 hours

  • Graph Data Models, Query Languages and Programming Paradigms

    by Alin Deutsch and Yannis Papakonstantinou (UC San Diego, USA).

    Numerous databases support semi-structured, schemaless and heterogeneous data, typically in the form of graphs (often restricted to trees and nested data). They also provide corresponding high-level query languages or graph-tailored programming paradigms.

    The evolving query languages present multiple variations: some are superficial syntactic ones, while other ones are genuine differences in modeling, language capabilities and semantics. Incompatibility with SQL presents a learning challenge for graph databases, while table orientation often leads to cumbersome syntactic/semantic structures that are contrary to graph data. Furthermore, the query languages often fall short of full-fledged semistructured and graph query language capabilities, when compared to the yardsticks set by prior academic efforts.

    We survey features, the designers’ options and differences in the approaches taken by current systems. We cover both declarative query languages, whose semantics is independent of the underlying model of computation, as well as languages with an operational semantics that is more tightly coupled with the model of computation. For the declarative languages over both general graphs and tree-shaped graphs (as motivated by XML and the recent generation of nested formats, such as JSON and Parquet) we present SQL extensions that capture the essentials of such database systems.

    More precisely, rather than presenting a single SQL extension, we present multiple configuration options whereas multiple possible (and different) semantics are formally captured by the multiple options that the language’s semantic configuration options can take. We show how appropriate setting of the configuration options morphs the semantics into the semantics of multiple surveyed languages, hence providing a compact and formal tool to understand the essential semantic differences between different systems.

    Finally we compare with prior nested and graph query languages (notably OQL, XQuery, Lorel, StruQL, PigLatin) and we transfer into the modern graph database context lessons from the semistructured query processing research of the 90s and 00s, combining them with insights on current graph databases.

  • Forecasting Big Time Series: Old and New

    by Christos Faloutsos (Carnegie Mellon Univ. and Amazon Research, USA), Jan Gasthaus, Tim Januschowski and Yuyang Wang (Amazon Research, USA).

    Time series forecasting is a key ingredient in the automation and optimization of business processes: in retail, deciding which products to order and where to store them depends on the forecasts of future demand in different regions; in cloud computing, the estimated future usage of services and infrastructure components guides capacity planning; and workforce scheduling in warehouses and factories requires forecasts of the future workload. Recent years have witnessed a paradigm shift in forecasting techniques and applications, from computer-assisted model- and assumption-based to data-driven and fully-automated. This shift can be attributed to the availability of large, rich, and diverse time series data sources, The challenges that need to be addressed are therefore the following. How can we build statistical models to efficiently and effectively learn to forecast from large and diverse data sources? How can we leverage the statistical power of ``similar'' time series to improve forecasts in the case of limited observations? What are the implications for building forecasting systems that can handle large data volumes?

    The objective of this tutorial is to provide a concise and intuitive overview of the most important methods and tools available for solving large-scale forecasting problems. We review the state of the art in three related fields: (1) classical modeling of time series, (2) scalable tensor methods, and (3) deep learning for forecasting. Further, we share lessons learned from building scalable forecasting systems. While our focus is on providing an intuitive overview of the methods and practical issues which we will illustrate via case studies, we also present some technical details underlying these powerful tools.