# Program Schedule

Room : Segóvia I

Room : Segóvia II, Segóvia III, Segóvia IV

Room : El Pardo I

Room : Oriente

Room : Aranjuez

#### Conference Inauguration

Room : Segóvia I, Segóvia II, Segóvia III, Segóvia IV

Room : Segóvia I, Segóvia II, Segóvia III, Segóvia IV

Chair : Fabio Porto (LNCC)

Authors: Renée J. Miller (Northeastern University)

Open Data plays a major role in open government initiatives. Governments around the world are adopting Open Data Principles promising to make their Open Data complete, primary, and timely. These properties make this data tremendously valuable to data scientists. However scientists generally do not have a priori knowledge about what data is available (its schema or content), but will want to be able to use Open Data and integrate it with other public or private data they are studying. Traditionally, data integration is done using a framework called “query discovery” where the main task is to discover a query (or transformation script) that transforms data from one form into another. The goal is to find the right operators to join, nest, group, link, and twist data into a desired form. In this talk, I introduce a new paradigm for thinking about Open Data Integration where the focus is on “data discovery”, but highly efficient internet-scale discovery that is heavily query-aware. As an example, a join-aware discovery algorithm finds datasets, within a massive data lake, that join (in a precise sense of having high containment) with a known dataset. I describe a research agenda and recent progress in developing scalable query-aware data discovery algorithms.

Room : Segóvia I

Chair : Arnab Bhattacharya (IIT Kanpur)

##### MLBench: Benchmarking Machine Learning Services Against Human Experts

Authors: Yu Liu (ETH Zurich), Hantian Zhang (ETH Zurich), Luyuan Zeng (ETH Zurich), Wentao Wu (Microsoft Research), Ce Zhang (ETH)

Modern machine learning services and systems are complicated data systems - the process of designing such systems is an art of compromising between functionality, performance, and quality. Providing different levels of system supports for different functionalities, such as automatic feature engineering, model selection and ensemble, and hyperparameter tuning, could improve the quality, but also introduce additional cost and system complexity. In this paper, we try to facilitate the process of asking the following type of questions: How much will the users lose if we remove the support of functionality x from a machine learning service? Answering this type of questions using existing datasets, such as the UCI datasets, is challenging. The main contribution of this work is a novel dataset, MLBench, harvested from Kaggle competitions. Unlike existing datasets, MLBench contains not only the raw features for a machine learning task, but also those used by the winning teams of Kaggle competitions. The winning features serve as a baseline of best human effort that enables multiple ways to measure the quality of machine learning services that cannot be supported by existing datasets, such as relative ranking on Kaggle and relative accuracy compared with best-effort systems. We then conduct an empirical study using MLBench to understand example machine learning services from Amazon and Microsoft Azure, and showcase how MLBench enables a comparative study revealing the strength and weakness of these existing machine learning services quantitatively and systematically. The full version of this paper can be found at arxiv.org/abs/1707.09562

##### Scalable Training of Hierarchical Topic Models

Authors: Jianfei Chen (Tsinghua University), Jun Zhu (Tsinghua University), Jie Lu (Tsinghua University), Shixia Liu (Tsinghua University)

Large-scale topic models serve as basic tools for feature extraction and dimensionality reduction in many practical applications. As a natural extension of flat topic models, hierarchical topic models (HTMs) are able to learn topics of different levels of abstraction, which lead to deeper understanding and better generalization than their flat counterparts. However, existing scalable systems for flat topic models cannot handle HTMs, due to their complicated data structures such as trees and concurrent dynamically growing matrices, as well as their susceptibility to local optima. In this paper, we study the hierarchical latent Dirichlet allocation (hLDA) model which is a powerful nonparametric Bayesian HTM. We propose an efficient partially collapsed Gibbs sampling algorithm for hLDA, as well as an initialization strategy to deal with local optima introduced by tree-structured models. We also identify new system challenges in building scalable systems for HTMs, and propose efficient data layout for vectorizing HTM as well as distributed data structures including dynamic matrices and trees. Empirical studies show that our system is 87 times more efficient than the previous open-source implementation for hLDA, and can scale to thousands of CPU cores. We demonstrate our scalability on a 131-million-document corpus with 28 billion tokens, which is 4-5 orders of magnitude larger than previously used corpus. Our distributed implementation can extract 1,722 topics from the corpus with 50 machines in just 7 hours.

##### On Optimizing Operator Fusion Plans for Large-Scale Machine Learning in SystemML

Authors: Matthias Boehm (IBM Research - Almaden), Berthold Reinwald (IBM Research - Almaden), Dylan Hutchison (University of Washington), Prithviraj Sen (IBM Research - Almaden), Alexandre Evfimievski (IBM Research - Almaden), Niketan Pansare (IBM Research - Almaden)

Many machine learning (ML) systems allow the specification of ML algorithms by means of linear algebra programs, and automatically generate efficient execution plans. The opportunities for fused operators---in terms of fused chains of basic operators---are ubiquitous, and include fewer materialized intermediates, fewer scans of inputs, and sparsity exploitation across operators. However, existing fusion heuristics struggle to find good plans for complex operator DAGs or hybrid plans of local and distributed operations. In this paper, we introduce an exact yet practical cost-based optimization framework for fusion plans and describe its end-to-end integration into Apache SystemML. We present techniques for candidate exploration and selection of fusion plans, as well as code generation of local and distributed operations over dense, sparse, and compressed data. Our experiments in SystemML show end-to-end performance improvements of up to 22x, with negligible compilation overhead.

##### Snorkel: Rapid Training Data Creation with Weak Supervision

Authors: Alexander Ratner (Stanford University), Stephen Bach (Stanford University), Henry Ehrenberg (Stanford University), Jason Fries (Stanford University), Sen Wu (Stanford University), Christopher Re (Stanford University)

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8x faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8x speedup per pipeline execution. In two collaborations, with the U.S. Department of Veterans Affairs and the U.S. Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

##### Ease.ml: Towards Multi-tenant Resource Sharing for Machine Learning Workloads

Authors: Tian Li (ETH Zurich), Jie Zhong (University of Rochester), Ji Liu (University of Rochester), Wentao Wu (Microsoft Research), Ce Zhang (ETH)

We present ease.ml, a declarative machine learning service platform. With ease.ml, a user defines the high-level schema of an ML application and submits the task via a Web interface. The system then deals with the rest, such as model selection and data movement. The ultimate question we hope to understand is that, as a â€œservice providerâ€ that manages a shared cluster of machines running machine learning workloads, what is the resource sharing strategy that maximizes the global satisfaction of all our users? This paper does not completely answer this general question, but focuses on solving the first technical challenge we were facing when trying to build ease.ml. We observe that resource sharing is a critical yet subtle issue in this multi-tenant scenario, as we have to balance between efficiency and fairness. We first formalize the problem that we call multi-tenant model selection, aiming for minimizing the total regret of all users running automatic model selection tasks. We then develop a novel algorithm that combines multi-armed bandits with Bayesian optimization and prove a regret bound under the multi-tenant setting. Finally, we report our evaluation of ease.ml on synthetic data and on two services we are providing to our users, namely, image classification with deep neural networks and binary classification with Azure ML Studio. Our experimental evaluation results show that our proposed solution can be up to 9.8x faster in achieving the same global average accuracy for all users as the two popular heuristics used by our users before ease.ml, and 4.1x faster than state-of-the-art systems.

Room : Segóvia II

Chair : Pinar Tozun (IT University of Copenhagen)

##### Scalable Database Logging for Multicores

Author: Hyungsoo Jung (Hanyang University), Hyuck Han (Donduk Women's University), Sooyong Kang (Hanyang University)

Modern databases, guaranteeing atomicity and durability, store transaction logs in a volatile, central log buffer and then flush the log buffer to non-volatile storage by the write-ahead logging principle. Buffering logs in central log store has recently faced a severe multicore scalability problem, and log flushing has been challenged by synchronous I/O delay. We have designed and implemented a fast and scalable logging method, ELEDA, that can migrate a surge of transaction logs from volatile memory to stable storage without risking durable transaction atomicity. Our efficient implementation of ELEDA is enabled by a highly concurrent data structure, GRASSHOPPER, that eliminates a multicore scalability problem of centralized logging and enhances system utilization in the presence of synchronous I/O delay. We implemented ELEDA and plugged it to WiredTiger and Shore-MT by replacing their log managers. Our evaluation showed that ELEDA-based transaction systems improve performance up to 71 x, thus showing the applicability of ELEDA.

##### In-RDBMS Hardware Acceleration of Advanced Analytics

Authors: Divya Mahajan (Georgia Institute of Technology), Joon Kyung Kim (Georgia Institute of Technology), Jacob Sacks (Georgia Institute of Technology), Adel Ardalan (University of Wisconsin-Madison), Arun Kumar (University of California), Hadi Esmaeilzadeh (University of California)

The data revolution is fueled by advances in areas such as databases, hardware design, and machine learning. Now, programmable accelerators are making their way into each of these areas in isolation. As such, there is timely need for solutions that enable emerging accelerators in the conjunction of these areas. This paper sets out to be the initial step towards such a unifying solution. The aim is to devise a solution for the in-Database Acceleration of Advanced Analytics (DAnA). DAnA empowers database users to leap beyond traditional data summarization techniques and seamlessly utilize hardware-accelerated machine learning. Deploying specialized hardware, such as FPGAs, for in-database analytics currently requires hand-designing the hardware and manually routing the data. Instead, DAnA automatically maps a high-level specification of in-database analytics queries to the FPGA accelerator. The accelerator implementation is generated from a User De- fined Function (UDF), expressed as a part of an SQL query using Python-embedded Domain-Specific Language (DSL). To realize efficient in-database integration, DAnA-generated accelerators contain a novel hardware structure, Striders, that directly interface with the buffer pool of the database. DAnA obtains the schema and page layout information from the database catalog to configure the Striders. In turn, Striders extract, cleanse, and process the training data tuples, which are consumed by a multi-threaded FPGA engine that executes the analytics algorithm. We integrated DAnA with PostgreSQL to generate hardware accelerators for a range of real-world and synthetic datasets running diverse ML algorithms. Results show that DAnA-enhanced PostgreSQL provides, on average, 8.3x end-to-end speedup for real datasets, with the maximum at 28.2x. Moreover, DAnA-enhanced PostgreSQL is 4.0x faster, on average, than the multi-threaded Apache MADLib running on Greenplum. DAnA provides these benefits while hiding the com- plexity of hardware design from data scientists and allowing them to express the algorithm in 30-60 lines of Python code.

##### PolarFS: An Ultra-low Latency and Failure Resilient Distributed File System for Shared Storage Cloud Database

Authors: Zhenjun Liu (Alibaba), Wei Cao (Alibaba)

PolarFS is a distributed file system with ultra-low latency and high availability, designed for database service POLARDB on Alibaba Cloud. PolarFS utilizes lightweight network stack and I/O stack in user-space, taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. In this way, the end-to-end latency of PolarFS has been reduced drastically and our experiments show that the write latency of PolarFS is quite close to that of local file system on SSD. To keep replica consistency while maximizing I/O throughput for PolarFS, we proposed ParallelRaft, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases. ParallelRaft inherits the understandability and easy implementation of Raft while providing much better I/O scalability for PolarFS. We also described the shared storage architecture of PolarFS, which gives a strong support for POLARDB.

##### Efficient Distributed Memory Management with RDMA and Caching

Authors: Qingchao Cai (NUS), Wentian Guo (NUS), Hao Zhang (NUS), Divy Agrawal (University of California), Gang Chen (Zhejiang University), Beng Chin Ooi (NUS), Kian-Lee Tan (NUS), Yong Meng Teo (NUS), Sheng Wang (NUS)

Recent advancements in high-performance networking interconnect significantly narrow the performance gap between intra-node and inter-node communications, and opens up opportunities for distributed memory platforms to enforce cache coherency among distributed nodes. To this end, we propose GAM, an efficient distributed in-memory platform that provides a directory-based cache coherence protocol over remote direct memory access (RDMA). GAM manages the free memory distributed among multiple nodes to provide a unified memory model, and supports a set of user-friendly APIs for memory operations. To remove writes from critical execution paths, GAM allows a write to be reordered with the following reads and writes, and hence enforces partial store order (PSO) memory consistency. A light-weight logging scheme is designed to provide fault tolerance in GAM. We further build a transaction engine and a distributed hash table (DHT) atop GAM to show the ease-of-use and applicability of the provided APIs. Finally, we conduct an extensive micro benchmark to evaluate the read/write/lock performance of GAM under various workloads, and a macro benchmark against the transaction engine and DHT. The results show the superior performance of GAM over existing distributed memory platforms.

Room : Segóvia III

Chair : Cong Yu (Google AI)

##### Managing Healthy Communities needs Data Collaboration

Authors: Dr. Anand Deshpande (CEO&Founder Persistent Systems - India)

##### Big Data and Cloud: Challenges in Accelerating Analytics

Authors: Dr. Surajit Chaudhury (Distinguished Scientist Microsoft - USA)

##### Making Sense of Applying ML to APIs

Authors: Dr. Anant Jhingran ( CTO Apigee - USA)

Room : Segóvia IV

Authors: Christos Faloutsos (Carnegie Mellon Univ. and Amazon Research, USA), Jan Gasthaus (Amazon), Tim Januschowski (Amazon) and Yuyang Wang (Amazon)

Room : El Pardo I

Chair : Lei Chen (HKUST)

##### Question Answering Over Knowledge Graphs: Question Understanding Via Template Decomposition

Authors: Weiguo Zheng (CUHK), Jeffrey Xu Yu (CUHK), Lei Zou (Peking University), Hong Cheng (CUHK)

Using natural language to query large knowledge graphs is an interesting and important task, which provides an easier and natural way for casual end users to explore knowledge graphs. However, the gap between unstructured natural language and structured data makes it challenging to build such a system. Many existing methods construct a structured query for the input question based on a syntactic parser. Once the input question is parsed incorrectly, a false structured query will be generated, which may result in false or incomplete answers. The problem gets worse especially for complex questions. In this paper, we propose a novel systematic method to understand natural language questions by using a large number of binary templates rather than semantic parsers. It is clear that sufficient templates are critical in the procedure. We present a low-cost approach that can build a huge number of templates automatically. To reduce the search space, we carefully devise an index to facilitate the online template decomposition. Moreover, we identify two-level ambiguities, i.e., entity-level ambiguity and structure-level ambiguity. We design effective strategies to perform the two-level disambiguations by considering the query semantics. Extensive experiments over real datasets demonstrate that our proposed approach is effective as it significantly outperforms state-of-the-art methods in terms of both precision and recall.

##### CERES: Distantly Supervised Relation Extraction from the Semi-Structured Web

Authors: Colin Lockard (University of Washington), Luna Dong (Amazon), Arash Einolghozati (Facebook), Prashant Shiralkar (Amazon)

The web contains countless semi-structured websites, which can be a rich source of information for populating knowledge bases. Existing methods for extracting relations from the DOM trees of semi-structured webpages can achieve high precision and recall only when manual annotations for each website are available. Although there have been efforts to learn extractors from automatically generated labels, these methods are not sufficiently robust to succeed in settings with complex schemas and information-rich websites. In this paper we present a new method for automatic extraction from semi-structured websites based on distant supervision. We automatically generate training labels by aligning an existing knowledge base with a website and leveraging the unique structural characteristics of semi-structured websites. We then train a classifier based on the potentially noisy and incomplete labels to predict new relation instances. Our method can compete with annotation-based techniques in the literature in terms of extraction quality. A large-scale experiment on over 400,000 pages from dozens of multi-lingual long-tail websites harvested 1.25 million facts at a precision of 90%.

##### Query-Driven On-The-Fly Knowledge Base Construction

Authors: Dat Nguyen (MPI for Informatics), Abdalghani Abujabal (MPI for Informatics), Khanh Tran (L3S Research Center), Martin Theobald (University of Luxembourg), Gerhard Weikum (Max Planck Institute for Informatics)

Today’s openly available knowledge bases, such as DBpedia, Yago, Wikidata or Freebase, capture billions of facts about the world’s entities. However, even the largest among these (i) are still limited in up-to-date coverage of what happens in the real world, and (ii) miss out on many relevant predicates that precisely capture the wide variety of relationships among entities. To overcome both of these limitations, we propose a novel approach to build on-the-fly knowledge bases in a query-driven manner. Our system, called QKBfly, supports analysts and journalists as well as question answering on emerging topics, by dynamically acquiring relevant facts as timely and comprehensively as possible. QKBfly is based on a semantic-graph representation of sentences, by which we perform three key IE tasks, namely named-entity disambiguation, co-reference resolution and relation extraction, in a light-weight and integrated manner. In contrast to Open IE, our output is canonicalized. In contrast to traditional IE, we capture more predicates, including ternary and higher-arity ones. Our experiments demonstrate that QKBfly can build high-quality, on-the-fly knowledge bases that can readily be deployed, e.g., for the task of ad-hoc question answering.

##### The Vadalog System: Datalog-based Reasoning for Knowledge Graphs

Authors: Luigi Bellomarini (University of Oxford), Emanuel Sallinger (University of Oxford), Georg Gottlob (University of Oxford)

Over the past years, there has been a resurgence of Datalog-based systems in the database community as well as in industry. In this context, it has been recognized that to handle the complex knowledge-based scenarios encountered today, such as reasoning over large knowledge graphs, Datalog has to be extended with features such as existential quantification. Yet, Datalog-based reasoning in the presence of existential quantification is in general undecidable. Many efforts have been made to define decidable fragments. Warded Datalog+/- is a very promising one, as it captures PTIME complexity while allowing ontological reasoning. Yet so far, no implementation of Warded Datalog+/- was available. In this paper we present the Vadalog system, a Datalog-based system for performing complex logic reasoning tasks, such as those required in advanced knowledge graphs. The Vadalog system is Oxford’s contribution to the VADA research programme, a joint effort of the universities of Oxford, Manchester and Edinburgh and around 20 industrial partners. As the main contribution of this paper, we illustrate the first implementation of Warded Datalog+/-, a high-performance Datalog+/- system utilizing an aggressive termination control strategy. We also provide a comprehensive experimental evaluation.

##### A Survey and Experimental Comparison of Distributed SPARQL Engines for Very Large RDF Data

Authors: Ibrahim Abdelaziz (KAUST), Razen Harbi(Saudi Aramco), Zuhair Khayyat (KAUST), Panos Kalnis (KAUST)

Distributed SPARQL engines promise to support very large RDF datasets by utilizing shared-nothing computer clusters. Some are based on distributed frameworks such as MapReduce; others implement proprietary distributed processing; and some rely on expensive preprocessing for data partitioning. These systems exhibit a variety of trade-offs that are not well-understood, due to the lack of any comprehensive quantitative and qualitative evaluation. In this paper, we present a survey of 22 state-of-the-art systems that cover the entire spectrum of distributed RDF data processing and categorize them by several characteristics. Then, we select 12 representative systems and perform extensive experimental evaluation with respect to preprocessing cost, query performance, scalability and workload adaptability, using a variety of synthetic and real large datasets with up to 4.3 billion triples. Our results provide valuable insights for practitioners to understand the trade-offs for their usage scenarios. Finally, we publish online our evaluation framework, including all datasets and workloads, for researchers to compare their novel systems against the existing ones.

Room : Oriente

Room : Segóvia I

Moderator: Julia Stoyanovich (Drexel University)

Panelists: Bill Howe (University of Washington), HV Jagadish (University of Michigan), and Gerome Miklau (University of Massachusetts)

Room : Segóvia II

Chair : Tamer Ozsu (University of Waterloo)

##### The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing

Authors: Siddhartha Sahu (University of Waterloo), Amine Mhedhbi (University of Waterloo), Semih Salihoglu (University of Waterloo), Jimmy Lin (University of Waterloo), Tamer Özsu (University of Waterloo)

Graph processing is becoming increasingly prevalent across many application domains. In spite of this prevalence, there is little research about how graphs are actually used in practice. We conducted an online survey aimed at understanding: (i) the types of graphs users have; (ii) the graph computations users run; (iii) the types of graph software users use; and (iv) the major challenges users face when processing their graphs. We describe the participants' responses to our questions highlighting common patterns and challenges. We further reviewed user feedback in the mailing lists, bug reports, and feature requests in the source repositories of a large suite of software products for processing graphs. Through our review, we were able to answer some new questions that were raised by participants' responses and identify specific challenges that users face when using different classes of graph software. The participants' responses and data we obtained revealed surprising facts about graph processing in practice. In particular, real-world graphs represent a very diverse range of entities and are often very large, and scalability and visualization are undeniably the most pressing challenges faced by participants. We hope these findings can guide future research.

##### Analyzing the Impact of System Architecture on the Scalability of OLTP Engines for High-Contention Workloads

Authors: Raja Appuswamy (EPFL), Angelos Anadiotis (EPFL), Danica Porobic (Oracle), Mustafa Iman (EPFL), Anastasia Ailamaki (EPFL)

Main-memory OLTP engines are being increasingly deployed on multicore servers that provide abundant thread-level parallelism. However, recent research has shown that even the state-of-the-art OLTP engines are unable to exploit available parallelism for high contention workloads. While previous studies have shown the lack of scalability of all popular concurrency control protocols, they consider only one system architecture—a non-partitioned, shared everything one where transactions can be scheduled to run on any core and can access any data or metadata stored in shared memory. In this paper, we perform a thorough analysis of the impact of other architectural alternatives (Data-oriented transaction execution, Partitioned Serial Execution, and Delegation) on scalability under high contention scenarios. In doing so, we present Trireme, a main-memory OLTP engine testbed that implements four system architectures and several popular concurrency control protocols in a single code base. Using Trireme, we present an extensive experimental study to understand i) the impact of each system architecture on overall scalability, ii) the interaction between system architecture and concurrency control protocols, and iii) the pros and cons of new architectures that have been proposed recently to explicitly deal with high-contention workloads.

##### FlexPS: Flexible Parallelism Control in Parameter Server Architecture

Authors: Yuzhen Huang (CUHK), Tatiana Jin (CUHK), Yidi Wu (CUHK), Zhenkun Cai (CUHK), Xiao Yan (CUHK), Fan Yang (CUHK), Jinfeng Li (CUHK), Yuying Guo (CUHK), James Cheng (CUHK)

As a general abstraction for coordinating the distributed storage and access of model parameters, the parameter server (PS) architecture enables distributed machine learning to handle large datasets and high dimensional models. Many systems, such as Parameter Server and Petuum, have been developed based on the PS architecture and widely used in practice. However, none of these systems supports changing parallelism during runtime, which is crucial for the efficient execution of machine learning tasks with dynamic workloads. We propose a new system, called FlexPS, which introduces a novel multi-stage abstraction to support flexible parallelism control. With the multi-stage abstraction, a machine learning task can be mapped to a series of stages and the parallelism for a stage can be set according to its workload. Optimizations such as stage scheduler, stage-aware consistency controller, and direct model transfer are proposed for the efficiency of multi-stage machine learning in FlexPS. As a general and complete PS systems, FlexPS also incorporates many optimizations that are not limited to multi-stage machine learning. We conduct extensive experiments using a variety of machine learning workloads, showing that FlexPS achieves significant speedups and resource saving compared with the state-of-the-art PS systems such as Petuum and Multiverso.

##### CloudKit: Structured Storage for Mobile Applications

Authors: Alexander Shraer (Apple), Alexandre Aybes (Apple), Bryan Davis (Apple), Christos Chrysafis (Apple), Dave Browning (Apple), Eric Krugler (Apple), Eric Stone (Apple), Harrison Chandler (Apple), Jacob Farkas (Apple), John Quinn (Apple), Jonathan Ruben (Apple), Michael Ford (Apple), Mike McMahon (Apple), Nathan Williams (Apple), Nicolas Favre-Felix (Apple), Nihar Sharma (Apple) Inc, Ori Herrnstadt (Apple), Paul Seligman (Apple), Raghav Pisolkar (Apple), Scott Dugas (Apple), Scott Gray (Apple), Sytze Harkema (Apple), Valentin Kravtsov (Apple), Vanessa Hong (Apple), Wan Ling Yih (Apple), Yizuo Tian (Apple), Shirley Lu (Apple)

CloudKit is Apple’s cloud backend service and application development framework that provides strongly-consistent storage for structured data and makes it easy to synchronize data across user devices or share it among multiple users. Launched more than 3 years ago, CloudKit forms the foundation for more than 50 Apple apps, including many of our most important and popular applications such as Photos, iCloud Drive, Notes, Keynote, and News, as well as many third-party apps. To deliver this at large scale, CloudKit explicitly leverages multi-tenancy at the application level as well as at the user level to guide efficient data placement and distribution. By using CloudKit application developers are free to focus on delivering the application front-end and logic while relying on CloudKit for scale, consistency, durability and security. CloudKit manages petabytes of data and handles hundreds of millions of users around the world on a daily basis.

Room : Segóvia III

Chair : Eric Li (Huawei)

##### Benchmarking Industrial IoT Applications

Authors: Dr. Umesh Dayal (Sr. Fellow Hitachi - Japan/USA)

##### Enterprise In-Memory Database Technology: A Long Journey from Dreaming to Crossing the Chasm

Authors: Dr. Sang Cha (SNU Big Data Institute, Founder of the predecessor to SAP-Hana - Korea)

##### Machine Learning for Online Advertising

Authors: Dr. Ramakrishnan Srikant (Google Fellow - USA)

Room : Segóvia IV

Authors: Luna Dong (Amazon.com), Theodoros Rekatsinas (University of Wisconsin-Madison)

Room : Oriente

Room : Segóvia I

Chair : Herodotos Herodotou (Cyprus University of Technology)

##### Frontier: Resilient Edge Processing for the Internet of Things

Authors: Dan O'Keeffe (Imperial College London), Theodoros Salonidis (IBM T.J. Watson Research Center), Peter Pietzuch (Imperial College London)

In an edge deployment model, Internet-of-Things (IoT) applications, e.g. for building automation or video surveillance, must process data locally on IoT devices without relying on permanent connectivity to a cloud backend. The ability to harness the combined resources of multiple IoT devices for computation is influenced by the quality of wireless network connectivity. An open challenge is how practical edge-based IoT applications can be realised that are robust to changes in network bandwidth between IoT devices, due to interference and intermittent connectivity. We present Frontier, a distributed and resilient edge processing platform for IoT devices. The key idea is to express data-intensive IoT applications as continuous data-parallel streaming queries and to improve query throughput in an unreliable wireless network by exploiting network path diversity: a query includes operator replicas at different IoT nodes, which increases possible network paths for data. Frontier dynamically routes stream data to operator replicas based on network path conditions. Nodes probe path throughput and use backpressure stream routing to decide on transmission rates, while exploiting multiple operator replicas for data-parallelism. If a node loses network connectivity, a transient disconnection recovery mechanism reprocesses the lost data. Our experimental evaluation of Frontier shows that network path diversity improves throughput by 1.3×–2.8× for different IoT applications, while being resilient to intermittent network connectivity.

##### ForkBase: An Efficient Storage Engine for Blockchain and Forkable Applications

Authors: Sheng Wang (NUS), Anh Dinh (NUS), Qian Lin (NUS), Zhongle Xie (NUS), Meihui Zhang (Beijing Institute of Technology), Qingchao Cai (NUS), Gang Chen (Zhejiang University), Beng Chin Ooi (NUS), Pingcheng Ruan (NUS)

Existing data storage systems offer a wide range of functionalities to accommodate an equally diverse range of applications. However, new classes of applications have emerged, e.g., blockchain and collaborative analytics, featuring data versioning, fork semantics, tamper-evidence or any combination thereof. They present new opportunities for storage systems to efficiently support such applications by embedding the above requirements into the storage. In this paper, we present ForkBase, a storage engine designed for blockchain and forkable applications. By integrating core application properties into the storage, ForkBase not only delivers high performance but also reduces development effort. The storage manages multiversion data and supports two variants of fork semantics which enable different fork worklflows. ForkBase is fast and space efficient, due to a novel index class that supports efficient queries as well as effective detection of duplicate content across data objects, branches and versions. We demonstrate ForkBase’s performance using three applications: a blockchain platform, a wiki engine and a collaborative analytics application. We conduct extensive experimental evaluation against respective state-of-the-art solutions. The results show that ForkBase achieves superior performance while significantly lowering the development effort.

##### LightDB: A DBMS for Virtual Reality Video

Authors: Brandon Haynes (University of Washington), Amrita Mazumdar (University of Washington), Armin Alaghi (University of Washington), Magdalena Balanziska (University of Washington), Luis Ceze (University of Washington), Alvin Cheung (University of Washington)

We present the data model, architecture, and evaluation of LightDB, a database management system designed to efficiently manage virtual, augmented, and mixed reality (VAMR) video content. VAMR video differs from its two-dimensional counterpart in that it is spherical with periodic angular dimensions, is nonuniformly and continuously sampled, and applications that consume such videos often have demanding latency and throughput requirements. To address these challenges, LightDB treats VAMR video data as a logically-continuous six-dimensional light field. Furthermore, LightDB supports a rich set of operations over light fields, and automatically transforms declarative queries into executable physical plans. We have implemented a prototype of LightDB and, through experiments with VAMR applications in the literature, we find that LightDB offers up to 4× throughput improvements compared with prior work.

##### Data Synthesis based on Generative Adversarial Networks

Authors: Noseong Park (UNC Charlotte), Mahmoud Mohammadi (UNC Charlotte), Kshitij Gorde (UNC Charlotte), Sushil Jajodia (George Mason University), Hongkyu Park (ETRI), Youngmin Kim (ETRI)

Privacy is an important concern for our society where sharing data with partners or releasing data to the public is a frequent occurrence. Some of the techniques that are being used to achieve privacy are to remove identifiers, alter quasi-identifiers, and perturb values. Unfortunately, these approaches suffer from two limitations. First, it has been shown that private information can still be leaked if attackers possess some background knowledge or other information sources. Second, they do not take into account the adverse impact these methods will have on the utility of the released data. In this paper, we propose a method that meets both requirements. Our method, called table-GAN, uses generative adversarial networks (GANs) to synthesize fake tables that are statistically similar to the original table yet do not incur information leakage. We show that the machine learning models trained using our synthetic tables exhibit performance that is similar to that of models trained using the original table for unknown testing cases. We call this property model compatibility. We believe that anonymization/perturbation/synthesis methods without model compatibility are of little value. We used four real-world datasets from four different domains for our experiments and conducted in-depth comparisons with state-of-the-art anonymization, perturbation, and generation techniques. Throughout our experiments, only our method consistently shows balance between privacy level and model compatibility.

##### Efficient Searchable Encryption Through Compression

Authors: Ioannis Demertzis (University of Maryland), Charalampos Papamanthou (University of Maryland), Rajdeep Talapatra (University of Maryland)

In this work we design new searchable encryption schemes whose goal is to minimize the number of server-side cryptographic operations required to retrieve the result---a dimension mostly overlooked by previous works, yet very important in practice. Towards this goal, we utilize bitmap and inverted list compression to compress the plaintext indexes before producing the encrypted searchable indices. Our solution can use any existing Searchable Encryption (SE) as a black box and any combination of lossless compression algorithms, without compromising security. The efficiency of our schemes varies based on SE leakage exposed by the underlying application. For instance, for private keyword search (more leakage), we demonstrate up to 217x savings in search time, while for database search (less leakage) our saving drops to 63x. The power of our approach is better manifested when combined with more secure, yet less practical, cryptographic tools, such as Oblivious Random Access Memory (ORAM). In particular while ORAM has been criticized as prohibitively expensive for large-scale applications, we show that our compression solutions allow up to \textbf{63x} more efficient search time, which reflects into reducing the time for ORAM solutions from approximately 21 hours to 20 minutes, when executing a query of result size more than $1$ million.

Room : Segóvia II

Chair : Altigran Soares da Silva (Federal University of Amazonas)

##### Interleaving with Coroutines: A Practical Approach for Robust Index Joins

Authors: Georgios Psaropoulos (EPFL), Thomas Legler (SAP SE), Norman May (SAP SE), Anastasia Ailamaki (EPFL)

Index join performance is determined by the efficiency of the lookup operation on the involved index. Although database indexes are highly optimized to leverage processor caches, main memory accesses inevitably increase lookup runtime when the index outsizes the last-level cache; hence, index join performance drops. Still, robust index join performance becomes possible with instruction stream interleaving: given a group of lookups, we can hide cache misses in one lookup with instructions from other lookups by switching among their respective instruction streams upon a cache miss. In this paper, we propose interleaving with coroutines for any type of index join. We showcase our proposal on SAP HANA by implementing binary search and CSB+-tree traversal for an instance of index join related to dictionary compression. Coroutine implementations not only perform similarly to prior interleaving techniques, but also resemble the original code closely, while supporting both interleaved and non-interleaved execution. Thus, we claim that coroutines make interleaving practical for use in real DBMS codebases.

##### BlockJoin: Efficient Matrix Partitioning Through Joins

Authors: Andreas Kunft (TU Berlin, Delft University of Technology), Asterios Katsifodimos (SAP Innovation Center), Sebastian Schelter (TU Berlin), Tilmann Rabl (TU Berlin), Volker Markl (TU Berlin, Delft University of Technology)

Linear algebra operations are at the core of many Machine Learning (ML) programs. At the same time, a considerable amount of the effort for solving data analytics problems is spent in data preparation. As a result, end-toend ML pipelines often consist of (i) relational operators used for joining the input data, (ii) user defined functions used for feature extraction and vectorization, and (iii) linear algebra operators used for model training and crossvalidation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. These dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this paper, we investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. We present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines. Our experimental evaluation shows speedups up to 6× and the skew resistance of BlockJoin compared to stateof-the-art pipelines implemented in Spark.

##### Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models

Authors: Martin Kiefer (TU Berlin), Max Heimel (Snowflake Computing), Sebastian Breß (TU Berlin, 3German Research Center for Artificial Intelligence), Volker Markl (TU Berlin, 3German Research Center for Artificial Intelligence)

Accurately predicting the cardinality of intermediate plan operations is an essential part of any modern relational query optimizer. The accuracy of said estimates has a strong and direct impact on the quality of the generated plans, and incorrect estimates can have a negative impact on query performance. One of the biggest challenges in this field is to predict the result size of join operations. Kernel Density Estimation (KDE) is a statistical method to estimate multivariate probability distributions from a data sample. Previously, we introduced a modern, self-tuning selectivity estimator for range scans based on KDE that outperforms state-of-the-art multidimensional histograms and is ecient to evaluate on graphics cards. In this paper, we extend these bandwidth-optimized KDE models to estimate the result size of single and multiple joins. In particular, we propose two approaches: (1) Building a KDE model from a sample drawn from the join result. (2) Eciently combining the information from base table KDE models. We evaluated our KDE-based join estimators on a variety of synthetic and real-world datasets, demonstrating that they are superior to state-of-the art join estimators based on sketching or sampling.

##### Leveraging Similarity Joins for Signal Reconstruction

Authors: Abolfazl Asudeh (University of Michigan), Azade Nazi (Microsoft Research), Jees Augustine (UTA), Saravanan Thirumuruganathan (QCRI), Nan Zhang (Pennsylvania State University), Gautam Das (UTA), Divesh Srivastava (AT&T Labs Research)

Signal reconstruction problem (SRP) is an important optimization problem where the objective is to identify a solution to an underdetermined system of linear equations that is closest to a given prior. It has a substantial number of applications in diverse areas including network traffic engineering, medical image reconstruction, acoustics, astronomy and many more. Most common approaches for SRP do not scale to large problem sizes. In this paper, we propose a dual formulation of this problem and show how adapting database techniques developed for scalable similarity joins provides a significant speedup. Extensive experiments on real-world and synthetic data show that our approach produces a significant speedup of up to 20x over competing approaches.

##### Set Similarity Joins on MapReduce: An Experimental Survey

Authors: Fabian Fier (Humboldt University of Berlin), Nikolaus Augsten (University of Salzburg), Panagiotis Bouros (Johannes Gutenberg University Mainz), Ulf Leser (Humboldt University of Berlin), Johann-Christoph Freytag (Humboldt University of Berlin)

Set similarity joins, which compute pairs of similar sets, constitute an important operator primitive in a variety of applications, including applications that must process large amounts of data. To handle these data volumes, several distributed set similarity join algorithms have been proposed. Unfortunately, little is known about the relative performance, strengths and weaknesses of these techniques. Previous comparisons are limited to a small subset of relevant algorithms, and the large differences in the various test setups make it hard to draw overall conclusions. In this paper we survey ten recent, distributed set similarity join algorithms, all based on the MapReduce paradigm. We empirically compare the algorithms in a uniform test environment on twelve datasets that expose different characteristics and represent a broad range of applications. Our experiments yield a surprising result: All algorithms in our test fail to scale for at least one dataset and are sensitive to long sets, frequent set elements, low similarity thresholds, or a combination thereof. Interestingly, some algorithms even fail to handle the small datasets that can easily be processed in a non-distributed setting. Our analytic investigation of the algorithms pinpoints the reasons for the poor performance and targeted experiments confirm our analytic findings. Based on our investigation, we suggest directions for future research in the area.

Room : Segóvia III

Chair : Eduardo Cunha de Almeida (UFPR)

##### Real-Time Data Analytics and Intelligence at Alibaba

Authors: Dr. Jingren Zhou (V.P. Alibaba - China)

##### Digital Transformation toward Intelligent Enterprise

Authors: Dr. Wen-Syan Li (Senior Vice President SAP - China)

##### Caching: 5 minutes ought to be enough for everybody

Authors: Dr. Anastasia Ailamaki (CEO RAW labs - Switzeland)

Room : Segóvia IV

Authors: Luna Dong (Amazon.com), Theodoros Rekatsinas (University of Wisconsin-Madison)

Room : Oriente

Room : El Pardo II

Room : Segóvia I, Segóvia II, Segóvia III, Segóvia IV

Room : Segóvia I, Segóvia II, Segóvia III, Segóvia IV

Chair : Jian Pei (Simon Fraser University, JD.com)

Author: Beng Chin Ooi(NUS)

While AI and data-driven approaches are still evolving, they are likely to surpass current medical practices in the healthcare domain soon. The potential advantages are not only faster and more accurate analysis, but also the democratization of healthcare services. Notwithstanding, there are some common challenges when applying existing approaches onto the healthcare domain, due to the noise and bias of electronic health records (EHR), complex and heterogeneous feature relations, access control and data privacy and etc. In this talk, I discuss our design and implementation strategies: solve common challenges, instill domain knowledge, automate knowledge extraction, and enable system-based global optimization. I discuss our rationale on building a general analytics stack instead of solving individual problems, and explain how these challenges are being addressed. Several detailed technologies from both system and algorithm perspectives in our healthcare data management and analytics framework are also described. Finally, we introduce our healthcare analytics stack and MediLOT blockchain system, with the hope of playing a role in healthcare transformation.

Room : Segóvia I

Chair : Arun Kumar (University of California, San Diego)

##### Automating Large-Scale Data Quality Verification

Authors: Sebastian Schelter (Amazon), Dustin Lange (Amazon), Philipp Schmidt (Amazon), Meltem Celikel (Amazon), Felix Biessmann (Amazon), Andreas Grafberger (University of Augsburg)

Modern companies rely on data to guide every single business process and decision. Missing or incorrect information seriously compromises any decision process downstream. Therefore, a crucial, but tedious task for every team involved in data processing is to verify the quality of their data. We present a system for automating the verification of data quality at scale, which meets the requirements of production use cases. Our system provides a declarative API, which combines common quality constraints with user-defined validation code, and enables `unit tests' for data. We efficiently execute the resulting constraint validation workload by translating it to aggregation queries on Apache Spark. Our platform supports the incremental validation of data quality on growing datasets and leverages machine learning in many places. We discuss our design decisions, describe the resulting system architecture, and present an experimental evaluation on various datasets.

##### Moment-Based Quantile Sketches for Efficient High Cardinality Aggregation Queries

Authors: Edward Gan (Stanford University), Jialin Ding (Stanford University), Kai Sheng Tai (Stanford University), Vatsal Sharan (Stanford University), Peter Bailis (Stanford University)

Interactive analytics increasingly involves querying for quantiles over specific sub-populations and time windows of high cardinality datasets. Data processing engines such as Druid and Spark use mergeable summaries to estimate quantiles on these large datasets, but summary merge times are a bottleneck during high-cardinality aggregation. We show how a compact and efficiently mergeable quantile sketch can support aggregation workloads. This data structure, which we refer to as the moments sketch, operates with a small memory footprint (200 bytes) and computationally efficient (50ns) merges by tracking only a set of summary statistics, notably the sample moments. We demonstrate how we can efficiently and practically estimate quantiles using the method of moments and the maximum entropy principle, and show how the use of a cascade further improves query time for threshold predicates. Empirical evaluation on real-world datasets shows that the moments sketch can achieve less than 1 percent quantile error with 15 times less merge overhead than comparable summaries, improving end query time in the MacroBase engine by up to 7 times and the Druid engine by up to 60 times.

##### Efficient Denial Constraint Discovery with Hydra

Authors: Tobias Bleifuß (Hasso Plattner Institute), Sebastian Kruse (Hasso-Plattner-Institut), Felix Naumann (Hasso Plattner Institute)

Denial constraints (DCs) are a generalization of many other integrity constraints (ICs) widely used in databases, such as key constraints, functional dependencies, or order dependencies. Therefore, they can serve as a unified reasoning framework for all of these ICs and express business rules that cannot be expressed by the more restrictive IC types. The process of formulating DCs by hand is difficult, because it requires not only domain expertise but also database knowledge, and due to DCs’ inherent complexity, this process is tedious and error-prone. Hence, an automatic DC discovery is highly desirable: we search for all valid denial constraints in a given database instance. However, due to the large search space, the problem of DC discovery is computationally expensive. We propose a new algorithm Hydra, which overcomes the quadratic runtime complexity in the number of tuples of state-of-the-art DC discovery methods. The new algorithm’s experimentally determined runtime grows only linearly in the number of tuples. This results in a speedup by orders of magnitude, especially for datasets with a large number of tuples. Hydra can deliver results in a matter of seconds that to date took hours to compute.

##### HomeRun: Scalable Sparse-Spectrum Reconstruction of Aggregated Historical Data

Authors: Faisal Almutairi (University of Minnesota), Fan Yang (University of Pittsburgh), Hyun Ah Song (Carnegie Mellon University), Christos Faloutsos (CMU), Nicholas Sidiropoulos (University of Virginia), Vladimir Zadorozhny (University of Pittsburgh)

Recovering a time sequence of events from multiple aggregated and possibly overlapping reports is a major challenge in historical data fusion. The goal is to reconstruct a higher resolution event sequence from a mixture of lower resolution samples as accurately as possible. For example, we may aim to disaggregate overlapping monthly counts of people infected with measles into weekly counts. In this paper, we propose a novel data disaggregation method, called HOMERUN, that exploits an alternative representation of the sequence and finds the spectrum of the target sequence. More specifically, we formulate the problem as so-called basis pursuit using the Discrete Cosine Transform (DCT) as a sparsifying dictionary and impose non-negativity and smoothness constraints. HOMERUN utilizes the energy compaction feature of the DCT by finding the sparsest spectral representation of the target sequence that contains the largest (most important) coefficients. We leverage the Alternating Direction Method of Multipliers to solve the resulting optimization problem with scalable and memory efficient steps. Experiments using real epidemiological data show that our method considerably outperforms the state-of-the-art techniques, especially when the DCT of the sequence has a high degree of energy compaction.

##### An Analytical Study of Large SPARQL Query Logs

Authors: Angela Bonifati (University of Lyon), Wim Martens (University of Bayreuth), Thomas Timm (University of Bayreuth)

With the adoption of RDF as the data model for Linked Data and the Semantic Web, query specification from end-users has become more and more common in SPARQL endpoints. In this paper, we conduct an in-depth analytical study of the queries formulated by end-users and harvested from large and up-to-date query logs from a wide variety of RDF data sources. As opposed to previous studies, ours is the first assessment on a voluminous query corpus, spanning over several years and covering many representative SPARQL endpoints. Apart from the syntactical structure of the queries, that exhibits already interesting results on this generalized corpus, we drill deeper in the structural characteristics related to the graph- and hypergraph representation of queries. We outline the most common shapes of queries when visually displayed as undirected graphs, and characterize their (hyper-)tree width. Moreover, we analyze the evolution of queries over time, by introducing the novel concept of a streak, i.e., a sequence of queries that appear as subsequent modifications of a seed query. Our study offers several fresh insights on the already rich query features of real SPARQL queries formulated by real users, and brings us to draw a number of conclusions and pinpoint future directions for SPARQL query evaluation, query optimization, tuning, and benchmarking.

Room : Segóvia II

Chair : Asterios Katsifodimos (TU Delft)

##### Challenges and Experiences in Building an Efficient Apache Beam Runner For IBM Streams

Authors: Shen Li (IBM Research), Paul Gerver (IBM Watson Cloud Platform), John MacMillan (IBM Watson Cloud Platform), Daniel Debrunner (IBM Watson Cloud Platform), William Marshall (IBM Watson Cloud Platform), Kun-Lung Wu (IBM Research)

This paper describes the challenges and experiences in the development of IBM Streams runner for Apache Beam. Apache Beam is emerging as a common stream programming interface. Applications written with the Beam SDK can be executed on different underlying stream computing engines, with negligible migration penalty. IBM Streams is a widely-used enterprise streaming platform. It has a rich set of connectors and toolkits for easy integration of streaming applications with other enterprise applications. It also supports a broad range of programming language interfaces, including Java, C++, Python, Stream Processing Language (SPL) and Apache Beam. This paper focuses on our solutions to efficiently support the Beam programming abstractions in IBM Streams runner. Beam's discrete window model, on the one hand, supports out-of-order data arrivals, but on the other hand, forces runners to maintain more states, which leads to higher space and computation overhead...

##### Chi: A Scalable and Programmable Control Plane for Distributed Stream Processing Systems

Authors: Luo Mai (Imperial College London), Kai Zeng (Microsoft), Rahul Potharaju (Microsoft), Le Xu (UIUC), Shivaram Venkataraman (Microsoft), Costa Paolo (Microsoft)

Stream processing workloads and modern shared cluster environments are known to exhibit high variability and unpredictability. Furthermore, the large parameter space and the diverse set of user SLOs make modern streaming systems very challenging to statically configure and tune. All these issues necessitate introducing a powerful control plane into modern streaming systems with the ability to do continuous monitoring and feedback, as well as dynamically modify configuration. To address these requirements, in this paper we explore a novel control-plane design where we embed control plane messages in data plane channels. Building on this idea, we develop Chi, a low latency and extensible control plane for stream processing. We introduce a new reactive programming model and design mechanisms for asynchronously executing control policies thus avoiding global synchronization. We show how this allows us to easily implement a wide spectrum of control policies to address different use cases observed in production. Large-scale experiments using production workloads from a popular cloud provider demonstrate the flexibility and efficiency of our approach.

##### Providing Streaming Joins as a Service at Facebook

Stream processing applications reduce the latency of batch data pipelines and enable engineers to quickly identify production issues. Many times, a service can log data to distinct streams, even if they relate to the same real-world event (e.g., a search on Facebook's search bar). Furthermore, the logging of related events can appear on the server side with different delay, causing one stream to be significantly behind the other in terms of logged event times for a given log entry. To be able to stitch this information together with low latency, we need to be able to join two different streams where each stream may have its own characteristics regarding the degree in which its data is out-of-order. Doing so in a streaming fashion is challenging as a join operator consumes lots of memory, especially with significant data volumes.

##### Model-free Control for Distributed Stream Data Processing using Deep Reinforcement Learning

Authors: Teng Li (Syracuse University), Zhiyuan Xu (Syracuse University), Jian Tang (Syracuse University), Yanzhi Wang (Syracuse University)

In this paper, we focus on general-purpose Distributed Stream Data Processing Systems (DSDPSs), which deal with processing of unbounded streams of continuous data at scale distributedly in real or near-real time. A fundamental problem in a DSDPS is the scheduling problem (i.e., assigning workload to workers/machines) with the objective of minimizing average end-to-end tuple processing time. A widelyused solution is to distribute workload evenly over machines in the cluster in a round-robin manner, which is obviously not ecient due to lack of consideration for communication delay. Model-based approaches (such as queueing theory) do not work well either due to the high complexity of the system environment. We aim to develop a novel model-free approach that can learn to well control a DSDPS from its experience rather than accurate and mathematically solvable system models, just as a human learns a skill (such as cooking, driving, swimming, etc). Specifically, we, for the first time, propose to leverage emerging Deep Reinforcement Learning (DRL) for enabling model-free control in DSDPSs; and present design, implementation and evaluation of a novel and highly e↵ective DRL-based control framework, which minimizes average end-to-end tuple processing time by jointly learning the system environment via collecting very limited runtime statistics data and making decisions under the guidance of powerful Deep Neural Networks (DNNs). To validate and evaluate the proposed framework, we implemented it based on a widely-used DSDPS, Apache Storm, and tested it with three representative applications: continuous queries, log stream processing and word count (stream version). Extensive experimental results show 1) Compared to Storm’s default scheduler and the state-of-the-art model-based method, the proposed framework reduces average tuple processing by 33.5% and 14.0% respectively on average. 2) The proposed framework can quickly reach a good scheduling solution during online learning, which justifies its practicability for online control in DSDPSs.

##### Join Query Optimization Techniques for Complex Event Processing Applications

Authors: Ilya Kolchinsky (Technion, Israel Institute of Technology ), Assaf Schuster (Technion, Israel Institute of Technology )

Complex event processing (CEP) is a prominent technology used in many modern applications for monitoring and tracking events of interest in massive data streams. CEP engines inspect real-time information flows and attempt to detect combinations of occurrences matching predefined patterns. This is done by combining basic data items, also called â€œprimitive eventsâ€, according to a pattern detection plan, in a manner similar to the execution of multi-join queries in traditional data management systems. Despite this similarity, little work has been done on utilizing existing join optimization methods to improve the performance of CEP-based systems. In this paper, we provide the first theoretical and experimental study of the relationship between these two research areas. We formally prove that the CEP Plan Generation problem is equivalent to the Join Query Plan Generation problem for a restricted class of patterns and can be reduced to it for a considerably wider range of classes. This result implies the NP-completeness of the CEP Plan Generation problem. We further show how join query optimization techniques developed over the last decades can be adapted and utilized to provide practically efficient solutions for complex event detection. Our experiments demonstrate the superiority of these techniques over existing strategies for CEP optimization in terms of throughput, latency, and memory consumption.

Room : Segóvia III

Chair : José Macêdo (Federal University of Ceará)

##### How Good Are Modern Spatial Analytics Systems?

Authors: Varun Pandey (Technical University of Munich), Andreas Kipf (Technical University of Munich), Thomas Neumann (Technical University of Munich), Alfons Kemper (Technical University of Munich)

Spatial data is pervasive. Large amount of spatial data is produced everyday from GPS enabled devices such as cell phones, cars, sensors, and various consumer based applications such as Uber, location-tagged posts in Facebook, Instagram, Snapchat etc. This growth in spatial data coupled with the fact that spatial queries, analytical or transactional, can be computationally extensive has attracted enormous interest from the research community to develop systems that can efficiently process and analyze this data. In recent years a lot of spatial analytics systems have emerged. Existing work compare either limited features of these systems or the studies are outdated since new systems have emerged. In this work, we first explore the available modern spatial processing systems and then thoroughly compare them based on features and queries they support, using real world datasets.

##### GPU Rasterization for Real-Time Spatial Aggregation over Arbitrary Polygons

Authors: Eleni Tzirita Zacharatou (EPFL), Harish Doraiswamy (NYU), Anastasia Ailamaki (EPFL), Claudio Silva (NYU), Juliana Freire (NYU)

Visual exploration of spatial data relies heavily on spatial aggregation queries that slice and summarize the data over different regions. These queries comprise computationally-intensive point-in-polygon tests that associate data points to polygonal regions, challenging the responsiveness of visualization tools. This challenge is compounded by the sheer amounts of data, requiring a large number of such tests to be performed. Traditional pre-aggregation approaches are unsuitable in this setting since they fix the query constraints and support only rectangular regions. On the other hand, query constraints are defined interactively in visual analytics systems, and polygons can be of arbitrary shapes. In this paper, we convert a spatial aggregation query into a set of drawing operations on a canvas and leverage the rendering pipeline of the graphics hardware (GPU) to enable interactive response times. Our technique trades-off accuracy for response time by adjusting the canvas resolution, and can even provide accurate results when combined with a polygon index. We evaluate our technique on two large real-world data sets, exhibiting superior performance compared to index-based approaches.

##### Theoretically Optimal and Empirically Efficient R-trees with Strong Parallelizability

Authors: Jianzhong Qi (The University of Melbourne), Yufei Tao (The Chinese University of Hong Kong), Yanchuan Chang (University of Melbourne), Rui Zhang (University of Melbourne)

The massive amount of data and large variety of data distributions in the big data era call for access methods that are efficient in both query processing and index bulk-loading, and over both practical and worst-case workloads. To address this need, we revisit a classic multidimensional access method - the R-tree. We propose a novel R-tree packing strategy that produces R-trees with an asymptotically optimal I/O complexity for window queries in the worst case. Our experiments show that the R-trees produced by the proposed strategy are highly efficient on real and synthetic data of different distributions. The proposed strategy is also simple to parallelize, since it relies only on sorting. We propose a parallel algorithm for R-tree bulk-loading based on the proposed packing strategy, and analyze its performance under the massively parallel communication model. Experimental results confirm the efficiency and scalability of the parallel algorithm over large data sets.

##### ChronosDB: Distributed, File Based, Geospatial Array DBMS

Author: Ramon Antonio Rodriges Zalipynis (National Research University Higher School of Economics)

An array DBMS streamlines large N-d array management. A large portion of such arrays originates from the geospatial domain. The arrays often natively come as raster files while standalone command line tools are one of the most popular ways for processing these files. Decades of development and feedback resulted in numerous feature-rich, elaborate, free and quality-assured tools optimized mostly for a single machine. ChronosDB partially delegates in situ data processing to such tools and offers a formal N-d array data model to abstract from the files and the tools. ChronosDB readily provides a rich collection of array operations at scale and outperforms SciDB by up to 75x on average.

##### An Experimental Study on Hub Labeling based Shortest Path Algorithms

Authors: Ye Li (University of Macau), Leong Hou U (University of Macau), Man Lung Yiu (Hong Kong Polytechnic University), Ngai Meng Kou (University of Macau)

Shortest path distance retrieval is a core component in many important applications. For a decade, hub labeling (HL) techniques have been considered as a practical solution with fast query response time (e.g., 1-3 orders of magnitude faster), competitive indexing time, and slightly larger storage overhead (e.g., several times larger). These techniques enhance query throughput up to hundred thousands queries per second, which is particularly helpful in large user environment. Despite the importance of HL techniques, we are not aware of any comprehensive experimental study on HL techniques. Thus it is difficult for a practitioner to adopt HL techniques for her applications. To address the above issues, we provide a comprehensive experimental study on the state-of-the-art HL technique with analysis of their efficiency, effectiveness and applicability. From insightful summary of different HL techniques, we further develop a simple yet effective HL techniques called Significant path based Hub Pushing (SHP) which greatly improves indexing time of previous techniques while retains good query performance. We also complement extensive comparisons between HL techniques and other shortest path solutions to demonstrate robustness and efficiency of HL techniques.

Room : Segóvia IV

Authors: Sylvie Cazalens (INSA-Lyon, France), Julien Leblay (AIST Tokyo, Japan), Philippe Lamarre (INSA-Lyon, France), Ioana Manolescu (INRIA, France) and Xavier Tannier(Univ. Sorbonne, Paris, France)

Room : El Pardo I

Chair : Senjuti Basu Roy (New Jersey Institute of Technology)

##### Exact Processing of Uncertain Top-k Queries in Multi-criteria Settings

Authors: Kyriakos Mouratidis (Singapore Management University), Bo Tang (Southern University of Science and Technology)

Traditional rank-aware processing assumes a dataset that contains available options to cover a specific need (e.g., restaurants, hotels, etc) and users who browse that dataset via top-k queries with linear scoring functions, i.e., by ranking the options according to the weighted sum of their attributes, for a set of given weights. In practice, however, user preferences (weights) may only be estimated with bounded accuracy, or may be inherently uncertain due to the inability of a human user to specify exact weight values with absolute accuracy. Motivated by this, we introduce the uncertain top-k query (UTK). Given uncertain preferences, that is, an approximate description of the weight values, the UTK query reports all options that may belong to the top-k set. A second version of the problem additionally reports the exact top-k set for each of the possible weight settings. We develop a scalable processing framework for both UTK versions, and demonstrate its efficiency using standard benchmark datasets.

##### Adaptive Sampling for Rapidly Matching Histograms

Authors: Stephen Macke (University of Illinois-Urbana Champaign), Yiming Zhang (Shanghai Jiao Tong University), Silu Huang (University of Illinois-Urbana Champaign), Aditya Parameswaran (University of Illinois-Urbana Champaign)

In exploratory data analysis, analysts often have a need to identify histograms that possess a specific distribution, among a large class of candidate histograms, e.g., find countries whose income distribution is most similar to that of Greece. This distribution could be a new one that the user is curious about, or a known distribution from an existing histogram visualization. At present, this process of identification is brute-force, requiring the manual generation and evaluation of a large number of histograms. We present FastMatch: an end-to-end approach for interactively retrieving the histogram visualizations most similar to a user-specified target, from a large collection of histograms. The primary technical contribution underlying FastMatch is a probabilistic algorithm, HistSim, a theoretically sound sampling-based approach to identify the top-k closest histograms under ℓ1 distance. While HistSim can be used independently, within FastMatch we couple HistSim with a novel system architecture that is aware of practical considerations, employing asynchronous block-based sampling policies. FastMatch obtains near-perfect accuracy with up to 35× speedup over approaches that do not use sampling on several real-world datasets.

##### RC-Index: Diversifying Answers to Range Queries

Authors: Yue Wang (University of Massachusetts - Amherst), Alexandra Meliou (University of Massachusetts - Amherst), Gerome Miklau (University of Massachusetts - Amherst)

Query result diversification is widely used in data exploration, Web search, and recommendation systems. The problem of returning diversified query results consists of finding a small subset of valid query answers that are representative and different from one another, usually quantified by a diversity score. Most existing techniques for query diversification first compute all valid query results and then find a diverse subset. These techniques are inefficient when the set of valid query results is large. Other work has proposed efficient solutions for restricted application settings, where results are shared across multiple queries. In this paper, our goal is to support result diversification for general range queries over a single relation. We propose the RC-Index, a novel index structure that achieves efficiency by reducing the number of items that must be retrieved by the database to form a diverse set of the desired size (about 1 second for a dataset of 1 million items). Further, we prove that an RC-Index offers strong approximation guarantees. To the best of our knowledge, this is the first index-based diversification method with a guaranteed approximation ratio for range queries.

Authors: Siqiang Luo (The University of Hong Kong), Ben Kao (The University of Hong Kong), Guoliang Li (Tsinghua University), Jiafeng Hu (The University of Hong Kong), Reynold Cheng (The University of Hong Kong), Yudian Zheng (The University of Hong Kong)

We study the classical kNN queries on road networks. Existing solutions mostly focus on reducing query processing time. In many applications, however, system throughput is a more important measure. We devise a mathematical model that describes throughput in terms of a number of system characteristics. We show that query time is only one of the many parameters that impact throughput. Others include update time and query/update arrival rates. We show that the traditional approach of improving query time alone is generally inadequate in optimizing throughput. Moreover, existing solutions lack flexibility in adapting to environments of different characteristics. We propose TOAIN, which is a very flexible algorithm that can be easily trained to adapt to a given environment for maximizing query throughput. We conduct extensive experiments on both real and synthetic data and show that TOAIN gives significantly higher throughput compared with existing solutions.

##### HD-Index: Pushing the Scalability-Accuracy Boundary for Approximate kNN Search in High-Dimensional Spaces

Authors: Akhil Arora (American Express Big Data Labs), Sakshi Sinha (Fresh Gravity Inc.), Piyush Kumar (Visa Inc.), Arnab Bhattacharya (IIT Kanpur)

Nearest neighbor searching of large databases in high-dimensional spaces is inherently difficult due to the curse of dimensionality. A flavor of approximation is, therefore, necessary to practically solve the problem of nearest neighbor search. In this paper, we propose a novel yet simple indexing scheme, HD-Index, to solve the problem of approximate k-nearest neighbor queries in massive high-dimensional databases. HD-Index consists of a set of novel hierarchical structures called RDB-trees built on Hilbert keys of database objects. The leaves of the RDB-trees store distances of database objects to reference objects, thereby allowing efficient pruning using distance filters. In addition to triangular inequality, we also use Ptolemaic inequality to produce better lower bounds. Experiments on massive (up to billion scale) high-dimensional (up to 1000+) datasets show that HD-Index is effective, efficient, and scalable.

Room : Oriente

Room : Segóvia I

Chair : Kyuseok Shim (Seoul National University)

##### Morton Filters: Faster, Space-Efficient Cuckoo Filters via Biasing, Compression, and Decoupled Logical Sparsity

Authors: Alexander Breslow (AMD Research), Nuwan Jayasena (AMD Research)

Approximate set membership data structures (ASMDSs) are ubiquitous in computing. They trade a tunable, often small, error rate (∈) for large space savings. The canonical ASMDS is the Bloom filter, which supports lookups and insertions but not deletions in its simplest form. Cuckoo filters (CFs), a recently proposed class of ASMDSs, add deletion support and often use fewer bits per item for equal ∈. This work introduces the Morton filter (MF), a novel ASMDS that introduces several key improvements to CFs. Like CFs, MFs support lookups, insertions, and deletions, but improve their respective throughputs by 1.3× to 2.5×, 0.9× to 15.5×, and 1.3× to 1.6×. MFs achieve these improvements by (1) introducing a compressed format that permits a logically sparse filter to be stored compactly in memory, (2) leveraging succinct embedded metadata to prune unnecessary memory accesses, and (3) heavily biasing insertions to use a single hash function. With these optimizations, lookups, insertions, and deletions often only require accessing a single hardware cache line from the filter. These improvements are not at a loss in space efficiency, as MFs typically use comparable to slightly less space than CFs for the same ∈.

##### Efficient Haar+ Synopsis Construction for the Maximum Absolute Error Measure

Authors: Jinhyun Kim (Seoul National University), Jun-Ki Min (Korea University of Technology and Education), Kyuseok Shim (Seoul National University)

Several wavelet synopsis construction algorithms were previously proposed based on dynamic programming for unrestricted Haar wavelet synopses as well as Haar+ synopses. However, they find an optimal synopsis for every incoming value in each node of a coefficient tree, even if different incoming values share an identical optimal synopsis. To alleviate the limitation, we present novel algorithms, which keep only a minimal set of the distinct optimal synopses in each node of the tree, for the error-bounded synopsis problem. Furthermore, we propose the methods to restrict coefficient values to be considered to compute the optimal synopses in each node. In addition, by partitioning all optimal synopses in each node into a set of groups, such that every group can be represented by a compact representation, we significantly improve the performance of the proposed algorithms.

##### Efficient Construction of Approximate Ad-Hoc ML models Through Materialization and Reuse

Authors: Sona Hasani (University of Texas at Arlington), Saravanan Thirumuruganathan (QCRI), Abolfazl Asudeh (University of Michigan), Nick Koudas (University of Toronto), Gautam Das (University of Texas at Arlington)

Machine learning has become an essential toolkit for com- plex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering. We propose a cost based optimization framework that identifies appropriate ML models to combine at query time and conduct extensive experiments on real-world and synthetic datasets. Our results indicate that our framework can support analytic queries on ML models, with superior performance, achieving dramatic speedups of several orders in magnitude on very large datasets.

##### Improved Selectivity Estimation by Combining Knowledge from Sampling and Synopses

Authors: Magnus Mueller (University of Mannheim), Guido Moerkotte (University of Mannheim), Oliver Kolb (University of Mannheim)

Estimating selectivities remains a critical task in query processing. Optimizers rely on the accuracy of selectivities when generating execution plans and, in approximate query answering, estimated selectivities affect the quality of the result. Many systems maintain synopses, e.g., histograms, and, in addition, provide sampling facilities. In this paper, we present a novel approach to combine knowledge from synopses and sampling for the purpose of selectivity estimation for conjunctive queries. We first show how to extract information from synopses and sampling such that they are mutually consistent. In a second step, we show how to combine them and decide on an admissible selectivity estimate. We compare our approach to state-of-the-art methods and evaluate the strengths and limitations of each approach.

##### Cardinality Estimation: An Experimental Survey

Authors: Hazar Harmouch (Hasso Plattner Institute), Felix Naumann (Hasso Plattner Institute)

Data preparation and data profiling comprise many both basic and complex tasks to analyze a dataset at hand and extract metadata, such as data distributions, key candidates, and functional dependencies. Among the most important types of metadata is the number of distinct values in a column, also known as the zeroth-frequency moment. Cardinality estimation itself has been an active research topic in the past decades due to its many applications. The aim of this paper is to review the literature of cardinality estimation and to present a detailed experimental study of twelve algorithms, scaling far beyond the original experiments. First, we outline and classify approaches to solve the problem of cardinality estimation -we describe their main idea, error-guarantees, advantages, and disadvantages. Our experimental survey then compares the performance all twelve cardinality estimation algorithms. We evaluate the algorithms' accuracy, runtime, and memory consumption using synthetic and real-world datasets. Our results show that different algorithms excel in different in categories, and we highlight their trade-offs.

Room : Segóvia II

Chair : Nick Koudas (University of Toronto)

##### Effective Temporal Dependence Discovery in Time Series Data

Authors: Qingchao Cai (NUS), Zhongle Xie (NUS), Meihui Zhang (Beijing Institute of Technology), Gang Chen (Zhejiang University), H. Jagadish (University of Michigan), Beng Chin Ooi (NUS)

To analyze user behavior over time, it is useful to group users into cohorts, giving rise to {\em cohort analysis}. We identify several crucial limitations of current cohort analysis, motivated by the unmet need for temporal dependence discovery. To address these limitations, we propose a generalization that we call {\em recurrent cohort analysis}. We introduce a set of operators for recurrent cohort analysis and design access methods specific to these operators in both single-node and distributed environments. Through extensive experiments, we show that recurrent cohort analysis when implemented using the proposed access methods is up to six orders faster than one implemented as a layer on top of a database in a single-node setting, and two orders faster than one implemented using Spark SQL in a distributed setting.

##### Vocalizing Large Time Series Efficiently

Authors: Immanuel Trummer (Cornell University), Mark Bryan (Cornell University), Ramya Narasimha (Cornell University)

We vocalize query results for time series data. We describe a holistic approach that integrates query evaluation and vocalization. In particular, we generate only those parts of the query result that are relevant for voice output. We exploit the fact that voice output has to be concise and simple to be understandable for listeners. Hence, the problem of generating voice output reduces to choosing between several coarse-grained alternatives. To make that choice, it is sufficient to evaluate the time series at a few carefully chosen locations. We use techniques from the area of optimal experimental design to choose optimal sampling points. Our algorithm is iterative and generates in each iteration a set of promising voice description candidates. We consider multiple metrics when generating voice descriptions, including the accuracy of description as well as its complexity and length. Then, we choose a near-optimal batch of sampling points to refine our choice between promising candidates. We compare this algorithm experimentally against several baselines, demonstrating superior performance in terms of execution time and output quality. We also conducted a user study, showing that it enables users to execute simple exploratory data analysis via voice descriptions alone. We also compare against visual interfaces and sonification (i.e., non-speech sound) interfaces in terms of user performance.

##### Efficient Adaptive Detection of Complex Event Patterns

Authors: Ilya Kolchinsky (Technion, Israel Institute of Technology), Assaf Schuster (Technion, Israel Institute of Technology)

Complex event processing (CEP) is widely employed to detect occurrences of predefined combinations (patterns) of events in massive data streams. As new events are accepted, they are matched using some type of evaluation structure, commonly optimized according to the statistical properties of the data items in the input stream. However, in many real-life scenarios the data characteristics are never known in advance or are subject to frequent on-the-fly changes. To modify the evaluation structure as a reaction to such changes, adaptation mechanisms are employed. These mechanisms typically function by monitoring a set of properties and applying a new evaluation plan when significant deviation from the initial values is observed. This strategy often leads to missing important input changes or it may incur substantial computational overhead by over-adapting. In this paper, we present an efficient and precise method for dynamically deciding whether and how the evaluation structure should be reoptimized. This method is based on a small set of constraints to be satisfied by the monitored values, defined such that a better evaluation plan is guaranteed if any of the constraints is violated. To the best of our knowledge, our proposed mechanism is the first to provably avoid false positives on reoptimization decisions. We formally prove this claim and demonstrate how our method can be applied on known algorithms for evaluation plan generation. Our extensive experimental evaluation on real-world datasets confirms the superiority of our strategy over existing methods in terms of performance and accuracy.

##### ModelarDB: Modular Model-Based Time Series Management with Spark and Cassandra

Authors: Søren Jensen (Aalborg University), Torben Pedersen (Aalborg University), Christian Thomsen (Aalborg University)

Industrial systems, e.g., wind turbines, generate big amounts of data from reliable sensors with high velocity. As it is unfeasible to store and query such big amounts of data, only simple aggregates are currently stored. However, aggregates remove fluctuations and outliers that can reveal underlying problems and limit the knowledge to be gained from historical data. As a remedy, we present the distributed Time Series Management System (TSMS) ModelarDB that uses models to store sensor data. We thus propose an online, adaptive multi-model compression algorithm that maintains data values within a user-defined error bound (possibly zero). We also propose (i) a database schema to store time series as models, (ii) methods to push-down predicates to a key-value store utilizing this schema, (iii) optimized methods to execute aggregate queries on models, (iv) a method to optimize execution of projections through code-generation, and (v) dynamic extensibility that allows new models to be used without recompiling the TSMS. Further, we present a general modular distributed TSMS architecure and its implementation, ModelarDB, as a portable library, using Apache Spark for query processing and Apache Cassandra for storage. An experimental evaluation shows that, unlike current systems, ModelarDB hits a sweet spot and offers fast ingestion, good compression, and fast, scalable online aggregate query processing at the same time. This is achieved by dynamically adapting to data sets using multiple models. The system degrades gracefully as more outliers occur and the actual errors are much lower than the bounds.

##### GRETA: Graph-based Real-time Event Trend Aggregation

Authors: Olga Poppe (WPI), Chuan Lei (IBM Research - Almaden), Elke Rundensteiner (WPI), David Maier (Portland State University)

Streaming applications from algorithmic trading to traffic management deploy Kleene patterns to detect and aggregate arbitrarily-long event sequences, called event trends. State-of-the-art systems process such queries in two steps. Namely, they first construct all trends and then aggregate them. Due to the exponential costs of trend construction, this two-step approach suffers from both a long delays and high memory costs. To overcome these limitations, we propose the Graph-based Real-time Event Trend Aggregation (GRETA) approach that dynamically computes event trend aggregation without first constructing these trends. We define the GRETA graph to compactly encode all trends. Our GRETA runtime incrementally maintains the graph, while dynamically propagating aggregates along its edges. Based on the graph, the final aggregate is incrementally updated and instantaneously returned at the end of each query window. Our GRETA runtime represents a win-win solution, reducing both the time complexity from exponential to quadratic and the space complexity from exponential to linear in the number of events. Our experiments demonstrate that GRETA achieves up to four orders of magnitude speed-up and up to 50--fold memory reduction compared to the state-of-the-art two-step approaches.

Room : Segóvia III

Chair : Maria Luisa Damiani (University of Milano, Italy)

##### Trajectory Simplification: An Experimental Study and Quality Analysis

Authors: Dongxiang Zhang (UESTC), Mengting Ding (UESTC), Dingyu Yang (Shanghai Dian Ji University), Yi Liu (UESTC), Ju Fan (Renmin University of China), Heng Tao Shen (UESTC)

The ubiquitousness of GPS sensors in smart-phones, vehicles and wearable devices has enabled the collection of massive volumes of trajectory data from tracing moving objects. Consequently, an unprecedented scale of timestamped GPS data has been generated and posed an urgent demand for an effective storage mechanism for trajectory databases. The mainstream compression technique is called trajectory simplification, that finds a subsequence to approximate the original trajectory and attempts to minimize the information loss under a distance measure. Even though various simplification algorithms have been proposed in the past decades, there still lacks a thorough comparison to cover all the state-of-the-art algorithms and evaluate their quality using datasets in diversified motion patterns. Hence, it still remains a challenge for GPS data collectors to determine a proper algorithm in a concrete application. In addition, almost the entire line of previous methods uses error-based metrics to evaluate the compression quality, while ignoring their usability in supporting spatio-temporal queries on top of the reduced database. To bridge these gaps, we conduct so far the most comprehensive evaluation on trajectory simplification techniques. We compare the performance of 25 algorithms in total using five real datasets in different motion patterns. According to the experimental findings, we present useful guidance for the selection or development of effective trajectory simplification algorithms.

##### A Unified Approach to Route Planning for Shared Mobility

Authors: Yongxin Tong (Beihang University), Yuxiang Zeng (HKUST), Zimu Zhou (ETH Zurich), Lei Chen (HKUST), Jieping Ye (Didichuxing Inc.), Ke Xu (Beihang University)

There has been a dramatic growth of shared mobility applications, including ride-sharing, food delivery, crowdsourced parcel delivery, etc. Shared mobility refers to transportation services that are shared among users, where a central issue is route planning. Specifically, given a set of workers and requests, route planning is to plan for each worker a route, i.e. a sequence of locations to pick up and drop off passengers/parcels that arrive from time to time, with different optimization objectives. Previous studies lack practicability due to their conflicted objectives and inefficiency in inserting a new request into a route, a basic operation called insertion. In this paper, we present a unified formulation of the route planning problem called URPSM. It has a well-defined parameterized objective function which not only eliminates the contradicted objectives in existing studies but also benefits flexible multi-objective route planning in shared mobility. Then we prove the problem itself is NP-hard and there is no polynomial-time algorithm with constant competitive ratio for the URPSM problem and its variants. In response, we devise an effective and efficient solution to address the URPSM problem approximately. We first design a novel dynamic programming (DP) algorithm to accelerate the insertion operation from cubic or quadric time in previous work to only linear time. On basis of the DP algorithm, we propose a greedy based solution to the URPSM problem. Experimental results on real datasets show that our solution outperforms the state-of-the-arts by 1.2 to 12.8 times in effectiveness, and also runs 2.6 to 20.7 times faster.

##### Efficient Mining of Regional Movement Patterns in Semantic Trajectories

Authors: Dong-Wan Choi (Imperial College London), Jian Pei (Simon Fraser University, JD.com), Thomas Heinis (Imperial College)

Semantic trajectory pattern mining is becoming more and more important with the rapidly growing volumes of semantically rich trajectory data. Extracting sequential patterns in semantic trajectories plays a key role in understanding semantic behaviour of human movement, which can widely be used in many applications such as location-based advertising, road capacity optimisation, and urban planning. However, most of existing works on semantic trajectory pattern mining focus on the entire spatial area, leading to missing some locally significant patterns within a region. Based on this motivation, this paper studies a regional semantic trajectory pattern mining problem, aiming at identifying all the regional sequential patterns in semantic trajectories. Specifically, we propose a new density scheme to quantify the frequency of a particular pattern in space, and thereby formulate a new mining problem of finding all the regions in which such a pattern densely occurs. For the proposed problem, we develop an efficient mining algorithm, called RegMiner (Regional Semantic Trajectory Pattern Miner), which effectively reveals movement patterns that are locally frequent in such a region but not necessarily dominant in the entire space. Our empirical study using real trajectory data shows that RegMiner finds many interesting local patterns that are hard to find by a state-of-the-art global pattern mining scheme, and it also runs several orders of magnitude faster than the global pattern mining algorithm.

##### UlTraMan: A Unified Platform for Big Trajectory Data Management and Analytics

Authors: Xin Ding (Zhejiang University), Lu Chen (Aalborg University), Yunjun Gao (Zhejiang University), Christian Jensen (Aalborg University), Hujun Bao (Zhejiang University)

Massive trajectory data is being generated by GPS-equipped devices, such as cars and mobile phones, which is used increasingly in transportation, location-based services, and urban computing. As a result, a variety of methods have been proposed for trajectory data management and analytics. However, traditional systems and methods are usually designed for very specific data management or analytics needs, which forces users to stitch together heterogeneous systems to analyze trajectory data in an inefficient manner. Targeting the overall data pipeline of big trajectory data management and analytics, we present a unified platform, termed as UlTraMan. In order to achieve scalability, efficiency, persistence, and flexibility, (i) we extend Apache Spark with respect to both data storage and computing by seamlessly integrating a key-value store, and (ii) we enhance the MapReduce paradigm to allow flexible optimizations based on random data access. We study the resulting system's flexibility using case studies on data retrieval, aggregation analyses, and pattern mining. Extensive experiments on real and synthetic trajectory data are reported to offer insight into the scalability and performance of UlTraMan.

##### Order Dispatch in Price-aware Ridesharing

Authors: Libin Zheng (HKUST), Lei Chen (HKUST)

With the prevalence of car-hailing applications, ridesharing becomes more and more popular because of its great potential in monetary saving and environmental protection. Order dispatch is the key problem in ridesharing, which has a strong impact on riders' experience and platform's performance. Existing order dispatch research works fail to consider the price of the orders, which can be an important reference because it directly relates to the platform's profit. Our work takes the order price into concern, and formulates a constrained optimization problem, which takes platform's profit as the optimization objective and performs controls on riders' detour distance and waiting time. We prove the problem is NP-hard, thus, we propose approximation methods. We further develop a simulation framework based on real ridesharing order and vehicle data. We conduct experiments with this simulation framework to evaluate the effectiveness and efficiency of the proposed methods.

Room : Segóvia IV

Authors: Sylvie Cazalens (INSA-Lyon, France), Julien Leblay (AIST Tokyo, Japan), Philippe Lamarre (INSA-Lyon, France), Ioana Manolescu (INRIA, France) and Xavier Tannier (University of Sorbonne)

Room : El Pardo I

##### Concurrent Log-Structured Memory for Many-Core Key-Value Stores

Authors: Alexander Merritt (Georgia Institute of Technology), Ada Gavrilovska (Georgia Institute of Technology), Yuan Chen (Hewlett Packard Labs), Dejan Milojicic (Hewlett Packard Labs)

Key-value stores are an important tool in managing and accessing large in-memory data sets. As many applications benefit from having as much of their working state fit into main memory, an important design of the memory management of modern key-value stores is the use of log-structured approaches, enabling efficient use of the memory capacity, by compacting objects to avoid fragmented states. However, with the emergence of thousand-core and peta-byte memory platforms (DRAM or future storage-class memories) log-structured designs struggle to scale, preventing parallel applications from exploiting the full capabilities of the hardware: careful coordination is required for background activities (compacting and organizing memory) to remain asynchronous with respect to the use of the interface, and for insertion operations to avoid contending for centralized resources such as the log head and memory pools. In this work, we present the design of a log-structured key-value store called Nibble that incorporates a multi-head log for supporting concurrent writes, a novel distributed epoch mechanism for scalable memory reclamation, and an optimistic concurrency index. We implement Nibble in the Rust language in ca. 4000 lines of code, and evaluate it across a variety of data-serving workloads on a 240-core cache-coherent server. Our measurements show Nibble scales linearly in uniform YCSB workloads, matching competitive non-log-structured key-value stores for write-dominated traces at 50 million operations per second on 1 TiB-sized working sets. Our memory analysis shows Nibble is efficient, requiring less than 10% additional capacity, whereas memory use by non-log-structured key-value store designs may be as high as 2x.

##### Accordion: Better Memory Organization for LSM Key-Value Stores

Authors: Edward Bortnikov (Yahoo Research), Anastasia Braginsky (Yahoo Research), Eshcar Hillel (Yahoo Research), Idit Keidar (Technion IIT and Yahoo Research), Gali Sheffi (Yahoo Research)

Log-structured merge (LSM) stores are the technology of choice for building scalable write-intensive key-value storage systems. An LSM store replaces random I/O with sequential I/O by accumulating large batches of writes in a memory store prior to flushing them to disk storage; the latter is continuously re-organized in the background through a compaction process for efficiency of reads. Frequent compactions are a major pain point because they slow down data store operations, and increase disk wear. Another performance bottleneck is the fragmented memory layout of the dynamic memory store. In this paper we show that these pain points may be mitigated via better organization of the memory store. We present Accordion â€“ an algorithm that addresses these problems by re-applying the LSM design principles to memory management. Accordion became the default deployment option in the production code of Apache HBase with double-digit performance gains versus the baseline HBase implementation.

##### SlimDB: A Space-Efficient Key-Value Storage Engine For Semi-Sorted Data

Authors: Kai Ren (Carnegie Mellon University), Qing Zheng (Carnegie Mellon University), Joy Arulraj (Carnegie Mellon University), Garth Gibson (Carnegie Mellon University)

Modern key-value stores often use write-optimized indexes and compact in-memory indexes to speed up read and write performance. One popular write-optimized index is the Logstructured merge-tree (LSM-tree) which provides indexed access to write-intensive data. It has been increasingly used as a storage backbone for many services, including file system metadata management, graph processing engines, and machine learning feature storage engines. Existing LSMtree implementations often exhibit high write amplifications caused by compaction, and lack optimizations to maximize read performance on solid-state disks. The goal of this paper is to explore techniques that leverage common workload characteristics shared by many systems using key-value stores to reduce the read/write amplification overhead typically associated with general-purpose LSM-tree implementations. Our experiments show that by applying these design techniques, our new implementation of a key-value store, SlimDB, can be two to three times faster, use less memory to cache metadata indices, and show lower tail latency in read operations compared to popular LSM-tree implementations such as LevelDB and RocksDB.

##### Sundial: Harmonizing Concurrency Control and Caching in a Distributed OLTP Database Management System

Authors: Xiangyao Yu (MIT), Yu Xia (MIT), Andrew Pavlo (Carnegie Mellon University), Daniel Sanchez (MIT), Larry Rudolph (MIT), Srinivas Devadas (MIT)

Distributed transactions suffer from poor performance due to two major limiting factors. First, distributed transactions suffer from high latency because each of their accesses to remote data incurs a long network delay. Second, this high latency increases the likelihood of contention among distributed transactions, leading to high abort rates and low performance. We present Sundial, an in-memory distributed optimistic concurrency control protocol that addresses these two limitations. First, to reduce the transaction abort rate, Sundial dynamically determines the logical order among transactions at runtime, based on their data access patterns. Sundial achieves this by applying logical leases to each data element, which allows the database to dynamically calculate a transaction’s logical commit timestamp. Second, to reduce the overhead of remote data accesses, Sundial allows the database to cache remote data in a server’s local main memory and maintains cache coherence. With logical leases, Sundial integrates concurrency control and cache coherence into a simple unified protocol. We evaluate Sundial against state-of-the-art distributed concurrency control protocols. Sundial outperforms the next-best protocol by up to 57% under high contention. Sundial’s caching scheme improves performance by up to 4.6× in workloads with high access skew.

##### ReCache: Reactive Caching for Fast Analytics over Heterogeneous Data

Authors: Tahir Azim (EPFL), Manos Karpathiotakis (EPFL), Anastasia Ailamaki (EPFL)

As data continues to be generated at exponentially growing rates in heterogeneous formats, fast analytics to extract meaningful information is becoming increasingly important. Systems widely use in-memory caching as one of their primary techniques to speed up data analytics. However, caches in data analytics systems cannot rely on simple caching policies and a fixed data layout to achieve good performance. Different datasets and workloads require different layouts and policies to achieve optimal performance. This paper presents ReCache, a cache-based performance accelerator that is reactive to the cost and heterogeneity of diverse raw data formats. Using timing measurements of caching operations and selection operators in a query plan, ReCache accounts for the widely varying costs of reading, parsing, and caching data in nested and tabular formats. Combining these measurements with information about frequently accessed data fields in the workload, ReCache automatically decides whether a nested or relational column-oriented layout would lead to better query performance. Furthermore, ReCache keeps track of commonly utilized operators to make informed cache admission and eviction decisions. Experiments on synthetic and real-world datasets show that our caching techniques decrease caching overhead for individual queries by an average of 59%. Furthermore, over the entire workload, ReCache reduces execution time by 19-75% compared to existing techniques.

Room : Oriente

Room : Segóvia I, Segóvia II, Segóvia III, Segóvia IV

Chair : Sihem Amer-Yahia (CNRS Univ. Grenoble Alpes)

Title: Antisocial Behavior on the Web

Author: Jure Leskovec (Stanford University)

User contributions in the form of posts, comments, and votes are essential to the success of online communities. However, allowing user participation also invites undesirable and harmful behavior. In the talk I will discuss antisocial behavior in online discussion communities by analyzing users who were banned from these communities. We will find that such users tend to concentrate their efforts in a small number of threads, are more likely to post irrelevantly, and are more successful at garnering responses from other users. Studying the evolution of these users from the moment they join a community up to when they get banned, we find that not only do they write worse than other users over time, but they also become increasingly less tolerated by the community. Our analysis also reveals distinct groups of users with different levels of antisocial behavior that can change over time. We will use these insights to identify antisocial users early on, a task of high practical importance to community maintainers.

Title: Structured Data for Building Online Trust

In 22 out of 28 countries surveyed by Edelman in 2018, there are more people who distrust the media than those who trust it. In the US, according to Gallup, trust in the media reached an all time low in 2016. Those figures matter because a free and trustworthy news ecosystem is the foundation of democratic society. Technology played a role in the decline of journalism and technology can play a role in rebuilding it. In this talk, I discuss various aspects for (re)building online trust and how data researchers in our community can make a difference. I conclude this talk by reissuing the Call to Arms originally made to the database community more than 8 years ago (Cohen et al, CIDR 2011): it's time for us to tackle some of the most important societal challenges in journalism and online information ecosystem.

Room : Segóvia I

Chair : Carmem Hara (Universidade Federal do Parana)

##### Discovery of Genuine Functional Dependencies from Relational Data with Missing Values

Authors: Laure Berti-Equille (Aix-Marseille University), Hazar Harmouch (Hasso Plattner Institute), Felix Naumann (Hasso Plattner Institute), Noel Novelli (Aix-Marseille University), Saravanan Thirumuruganathan (QCRI)

Functional dependencies (FDs) play an important role in maintaining data quality. They can be used to enforce data consistency and to guide repairs over a database. In this work, we investigate the problem of missing values and its impact on FD discovery. When using existing FD discovery algorithms, some genuine FDs could not be detected precisely due to missing values or some non-genuine FDs can be discovered even though they are caused by missing values with a certain NULL semantics. We define a notion of genuineness and propose algorithms to compute the genuineness score of a discovered FD. This can be used to identify the genuine FDs among the set of all valid dependencies that hold on the data. We evaluate the quality of our method over various real-world and semi-synthetic datasets with extensive experiments. The results show that our method performs well for relatively large FD sets and is able to accurately capture genuine FDs.

##### Explaining Repaired Data with CFDs

Authors: Joeri Rammelaere (University of Antwerp), Floris Geerts (University of Antwerp)

Many popular data cleaning approaches are rule-based: Constraints are formulated in a logical framework, and data is considered dirty if constraints are violated. These constraints are often discovered from data, but to ascertain their validity, user verification is necessary. Since the full set of discovered constraints is typically too large for manual inspection, recent research integrates user feedback into the discovery process. We propose a different approach that employs user interaction only at the start of the algorithm: a user manually cleans a small set of dirty tuples, and we infer the constraint underlying those repairs, called an explanation. We make use of conditional functional dependencies (CFDs) as the constraint formalism. We introduce XPlode, an on-demand algorithm which discovers the best explanation for a given repair. Guided by this explanation, data can then be cleaned using state-of-the-art CFD-based cleaning algorithms. Experiments on synthetic and real-world datasets show that the best explanation can typically be inferred using a limited number of modifications. Moreover, XPlode is substantially faster than discovering all CFDs that hold on a dataset, and is robust to noise in the modifications.

##### Efficient Discovery of Approximate Dependencies

Authors: Sebastian Kruse (Hasso-Plattner-Institut), Felix Naumann (Hasso Plattner Institute)

Functional dependencies (FDs) and unique column combinations (UCCs) form a valuable ingredient for many data management tasks, such as data cleaning, schema recovery, and query optimization. Because these dependencies are unknown in most scenarios, their automatic discovery has been well researched. However, existing methods mostly discover only exact dependencies, i.e., those without violations. Real-world dependencies, in contrast, are frequently approximate due to data exceptions, ambiguities, or data errors. This relaxation to approximate dependencies renders their discovery an even harder task than the already challenging exact dependency discovery. To this end, we propose the novel and highly efficient algorithm Pyro to discover both approximate FDs and approximate UCCs. Pyro combines a separate-and-conquer search strategy with sampling-based guidance that quickly detects dependency candidates and verifies them. In our broad experimental evaluation, Pyro outperforms existing discovery algorithms by a factor of up to 33, scales to larger datasets, and at the same time requires the least main memory.

##### Axiomatic Foundations and Algorithms for Deciding Semantic Equivalences of SQL Queries

Authors: Shumo Chu (University of Washington), Brendan Murphy (University of Washington), Jared Roesch (University of Washington), Alvin Cheung (University of Washington), Dan Suciu (University of Washington)

Deciding the equivalence of SQL queries is a fundamental problem in data management. As prior work has mainly focused on studying the theoretical limitations of the problem, very few implementations for checking such equivalences exist. In this paper, we present a new formalism and implementation for reasoning about the equivalences of SQL queries. Our formalism, U-semiring, extends SQL's semiring semantics with unbounded summation and duplicate elimination. U-semiring is defined using only very few axioms and can thus be easily implemented using proof assistants such as Coq for automated query reasoning. Yet, they are sufficient enough to enable us reason about sophisticated SQL queries that are evaluated over bags and sets, along with various integrity constraints. To evaluate the effectiveness of U-semiring, we have used it to formally verify 68 equivalent queries and rewrite rules from both classical data management research papers and real-world SQL engines, where many of them have never been proven correct before.

##### A Formal Semantics of SQL Queries, Its Validation, and Applications

Authors: Paolo Guagliardo (University of Edinburgh), Leonid Libkin (University of Edinburgh)

While formal semantics of theoretical languages underlying SQL have been provided in the past, they all made simplifying assumptions ranging from changes in the syntax to omitting bag semantics and nulls. This situation is reminiscent of what happens in the field of programming languages, where semantics of formal calculi underlying the main features of languages are abundant, but formal semantics of real languages that people use are few and far between. We consider the basic class of SQL queries – essentially SELECT-FROM-WHERE queries with subqueries, set/bag operations, and nulls – and define a formal semantics for it, without any departures from the real language. This fragment already requires decisions related to the data model and handling variable names that are normally disregarded by simplified semantics. To justify our choice of the semantics, we validate it experimentally on a large number of randomly generated queries and databases. We give two applications of the semantics. One is the first formal proof of the equivalence of basic SQL and relational algebra that extends to bag semantics and nulls. The other application looks at the three-valued logic employed by SQL, which is universally assumed to be necessary to handle nulls. We prove however that this is not so, as three-valued logic does not add expressive power: every SQL query in our fragment can be evaluated under the usual two-valued Boolean semantics of conditions.

Room : Segóvia II

Chair : Alan Fekete (University of Sydney)

##### SQL Statement Logging for Making SQLite Truly Lite

Authors: Jong-Hyeok Park (Sungkyunkwan University), Gihwan Oh (Sungkyunkwan University), Sang Won Lee (Sungkyunkwan University)

The lightweight codebase of SQLite was helpful in making it become the de-facto standard database in most mobile devices, but, at the same time, forced it to take lesscomplicated transactional schemes, such as physical page logging, journaling, and force commit, which in turn cause excessive write amplification. Thus, the write IO cost in SQLite is not lightweight at all. In this paper, to make SQLite truly lite in terms of IO efficiency for the transactional support, we propose SQLite/SSL, a per-transaction SQL statement logging scheme: when a transaction commits, SQLite/SSL ensures its durability by storing only SQL statements of small size, thus writing less and performing faster at no compromise of transactional solidity. Our main contribution is to show that, based on the observation that mobile transactions tend to be short and exhibit strong update locality, logical logging can, though long discarded, become an elegant and perfect fit for SQLite-based mobile applications. Further, we leverage the WAL journal mode in vanilla SQLite as a transaction-consistent checkpoint mechanism which is indispensable in any logical logging scheme. In addition, we show for the first time that byte-addressable NVM (non-volatile memory) in host-side can realize the full potential of logical logging because it allows to store fine-grained logs quickly. We have prototyped SQLite/SSL by augmenting vanilla SQLite with a transaction-consistent checkpoint mechanism and a redo-only recovery logic, and have evaluated its performance using a set of synthetic and real workloads. When a real NVM board is used as its log device, SQLite/SSL can outperform vanilla SQLite’s WAL mode by up to 300x and also outperform the state-of-the-arts SQLite/PPL scheme by several folds in terms of IO time.

##### Contention-Aware Lock Scheduling for Transactional Databases

Authors: Boyu Tian (University of Michigan), Barzan Mozafari (University of Michigan), Grant Schoenebeck (University of Michigan)

Lock managers are among the most well-studied components in concurrency control and transactional systems. However, one question seems to have been generally overlooked: "when there are multiple lock requests on the same object, which one(s) should be granted first?" Nearly all existing systems rely on a FIFO (first in, first out) strategy to decide which transaction(s) to grant the lock to. However, in this paper, we show that the choice of lock scheduling has significant ramifications on the overall performance of a transactional system. Despite the large body of research on job scheduling outside the database context, lock scheduling presents subtle but challenging requirements that render existing results on scheduling inapt for a transactional database. By carefully studying this problem, we present the concept of contention-aware scheduling, show the hardness of the problem, and propose novel lock scheduling algorithms (LDSF and bLDSF), which guarantee a constant factor approximation of the best scheduling. We conduct extensive experiments using a popular database on both TPC-C and a microbenchmark. Compared to FIFO---the default scheduler in most database systems---our bLDSF algorithm yields up to 300x speedup in overall transaction latency. On the other hand, our LDSF algorithm, which is simpler and achieves comparable performance to \ldsfb, has already been adopted by open-source community, and chosen as default scheduling strategy in MySQL 8.0.3+.

##### OLTPShare: The Case for Sharing in OLTP Workloads

Authors: Robin Rehrmann (SAP SE), Carsten Binnig (TU Darmstadt), Alexander Boehm (SAP SE), Kihong Kim (SAP Labs Korea), Wolfgang Lehner (TU Dresden), Amr Rizk (TU Darmstadt)

In the past, resource sharing has been extensively studied for OLAP workloads. Naturally, the question arises, why studies mainly focus on OLAP and not on OLTP workloads? At First sight, OLTP queries, due to their short runtime, may not have enough potential for the additional overhead. In addition, OLTP workloads do not only execute read operations but also updates. In this paper, we address query sharing for OLTP workloads. We first analyze the sharing potential in real-world OLTP workloads. Based on those findings, we then present an execution strategy, called OLTPShare that implements a novel batching scheme for OLTP workloads. We analyze the sharing benefits by integrating OLTPShare into a prototype version of the commercial database system SAP HANA. Our results show for different OLTP workloads that OLTPShare enables SAP HANA to provide a significant throughput increase in high-load scenarios compared to the conventional execution strategy without sharing.

##### Causal Consistency and Latency Optimality: Friend or Foe?

Authors: Diego Didona (EPFL), Rachid Guerraoui (EPFL), Jingjing Wang (EPFL), Willy Zwaenepoel (EPFL)

Causal consistency is an attractive consistency model for replicated data stores. It is provably the strongest model that tolerates partitions, it avoids the long latencies associated with strong consistency, and, especially when using read-only transactions, it prevents many of the anomalies of weaker consistency models. Recent work has shown that causal consistency allows "latency-optimal" read-only transactions, that are nonblocking, single-version and single-round in terms of communication. On the surface, this latency optimality is very appealing, as the vast majority of applications are assumed to have read-dominated workloads. In this paper, we show that such "latency-optimal" read-only transactions induce an extra overhead on writes that is so high that it actually jeopardizes performance even in read-dominated workloads. We show this result from a practical and a theoretical angle. First, we present a protocol that implements "almost latency-optimal" ROTs but does not impose on the writes any of the overhead of latency-optimal protocols. In this protocol, ROTs are nonblocking, one version and can be configured to use either two or one a half rounds of client-server communication. We experimentally show that this protocol not only provides better throughput, as expected, but that, surprisingly, for all but the lowest loads and most read-heavy workloads, it also provides better latencies. Then, we prove that the extra overhead imposed on writes by latency-optimal read-only transactions is inherent, i.e., it is not an artifact of the design we consider, and cannot be avoided by any implementation of latency-optimal read-only transactions. We show in particular that this overhead grows linearly with the number of clients.

##### Exploiting Coroutines to Attack the "Killer Nanoseconds"

Authors: Christopher Jonathan (Hasso Plattner Institute), Umar Farooq Minhas (Microsoft Research), James Hunter (Microsoft Research), Justin Levandoski (Microsoft Research), Gor Nishanov (Microsoft)

Database systems use many pointer-based data structures, including hash tables and B+-trees, which require extensive pointer chasing. Each pointer dereference, e.g., during a hash probe or a B+-tree traversal, can result in a CPU cache miss, stalling the CPU. Recent work has shown that CPU stalls due to main memory accesses are a significant source of overhead, even for cache conscious data structures, and has proposed techniques to reduce this overhead, by hiding memory-stall latency. In this work, we compare and contrast the state-of-the-art approaches to reduce CPU stalls due to cache misses for pointer-intensive data structures. We present an in-depth experimental evaluation and a detailed analysis using four popular data structures: hash table, binary search, MassTree, and Bw-tree. Our focus is on understanding the practicality of using coroutines to improve throughput of such data structures. The implementation, experiments, and analysis presented in this paper promote a deeper understanding of how to exploit coroutines-based approaches to build highly efficient systems.

Room : Segóvia III

Chair : Mario Nascimento (University of Alberta)

##### Indexed Fast Network Proximity Querying

Authors: Mustafa Coskun (Case Western Reserve University), Ananth Grama (Purdue University), Mehmet Koyuturk (Case Western Reserve University)

Node proximity queries are among the most common operations on network databases. A common measure of node proximity is random walk based proximity, which has been shown to be less susceptible to noise and missing data. Realtime processing of random-walk based proximity queries poses significant computational challenges for larger graphs with over billions of nodes and edges, since it involves solution of large linear systems of equations. Due to the importance of this operation, significant effort has been devoted to developing efficient methods for random-walk based node proximity computations. These methods either aim to speed up iterative computations by exploiting numerical properties of random walks, or rely on computation and storage of matrix inverses to avoid computation during query processing. Although both approaches have been well studied, the speedup achieved by iterative approaches does not translate to real-time query processing, and the storage requirements of inversion-based approaches prohibit their use on very large graph databases. We present a novel approach to significantly reducing the computational cost of random walk based node proximity queries with scalable indexing. Our approach combines domain graph-partitioning based indexing with fast iterative computations during query processing using Chebyshev polynomials over the complex elliptic plane. This approach combines the query processing benefits of inversion techniques with the memory and storage benefits of iterative approache. Using real-world networks with billions of nodes and edges, and top-k proximity queries as the benchmark problem, we show that our algorithm, I-Chopper, significantly outperforms existing methods. Specifically, it drastically reduces convergence time of the iterative procedure, while also reducing storage requirements for indexing.

##### Parallel Personalized Pagerank on Dynamic Graphs

Authors: Wentian Guo (NUS), Yuchen Li (Singapore Management University), Mo Sha (NUS), Kian-Lee Tan (NUS)

Personalized PageRank (PPR) is a well-known proximity measure in graphs. To meet the need for dynamic PPR maintenance, recent works have proposed a local update scheme to support incremental computation. Nevertheless, sequential execution of the scheme is still too slow for high-speed stream processing. Therefore, we are motivated to design a parallel approach for dynamic PPR computation. First, as updates always come in batches, we devise a batch processing method to reduce synchronization cost among every single update and enable more parallelism for iterative parallel execution. Our theoretical analysis shows that the parallel approach has the same asymptotic complexity as the sequential approach. Second, we devise novel optimization techniques to effectively reduce runtime overheads for parallel processes. Experimental evaluation shows that our parallel algorithm can achieve orders of magnitude speedups on GPUs and multi-core CPUs compared with the state-of-the-art sequential algorithm.

##### Efficient Algorithms for Adaptive Influence Maximization

Authors: Kai Han (University of Science and Technology of China), Keke Huang (Nanyang Technological Univeristy), Xiaokui Xiao (NUS), Jing Tang (NUS), Aixin Sun (Nanyang Technological University), Xueyan Tang (Nanyang Technological University)

Given a social network G, the influence maximization (IM) problem seeks a set S of k seed nodes in G to maximize the expected number of nodes activated via an influence cascade starting from S. Although a lot of algorithms have been proposed for IM, most of them only work under the non-adaptive setting, i.e., when all k seed nodes are selected before we observe how they influence other users. In this paper, we study the adaptive IM problem, where we select the k seed nodes in batches of equal size b, such that the choice of the i-th batch can be made after the influence results of the first i − 1 batches are observed. We propose the first practical algorithms for adaptive IM with an approximation guarantee of 1 − exp(ξ − 1) for b = 1 and 1 − exp(ξ − 1 + 1/e) for b > 1, where ξ is any number in (0, 1). Our approach is based on a novel AdaptGreedy framework instantiated by non-adaptive IM algorithms, and its performance can be substantially improved if the non-adaptive IM algorithm has a small expected approximation error. However, no current non-adaptive IM algorithms provide such a desired property. Therefore, we further propose a non-adaptive IM algorithm called EPIC, which not only has the same worst-case performance bounds with that of the state-of-the-art non-adaptive IM algorithms, but also has a reduced expected approximation error. We also provide a theoretical analysis to quantify the performance gain brought by instantiating AdaptGreedy using EPIC, compared with a naive approach using the existing IM algorithms. Finally, we use real social networks to evaluate the performance of our approach through extensive experiments, and the experimental experiments strongly corroborate the superiorities of our approach.

##### Noticeable Network Delay Minimization via Node Upgrades

Authors: Sourav Medya (UCSB), Jithin Vachery (Indian Institute of Technology, Madras), Sayan Ranu (Indian Institute of Technology, Delhi), Ambuj Singh (UCSB)

In several domains, the flow of data is governed by an underlying network. Reduction of delays in end-to-end data flow is an important network optimization task. Reduced delays enable shorter travel times for vehicles in road networks, faster information flow in social networks, and increased rate of packets in communication networks. While techniques for network delay minimization have been proposed, they fail to provide any noticeable reduction in individual data flows. Furthermore, they treat all nodes as equally important, which is often not the case in real-world networks. In this paper, we incorporate these practical aspects and propose a network design problem where the goal is to perform k network upgrades such that it maximizes the number of flows in the network with a noticeable reduction in delay. We show that the problem is NP-hard, APX-hard, and non-submodular. We overcome these computational challenges by designing an importance sampling based algorithm with provable quality guarantees. Through extensive experiments on real and synthetic data sets, we establish that importance sampling imparts up to 1000 times speed-up over the greedy approach, and provides up to 70 times the improvement achieved by the state-of-the-art technique.

##### ProbeSim: Scalable Single-Source and Top-k SimRank Computations on Dynamic Graphs

Authors: Yu Liu (Renmin University of China), Bolong Zheng (Sun Yat-sen University), Xiaodong He (Renmin University of China), Zhewei Wei (Renmin University of China), Xiaokui Xiao (Nanyang Technological University), Kai Zheng (UESTC), Jiaheng Lu (University of Helsinki)

Single-source and top-k SimRank queries are two important types of similarity search in graphs with numerous applications in web mining, social network analysis, spam detection, etc. A plethora of techniques have been proposed for these two types of queries, but very few can efficiently support similarity search over large dynamic graphs, due to either significant preprocessing time or large space overheads. This paper presents ProbeSim, an index-free algorithm for single-source and top-k SimRank queries that provides a nontrivial theoretical guarantee in the absolute error of query results. ProbeSim estimates SimRank similarities without precomputing any indexing structures, and thus can naturally support real-time SimRank queries on dynamic graphs. Besides the theoretical guarantee, ProbeSim also offers satisfying practical efficiency and effectiveness due to non-trivial optimizations. We conduct extensive experiments on a number of benchmark datasets, which demonstrate that our solutions outperform the existing methods in terms of efficiency and effectiveness. Notably, our experiments include the first empirical study that evaluates the effectiveness of SimRank algorithms on graphs with billion edges, using the idea of pooling.

Room : Segóvia IV

Authors: Alin Deutsch and Yannis Papakonstantinou (UC San Diego, USA)

Room : El Pardo I

Chair : Luna Dong (Amazon)

##### Transform-Data-by-Example (TDE): Extensible Data Transformation using Functions

Authors: Yeye He (Microsoft Research), Xu Chu (Georgia Institute of Technology), Kris Ganjam (Microsoft Research), Yudian Zheng (Twitter), Vivek Narasayya (Microsoft Research), Surajit Chaudhuri (Microsoft Research)

Today, business analysts and data scientists increasingly need to clean, standardize and transform diverse data sets, such as name, address, date time, and phone number, before they can perform analysis. This process of data transformation is an important part of data preparation, and is known to be difficult and time-consuming for end-users. Traditionally, developers have dealt with these longstanding transformation problems using custom code libraries. They have built vast varieties of custom logic for name parsing and address standardization, etc., and shared their source code in places like GitHub. Data transformation would be a lot easier for end-users if they can discover and reuse such existing transformation logic. We developed Transform-Data-by-Example (TDE), which works like a search engine for data transformations. TDE "indexes" vast varieties of transformation logic in source code, DLLs, web services and mapping tables, so that users only need to provide a few input/output examples to demonstrate a desired transformation, and TDE can interactively find relevant functions to synthesize new programs consistent with all examples. Using an index of 50K functions crawled from GitHub and Stackoverflow, TDE can already handle many common transformations not currently supported by existing systems. On a benchmark with over 200 transformation tasks, TDE generates correct transformations for 72% tasks, which is considerably better than other systems evaluated. A beta version of TDE for Microsoft Excel is available via Office store. Part of the TDE technology also ships in Microsoft Power BI.

##### ICARUS: Minimizing Human Effort in Iterative Data Completion

Authors: Protiva Rahman (The Ohio State University), Courtney Hebert (The Ohio State University), Arnab Nandi (The Ohio State University)

An important step in data preparation involves dealing with incomplete datasets. In some cases, the missing values are unreported because they are characteristics of the domain and are known by practitioners. Due to this nature of the missing values, imputation and inference methods do not work and input from domain experts is required. A common method for experts to fill missing values is through rules. However, for large datasets with thousands of missing data points, it is laborious and time consuming for a user to make sense of the data and formulate effective completion rules. Thus, users need to be shown subsets of the data that will have the most impact in completing missing fields. Further, these subsets should provide the user with enough information to make an update. Choosing subsets that maximize the probability of filling in missing data from a large dataset is computationally expensive. To address these challenges, we present Icarus, which uses a heuristic algorithm to show the user small subsets of the database in the form of a matrix. This allows the user to iteratively fill in data by applying suggested rules based on their direct edits to the matrix. The suggested rules amplify the users' input to multiple missing fields by using the database schema to infer hierarchies. Simulations show Icarus has an average improvement of 50% across three datasets over the baseline system. Further, in-person user studies demonstrate that naive users can fill in 68% of missing data within an hour, while manual rule specification spans weeks.

##### Automated Migration of Hierarchical Data to Relational Tables using Programming-by-Example

Authors: Navid Yaghmazadeh (UT Austin), Xinyu Wang (UT Austin), Isil Dillig (UT Austin)

While many applications export data in hierarchical formats like XML and JSON, it is often necessary to convert such hierarchical documents to a relational representation. This paper presents a novel programming-by-example approach, and its implementation in a tool called Mitra, for automatically migrating tree-structured documents to relational tables. Given a set of small input-output examples that illustrates the required transformation, our method automatically synthesizes a program that automates the desired task. We have evaluated the proposed technique using two sets of experiments. In the first experiment, we used Mitra to automate 98 data transformation tasks collected from StackOverflow. Our method can generate the desired program for 94% of these benchmarks with an average synthesis time of 3.8 seconds. In the second experiment, we used Mitra to generate programs which are used to convert four real-world XML and JSON datasets to full-fledged relational databases. Our evaluation shows that Mitra can automate the desired transformation for all datasets.

##### Synthesizing Entity Matching Rules by Examples

Authors: Rohit Singh (MIT), Venkata Vamsikrishna Meduri (Arizona State University), Ahmed Elmagarmid (QCRI), Samuel Madden (MIT), Paolo Papotti (Arizona State University), Jorge-Arnulfo Quiane-Ruiz (QCRI), Armando Solar-Lezama (MIT), Nan Tang (QCRI)

Entity matching (EM) is a critical part of data integration. We study how to synthesize entity matching rules from positive-negative matching examples. The core of our solution is program synthesis, a powerful tool to automatically generate rules (or programs) that satisfy a given high-level specification, via a predefined grammar. This grammar describes a General Boolean Formula (GBF) that can include arbitrary attribute matching predicates combined by conjunctions, disjunctions and negations, and is expressive enough to model EM problems, from capturing arbitrary attribute combinations to handling missing attribute values. The rules in the form of GBF are more concise than traditional EM rules represented in Disjunctive Normal Form (DNF). Consequently, they are more interpretable than decision trees and other machine learning algorithms that output deep trees with many branches. We present a new synthesis algorithm that, given only positive-negative examples as input, synthesizes EM rules that are effective over the entire dataset. Extensive experiments show that we outperform other interpretable rules (e.g. decision trees with low depth) in effectiveness, and are comparable with non-interpretable tools (e.g. decision trees with high depth, gradient-boosting trees, random forests and SVM).

##### Approximate String Joins with Abbreviations

Authors: Wenbo Tao (MIT), Dong Deng (MIT), Michael Stonebraker (MIT)

String joins have wide applications in data integration and cleaning. The inconsistency of data caused by data errors, term variations and missing values has led to the need for approximate string joins (ASJ). In this paper, we study ASJ with abbreviations, which are a frequent type of term variation. Although prior works have studied ASJ given a user-inputted dictionary of synonym rules, they have three common limitations. First, they suffer from low precision in the presence of abbreviations having multiple full forms. Second, their join algorithms are not scalable because their time complexity is exponential in the number of applicable synonym rules in a string. Third, the dictionary may not exist since abbreviations are highly domain-dependent. We propose an end-to-end workflow to address these limitations. There are three main components in the workflow: (1) a new similarity measure taking abbreviations into account that can handle abbreviations having multiple full forms, (2) an efficient join algorithm following the filter-verification framework and (3) an automatic approach to learn a dictionary of abbreviation rules from input strings. We evaluate our workflow on four real-world datasets and show that our workflow outputs accurate join results, scales well as input size grows and greatly outperforms state-of-the-art approaches in both accuracy and efficiency.

Room : Oriente

#### VLDB Early Career Research Contribution Award Talk

Room : Segóvia I, Segóvia II, Segóvia III, Segóvia IV

Chair : Sihem Amer-Yahia (CNRS Univ. Grenoble Alpes), Jian Pei (Simon Fraser University, JD.com)

Room : Segóvia I

Chair : Reynold Cheng (The University of Hong Kong)

##### Subgraph Matching: on Compression and Computation

Authors: Miao Qiao (Massey University), Hao Zhang (Chinese University of Hong Kong), Hong Cheng (Chinese University of Hong Kong)

Subgraph matching finds a set I of all occurrences of a pattern graph in a target graph. It has a wide range of applications while suffers an expensive computation. This efficiency issue has been studied extensively. All existing approaches, however, turn a blind eye to the output crisis, that is, when the system has to materialize I as a preprocessing/intermediate/final result or an index, the cost of the export of I dominates the overall cost, which could be prohibitive even for a small pattern graph. This paper studies subgraph matching via two problems. 1) Is there an ideal compression of I? 2) Will the compression of I reversely boost the computation of I? For the problem 1), we propose a technique called VCBC to compress I to code(I) which serves effectively the same as I. For problem 2), we propose a subgraph matching computation framework CBF which computes code(I) instead of I to bring down the output cost. CBF further reduces the overall cost by reducing the intermediate results. Extensive experiments show that the compression ratio of VCBC can be up to 105 which also significantly lowers the output cost of CBF. Extensive experiments show the superior performance of CBF over existing approaches.

##### Distributed Evaluation of Subgraph Queries Using Worst-case Optimal and Low-Memory Dataflows

Authors: Khaled Ammar (University of Waterloo), Frank McSherry (ETHZ), Semih Salihoglu (University of Waterloo), Manas Joglekar (Google)

We study the problem of finding and monitoring fixed-size subgraphs in a continually changing large-scale graph. We present the first approach that (i) performs worst-case optimal computation and communication, (ii) maintains a total memory footprint linear in the number of input edges, and (iii) scales down per-worker computation, communication, and memory requirements linearly as the number of workers increases, even on adversarially skewed inputs. Our approach is based on worst-case optimal join algorithms, recast as a data-parallel dataflow computation. We describe the general algorithm and modifications that make it robust to skewed data, prove theoretical bounds on its resource requirements in the {\em massively parallel computing} model, and implement and evaluate it on graphs containing as many as 64 billion edges. The underlying algorithm and ideas generalize from subgraph monitoring to the more general problem of computing and maintaining relational equi-joins over dynamic relations.

##### An Optimal and Progressive Approach to Online Search of Top-K Influential Communities

Authors: Fei Bi (University of New South Wales), Lijun Chang (The University of Sydney), Xuemin Lin (University of New South Wales), Wenjie Zhang (University of New South Wales)

Community search over large graphs is a fundamental problem in graph analysis. Recent studies propose to compute top-k influential communities, where each reported community not only is a cohesive subgraph but also has a high influence value. The existing approaches to the problem of top-k influential community search can be categorized as index-based algorithms and online search algorithms without indexes. The index-based algorithms, although being very efficient in conducting community searches, need to pre-compute a special-purpose index and only work for one built-in vertex weight vector. In this paper, we investigate online search approaches and propose an instance-optimal algorithm LocalSearch whose time complexity is linearly proportional to the size of the smallest subgraph that a correct algorithm needs to access without indexes. In addition, we also propose techniques to make LocalSearch progressively compute and report the communities in decreasing influence value order such that k does not need to be specified. Moreover, we extend our framework to the general case of top-k influential community search regarding other cohesiveness measures. Extensive empirical studies on real graphs demonstrate that our algorithms outperform the existing online search algorithms by several orders of magnitude.

##### Maximum Co-located Community Search in Large Scale Social Networks

Authors: Lu Chen (Swinburne University of Technology), Chengfei Liu (Swinburne University of Technology), Rui Zhou (Swinburne University of Technology), Jianxin Li (The University of Western Australia), Xiaochun Yang (Northeastern University), Bin Wang (Northeastern University)

The problem of k-truss search has been well defined and investigated to find the highly correlated user groups in social networks. But there is no previous study to consider the constraint of users' spatial information in k-truss search, denoted as co-located community search in this paper. The co-located community can serve many real applications. To search the maximum co-located communities efficiently, we first develop an efficient exact algorithm with several pruning techniques. After that, we further develop an approximation algorithm with adjustable accuracy guarantees and explore more effective pruning rules, which can reduce the computational cost significantly. To accelerate the real-time efficiency, we also devise a novel quadtree based index to support the efficient retrieval of users in a region and optimise the search regions with regards to the given query region. Finally, we verify the performance of our proposed algorithms and index using five real datasets.

##### Effective and Efficient Dynamic Graph Coloring

Authors: Long Yuan (UNSW), Lu Qin (UTS), Xuemin Lin (University of New South Wales), Lijun Chang (The University of Sydney), Wenjie Zhang (University of New South Wales)

Graph coloring is a fundamental graph problem that is widely applied in a variety of applications. The aim of graph coloring is to minimize the number of colors used to color the vertices in a graph such that no two incident vertices have the same color. Existing solutions for graph coloring mainly focus on computing a good coloring for a static graph. However, since many real-world graphs are highly dynamic, in this paper, we aim to incrementally maintain the graph coloring when the graph is dynamically updated. We target on two goals: high effectiveness and high efficiency. To achieve high effectiveness, we maintain the graph coloring in a way such that the coloring result is consistent with one of the best static graph coloring algorithms. To achieve high efficiency, we investigate efficient incremental algorithms to update the graph coloring by only exploring a small number of vertices. We design a color-propagation based algorithm which only explores the vertices within the 2-hop neighbors of the update-related and color-changed vertices. We then propose a novel color index to maintain some summary color information and, thus, bound the explored vertices within the neighbors of these vertices. Moreover, we derive some effective pruning rules to further reduce the number of propagated vertices. The experimental results demonstrate the high effectiveness and efficiency of our approach.

Room : Segóvia II

Chair : Karthik Ramachandra (Microsoft Gray Systems Lab)

##### F1 Query: Declarative Querying at Scale

This paper presents the end-to-end design of F1 Query. Evolved out of F1, the distributed database that Google uses to manage its advertising data, F1 Query has been in production for multiple years at Google and serves the querying needs of a large number of users and systems. F1 Query is a stand-alone, federated query processing platform that executes SQL queries against data stored in different file-based formats as well as different storage systems at Google. F1 Query eliminates the need to maintain the traditional distinction between different types of data processing workloads by simultaneously supporting: (i) OLTP, (ii) low-latency OLAP, and (iii) ETL pipelines. F1 Query has also significantly reduced the need for developing hard-coded data processing pipelines by enabling declarative queries integrated with custom business logic.

##### Query Fresh: Log Shipping on Steroids

Authors: Tianzheng Wang (University of Toronto), Ryan Johnson (Logic Blox USA), Ippokratis Pandis (Amazon)

Hot standby systems often have to trade safety (i.e., not losing committed work) and freshness (i.e., having access to recent updates) for performance. Guaranteeing safety requires synchronous log shipping that blocks the primary until the log records are durably replicated in one or multiple backups; maintaining freshness necessitates fast log replay on backups, but is often defeated by the dual-copy architecture and serial replay: a backup must generate the "real" data from the log to make recent updates accessible to read-only queries. This paper proposes Query Fresh, a hot standby system that provides both safety and freshness while maintaining high performance on the primary. The crux is an append-only storage architecture used in conjunction with fast networks (e.g., InfiniBand) and byte-addressable, non-volatile memory (NVRAM). Query Fresh avoids the dual-copy design and treats the log as the database, enabling lightweight, parallel log replay that does not block the primary. Experimental results using the TPC-C benchmark show that under Query Fresh, backup servers can replay log records faster than they are generated by the primary server, using one quarter of the available compute resources. With a 56Gbps network, Query Fresh can support up to 4-5 synchronous replicas, each of which receives and replays ~1.4GB of log records per second, with up to 4-6% overhead on the primary compared to a standalone server that achieves 620kTPS without replication.

##### Selecting Subexpressions to Materialize at Datacenter Scale

Authors: Alekh Jindal (Microsoft), Konstantinos Karanasos (Microsoft), Sriram Rao (Microsoft), Hiren Patel (Microsoft)

We observe significant overlaps in the computations performed by user jobs in modern shared analytics clusters. Naively computing the same subexpressions multiple times results in wasting cluster resources and longer execution times. Given that these shared cluster workloads consist of tens of thousands of jobs, identifying overlapping computations across jobs is of great interest to both cluster operators and users. Nevertheless, existing approaches support orders of magnitude smaller workloads or employ heuristics with limited effectiveness. In this paper, we focus on the problem of subexpression selection for large workloads, i.e., selecting common parts of job plans and materializing them to speed-up the evaluation of subsequent jobs. We provide an ILP-based formulation of our problem and map it to a bipartite graph labeling problem. Then, we introduce BIGSUBS, a vertex-centric graph algorithm to iteratively choose in parallel which subexpressions to materialize and which subexpressions to use for evaluating each job. We provide a distributed implementation of our approach using our internal SQL-like execution framework, SCOPE, and assess its effectiveness over production workloads. BIGSUBS supports workloads with tens of thousands of jobs, yielding savings of up to 40% in machine-hours. We are currently integrating our techniques with the SCOPE runtime in our production clusters.

##### You Say ‘What’, I Hear ‘Where’ and ‘Why’ — (Mis-)Interpreting SQL to Derive Fine-Grained Provenance

Authors: Tobias Müller (Universität Tübingen), Benjamin Dietrich (Universität Tübingen), Torsten Grust (Universität Tübingen)

SQL declaratively specifies what the desired output of a query is. This work shows that a non-standard interpretation of the SQL semantics can, instead, disclose where a piece of the output originated in the input and why that piece found its way into the result. We derive such data provenance for very rich SQL dialects—including recursion, windowed aggregates, and user-defined functions—at the fine-grained level of individual table cells. The approach is non-invasive and implemented as a compositional source-level SQL rewrite: an input SQL query is transformed into its own interpreter that wields data dependencies instead of regular values. We deliberately design this transformation to preserve the shape of both data and query, which allows provenance derivation to scale to complex queries without overwhelming the underlying database system.

##### Smoke: Fine-grained Lineage at Interactive Speed

Authors: Fotis Psallidas (Columbia University), Eugene Wu (Columbia University)

Data lineage describes the relationship between individual input and output data items of a workflow and is an integral ingredient for both traditional (e.g., debugging or auditing) and emergent (e.g., explanations or cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to quickly capture lineage across a workflow in order to speed up future queries over lineage. Current lineage systems, however, either incur high lineage capture overheads, high lineage query processing costs, or both. In response, developers resort to manual implementations of applications that, in principal, can be expressed and optimized in lineage terms. This paper describes Smoke, an in-memory database engine that provides both fast lineage capture and lineage query processing. To do so, Smoke tightly integrates the lineage capture logic into physical database operators; stores lineage in efficient lineage representations; and employs optimizations if future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and lineage query costs by multiple orders of magnitude as compared to state-of-the-art alternatives. On real-world applications, we show that Smoke meets the latency requirements of interactive visualizations (e.g., <150ms) and outperforms hand-written implementations of data profiling primitives.

Room : Segóvia III

Chair : Essam Mansour (Qatar Computing Research Institute)

##### Table Union Search on Open Data

Authors: Fatemeh Nargesian (University of Toronto), Erkang Zhu (University of Toronto), Ken Pu (University of Ontario Inst. of Technology), Renee Miller (University of Toronto)

We define the table union search problem and present a probabilistic solution for finding tables that are unionable with a query table within massive repositories. Two tables are unionable if they share attributes from the same domain. Our solution formalizes three statistical models that describe how unionable attributes are generated from set domains, semantic domains with values from an ontology, and natural language domains. We propose a data-driven approach that automatically determines the best model to use for each pair of attributes. Through a distribution-aware algorithm, we are able to find the optimal number of attributes in two tables that can be unioned. To evaluate accuracy, we created and open-sourced a benchmark of Open Data tables. We show that our table union search outperforms in speed and accuracy existing algorithms for finding related tables and scales to provide efficient search over Open Data repositories containing more than one million attributes.

##### Distributed Representations of Tuples for Entity Resolution

Authors: Muhammad Ebraheem (QCRI), Saravanan Thirumuruganathan (QCRI), Shafiq Joty (Nanyang Technological University), Mourad Ouzzani (QCRI), Nan Tang (QCRI)

Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representation of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). For accuracy, we use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short-term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available (we use GloVe in this paper) as well the case where they are not; we present ways to learn and tune the distributed representations that are customized for a specific ER task under different scenarios. For efficiency, we propose a locality sensitive hashing (LSH) based blocking approach that uses distributed representations of tuples; it takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only few attributes. For ease-of-use, DeepER requires much less human labeled data, and does not need feature engineering, compared with traditional machine learning based approaches which require handcrafted features, and similarity functions along with their associated thresholds. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

##### Efficient Estimation of Inclusion Coefficient using HyperLogLog Sketches

Authors: Azade Nazi (Microsoft Research), Bolin Ding (Alibaba Group), Vivek Narasayya (Microsoft Research), Surajit Chaudhuri (Microsoft Research)

Efficiently estimating the inclusion coefficient – the fraction of values of one column that are contained in another column – is useful for tasks such as data profiling and foreign-key detection. We present a new estimator, BML, for inclusion coefficient based on Hyperloglog sketches that results in significantly lower error compared to the state-of-the art approach that uses Bottom-k sketches. We evaluate the error of the BML estimator using experiments on industry benchmarks such as TPC-H and TPC-DS, and several realworld databases. As an independent contribution, we show how Hyperloglog sketches can be maintained incrementally with data deletions using only a constant amount of additional memory.

##### Domain-Aware Multi-Truth Discovery from Conflicting Sources

Authors: Xueling Lin (HKUST), Lei Chen (HKUST)

In the Big Data era, truth discovery has served as a promising technique to solve conflicts in the facts provided by numerous data sources. The most significant challenge for this task is to estimate source reliability and select the answers supported by high quality sources. However, existing works assume that one data source has the same reliability on any kinds of entity, ignoring the possibility that a source may vary in reliability on different domains. To capture the influence of various levels of expertise in different domains, we integrate domain expertise knowledge to achieve a more precise estimation of source reliability. We propose to infer the domain expertise of a data source based on its data richness in different domains. We also study the mutual influence between domains, which will affect the inference of domain expertise. Through leveraging the unique features of the multi-truth problem that sources may provide partially correct values of a data item, we assign more reasonable confidence scores to value sets. We propose an integrated Bayesian approach to incorporate the domain expertise of data sources and confidence scores of value sets, aiming to find multiple possible truths without any supervision. Experimental results on two real-world datasets demonstrate the feasibility, efficiency and effectiveness of our approach.

##### Stylus: A Strongly-Typed Store for Serving Massive RDF Data

Authors: Liang He (University of Science and Technology of China), Bin Shao (Microsoft Research Asia), Yatao Li (Microsoft Research Asia), Huanhuan Xia (Microsoft Research Asia), Yanghua Xiao (Fudan University), Enhong Chen (University of Science and Technology of China), Liang Chen (Microsoft Research Asia)

RDF is one of the most commonly used knowledge representation forms. Many highly influential knowledge bases, such as Freebase and PubChemRDF, are in RDF format. An RDF data set is usually represented as a collection of subject-predicate-object triples. Despite the flexibility of RDF triples, it is challenging to serve SPARQL queries on RDF data eciently by directly managing triples due to the following two reasons. First, heavy joins on a large number of triples are needed for query processing, resulting in a large number of data scans and large redundant intermediate results; Second, weakly-typed triple representation provides suboptimal random access – typically with logarithmic complexity. This data access challenge, unfortunately, cannot be easily met by a better query optimizer as large graph processing is extremely I/O-intensive. In this paper, we argue that strongly-typed graph representation is the key to high-performance RDF query processing. We propose Stylus – a strongly-typed store for serving massive RDF data. Stylus exploits a strongly-typed storage scheme to boost the performance of RDF query processing. The storage scheme is essentially a materialized join view on entities, it thus can eliminate a large number of unnecessary joins on triples. Moreover, it is equipped with a compact representation for intermediate results and an ecient graphdecomposition based query planner. Experimental results on both synthetic and real-life RDF data sets confirm that the proposed approach can dramatically boost the performance of SPARQL query processing.

Room : Segóvia IV

Chair : Thomas Neumann (TUM)

##### Relaxed Operator Fusion for In-Memory Databases: Making Compilation, Vectorization, and Prefetching Work Together At Last

Authors: Prashanth Menon (Carnegie Mellon University), Todd C. Mowry (Carnegie Mellon University), Andrew Pavlo (Carnegie Mellon University)

In-memory database management systems (DBMSs) are a key component of modern on-line analytic processing (OLAP) applications, since they provide low-latency access to large volumes of data. Because disk accesses are no longer the principle bottleneck in such systems, the focus in designing query execution engines has shifted to optimizing CPU performance. Recent systems have revived an older technique of using just-in-time (JIT) compilation to execute queries as native code instead of interpreting a plan. The state-ofthe-art in query compilation is to fuse operators together in a query plan to minimize materialization overhead by passing tuples efficiently between operators. Our empirical analysis shows, however, that more tactful materialization yields better performance. We present a query processing model called “relaxed operator fusion” that allows the DBMS to introduce staging points in the query plan where intermediate results are temporarily materialized. This allows the DBMS to take advantage of inter-tuple parallelism inherent in the plan using a combination of prefetching and SIMD vectorization to support faster query execution on data sets that exceed the size of CPU-level caches. Our evaluation shows that our approach reduces the execution time of OLAP queries by up to 2.2x and achieves up to 1.8x better performance compared to other in-memory DBMSs.

##### Approximately Counting Triangles in Large Graph Streams Including Edge Duplicates with a Fixed Memory Usage

Authors: Pinghui Wang (Xi'an Jiaotong University), Yiyan Qi (Xi'an Jiaotong University), Yu Sun (Xi'an Jiaotong University), Xiangliang Zhang (King Abdullah University of Science and Technology), Jing Tao (Xi'an Jiaotong University), Xiaohong Guan (Xi'an Jiaotong University)

Counting triangles in a large graph is important for detecting network anomalies such as spam web pages and suspicious accounts (e.g., fraudsters and advertisers) on online social networks. However, it is challenging to compute the number of triangles in a large graph represented as a stream of edges with a low computational cost when given a limited memory. Recently, several effective sampling-based approximation methods have been developed to solve this problem. However, they assume the graph stream of interest contains no duplicate edges, which does not hold in many real-world graph streams (e.g., phone calling networks). In this paper, we observe that these methods exhibit a large estimation error or computational cost even when modified to deal with duplicate edges using deduplication techniques such as Bloom filter and hash-based sampling. To solve this challenge, we design a onepass streaming algorithm for uniformly sampling distinct edges at a high speed. Compared to state-of-the-art algorithms, our algorithm reduces the sampling cost per edge from O(log k) (k is the maximum number of sampled edges determined by the available memory space) to O(1) without using any additional memory space. Based on sampled edges, we develop a simple yet accurate method to infer the number of triangles in the original graph stream. We conduct extensive experiments on a variety of realworld large graphs, and the results demonstrate that our method is several times more accurate and faster than state-of-the-art methods with the same memory usage.

##### AIDA - Abstraction for Advanced In-Database Analytics

Authors: Joseph Vinish D'silva (McGill University), Florestan De Moor (McGill University), Bettina Kemme (McGill University)

With the tremendous growth in data science and machine learning, it is becoming increasingly clear that traditional relational database management systems (RDBMS) are lacking appropriate support for the programming paradigms required by such applications, whose developers prefer tools and packages that perform the computation outside the database system. While the database community has attempted to integrate some of these tools as user defined functions (UDFs) in the RDBMS, this has not swayed the trend as UDFs and SQL are not convenient for an incremental, iterative development approach used in these fields. In this paper, we propose AIDA - an abstraction for advanced in-database analytics. AIDA emulates the syntax and semantics of popular data science tools and packages but transparently executes the required transformations and computations inside the RDBMS. In particular, AIDA enables users to avail the regular Python interpreter as a client to connect to the database and perform computations. Furthermore, it supports the interleaved use of both relational and linear algebra operations using a convenient, unified abstraction. AIDA uses the RDBMS engine to efficiently execute relational operations while relying on an embedded Python interpreter and NumPy to perform linear algebra operations. The transformation between data formats is done transparently by AIDA and avoids data copying whenever possible. AIDA does not require changes to statistical packages or RDBMS implementations making it easily portable.

##### BTrim - Hybrid In-Memory Database Architecture for Extreme Transaction Processing in VLDBs

Authors: Aditya Gurajada (SAP Inc.), Dheren Gala (SAP Inc.), Amit Pathak (SAP Inc.), Zhan-Feng Ma (SAP Inc.)

To address the need for extreme OLTP performance on commodity multi-core hardware supporting large amounts of memory, SAP ASE is re-architected to tightly integrate an In-Memory Row Store (IMRS) within the existing database engine. The IMRS is both a store and a caching layer to host "hot" rows in-memory, in a row-oriented format. The IMRS is an extension to the traditional buffer-cache which deals with data in a page-oriented storage format (referred to as the page-store). Data in individual tables marked as IMRS-enabled can be fully memory-resident or can straddle the page store and the IMRS. Cold data in the IMRS is organically identified, harvested, and "packed" back to the page store. All Transact-SQL capabilities and language constructs are supported. Full durability for in-memory data is provided. The high-level system design supporting this architecture, along with experimental results and performance benefits is presented.

Room : El Pardo I

Chair : Wagner Meira Jr. (UFMG)

##### Coconut: A Scalable Bottom-Up Approach for Building Data Series Indexes

Authors: Haridimos Kondylakis (FORTH-ICS), Niv Dayan (Harvard University), Kostas Zoumpatianos (Harvard University), Themis Palpanas (Paris Descartes University)

Many modern applications produce massive amounts of data series that need to be analyzed, requiring efficient similarity search operations. However, the state-of-the-art data series indexes that are used for this purpose do not scale well for massive datasets in terms of performance, or storage costs. We pinpoint the problem to the fact that existing summarizations of data series used for indexing cannot be sorted while keeping similar data series close to each other in the sorted order. This leads to two design problems. First, traditional bulk-loading algorithms based on sorting cannot be used. Instead, index construction takes place through slow top-down insertions, which create a non-contiguous index that results in many random I/Os. Second, data series cannot be sorted and split across nodes evenly based on their median value; thus, most leaf nodes are in practice nearly empty. This further slows down query speed and amplifies storage costs. To address these problems, we present Coconut. The first innovation in Coconut is an inverted, sortable data series summarization that organizes data series based on a z-order curve, keeping similar series close to each other in the sorted order. As a result, Coconut is able to use bulk-loading techniques that rely on sorting to quickly build a contiguous index using large sequential disk I/Os. We then explore prefix-based and median-based splitting policies for bottom-up bulk-loading, showing that median-based splitting outperforms the state of the art, ensuring that all nodes are densely populated. Overall, we show analytically and empirically that Coconut dominates the state-of-the-art data series indexes in terms of construction speed, query speed, and storage costs.

##### VERIFAS: A Practical Verifier for Artifact Systems

Authors: Yuliang Li (UC San Diego), Alin Deutsch (UC San Diego), Victor Vianu (UC San Diego)

Data-driven workflows, of which IBM's Business Artifacts are a prime exponent, have been successfully deployed in practice, adopted in industrial standards, and have spawned a rich body of research in academia, focused primarily on static analysis. The present research bridges the gap between the theory and practice of artifact verification with VERIFAS, the first implementation of practical significance of an artifact verifier with full support for unbounded data. VERIFAS verifies within seconds linear-time temporal properties over real-world and synthetic workflows of complexity in the range recommended by software engineering practice. Compared to our previous implementation based on the widely-used Spin model checker, VERIFAS not only supports a model with richer data manipulations but also outperforms it by over an order of magnitude. VERIFAS' good performance is due to a novel symbolic representation approach and a family of specialized optimizations.

##### Worker Recommendation for Crowdsourced Q&A Services: A Triple-Factor Aware Approach

Authors: Zheng Liu (HKUST), Lei Chen (HKUST)

Worker Recommendation (WR) is one of the most important functions for crowdsourced Q&A services. Specifically, given a set of tasks to be solved, WR recommends each task with a certain group of workers, whom are expected to give timely answers with high qualities. To address the WR problem, recent studies have introduced a number of recommendation approaches, which take advantage of workers' expertises or preferences towards different types of tasks. However, without a thorough consideration of workers' characters, such approaches will lead to either inadequate task fulfillment or inferior answer quality. In this work, we propose the Triple-factor Aware Worker Recommendation framework, which collectively considers workers' expertises, preferences and activenesses to maximize the overall production of high quality answers. We construct the Latent Hierarchical Factorization Model, which is able to infer the tasks' underlying categories and workers' latent characters from the historical data; and we propose a novel parameter inference method, which only requires the processing of positive instances, giving rise to significantly higher time efficiency and better inference quality. What's more, the sampling-based recommendation algorithm is developed, such that the near optimal worker recommendation can be generated for a presented batch of tasks with considerably reduced time consumption. Comprehensive experiments have been carried out using both real and synthetic datasets, whose results verify the effectiveness and efficiency of our proposed methods.

##### Task Assignment in Spatial Crowdsourcing

Authors: Peng Cheng (HKUST), Xun Jian (HKUST), Lei Chen (HKUST)

Recently, with the rapid development of mobile devices and the crowdsourcing platforms, the spatial crowdsourcing has attracted much attention from the database community. Specifically, spatial crowdsourcing refers to sending a location-based request to workers according to their positions, and workers need to physically move to specified locations to conduct tasks. Many works have studied task assignment problems in spatial crowdsourcing, however, their problem settings are different from each other. Thus, it is hard to compare the performances of existing algorithms on task assignment in spatial crowdsourcing. In this paper, we present a comprehensive experimental comparison of most existing algorithms on task assignment in spatial crowdsourcing. Specifically, we first give general definitions about spatial workers and spatial tasks based on definitions in the existing works such that the existing algorithms can be applied on the same synthetic and real data sets. Then, we provide a uniform implementation for all the tested algorithms of task assignment problems in spatial crowdsourcing (open sourced). Finally, based on the results on both synthetic and real data sets, we discuss the strengths and weaknesses of tested algorithms, which can guide future research on the same area and practical implementations of spatial crowdsourcing systems.

##### Constraint-based Explanation and Repair of Filter-Based Transformations

Authors: Dolan Antenucci (University of Michigan), Michael Cafarella (University of Michigan)

Data analysts often need to transform an existing dataset, such as with filtering, into a new dataset for downstream analysis. Even the most trivial of mistakes in this phase can introduce bias and lead to the formation of invalid conclusions. For example, consider a researcher identifying subjects for trials of a new statin drug. She might identify patients with a high dietary cholesterol intake as a population likely to benefit from the drug, however, selection of these individuals could bias the test population to those with a generally unhealthy lifestyle, thereby compromising the analysis. Reducing the potential for bias in the dataset transformation process can minimize the need to later engage in the tedious, time-consuming process of trying to eliminate bias while preserving the target dataset. We propose a novel interaction model for explain-and-repair data transformation systems, in which users interactively define constraints for transformation code and the resultant data. The system satisfies these constraints as far as possible, and provides an explanation for any problems encountered. We present an algorithm that yields filter-based transformation code satisfying user constraints. We implemented and evaluated a prototype of this architecture, Emeril, using both synthetic and real-world datasets. Our approach finds solutions 34% more often and 77% more quickly than the previous state-of-the-art solution.

Room : Oriente

Room : Segóvia I

Chair : Sourav Bhowmick (Nanyang Technological University)

##### 2SCENT: An Efficient Algorithm to Enumerate All Simple Temporal Cycles

Authors: Rohit Kumar (ULB), Toon Calders (Universiteit Antwerpen)

In interaction networks nodes may interact continuously and repeatedly. Not only which nodes interact is important, but also the order in which interactions take place and the patterns they form. These patterns cannot be captured by solely inspecting the static network of who interacted with whom and how frequently, but also the temporal nature of the network needs to be taken into account. In this paper we focus one such fundamental interaction pattern, namely a temporal cycle. Temporal cycles have many applications and appear naturally in communication networks where one person posts a message and after a while reacts to a thread of reactions from peers on the post. In financial networks, on the other hand, the presence of a temporal cycle could be indicative for certain types of fraud. We present an efficient algorithms to find all temporal cycles in a directed interaction network. Our algorithm is a non-trivial temporal extension of a seminal algorithm for finding cycles in static graphs, preceded by an efficient candidate root filtering technique which can be based on Bloom filters to reduce the memory footprint. We tested our algorithm on seven real-world data sets, showing that our algorithm is up to 300 times faster than the only existing competitor and scales up to networks with millions of nodes and hundreds of millions of interactions. Results of a qualitative experiment indicate that different interaction networks may have vastly different distributions of temporal cycles and hence temporal cycles are able to characterize an important aspect of the dynamic behavior in the networks.

##### Real-time Constrained Cycle Detection in Large Dynamic Graphs

Authors: Xiafei Qiu (Alibaba Group), Zhengping Qian (Alibaba Group), You Peng (University of New South Wales), Ying Zhang (University of Technology Sydney), Xuemin Lin (University of New South Wales), Jingren Zhou (Alibaba Group)

As graph data is prevalent for an increasing number of internet applications, continuously monitoring structural patterns in dynamic graphs becomes critical for many applications. In this paper, we present a new system GraphS to efficiently detect constrained cycles in a dynamic graph in real-time. With optimized data representation, the system supports high frequency structural updates to the graph without sacrificing query performance. A hot point based index is built and efficiently maintained for each query so as to greatly speed-up query time and achieve high system throughput. The system is developed at Alibaba to actively monitor various online fraudulent activities based on cycle detection. For a dynamic graph with hundreds of millions of edges and vertices, the system is capable to cope with a peak rate of tens of thousands of edge updates per second and find all the cycles with pre-defined constraints within a millisecond.

##### Streaming Graph Partitioning: An Experimental Study

Authors: Zainab Abbas (KTH Royal Institute of Technology), Vasiliki Kalavri (ETH Zurich), Paris Carbone (KTH Royal Institute of Technology), Vladimir Vlassov (KTH Royal Institute of Technology)

Graph partitioning is an essential yet challenging task for massive graph analysis in distributed computing. Common graph partitioning methods scan the complete graph to obtain structural characteristics offline, before partitioning. However, the emerging need for low-latency, continuous graph analysis led to the development of online partitioning methods. Online methods ingest edges or vertices as a stream, making partitioning decisions on the fly based on partial knowledge of the graph. Prior studies have compared offline graph partitioning techniques across different systems. Yet, little effort has been put into investigating the characteristics of online graph partitioning strategies. In this work, we describe and categorize online graph partitioning techniques based on their assumptions, objectives and costs. Furthermore, we employ an experimental comparison across different applications and datasets, using a unified distributed runtime based on Apache Flink. Our experimental results showcase that model-dependent online partitioning techniques such as low-cut algorithms offer better performance for communication-intensive applications such as bulk synchronous iterative algorithms, albeit higher partitioning costs. Otherwise, model-agnostic techniques trade off data locality for lower partitioning costs and balanced workloads which is beneficial when executing data-parallel single-pass graph algorithms.

##### Clustering Uncertain Graphs

An uncertain graph G = (V,E, p : E ! (0, 1]) can be viewed as a probability space whose outcomes (referred to as possible worlds) are subgraphs of G where any edge e ∈ E occurs with probability p(e), independently of the other edges. These graphs naturally arise in many application domains where data management systems are required to cope with uncertainty in interrelated data, such as computational biology, social network analysis, network reliability, and privacy enforcement, among the others. For this reason, it is important to devise fundamental querying and mining primitives for uncertain graphs. This paper contributes to this endeavor with the development of novel strategies for clustering uncertain graphs. Specifically, given an uncertain graph G and an integer k, we aim at partitioning its nodes into k clusters, each featuring a distinguished center node, so to maximize the minimum/average connection probability of any node to its cluster’s center, in a random possible world. We assess the NP-hardness of maximizing the minimum connection probability, even in the presence of an oracle for the connection probabilities, and develop ecient approximation algorithms for both problems and some useful variants. Unlike previous works in the literature, our algorithms feature provable approximation guarantees and are capable to keep the granularity of the returned clustering under control. Our theoretical findings are complemented with several experiments that compare our algorithms against some relevant competitors, with respect to both running-time and quality of the returned clusterings.

##### Efficient Structural Graph Clustering: An Index-Based Approach

Authors: Dong Wen (University of Technology Sydney), Lu Qin (University of Technology Sydney), Ying Zhang (University of Technology Sydney), Lijun Chang (The University of Sydney), Xuemin Lin (University of New South Wales)

Graph clustering is a fundamental problem widely experienced across many industries. The structural graph clustering (SCAN) method obtains not only clusters but also hubs and outliers. However, the clustering results closely depend on two sensitive parameters, and µ, while the optimal parameter setting depends on different graph properties and various user requirements. Moreover, all existing SCAN solutions need to scan at least the whole graph, even if only a small number of vertices belong to clusters. In this paper we propose an index-based method for SCAN. Based on our index, we cluster the graph for any and µ in O( P C∈C |EC |) time, where C is the result set of all clusters and |EC | is the number of edges in a specific cluster C. In other words, the time expended to compute structural clustering depends only on the result size, not on the size of the original graph. Our index’s space complexity is bounded by O(m), where m is the number of edges in the graph. To handle dynamic graph updates, we propose algorithms and several optimization techniques for maintaining our index. We conduct extensive experiments to practically evaluate the performance of all our proposed algorithms on 10 real-world networks, one of which contains more than 1 billion edges. The experimental results demonstrate that our approaches significantly outperform existing solutions.

Room : Segóvia II

Chair : Matthias Boehm (IBM Research - Almaden)

##### Scalable Replay-Based Replication For Fast Databases

Authors: Dai Qin (University of Toronto), Angela Brown (University of Toronto), Ashvin Goel (University of Toronto)

Primary-backup replication is commonly used for providing fault tolerance in databases. It is performed by replaying the database recovery log on a backup server. Such a scheme raises several challenges for modern, high-throughput multicore databases. It is hard to replay the recovery log concurrently, and so the backup can become the bottleneck. Moreover, with the high transaction rates on the primary, the log transfer can cause network bottlenecks. Both these bottlenecks can significantly slow the primary database. In this paper, we propose using record-replay for replicating fast databases. Our design enables replay to be performed scalably and concurrently, so that the backup performance scales with the primary performance. At the same time, our approach requires only 15-20% of the network bandwidth required by traditional logging, reducing network infrastructure costs significantly.

##### Filter Before You Parse: Faster Analytics on Raw Data with Sparser

Authors: Shoumik Palkar (Stanford University), Firas Abuzaid (Stanford University), Peter Bailis (Stanford University), Matei Zaharia (Stanford University and Databricks)

Exploratory big data applications often run on unstructured or semistructured raw data formats, such as JSON files or text logs. These applications can spend 80–90% of their execution time parsing the data. In this paper, we propose a new approach for reducing this overhead: apply filters on the data’s raw bytestream before parsing. This technique, which we call raw filtering, leverages the features of modern hardware and the high selectivity of queries found in many exploratory applications. With raw filtering, a user-specified query predicate is compiled into a set of filtering primitives called raw filters (RFs). RFs are fast, SIMD-based operators that occasionally yield false positives, but never any false negatives. We combine multiple RFs into an RF cascade to decrease the false positive rate and maximize parsing throughput. Because the best RF cascade is data-dependent, we propose an optimizer that dynamically selects the combination of RFs with the best expected throughput, achieving within 10% of the global optimum cascade while adding less than 1.2% overhead. We implement these techniques in a system called Sparser, which automatically manages a parsing cascade given a data stream in a supported format (e.g., JSON, Avro, Parquet, or log files) and a user query. We show that many real-world applications are highly selective and benefit from Sparser. Across diverse workloads, Sparser accelerates state-of-the-art parsers such as Mison and RapidJSON by 2–22× and improves end-to-end application performance by up to 9×.

##### Froid: Optimization of Imperative Programs in a Relational Database

Authors: Karthik Ramachandra (Microsoft Gray Systems Lab), Kwanghyun Park (Microsoft Gray Systems Lab), K. Venkatesh Emani (IIT Bombay), Alan Halverson (Microsoft Gray Systems lab), Cesar Galindo-Legaria (Microsoft), Conor Cunningham (Microsoft)

For decades, RDBMSs have supported declarative SQL as well as imperative functions and procedures as ways for users to express data processing tasks. While the evaluation of declarative SQL has received a lot of attention resulting in highly sophisticated techniques, the evaluation of imperative programs has remained naive and highly inefficient. Imperative programs offer several benefits over SQL and hence are often preferred and widely used. But unfortunately, their abysmal performance discourages, and even prohibits their use in many situations. We address this important problem that has hitherto received little attention. We present Froid, an extensible framework for optimizing imperative programs in relational databases. Froid's novel approach automatically transforms entire User Defined Functions (UDFs) into relational algebraic expressions, and embeds them into the calling SQL query. This form is now amenable to cost-based optimization and results in efficient, set-oriented, parallel plans as opposed to inefficient, iterative, serial execution of UDFs. Froid's approach additionally brings the benefits of many compiler optimizations to UDFs with no additional implementation effort. We describe the design of Froid and present our experimental evaluation that demonstrates performance improvements of up to multiple orders of magnitude on real workloads.

##### Evaluating End-to-End Optimization for Data Analytics Applications in Weld

Authors: Shoumik Palkar (Stanford University), James Thomas (Stanford University), Deepak Narayanan (Stanford University), Pratiksha Thaker (Stanford University), Rahul Palamuttam (Stanford University), Parimarjan Negi (Stanford University), Anil Shanbhag (Massachusetts institute of technology - MIT), Malte Schwarzkopf (MIT CSAIL), Holger Pirk (Imperial College), Saman Amarasinghe (Massachusetts institute of technology - MIT), Samuel Madden (Massachusetts institute of technology - MIT), Matei Zaharia (Stanford University and Databricks)

Modern analytics applications use a diverse mix of libraries and functions. Unfortunately, there is no optimization across these libraries, resulting in performance penalties as high as an order of magnitude in many applications. To address this problem, we proposed Weld, a common runtime for existing data analytics libraries that performs key physical optimizations such as pipelining under existing, imperative library APIs. In this work, we further develop the Weld vision by designing an automatic adaptive optimizer for Weld applications, and evaluating its impact on realistic data science workloads. Our optimizer eliminates multiple forms of overhead that arise when composing imperative libraries like Pandas and NumPy, and uses lightweight measurements to make data-dependent decisions at runtime in ad-hoc workloads where no statistics are available, with sub-second overhead. We also evaluate which optimizations have the largest impact in practice and whether Weld can be integrated into libraries incrementally. Our results are promising: using our optimizer, Weld accelerates data science workloads by up to 23x on one thread and 80x on eight threads, and its adaptive optimizations provide up to a 3.75x speedup over rule-based optimization. Moreover, Weld provides benefits if even just 4-5 operators in a library are ported to use it. Our results show that common runtime designs like Weld may be a viable approach to accelerate analytics.

##### Quickstep: A Data Platform Based on the Scaling-Up Approach

Modern servers pack enough storage and computing power that just a decade ago was spread across a modest-sized cluster. This paper presents a prototype system, called Quickstep, to exploit the large amount of parallelism that is packed inside modern servers. Quickstep builds on a vast body of previous methods for organizing data, optimizing, scheduling and executing queries, and brings them together in a single system. Quickstep also includes new query processing methods that go beyond previous approaches. To keep the project focused, the project's initial target is read-mostly in-memory data warehousing workloads in single-node settings. In this paper, we describe the design and implementation of Quickstep for this target application space. We also present experimental results comparing the performance of Quickstep to a number of other systems, demonstrating that Quickstep is often faster than many other contemporary systems, and in some cases faster by orders-of-magnitude. Quickstep is an Apache (incubating) project.

Room : Segóvia III

Chair : Alan Halverson (Microsoft Research)

##### Plan Stitch: Harnessing the Best of Many Plans

Authors: Bailu Ding (Microsoft Research), Sudipto Das (Microsoft Research), Wentao Wu (Microsoft Research), Surajit Chaudhuri (Microsoft Research), Vivek Narasayya (Microsoft Research)

Query performance regression due to the query optimizer selecting a bad query execution plan is a major pain point in production workloads. Commercial DBMSs today can automatically detect and correct such query plan regressions by storing previously-executed plans and reverting to a previous plan which is still valid and has the least execution cost. Such reversion-based plan correction has relatively low risk of plan regression since the decision is based on observed execution costs. However, this approach ignores potentially valuable information of eff icient subplans collected from other previously-executed plans.In this paper, we propose a novel technique, Plan Stitch, that automatically and opportunistically combines efficient subplans of previously-executed plans into a valid new plan, which can be cheaper than any individual previously-executed plan. We implement Plan Stitch on top of Microsoft SQL Server. Our experiments on TPC-DS benchmark and three real-world customer workloads show that plans obtained via Plan Stitch can reduce execution cost signif icantly, with a reduction of up to two orders of magnitude and less than 2.7% of plans regresses by more than 10% in execution cost when compared to reverting to the cheapest previously-executed plan.

##### Robustness Metrics for Relational Query Execution Plans

Authors: Florian Wolf (TU Ilmenau), Michael Brendle (University of Konstanz), Norman May (SAP SE), Paul Willems (SAP SE), Kai-Uwe Sattler (TU Ilmenau), Michael Grossniklaus (University of Konstanz)

The quality of query execution plans in database systems determines how fast a query can be executed. It has been shown that conventional query optimization still selects sub-optimal or even bad execution plans, due to errors in the cardinality estimation. Although cardinality estimation errors are an evident problem, they are in general not considered in the selection of query execution plans. In this paper, we present three novel metrics for the robustness of relational query execution plans w.r.t. cardinality estimation errors. We also present a novel plan selection strategy that takes both, estimated cost and estimated robustness into account, when choosing a plan for execution. Finally, we share the results of our experimental comparison between robust and conventional plan selection on real world and synthetic benchmarks, showing a speedup of at most factor 3.44.

##### Conjunctive Queries with Inequalities Under Updates

Authors: Muhammad Idris (ULB and TU Dresden), Martin Ugarte (ULB), Stijn Vansummeren (ULB), Hannes Voigt (Neo4j), Wolfgang Lehner (TU Dresden)

Modern application domains such as Composite Event Recognition (CER) and real-time Analytics require the ability to dynamically refresh query results under high update rates. Traditional approaches to this problem are based either on the materialization of subresults (to avoid their recomputation) or on the recomputation of subresults (to avoid the space overhead of materialization). Both techniques have recently been shown suboptimal: instead of materializing results and subresults, one can maintain a data structure that supports efficient maintenance under updates and can quickly enumerate the full query output, as well as the changes produced under single updates. Unfortunately, these data structures have been developed only for aggregate-join queries composed of equi-joins, limiting their applicability in domains such as CER where temporal joins are commonplace. In this paper, we present a new approach for dynamically evaluating queries with multi-way θ-joins under updates that is effective in avoiding both materialization and recomputation of results, while supporting a wide range of applications. To do this we generalize Dynamic Yannakakis, an algorithm for dynamically processing acyclic equi-join queries. In tandem, and of independent interest, we generalize the notions of acyclicity and free-connexity to arbitrary θ-joins. We instantiate our framework to the case where θ-joins are only composed of equalities and inequalities (<, ≤, =, >, ≥) and experimentally compare this algorithm, called IEDyn, to state of the art CER systems as well as incremental view maintenance engines. IEDyn performs consistently better than the competitor systems with up to two orders of magnitude improvements in both time and memory consumption.

##### BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory

Authors: Joy Arulraj (Carnegie Mellon University), Justin Levandoski (Microsoft Research), Umar Farooq Minhas (Microsoft Research), Per-Ake Larson (University of Waterloo)

Storing a database (rows and indexes) entirely in non-volatile memory (NVM) potentially enables both high performance and fast recovery. To fully exploit parallelism on modern CPUs, modern main-memory databases use latch-free (lock-free) index structures, e.g. Bw-tree or skip lists. To achieve high performance NVMresident indexes also need to be latch-free. This paper describes the design of the BzTree, a latch-free B-tree index designed for NVM. The BzTree uses a persistent multi-word compare-and-swap operation (PMwCAS) as a core building block, enabling an index design that has several important advantages compared with competing index structures such as the Bw-tree. First, the BzTree is latch-free yet simple to implement. Second, the BzTree is fast - showing up to 2x higher throughput than the Bw-tree in our experiments. Third, the BzTree does not require any special-purpose recovery code. Recovery is near-instantaneous and only involves rolling back (or forward) any PMwCAS operations that were in-flight during failure. Our end-to-end recovery experiments of BzTree report an average recovery time of 145 µs. Finally, the same BzTree implementation runs seamlessly on both volatile RAM and NVM, which greatly reduces the cost of code maintenance.

##### Holistic Query Evaluation over Information Extraction Pipelines

Authors: Ekaterini Ioannou (Open University of Cyprus), Minos Garofalakis (Technical University of Crete)

We introduce holistic in-database query processing over information extraction pipelines. This requires considering the joint conditional distribution over generic Conditional Random Fields that uses factor graphs to encode extraction tasks. Our approach introduces Canopy Factor Graphs, a novel probabilistic model for effectively capturing the joint conditional distribution given a canopy clustering of the data, and special query operators for retrieving resolution information. Since inference on such models is intractable, we introduce an approximate technique for query processing and optimizations that cut across the integrated tasks for reducing the required processing time. Effectiveness and scalability are verified through an extensive experimental evaluation using real and synthetic data.

Room : Segóvia IV

Authors: Amr El Abbadi, Divy Agrawal, Sujaya A. Maiyya and Victor Zakhary (UC Santa Barbara, USA)

Room : El Pardo I

Chair : Artur Ziviani (LNCC)

##### Are Key-Foreign Key Joins Safe to Avoid when Learning High-Capacity Classifiers?

Authors: Vraj Shah (University of California), Arun Kumar (University of California), Xiaojin Zhu (University of Wisconsin-Madison)

Machine learning (ML) over relational data is a booming area of data management. While there is a lot of work on scalable and fast ML systems, little work has addressed the pains of sourcing data for ML tasks. Real-world relational databases typically have many tables (often, dozens) and data scientists often struggle to even obtain all tables for joins before ML. In this context, Kumar et al. showed recently that key-foreign key dependencies (KFKDs) between tables often lets us avoid such joins without significantly affecting prediction accuracy--an idea they called "avoiding joins safely." While initially controversial, this idea has since been used by multiple companies to reduce the burden of data sourcing for ML. But their work applied only to linear classifiers. In this work, we verify if their results hold for three popular high-capacity classifiers: decision trees, non-linear SVMs, and ANNs. We conduct an extensive experimental study using both real-world datasets and simulations to analyze the effects of avoiding KFK joins on such models. Our results show that these high-capacity classifiers are surprisingly and counter-intuitively more robust to avoiding KFK joins compared to linear classifiers, refuting an intuition from the prior work's analysis. We explain this behavior intuitively and identify open questions at the intersection of data management and ML theoretical research. All of our code and datasets are available for download from http://cseweb.ucsd.edu/~arunkk/hamlet.

##### Locality-Sensitive Hashing for Earthquake Detection: A Case Study Scaling Data-Driven Science

Authors: Kexin Rong (Stanford University), Peter Bailis (Stanford University), Philip Levis (Stanford University)

In this work, we report on a novel application of Locality Sensitive Hashing (LSH) to seismic data at scale. Based on the high waveform similarity between reoccurring earthquakes, our application identifies potential earthquakes by searching for similar time series segments via LSH. However, a straightforward implementation of this LSH-enabled application has difficulty scaling beyond 3 months of continuous time series data measured at a single seismic station. As a case study of a data-driven science workflow, we illustrate how domain knowledge can be incorporated into the workload to improve both the efficiency and result quality. We describe several end-to-end optimizations of the analysis pipeline from pre-processing to post-processing, which allow the application to scale to time series data measured at multiple seismic stations. Our optimizations enable an over 100x speed up in the end-to-end analysis pipeline. This improved scalability enabled seismologists to perform seismic analysis on more than ten years of continuous time series data from over ten seismic stations, and has directly enabled the discovery of 597 new earthquakes near the Diablo Canyon nuclear power plant in California and 6123 new earthquakes in New Zealand.

##### Scalable Semantic Querying of Text

Authors: Xiaolan Wang (University of Massachusetts Amherst), Aaron Feng (Recruit Institute of Technology), Behzad Golshan (Recruit Institute of Technology), Alon Halevy (Recruit Institute of Technology), George Mihaila (Recruit Institute of Technology), Hidekazu Oiwa (Recruit Institute of Technology), Wang-Chiew Tan (Recruit Institute of Technology)

We present the KOKO system that takes declarative information extraction to a new level by incorporating advances in natural language processing techniques in its extraction language. KOKO is novel in that its extraction language simultaneously supports conditions on the surface of the text and on the structure of the dependency parse tree of sentences, thereby allowing for more refined extractions. KOKO also supports conditions that are forgiving to linguistic variation of expressing concepts and allows to aggregate evidence from the entire document in order to filter extractions. To scale up, KOKO exploits a multi-indexing scheme and heuristics for efficient extractions. We extensively evaluate KOKO over publicly available text corpora. We show that KOKO indices take up the smallest amount of space, are notably faster and more effective than a number of prior indexing schemes. Finally, we demonstrate KOKO’s scalability on a corpus of 5 million Wikipedia articles.

##### Efficient Document Analytics on Compressed Data: Method, Challenges, Algorithms, Insights

Authors: Feng Zhang (Renmin University of China), Jidong Zhai (Tsinghua University), Xipeng Shen (North Carolina State University), Onur Mutlu (ETH Zurich), Wenguang Chen (Tsinghua University)

Today's rapidly growing document volumes pose pressing challenges to modern document analytics, in both space usage and processing time. In this work, we propose the concept of compression-based direct processing to alleviate issues in both dimensions. The main idea is to enable direct document analytics on compressed data. We present how the concept can be materialized on Sequitur, a compression method that produces hierarchical grammar-like representations. We discuss the major complexities in applying the idea to various document analytics tasks, and reveal a set of guidelines and also assistant software modules for developers to effectively apply compression-based direct processing. Experiments show that the method saves 90.8% storage space and 87.9% memory usage, while speeding up data processing significantly (on average, 1.6X on sequential systems, and 2.2X on distributed clusters).

##### Clustering Stream Data by Exploring the Evolution of Density Mountain

Authors: Shufeng Gong (NorthEasten University), Yanfeng Zhang (NorthEasten University), Ge Yu (Northeast University)

Stream clustering is a fundamental problem in many streaming data analysis applications. Comparing to classical batchmode clustering, there are two key challenges in stream clustering: (i) Given that input data are changing continuously, how to incrementally update their clustering results efficiently? (ii) Given that clusters continuously evolve with the evolution of data, how to capture the cluster evolution activities? Unfortunately, most of existing stream clustering algorithms can neither update the cluster result in real-time nor track the evolution of clusters. In this paper, we propose a stream clustering algorithm EDMStream by exploring the Evolution of Density Mountain. The density mountain is used to abstract the data distribution, the changes of which indicate data distribution evolution. We track the evolution of clusters by monitoring the changes of density mountains. We further provide efficient data structures and filtering schemes to ensure that the update of density mountains is in real-time, which makes online clustering possible. The experimental results on synthetic and real datasets show that, comparing to the state-of-the-art stream clustering algorithms, e.g., DStream, DenStream, DBSTREAM and MR-Stream, our algorithm is able to response to a cluster update much faster (say 7-15x faster than the best of the competitors) and at the same time achieve comparable cluster quality. Furthermore, EDMStream successfully captures the cluster evolution activities.

Room : Oriente

Room : Segóvia I

Chair : Xuemin Lin (The University of New South Wales)

##### A Distributed Multi-GPU System for Fast Graph Processing

Authors: Zhihao Jia (Stanford University), Yongkee Kwon (UT Austin), Galen Shipman (LANL), Pat McCormick (LANL), Mattan Erez (UT Austin), Alex Aiken (Stanford University)

We present Lux, a distributed multi-GPU system that achieves fast graph processing by exploiting the aggregate memory bandwidth of multiple GPUs and taking advantage of locality in the memory hierarchy of multi-GPU clusters. Lux provides two execution models that optimize algorithmic efficiency and enable important GPU optimizations, respectively. Lux also uses a novel dynamic load balancing strategy that is cheap and achieves good load balance across GPUs. In addition, we present a performance model that quantitatively predicts the execution times and automatically selects the runtime configurations for Lux applications. Experiments show that Lux achieves up to 20 times speedup over state-of-the-art shared memory systems and up to two orders of magnitude speedup over distributed systems.

##### Experimental Analysis of Distributed Graph Systems

Authors: Khaled Ammar (University of Waterloo), M. Tamer Özsu (University of Waterloo)

This paper evaluates eight parallel graph processing systems: Hadoop, HaLoop, Vertica, Giraph, GraphLab (PowerGraph), Blogel, Flink Gelly, and GraphX (SPARK) over four very large datasets (Twitter, World Road Network, UK 200705, and ClueWeb) using four workloads (PageRank, WCC, SSSP and K-hop). The main objective is to perform an independent scale-out study by experimentally analyzing the performance, usability, and scalability (using up to 128 machines) of these systems. In addition to performance results, we discuss our experiences in using these systems and suggest some system tuning heuristics that lead to better performance.

##### Accelerating Dynamic Graph Analytics on GPUs

Authors: Mo Sha (NUS), Yuchen Li (Singapore Management University), Bingsheng He (NUS), Kian-Lee Tan (NUS)

As graph analytics often involves compute-intensive operations, GPUs have been extensively used to accelerate the processing. However, in many applications such as social networks, cyber security, and fraud detection, their representative graphs evolve frequently and one has to perform a rebuild of the graph structure on GPUs to incorporate the updates. Hence, rebuilding the graphs becomes the bottleneck of processing high-speed graph streams. In this paper, we propose a GPU-based dynamic graph storage scheme to support existing graph algorithms easily. Furthermore, we propose parallel update algorithms to support efficient stream updates so that the maintained graph is immediately available for high-speed analytic processing on GPUs. Our extensive experiments with three streaming applications on large-scale real and synthetic datasets demonstrate the superior performance of our proposed approach.

##### Lusail: A System for Querying Linked Data at Scale

Authors: Ibrahim Abdelaziz (KAUST), Essam Mansour (QCRI), Mourad Ouzzani (QCRI), Ashraf Aboulnaga (QCRI), Panos Kalnis (King Abdullah University of Science and Technology)

The RDF data model allows publishing interlinked RDF datasets, where each dataset is independently maintained and is queryable via a SPARQL endpoint. This creates a large decentralized geo-distributed queryable RDF graph, and many applications would benefit from querying this graph through a federated SPARQL query processor. A crucial factor for good performance in federated query processing is pushing as much computation as possible to the local endpoints. Surprisingly, existing federated SPARQL engines are not effective at this task since they rely only on schema information. Consequently, they cause unnecessary data retrieval and communication, leading to poor scalability and response time. This paper addresses these limitations and presents Lusail, a scalable and efficient federated SPARQL system for querying large RDF graphs geo-distributed on different endpoints. Lusail uses a novel query rewriting algorithm to push computation to the local endpoints by relying on information about the RDF instances and not only the schema. The query rewriting algorithm has the additional advantage of exposing parallelism in query processing, which Lusail exploits through advanced scheduling at query run time. Our experiments on billions of triples of real and synthetic data show that Lusail outperforms state-of-the-art systems by orders of magnitude in terms of scalability and response time.

##### LA3: A Scalable Link- and Locality-Aware Linear Algebra-Based Graph Analytics System

Authors: Yousuf Ahmad (Carnegie Mellon University in Qatar), Omar Khattab (Carnegie Mellon University in Qatar), Arsal Malik (Carnegie Mellon University in Qatar), Ahmad Musleh (Carnegie Mellon University in Qatar), Mohammad Hammoud (Carnegie Mellon University), Mucahid Kutlu (Qatar University), Mostafa Shehata (Qatar University), Tamer Elsayed (Qatar University)

This paper presents LA3, a scalable distributed system for graph analytics. LA3 couples a vertex-based programming model with a highly optimized linear algebra-based engine. It translates any vertex-centric program into an iteratively executed sparse matrix-vector multiplication (SpMV). To reduce communication and enhance scalability, the adjacency matrix representing an input graph is partitioned into locality-aware 2D tiles distributed across multiple processes. Alongside, three major optimizations are incorporated to preclude redundant computations and minimize communication. First, the link-based structure of the input graph is exploited to classify vertices into different types. Afterwards, vertices of special types are factored out of the main loop of the graph application to avoid superfluous computations. We refer to this novel optimization as computation filtering. Second, a communication filtering mechanism is involved to optimize for the high sparsity of the input matrix due to power-law distributions, common in real-world graphs. This optimization ensures that each process receives only the messages that pertain to non-zero entries in its tiles, substantially reducing communication traffic since most tiles are highly sparse. Lastly, a pseudo-asynchronous computation and communication optimization is proposed, whereby processes progress and communicate asynchronously, consume messages as soon as they become available, and block otherwise. We implemented and extensively tested LA3 on private and public clouds. Results show that LA3 outperforms six related state-of-the-art and popular distributed graph analytics systems by an average of 10x.

Room : Segóvia II

Chair : Umar Farooq Minhas (Microsoft Research)

##### RHEEM: Enabling Cross-Platform Data Processing -- May The Big Data Be With You! --

Authors: Divy Agrawal (University of California), Sanjay Chawla (QCRI), Bertty Contreras-Rojas (QCRI), Ahmed Elmagarmid (QCRI), Yasser Idris (QCRI), Zoi Kaoudi (QCRI), Sebastian Kruse (Hasso-Plattner-Institut), Ji Lucas (QCRI), Essam Mansour (QCRI), Mourad Ouzzani (QCRI), Paolo Papotti (Eurecom), Jorge-Arnulfo Quiane-Ruiz (QCRI), Nan Tang (QCRI), Saravanan Thirumuruganathan (QCRI), Anis Troudi (QCRI)

Solving business problems increasingly requires going beyond the limits of a single data processing platform (platform for short), such as Hadoop or a DBMS. As a result, organizations typically perform tedious and costly tasks to juggle their code and data across different platforms. Addressing this pain and achieving automatic cross-platform data processing is quite challenging: finding the most efficient platform for a given task requires quite good expertise for all the available platforms. We present Rheem, a general-purpose cross-platform data processing system that decouples applications from the underlying platforms. It not only determines the best platform to run an incoming task, but also splits the task into subtasks and assigns each subtask to a specific platform to minimize the overall cost (e.g., runtime or monetary cost). It features (i) a robust interface to easily compose data analytic tasks; (ii) a novel cost-based optimizer able to find the most efficient platform in almost all cases; and (iii) an executor to efficiently orchestrate tasks over different platforms. As a result, it allows users to focus on the business logic of their applications rather than on the mechanics of how to compose and execute them. Using different real-world applications with Rheem, we demonstrate how cross-platform data processing can accelerate performance by more than one order of magnitude compared to single-platform data processing.

##### An Authorization Model for Multi-Provider Queries

Authors: Sabrina De Capitani di Vimercati (University of Milan), Sara Foresti (University of Milan), Sushil Jajodia (George Mason University), Giovanni Livraga (University of Milan), Stefano Paraboschi (University of Bergamo), Pierangela Samarati (University of Milan)

We present a novel approach for the specification and enforcement of authorizations that enables controlled data sharing for collaborative queries in the cloud. Data authorities can establish authorizations regulating access to their data distinguishing three visibility levels (no visibility, encrypted visibility, and plaintext visibility). Authorizations are enforced in the query execution by possibly restricting operation assignments to other parties and by adjusting visibility of data on-the-fly. Our approach enables users and data authorities to fully enjoy the benefits and economic savings of the competitive open cloud market, while maintaining control over data.

##### Taking Omid to the Clouds: Fast, Scalable Transactions for Real-Time Cloud Analytics

Authors: Ohad Shacham (Yahoo Research), Yonatan Gottesman (Yahoo Research), Aran Bergman (Technion), Edward Bortnikov (Yahoo Research), Eshcar Hillel (Yahoo Research), Idit Keidar (Technion IIT and Yahoo Research)

We describe how we evolve Omid, a transaction processing (tps) system for Apache HBase, to power Apache Phoenix, a cloud-grade real-time SQL analytics engine. Omid was originally designed for throughput-oriented data processing pipelines at Yahoo. Providing a platform to support converged real-time tps and analytics applications, dubbed translytics, introduces new requirements. For example, SQL support is key for developer productivity, multi-tenancy is essential for cloud deployment, and latency is cardinal for just-in-time data ingestion and analytics insights. We discuss our efforts to adapt Omid to these domains, as part of the process of integrating it into Phoenix as the tps backend. A central piece of our work is latency reduction in Omid, which also improves scalability. The new protocol's latency is up to an order of magnitude smaller than the legacy Omid. We further describe a fast path protocol, which enables processing them almost as fast as native HBase operations.

##### Bubble Execution: Resource-aware Reliable Analytics at Cloud Scale

Authors: Zhicheng Yin (Microsoft), Jin Sun (Microsoft), Ming Li (Microsoft), Jaliya Ekanayake (Microsoft), Haibo Lin (Microsoft), Marc Friedman (Microsoft), Jose Blakeley (Microsoft), Clemens Szyperski (Microsoft), Nikhil Devanur (Microsoft)

Enabling interactive data exploration at cloud scale requires minimizing end-to-end query execution latency, while guaranteeing fault tolerance, and query execution under resource constraints. Typically, such a query execution involves orchestrating the execution of hundreds or thousands of related tasks on cloud scale clusters. Without any resource constraints, all query tasks can be scheduled to execute simultaneously (gang scheduling) while connected tasks stream data between them. When the data size referenced by a query increases, gang scheduling may be resource-wasteful or unsatisfiable with a limited, per-query resource budget. This paper introduces Bubble Execution, a new query processing framework for processing interactive workloads at cloud scale, that balances cost-based query optimization, fault tolerance, optimal resource management, and execution orchestration. Bubble execution involves dividing a query execution graph into a collection of query sub-graphs (bubbles), and scheduling them within a per-query resource budget. The query operators (tasks) inside a bubble stream data between them while fault tolerance is handled by persisting temporary results at bubble boundaries. Our implementation enhances our JetScope service, for interactive workloads, deployed in production clusters at Microsoft. Experiments with TPC-H-inspired queries show that bubble execution can reduce resource usage signicantly in the presence of failures while maintaining performance competitive with gang execution.

##### FusionInsight LibrA: Huawei's Enterprise Cloud Data Analytics Platform

Authors: Jianjun Chen (Huawei America Research), Mingyi Zhang (Huawei America Research), Yang Sun (Futurewei Technologies Inc), Le Cai (Huawei America Research), Kuorong Chiang (Huawei America Research), Ahmad Ghazal (Huawei America Research), Luara Chen (Huawei America Research), Chunfeng Pei (Huawei America Research), Kamini Jagtiani (Huawei America Research), Jacques Hebert (Huawei America Research), Marko Dimitrijevic (Huawei America Research), Yonghua Ding (Huawei America Research), Cheng Zhu (Huawei America Research), Ye Liu (Huawei America Research), suzhen lin (Huawei America Research), Jun Chen (Huawei America Research), Demai Ni (Huawei America Research), Li Zhang (Huawei America Research), Yu Dong (Huawei America Research), Yongyan Wang (Huawei America Research)

Huawei FusionInsight LibrA (FI-MPPDB) is a petabyte scale enterprise analytics platform developed by the Huawei database group. It was started as a prototype more than five years ago, and is now being used by many enterprise customers over the globe, including some of the world's largest financial institutions. This paper describes the architecture of FI-MPPDB and some of its major enhancements. In particular, we focus on top four requirements from our customers related to data analytics on the cloud: system availability, auto tuning, query over heterogeneous data models on the cloud, and the ability to utilize powerful modern hardware for good performance. We present our latest advancements in the above areas including online expansion, auto tuning in query optimizer, SQL on HDFS, and intelligent JIT compiled execution. Finally, we present some experimental results to demonstrate the effectiveness of these technologies.

Room : Segóvia III

Chair : Bolin Ding (Alibaba Group)

##### Differentially Private Hierarchical Group Size Estimation

Authors: Yu-Hsuan Kuo (Penn State University), Cho-Chun Chiu (Penn State University), Daniel Kifer (Penn State University), Michael Hay (Colgate University), Ashwin Machanavajjhala (Duke University)

We consider the problem of privately releasing a class of queries that we call hierarchical count-of-counts histograms. Count-of-counts histograms partition the rows of an input table into groups (e.g., group of people in the same household), and for every integer j report the number of groups of size j. Hierarchical count-of-counts queries report count-of-counts histograms at different granularities as per a hierarchy defined on one of the attributes in the input data (e.g., geographical location of a household at the national, state and county levels). In this paper, we introduce this problem, along with appropriate error metrics and propose a differentially private solution that generates count-of-counts histograms that are consistent across all levels of the hierarchy.

##### The Tao of Inference in Privacy-Protected Databases

Authors: Vincent Bindschaedler (UIUC), Paul Grubbs (Cornell Tech), David Cash (University of Chicago), Thomas Ristenpart (Cornell Tech), Vitaly Shmatikov (Cornell University)

To protect database confidentiality even in the face of full compromise while supporting standard functionality, recent academic proposals and commercial products rely on a mix of encryption schemes. The recommendation is to apply strong, semantically secure encryption to the â€œsensitiveâ€ columns and protect other columns with property-revealing encryption (PRE) that supports operations such as sorting. We design, implement, and evaluate a new methodology for inferring data stored in such encrypted databases. The cornerstone is the multinomial attack, a new inference technique that is analytically optimal and empirically outperforms prior heuristic attacks against PRE-encrypted data. We also extend the multinomial attack to take advantage of correlations across multiple columns. This recovers PRE-encrypted data with sufficient accuracy to then apply machine learning and record linkage methods to infer columns protected by semantically secure encryption or redaction. We evaluate our methodology on medical, census, and union-membership datasets, showing for the first time how to infer full database records. For PRE-encrypted attributes such as demographics and ZIP codes, our attack outperforms the best prior heuristic by a factor of 16. Unlike any prior technique, we also infer attributes, such as incomes and medical diagnoses, protected by strong encryption. For example, when we infer that a patient in a hospital-discharge dataset has a mental health or substance abuse condition, this prediction is 97% accurate.

##### Optimizing error of high-dimensional statistical queries under differential privacy

Authors: Ryan McKenna (University of Massachusetts Amherst), Gerome Miklau (University of Massachusetts Amherst), Michael Hay (Colgate University), Ashwin Machanavajjhala (Duke University)

Differentially private algorithms for answering sets of predicate counting queries on a sensitive database have many applications. Organizations that collect individual-level data, such as statistical agencies and medical institutions, use them to safely release summary tabulations. However, existing techniques are accurate only on a narrow class of query workloads, or are extremely slow, especially when analyzing more than one or two dimensions of the data. In this work we propose HDMM, a new differentially private algorithm for answering a workload of predicate counting queries, that is especially effective for higher-dimensional datasets. HDMM represents query workloads using an implicit matrix representation and exploits this compact representation to efficiently search (a subset of) the space of differentially private algorithms for one that answers the input query workload with high accuracy. We empirically show that HDMM can efficiently answer queries with lower error than state-of-the-art techniques on a variety of low and high dimensional datasets.

##### Towards Practical Differential Privacy for SQL Queries

Authors: Noah Johnson (UC Berkeley), Joseph Near (UC Berkeley), Dawn Song (UC Berkeley)

Differential privacy promises to enable general data analytics while protecting individual privacy, but existing differential privacy mechanisms do not support the wide variety of features and databases used in real-world SQL-based analytics systems. This paper presents the first practical approach for differential privacy of SQL queries. Using 8.1 million real-world queries, we conduct an empirical study to determine the requirements for practical differential privacy, and discuss limitations of previous approaches in light of these requirements. To meet these requirements we propose elastic sensitivity, a novel method for approximating the local sensitivity of queries with general equijoins. We prove that elastic sensitivity is an upper bound on local sensitivity and can therefore be used to enforce differential privacy using any local sensitivity-based mechanism. We build FLEX, a practical end-to-end system to enforce differential privacy for SQL queries using elastic sensitivity. We demonstrate that FLEX is compatible with any existing database, can enforce differential privacy for real-world SQL queries, and incurs negligible (0.03%) performance overhead.

Room : Segóvia IV

Authors: Amr El Abbadi, Divy Agrawal, Sujaya A. Maiyya and Victor Zakhary (UC Santa Barbara, USA)

Room : El Pardo II

Room : Segóvia I

Room : Segóvia II

Room : Segóvia III

Room : Segóvia IV

Room : El Pardo I

Room : Oriente

Room : Aranjuez