Open Data Integration

Renée J. Miller (Northeastern University)


Open Data plays a major role in open government initiatives. Governments around the world are adopting Open Data Principles promising to make their Open Data complete, primary, and timely. These properties make this data tremendously valuable to data scientists. However scientists generally do not have a priori knowledge about what data is available (its schema or content), but will want to be able to use Open Data and integrate it with other public or private data they are studying. Traditionally, data integration is done using a framework called “query discovery” where the main task is to discover a query (or transformation script) that transforms data from one form into another. The goal is to find the right operators to join, nest, group, link, and twist data into a desired form. In this talk, I introduce a new paradigm for thinking about Open Data Integration where the focus is on “data discovery”, but highly efficient internet-scale discovery that is heavily query-aware. As an example, a join-aware discovery algorithm finds datasets, within a massive data lake, that join (in a precise sense of having high containment) with a known dataset. I describe a research agenda and recent progress in developing scalable query-aware data discovery algorithms.


Renée J. Miller is a University Distinguished Professor of Computer Science at Northeastern University. She is a Fellow of the Royal Society of Canada, Canada’s National Academy of Science, Engineering and the Humanities. She received the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the United States government on outstanding scientists and engineers beginning their careers. She received an NSF CAREER Award, the Ontario Premier’s Research Excellence Award, and an IBM Faculty Award. She formerly held the Bell Canada Chair of Information Systems at the University of Toronto and is a fellow of the ACM. Her work has focused on the long-standing open problem of data integration and has achieved the goal of building practical data integration systems. She and her co-authors (Fagin, Kolaitis and Popa) received the (10 Year) ICDT Test-of-Time Award for their influential 2003 paper establishing the foundations of data exchange. Professor Miller has led the NSERC Business Intelligence Strategic Network and was elected president of the non-profit Very Large Data Base Foundation. She is an Editor-in-Chief of the VLDB Journal. She received her PhD in Computer Science from the University of Wisconsin, Madison and bachelor’s degrees in Mathematics and Cognitive Science from MIT.

Slides: VLDB2018-keynote-open-data-integration-renee-miller.pdf

Healthcare Transformation from Data and System Perspectives

Beng Chin Ooi (National University of Singapore)


While AI and data-driven approaches are still evolving, they are likely to surpass current medical practices in the healthcare domain soon. The potential advantages are not only faster and more accurate analysis, but also the democratization of healthcare services. Notwithstanding, there are some common challenges when applying existing approaches onto the healthcare domain, due to the noise and bias of electronic health records (EHR), complex and heterogeneous feature relations, access control and data privacy and etc. In this talk, I discuss our design and implementation strategies: solve common challenges, instill domain knowledge, automate knowledge extraction, and enable system-based global optimization. I discuss our rationale on building a general analytics stack instead of solving individual problems, and explain how these challenges are being addressed. Several detailed technologies from both system and algorithm perspectives in our healthcare data management and analytics framework are also described. Finally, we introduce our healthcare analytics stack and MediLOT blockchain system, with the hope of playing a role in healthcare transformation.


Beng Chin is a Distinguished Professor of Computer Science, NGS faculty member and Director of Smart Systems Institute (SSI@NUS) at the National University of Singapore (NUS), an adjunct Chang Jiang Professor at Zhejiang University, China, and the director of NUS AI Innovation and Commercialization Centre at Suzhou, China. He obtained his BSc (1st Class Honors) and PhD from Monash University, Australia, in 1985 and 1989 respectively.

Beng Chin's research interests include database systems, distributed and blockchain systems, and large scale analytics, in the aspects of system architectures, performance issues, security, accuracy and correctness. He works closely with the industry (eg. NUHS, Jurong Health, Tan Tok Seng Hospital, Singapore General Hospital, KK Hospital on healthcare analytics and prediebetes prevention), and exploits IT for efficiency in various appplication domains, including healthcare, finance and smart city. He is a co-founder of yzBigData(2012) for Big Data Management and analytics, and Shentilium Technologies(2016) for AI- and data-driven Financial data analytics, MediLot Technologies(2018) for blockchain based healthcare data management and analytics, an advisor of a RegTech company, Cynopsis Solutions, and an advisor to blockchain based KYC ICO. Beng Chin serves as a non-executive and independent director of ComfortDelgro, a transportation company, and a member of Hangzhou Government AI Development Committee (AI TOP 30)

Beng Chin is a fellow of the ACM , IEEE, and Singapore National Academy of Science (SNAS).

Slides: vldb18-keynote-ben-chin-ooi.pdf

Data Journalism

Antisocial Behavior on the Web

Jure Leskovec (Stanford University)


User contributions in the form of posts, comments, and votes are essential to the success of online communities. However, allowing user participation also invites undesirable and harmful behavior. In the talk I will discuss antisocial behavior in online discussion communities by analyzing users who were banned from these communities. We will find that such users tend to concentrate their efforts in a small number of threads, are more likely to post irrelevantly, and are more successful at garnering responses from other users. Studying the evolution of these users from the moment they join a community up to when they get banned, we find that not only do they write worse than other users over time, but they also become increasingly less tolerated by the community. Our analysis also reveals distinct groups of users with different levels of antisocial behavior that can change over time. We will use these insights to identify antisocial users early on, a task of high practical importance to community maintainers.


Jure Leskovec is Associate Professor of Computer Science at Stanford University, Chief Scientist at Pinterest, and investigator at Chan Zuckerberg Biohub. His research focuses on machine learning and data mining large social and information networks, their evolution, and the diffusion of information and influence over them. Computation over massive data is at the heart of his research and has applications in computer science, social sciences, economics, marketing, and healthcare. This research has won several awards including a Lagrange Prize, Microsoft Research Faculty Fellowship, the Alfred P. Sloan Fellowship, and numerous best paper awards. Leskovec received his bachelor's degree in computer science from University of Ljubljana, Slovenia, and his PhD in in machine learning from the Carnegie Mellon University and postdoctoral training at Cornell University.

Structured Data for Building Online Trust

Cong Yu (Google AI)


In 22 out of 28 countries surveyed by Edelman in 2018, there are more people who distrust the media than those who trust it. In the US, according to Gallup, trust in the media reached an all time low in 2016. Those figures matter because a free and trustworthy news ecosystem is the foundation of democratic society. Technology played a role in the decline of journalism and technology can play a role in rebuilding it. In this talk, I discuss various aspects for (re)building online trust and how data researchers in our community can make a difference. I conclude this talk by reissuing the Call to Arms originally made to the database community more than 8 years ago (Cohen et al, CIDR 2011): it's time for us to tackle some of the most important societal challenges in journalism and online information ecosystem.


Cong Yu is a Research Scientist at Google AI in New York City and leads the Structured Data group. The group’s mission is to understand and leverage structured data on the Web to enhance user experience for Google products and has been responsible for several impactful products and features such as WebTables, Structured Snippets and Fact Checking at Google. Cong is passionate about the health of the news ecosystem and has been partnering with journalists and policy advisors to combat online misinformation and polarization with the goal of helping Google users be more informed. His research interests are structured data exploration and mining, computational journalism, applied machine learning, and scalable data analysis. He twice served as Program Co-Chair for VLDB. Outside of Google, he periodically teaches at NYU Courant's Department of Computer Science. Before Google, Cong was a Research Scientist at Yahoo! Research, also in NYC. He has a PhD from University of Michigan, Ann Arbor, advised by Prof. H.V. Jagadish.