Utz Westermann

Keynote

Buoyancy in Data Lakes - Agile Metadata Management in Hadoop Data Warehouses at Otto Group

Current metadata and information about data lineage are crucial for understanding and interpreting data in a Hadoop data warehouse. At the same time, Hadoop data warehouse projects sink or swim with the ability to continuously add new data sources and views as business requirements evolve.

Conventional ETL workflow schedulers and metadata management approaches prove millstones round a project's neck, however:

  • with every rollout of new views significant effort has to be invested in one-time migration scripts for schema and data;
  • manual data documentation is laborious and gets out-of-sync quickly; and
  • automatic metadata derivation approaches are invasive, resource-taxing, and usually yield a perspective too technical for business.

The talk proposes integrated specification of data structure, data dependencies, and computation logic as a way to keep ETL productive and metadata current.

Based on such specifications, a scheduler can automatically detect changes and perform

  • appropriate schema migrations and
  • data recomputations automatically as necessary, significantly reducing rollout and operations effort.

Also, the very same specifications explicitly "program" rich metadata, avoiding

  • additional manual documentation labor,
  • automatic metadata derivation overhead and its low semantic level, greatly simplifying the implementation of metadata exploration tools.

The talk illustrates this approach with Schedoscope, a scheduler developed at Otto Group based on integrated view specification, and Metascope, a collaborative metadata exploration tool built on top of Schedoscope. Schedoscope and Metascope drive Otto Group BI's data platform, which processes clickstream, product, and CRM data from 120 online shops with a yearly revenue north of 5bn Euros. Schedoscope has enabled Otto Group BI's small team of data engineers to continuously release new data sources and view for more than 2 years now; with Metascope, Otto Group's analysts and data scientists have access to always up-to-date metadata and documentation. 


Schedoscope and Metascope are available as open-source at http://schedoscope.org
 

CV

Utz Westermann is Senior Data Architect at Otto Group BI, Hamburg, with 18 years of experience in large-volume data processing.

He is the tech lead for Otto Group BI's data platform, which processes clickstream, product,and CRM data from 120 online shops.

Utz started out in academia researching multimedia databases, receiving a doctoral degree with distinction from Technical University of Vienna in 2004. After postdoctoral visits at VTT Oulu and UC Irvine, he left academia in 2006 to work as a technical consultant on EAI and as CTO of a startup developing a SEO tool monitoring millions of search rankings prior to joining Otto.

Utz has published regularly in peer-reviewed international scientific journals and presented at scientific conferences. Utz is the maintainer of the Schedoscope open-source project.
 

Senior Data Architect at Otto Group