Current metadata and information about data lineage are crucial for understanding and interpreting data in a Hadoop data warehouse. At the same time, Hadoop data warehouse projects sink or swim with the ability to continuously add new data sources and views as business requirements evolve.
Conventional ETL workflow schedulers and metadata management approaches prove millstones round a project's neck, however:
The talk proposes integrated specification of data structure, data dependencies, and computation logic as a way to keep ETL productive and metadata current.
Based on such specifications, a scheduler can automatically detect changes and perform
Also, the very same specifications explicitly "program" rich metadata, avoiding
The talk illustrates this approach with Schedoscope, a scheduler developed at Otto Group based on integrated view specification, and Metascope, a collaborative metadata exploration tool built on top of Schedoscope. Schedoscope and Metascope drive Otto Group BI's data platform, which processes clickstream, product, and CRM data from 120 online shops with a yearly revenue north of 5bn Euros. Schedoscope has enabled Otto Group BI's small team of data engineers to continuously release new data sources and view for more than 2 years now; with Metascope, Otto Group's analysts and data scientists have access to always up-to-date metadata and documentation.
Schedoscope and Metascope are available as open-source at http://schedoscope.org
Utz Westermann is Senior Data Architect at Otto Group BI, Hamburg, with 18 years of experience in large-volume data processing.
He is the tech lead for Otto Group BI's data platform, which processes clickstream, product,and CRM data from 120 online shops.
Utz started out in academia researching multimedia databases, receiving a doctoral degree with distinction from Technical University of Vienna in 2004. After postdoctoral visits at VTT Oulu and UC Irvine, he left academia in 2006 to work as a technical consultant on EAI and as CTO of a startup developing a SEO tool monitoring millions of search rankings prior to joining Otto.
Utz has published regularly in peer-reviewed international scientific journals and presented at scientific conferences. Utz is the maintainer of the Schedoscope open-source project.
Senior Data Architect at Otto Group