Open Source DataHub: LinkedIn Metadata Search and Discovery Platform

Open Source DataHub: LinkedIn Metadata Search and Discovery Platform

Finding the right data quickly is essential for any company that relies on large amounts of data to make decisions based on that data. This not only affects the productivity of data users (including analysts, machine learning developers, data scientists, and data engineers), but also has a direct impact on end products that depend on a quality machine learning (ML) pipeline. Also, the trend towards implementing or building machine learning platforms naturally begs the question: what is your method of internally discovering features, models, metrics, datasets, etc.

In this article, we will talk about how we published a data source under an open license data hub in our metadata search and discovery platform since the early days of the project WhereHows. LinkedIn maintains its own version of the DataHub separate from the open source version. We will start by explaining why we need two separate development environments, after which we will discuss the first approaches to using the open source WhereHows and compare our internal (production) version of the DataHub with the version on GitHub. We will also share details about our new automated solution for pushing and receiving open source updates to keep both repositories in sync. Finally, we will provide instructions on how to get started using the open source DataHub and briefly discuss its architecture.

Open Source DataHub: LinkedIn Metadata Search and Discovery Platform

WhereHows is now DataHub!

The LinkedIn metadata team previously introduced data hub (the successor to WhereHows), LinkedIn's metadata search and discovery platform, and shared plans to launch it. Shortly after this announcement, we released an alpha version of the DataHub and shared it with the community. Since then, we've been constantly contributing to the repository and working with interested users to add the most requested features and resolve issues. Now we are happy to announce the official release DataHub on GitHub.

Open source approaches

WhereHows, LinkedIn's original portal for finding data and origins, started as an internal project; metadata team opened it source code in 2016. Since then, the team has always maintained two different codebases, one for open source and one for LinkedIn internal use, as not all product features developed for LinkedIn use cases were generally applicable to a wider audience. In addition, WhereHows has some internal dependencies (infrastructure, libraries, etc.) that are not open source. In the following years, WhereHows went through many iterations and development cycles, which made keeping the two codebases in sync a big challenge. The metadata team has tried different approaches over the years to try and keep internal and open source development in sync.

First Attempt: "Open Source First"

Initially, we followed an "open source first" development model, where most development takes place in an open source repository and changes are made for internal deployment. The problem with this approach is that the code is always pushed to GitHub first before it is fully tested internally. As long as no changes are made from the open source repository and a new internal deployment is done, we won't find any production issues. In the case of a bad rollout, it was also very difficult to identify the culprit because the changes were made in batches.

In addition, this model reduced the team's productivity when developing new features that required rapid iterations, as it forced all changes to be pushed into the open source repository first and then pushed into the internal repository. In order to reduce processing time, the necessary fix or change could be done first in the internal repository, but this became a huge problem when it came to merging those changes back into the open source repository because the two repositories got out of sync.

This model is much easier to implement for shared platforms, libraries, or infrastructure projects than it is for full-featured custom web applications. Also, this model is ideal for projects that start open source from day one, but WhereHows was built as a fully in-house web application. It was really difficult to completely abstract away all the internal dependencies, so we needed to keep the internal fork, but keeping the internal fork and developing mostly open source didn't quite work out.

Second attempt: "Internal first"

** As a second attempt, we moved to an "in-house first" development model in which most development happens in-house and changes are made to open source on a regular basis. While this model is best for our use case, it has inherent problems. Pushing all diffs directly to an open source repository and then trying to resolve merge conflicts later is an option, but it's time consuming. Developers in most cases try not to do this every time they check their code. As a result, this will be done much less often, in batches, and thus makes later merge conflict resolution more difficult.

It worked out for the third time!

The two failed attempts mentioned above resulted in the WhereHows GitHub repo being out of date for a long time. The team continued to improve the features and architecture of the product, so the internal version of WhereHows for LinkedIn became more and more advanced than the open source version. It even had a new name - DataHub. Based on previous failed attempts, the team decided to develop a scalable long-term solution.

For any new open source project, LinkedIn's open source development team advises and supports a development model in which project modules are fully developed open source. Versioned artifacts are deployed to a public repository and then checked back into an internal LinkedIn artifact using external library request (ELR). Following this development model is not only good for those using open source, but also leads to a more modular, extensible, and pluggable architecture.

However, it will take a significant amount of time for a mature back-end application such as the DataHub to reach this state. It also precludes the possibility of a fully working implementation being open sourced before all internal dependencies are fully abstracted away. That's why we've developed tools to help us make open source contributions faster and much less painful. This solution benefits both the metadata team (DataHub developer) and the open source community. The following sections will discuss this new approach.

Open source publishing automation

The metadata team's latest approach to the open source DataHub is to develop a tool that automatically synchronizes the internal codebase and the open source repository. The high-level features of this toolkit include:

  1. LinkedIn code sync to/from open source, similar Rsync.
  2. License header generation similar to Apache Rat.
  3. Automatic creation of open source commit logs from internal commit logs.
  4. Prevent internal changes that break an open source build by dependency testing.

In the following subsections, the aforementioned functions that have interesting problems will be discussed in detail.

Source Code Synchronization

Unlike the open source version of the DataHub, which is a single GitHub repository, the LinkedIn version of the DataHub is a combination of multiple repositories (internally referred to as multiproducts). The DataHub interface, Metadata Model Library, Metadata Store Backend Service, and Stream Jobs are in separate repositories on LinkedIn. However, to facilitate the experience of open source users, we have a single repository for the open source version of the DataHub.

Open Source DataHub: LinkedIn Metadata Search and Discovery Platform

Figure 1: Synchronization between repositories LinkedIn data hub and single repository data hub open source

To support automated build, push, and checkout workflows, our new tool automatically creates a file-level mapping corresponding to each source file. However, the toolkit requires initial configuration and users must provide a high-level mapping of the modules, as shown below.

{
  "datahub-dao": [
    "${datahub-frontend}/datahub-dao"
  ],
  "gms/impl": [
    "${dataset-gms}/impl",
    "${user-gms}/impl"
  ],
  "metadata-dao": [
    "${metadata-models}/metadata-dao"
  ],
  "metadata-builders": [
    "${metadata-models}/metadata-builders"
  ]
}

A module-level mapping is a simple JSON whose keys are the target modules in the open source repository and whose values ​​are the list of source modules in the LinkedIn repositories. Any target module in an open source repository can be fed by any number of source modules. To denote the internal names of repositories in source modules, use string interpolation bash style. Using a module-level mapping file, the tools create a file-level mapping file by scanning all files in associated directories.

{
  "${metadata-models}/metadata-builders/src/main/java/com/linkedin/Foo.java":
"metadata-builders/src/main/java/com/linkedin/Foo.java",
  "${metadata-models}/metadata-builders/src/main/java/com/linkedin/Bar.java":
"metadata-builders/src/main/java/com/linkedin/Bar.java",
  "${metadata-models}/metadata-builders/build.gradle": null,
}

The file-level mapping is automatically generated by the tools; however, it can also be manually updated by the user. This is a 1:1 mapping of a LinkedIn source file to a file in an open source repository. There are several rules associated with this automatic file association creation:

  • In the case of multiple source modules for a target module in open source, there may be conflicts, for example, the same FQCNA that exists in more than one source module. As a conflict resolution strategy, our tools default to the last wins option.
  • "null" means that the source file is not part of the open source repository.
  • Each time you push to or pull from open source, this mapping is automatically updated and a snapshot is created. This is necessary to detect additions and removals of source code since the last action.

Create commit logs

Commit logs for open source commits are also automatically generated by merging commit logs from internal repositories. Below is a sample commit log to show the structure of the commit log generated by our tool. A commit clearly specifies which versions of the source repositories are packaged in that commit and provides a summary of the commit log. Check out this commit on a real example of a commit log created by our toolkit.

metadata-models 29.0.0 -> 30.0.0
    Added aspect model foo
    Fixed issue bar

dataset-gms 2.3.0 -> 2.3.4
    Added rest.li API to serve foo aspect

MP_VERSION=dataset-gms:2.3.4
MP_VERSION=metadata-models:30.0.0

Dependency Testing

LinkedIn has dependency testing infrastructure, which helps ensure that changes to the internal multiproduct do not break the build of dependent multiproducts. The open source DataHub repository is not multi-product and it cannot be a direct dependency of any multi-product, but with a multi-product wrapper that pulls the open source DataHub source, we can still use this dependency testing system.. Thus, any change (which may later be exposed) in any of the multi-products that feed the open source DataHub repository will trigger a build event in the shell multi-product. Therefore, any change that does not allow the wrapper multiproduct to be built fails the tests before committing the original multiproduct and is reverted.

This is a useful mechanism to help prevent any internal commit that breaks the open source build from being discovered at the time the commit is created. Without this, it would be quite difficult to determine which internal commit caused the open source repository build to fail because we are batching internal changes to the open source DataHub repository.

Differences between the open source DataHub and our production version

Up to this point, we have discussed our solution for keeping the two versions of the DataHub repositories in sync, but so far we have not outlined the reasons why we need two different development streams at all. In this section, we will list the differences between the public version of the DataHub and the production version on LinkedIn servers, and explain the reasons for these differences.

One source of discrepancy stems from the fact that our production version has dependencies on not-yet-open source code such as LinkedIn's Offspring (LinkedIn's internal dependency injection framework). Offspring is widely used in the internal codebase because it is the preferred method of dynamic configuration management. But it's not open source; so we needed to find open source alternatives to the open source DataHub.

There are other reasons as well. As we create extensions to the metadata model for LinkedIn's needs, these extensions are usually very specific to LinkedIn and may not apply directly to other environments. For example, we have very specific labels for member IDs and other types of matching metadata. So, for the time being, we have excluded these extensions from the open source DataHub metadata model. As we interact with the community and understand their needs, we will work on shared open source versions of these extensions where appropriate.

Ease of use and easier adoption for the open source community also inspired some of the differences between the two versions of the DataHub. Differences in the streaming infrastructure are a good example of this. While our internal version uses a managed streaming framework, we chose to use native (standalone) streaming for the open source version because it avoids creating yet another infrastructure dependency.

Another example of the difference is having one GMS (Generic Metadata Store) in an open source implementation rather than multiple GMSs. GMA (Generic Metadata Architecture) is the name of the internal architecture for the DataHub and GMS is the metadata store in the context of GMA. GMA is a very flexible architecture that allows you to distribute each data construct (e.g. datasets, users, etc.) into its own metadata store, or store multiple data constructs in a single metadata store as long as the registry containing the data structure mapping in GMS is updated. For ease of use, we have chosen a single GMS instance that stores all the various data constructs in the open source DataHub.

The full list of differences between the two implementations is shown in the table below.

Product Features
LinkedIn DataHub
open source data hub

Supported Data Constructs
1) Datasets 2) Users 3) Metrics 4) ML Features 5) Charts 6) Dashboards
1) Datasets 2) Users

Supported Metadata Sources for Datasets
1) Ambry 2) Couchbase 3) Dalids 4) Espresso 5) HDFS 6) Hive 7) Kafka 8) MongoDB 9) MySQL 10) Oracle 11) Pinot 12) Presto 12) Seas 13) Teradata 13) Vector 14) Venice
Hive Kafka RDBMS

Pub sub
LinkedIn Kafka
Confluent Kafka

Stream Processing
Managed
Embedded (standalone)

Dependency Injection & Dynamic Configuration
LinkedIn Offspring
Spring

Build Tooling
Ligradle (LinkedIn's internal Gradle wrapper)
gradlew

CI / CD
CRT (LinkedIn's internal CI/CD)
Travis CI and Docker hub

Metadata Stores
Distributed multiple GMS: 1) Dataset GMS 2) User GMS 3) Metric GMS 4) Feature GMS 5) Chart/Dashboard GMS
Single GMS for: 1) Datasets 2) Users

Microservices in Docker containers

Docker simplifies the deployment and distribution of applications through containerization. Every part of the service in the open source DataHub, including infrastructure components such as Kafka, Elasticsearch, Neo4j ΠΈ MySQL, has its own Docker image. To orchestrate Docker containers, we used Docker Compose.

Open Source DataHub: LinkedIn Metadata Search and Discovery Platform

Figure 2: Architecture data hub *open source**

You can see the high-level architecture of the DataHub in the picture above. In addition to the infrastructure components, it has four different Docker containers:

datahub-gms: metadata store service

datahub-frontend: application PlayThe that serves the DataHub interface.

datahub-mce-consumer: application Kafka Streams, which consumes a metadata change event (MCE) stream and updates the metadata store.

datahub-mae-consumer: application Kafka Streams, which uses the Metadata Auditing Event (MAE) stream and creates a search index and graph database.

Open source repository documentation and original DataHub blog post contain more detailed information about the functions of various services.

CI/CD at the Open Source DataHub

The open source DataHub repository uses Travis CI for continuous integration and Docker hub for continuous deployment. Both have good GitHub integration and are easy to set up. For most open source infrastructure developed by the community or private companies (for example, confluent), Docker images are built and deployed to Docker Hub for ease of use by the community. Any Docker image found on Docker Hub can be easily used with a simple command docker-pull.

With every commit to the DataHub open source repository, all Docker images are automatically built and deployed to Docker Hub with the "latest" tag. If Docker Hub is configured with some naming regular expression branches, all tags in the open source repository are also released with the corresponding tag names in Docker Hub.

Using DataHub

DataHub setup very simple and consists of three simple steps:

  1. Clone the open source repository and start all Docker containers with docker-compose using the provided docker-compose script for a quick start.
  2. Download the sample data provided in the repository using the command line tool that is also provided.
  3. View the DataHub in your browser.

Actively tracked Gitter chat also configured for quick questions. Users can also create issues right in the GitHub repository. Most importantly, we welcome and appreciate all feedback and suggestions!

Plans for the future

Currently, every infrastructure or microservice for the open source DataHub is built as a Docker container, and the entire system is orchestrated using docker-compose. Given the popularity and widespread Kubernetes, we would also like to provide a Kubernetes based solution in the near future.

We also plan to provide a turnkey solution for deploying the DataHub on a public cloud service such as Azure, AWS or Google Cloud. Given the recent announcement of LinkedIn's migration to Azure, this would be in line with the metadata team's internal priorities.

Last but not least, thanks to all the early adopters of the DataHub in the open source community who have rated the DataHub alpha releases and helped us identify issues and improve the documentation.

Source: habr.com

Add a comment