How to search data quickly and easily with Whale

How to search data quickly and easily with Whale
This article talks about the simplest and fastest data discovery tool, the work of which you see on KDPV. Interestingly, whale is designed to be hosted on a remote git server. Details under the cut.

How Airbnb's Data Discovery Tool Changed My Life

In my career, I've had the pleasure of working on some fun problems: I studied flow math while doing my degree at MIT, worked on incremental models, and with an open source project pylift at Wayfair, and implemented new homepage targeting models and CUPED improvements at Airbnb. But all this work was never glamorous—in fact, I often spent most of my time searching, researching, and validating data. Although this was a constant state at work, it didn't occur to me that this was a problem until I got to Airbnb where it was resolved with a data discovery tool − dataportal.

Where can I find {{data}}? dataportal.
What does this column mean? dataportal.
How is {{metric}} doing today? dataportal.
What is a sense of life? IN dataportal, probably.

Okay, you've presented the picture. Finding data and understanding what it means, how it was created and how to use it all takes just a few minutes, not hours. I could spend my time drawing simple conclusions, or new algorithms, (… or answering random questions about the data) rather than digging through notes, writing repetitive SQL queries, and mentioning colleagues on Slack to try and recreate context. that someone else already had.

What's the problem?

I realized that most of my friends didn't have access to such a tool. Few companies are willing to devote huge resources to building and maintaining a platform tool like Dataportal. And while there are a few open source solutions, they tend to be designed to scale, making it difficult to set up and maintain without a dedicated DevOps engineer. So I decided to create something new.

Whale: A stupidly simple data discovery tool

How to search data quickly and easily with Whale

And yes, by stupidly simple I mean stupidly simple. The whale has only two components:

  1. A Python library that collects metadata and formats it in MarkDown.
  2. Rust command line interface for searching through this data.

From the point of view of the internal infrastructure for maintenance, there are only a lot of text files and a program that updates the text. That's it, so hosting on a git server like Github is trivial. No new query language to learn, no management infrastructure, no backups. Everyone knows Git, so syncing and collaboration is free. Let's take a closer look at the functionality Whale v1.0.

Full featured git-based GUI

Whale is designed to swim in the ocean of a remote git server. He very easy configurable: define some connections, copy the Github Actions script (or write one for your chosen CI/CD platform) and you'll have a data discovery web tool right away. You will be able to search, view, document and share your spreadsheets directly on Github.

How to search data quickly and easily with Whale
An example of a stub table generated using Github Actions. Full working demo see in this section.

Lightning fast CLI search for your repository

Whale lives and breathes on the command line, providing powerful, millisecond lookups across your tables. Even with millions of tables, we managed to make whale incredibly performant by using some clever caching mechanisms and also by rebuilding the backend in Rust. You won't notice any search delay [hello Google DS].

How to search data quickly and easily with Whale
Whale demo, million table lookup.

Automatic calculation of metrics [in beta]

One of my least favorite things as a data scientist is running the same queries over and over again just to check the quality of the data being used. Whale supports the ability to define metrics in plain SQL that will be scheduled to run along with your metadata cleanup pipelines. Define a YAML metrics block inside the stub table, and Whale will automatically run on a schedule and run queries nested in metrics.

```metrics
metric-name:
  sql: |
    select count(*) from table
```

How to search data quickly and easily with Whale
Combined with Github, this approach means whale can serve as an easy central source of truth for metric definitions. Whale even saves the values ​​along with the timestamp in the "~/. whale/metrics" if you want to do some charting or more in-depth research.

Future

After talking to users of our pre-release versions of whale, we realized that people needed more functionality. Why a table lookup tool? Why not a metrics search tool? Why not monitoring? Why not a SQL query execution tool? While whale v1 was originally conceived as a simple CLI companion tool Dataportal/Amundsen, it has already evolved into a full-featured standalone platform, and we hope it will become an integral part of the Data Scientist's toolkit.

If there is something you want to see in the development process, join our to the Slack community, open Issues at Githubor even contact directly LinkedIn. We already have a number of cool features - Jinja templates, bookmarks, search filters, Slack alerts, Jupyter integration, even a CLI dashboard for metrics - but we'd love your input.

Conclusion

Whale is developed and maintained by Dataframe, a startup that I recently had the pleasure of co-founding with other people. While whale is made for data scientists, Dataframe is made for data scientists. For those of you who want to collaborate more closely, feel free to to handlewe will add you to the waiting list.

How to search data quickly and easily with Whale
And by promo code HORNBEAM, you can get an additional 10% to the discount indicated on the banner.

More courses

Recommended Articles

Source: habr.com