The history of our open source: how we made an analytics service in Go and made it publicly available

Currently, almost every company in the world collects statistics about user actions on a web resource. The motivation is clear – companies want to know how their product/website is being used and better understand their users. Of course, there are a large number of tools on the market to solve this problem - from analytics systems that provide data in the form of dashboards and graphs (for example Google Analytics) to the Customer Data Platform, which allow you to collect and aggregate data from different sources in any storage (for example, segment).

But we found a problem that hasn't been solved yet. So born EventNative β€” open-source analytics service. About why we went to develop our own service, what it gave us and what happened in the end (with pieces of code), read under the cut.

The history of our open source: how we made an analytics service in Go and made it publicly available

Why should we develop our own service?

It was the nineties, we survived as best we could. 2019, we developed the First Customer Data Platform API kSense, which allowed aggregating data from different sources (Facebook ads, Stripe, Salesforce, Google play, Google Analytics, etc.) for more convenient data analysis, identifying dependencies, etc. We have noticed that many users use our data analytics platform, specifically Google Analytics (hereinafter referred to as GA). We spoke with some users and found out that they need their product analytics data, which they receive using GA, but Google samples data and for many GA User Interface is not a standard of convenience. We had enough conversations with our users and realized that many also used the Segment platform (which, by the way, was just a few days ago sold for $3.2 billion).

They installed a Segment javascript pixel on their web resource and their user behavior data was loaded into a specified database (eg Postgres). But Segment also has its minus - the price. For example, if a web resource has 90,000 MTU (monthly tracked users), then you need to pay ~ $ 1,000 per month to the cashier. There was also a third problem - some browser extensions (such as AdBlock) blocked the collection of analytics. http requests from the browser were sent to the GA and Segment domains. Based on the desire of our clients, we have created an analytics service that collects a full set of data (without sampling), free of charge and can work on our own infrastructure.

How the service works

The service consists of three parts: a javascript pixel (which we later rewrote to typescript), a server part implemented in the GO language, and it was planned to use Redshift and BigQuery as an in-house database (later they added support for Postgres, ClickHouse and Snowflake).

The structure of events GA and Segment decided to leave unchanged. All that was needed was to duplicate all events from the web resource where the pixel is installed to our backend. As it turns out, this is easy to do. The Javascript pixel overridden the original GA library method with a new one that duplicated the event in our system.

//'ga' - стандартноС Π½Π°Π·Π²Π°Π½ΠΈΠ΅ ΠΏΠ΅Ρ€Π΅ΠΌΠ΅Π½Π½ΠΎΠΉ Google Analytics
if (window.ga) {
    ga(tracker => {
        var originalSendHitTask = tracker.get('sendHitTask');
        tracker.set('sendHitTask', (model) => {
            var payLoad = model.get('hitPayload');
            //ΠΎΡ‚ΠΏΡ€Π°Π²ΠΊΠ° ΠΎΡ€ΠΈΠ³ΠΈΠ½Π°Π»ΡŒΠ½ΠΎΠ³ΠΎ события Π² GA
            originalSendHitTask(model);
            let jsonPayload = this.parseQuery(payLoad);
            //ΠΎΡ‚ΠΏΡ€Π°Π²ΠΊΠ° события Π² наш сСрвис
            this.send3p('ga', jsonPayload);
        });
    });
}

With the Segment pixel, everything is simpler, it has middleware methods, and we used one of them.


//'analytics' - стандартноС Π½Π°Π·Π²Π°Π½ΠΈΠ΅ ΠΏΠ΅Ρ€Π΅ΠΌΠ΅Π½Π½ΠΎΠΉ Segment
if (window.analytics) {
    if (window.analytics.addSourceMiddleware) {
        window.analytics.addSourceMiddleware(chain => {
            try {
		//Π΄ΡƒΠ±Π»ΠΈΡ€ΠΎΠ²Π°Π½ΠΈΠ΅ события Π² наш сСрвис
                this.send3p('ajs', chain.payload);
            } catch (e) {
                LOG.warn('Failed to send an event', e)
            }
	    //ΠΎΡ‚ΠΏΡ€Π°Π²ΠΊΠ° ΠΎΡ€ΠΈΠ³ΠΈΠ½Π°Π»ΡŒΠ½ΠΎΠ³ΠΎ события Π² Segment
            chain.next(chain.payload);
        });
    } else {
        LOG.warn("Invalid interceptor state. Analytics js initialized, but not completely");
    }
} else {
    LOG.warn('Analytics.js listener is not set.');
}

In addition to copying events, we added the ability to send arbitrary json:


//ΠžΡ‚ΠΏΡ€Π°Π²ΠΊΠ° событий с ΠΏΡ€ΠΎΠΈΠ·Π²ΠΎΠ»ΡŒΠ½Ρ‹ΠΌ json ΠΎΠ±ΡŠΠ΅ΠΊΡ‚ΠΎΠΌ
eventN.track('product_page_view', {
    product_id: '1e48fb70-ef12-4ea9-ab10-fd0b910c49ce',
    product_price: 399.99,
    price_currency: 'USD'
    product_release_start: '2020-09-25T12:38:27.763000Z'
});

Next, let's talk about the server side. The backend should accept http requests, fill them with additional information, for example, geodata (thanks maxmind for it) and write to the database. We wanted to make the service as convenient as possible so that it can be used with minimal configuration. We have implemented the functionality of determining the data schema based on the structure of the incoming event json. Data types are defined by values. Nested objects are decomposed and reduced to a flat structure:

//входящий json
{
  "field_1":  {
    "sub_field_1": "text1",
    "sub_field_2": 100
  },
  "field_2": "text2",
  "field_3": {
    "sub_field_1": {
      "sub_sub_field_1": "2020-09-25T12:38:27.763000Z"
    }
  }
}

//Ρ€Π΅Π·ΡƒΠ»ΡŒΡ‚Π°Ρ‚
{
  "field_1_sub_field_1":  "text1",
  "field_1_sub_field_2":  100,
  "field_2": "text2",
  "field_3_sub_field_1_sub_sub_field_1": "2020-09-25T12:38:27.763000Z"
}

However, arrays are currently simply converted to strings. not all relational databases support repeated fields. It is also possible to change field names or remove them using optional mapping rules. They allow you to change the data schema, if necessary, or cast one data type to another. For example, if the json field contains a string with a timestamp (field_3_sub_field_1_sub_sub_field_1 from the example above), then in order to create a field in the database with the timestamp type, you need to write a mapping rule in the configuration. In other words, the data type of the field is determined first by the json value, and then the type casting rule (if configured) is applied. We have identified 4 main data types: STRING, FLOAT64, INT64 and TIMESTAMP. The mapping and casting rules look like this:

rules:
  - "/field_1/subfield_1 -> " #ΠΏΡ€Π°Π²ΠΈΠ»ΠΎ удалСния поля
  - "/field_2/subfield_1 -> /field_10/subfield_1" #ΠΏΡ€Π°Π²ΠΈΠ»ΠΎ пСрСноса поля
  - "/field_3/subfield_1/subsubfield_1 -> (timestamp) /field_20" #ΠΏΡ€Π°Π²ΠΈΠ»ΠΎ пСрСноса поля ΠΈ привСдСния Ρ‚ΠΈΠΏΠ°

Algorithm for determining the data type:

  • convert json struct to flat struct
  • determining the data type of fields by values
  • applying mapping and type casting rules

Then from the incoming json structure:

{
    "product_id":  "1e48fb70-ef12-4ea9-ab10-fd0b910c49ce",
    "product_price": 399.99,
    "price_currency": "USD",
    "product_type": "supplies",
    "product_release_start": "2020-09-25T12:38:27.763000Z",
    "images": {
      "main": "picture1",
      "sub":  "picture2"
    }
}

data schema will be obtained:

"product_id" character varying,
"product_price" numeric (38,18),
"price_currency" character varying,
"product_type" character varying,
"product_release_start" timestamp,
"images_main" character varying,
"images_sub" character varying

We also thought that the user should be able to set up partitioning or split data in the database according to other criteria and implemented the ability to set the table name as a constant or expression in configuration. In the example below, the event will be saved to a table with a name calculated based on the values ​​of the product_type and _timestamp fields (for example supplies_2020_10):

tableName: '{{.product_type}}_{{._timestamp.Format "2006_01"}}'

However, the structure of incoming events can change at runtime. We have implemented an algorithm for checking the difference between the structure of an existing table and the structure of an incoming event. If a difference is found, the table will be updated with new fields. To do this, use the patch SQL query:

#ΠŸΡ€ΠΈΠΌΠ΅Ρ€ для Postgres
ALTER TABLE "schema"."table" ADD COLUMN new_column character varying

Architecture

The history of our open source: how we made an analytics service in Go and made it publicly available

Why do you need to write events to the file system, and not just write them directly to the database? Databases do not always show high performance with a large number of inserts (postgres recommendations). To do this, Logger writes incoming events to a file and already in a separate goroutine (thread) File reader reads the file, then the transformation and definition of the data schema takes place. After the Table manager makes sure that the table schema is up-to-date, the data will be written to the database in one batch. Subsequently, we added the ability to write data directly to the database, but we use this mode for events that are not many - for example, conversions.

Open Source and future plans

At some point, the service became like a full-fledged product and we decided to put it in Open Source. At the moment, integrations with Postgres, ClickHouse, BigQuery, Redshift, S3, Snowflake have been implemented. All integrations support both batch and streaming data loading modes. Added support for requests via API.

The current integration scheme looks like this:

The history of our open source: how we made an analytics service in Go and made it publicly available

Although the service can be used independently (for example, using Docker), we also have hosted version, where you can set up integration with the data warehouse, add a CNAME to your domain, and view statistics on the number of events. Our immediate plans are to add the ability to aggregate not only statistics from a web resource, but also data from external data sources and save them to any storage of your choice!

β†’ GitHub
β†’ Documentation
β†’ Slack

We will be glad if EventNative will help you solve your problems!

Only registered users can participate in the survey. Sign in, you are welcome.

What statistics collection system is used in your company

  • 48,0%Google Analytics12

  • 4,0%Segment1

  • 16,0%Other (write in the comments) 4

  • 32,0%Implemented your service8

25 users voted. 6 users abstained.

Source: habr.com

Add a comment