Currently, almost every company in the world collects statistics about user actions on a web resource. The motivation is clear β companies want to know how their product/website is being used and better understand their users. Of course, there are a large number of tools on the market to solve this problem - from analytics systems that provide data in the form of dashboards and graphs (for example
But we found a problem that hasn't been solved yet. So born
Why should we develop our own service?
It was the nineties, we survived as best we could. 2019, we developed the First Customer Data Platform API kSense, which allowed aggregating data from different sources (Facebook ads, Stripe, Salesforce, Google play, Google Analytics, etc.) for more convenient data analysis, identifying dependencies, etc. We have noticed that many users use our data analytics platform, specifically Google Analytics (hereinafter referred to as GA). We spoke with some users and found out that they need their product analytics data, which they receive using GA, but
They installed a Segment javascript pixel on their web resource and their user behavior data was loaded into a specified database (eg Postgres). But Segment also has its minus - the price. For example, if a web resource has 90,000 MTU (monthly tracked users), then you need to pay ~ $ 1,000 per month to the cashier. There was also a third problem - some browser extensions (such as AdBlock) blocked the collection of analytics. http requests from the browser were sent to the GA and Segment domains. Based on the desire of our clients, we have created an analytics service that collects a full set of data (without sampling), free of charge and can work on our own infrastructure.
How the service works
The service consists of three parts: a javascript pixel (which we later rewrote to typescript), a server part implemented in the GO language, and it was planned to use Redshift and BigQuery as an in-house database (later they added support for Postgres, ClickHouse and Snowflake).
The structure of events GA and Segment decided to leave unchanged. All that was needed was to duplicate all events from the web resource where the pixel is installed to our backend. As it turns out, this is easy to do. The Javascript pixel overridden the original GA library method with a new one that duplicated the event in our system.
//'ga' - ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΠΎΠ΅ Π½Π°Π·Π²Π°Π½ΠΈΠ΅ ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠΉ Google Analytics
if (window.ga) {
ga(tracker => {
var originalSendHitTask = tracker.get('sendHitTask');
tracker.set('sendHitTask', (model) => {
var payLoad = model.get('hitPayload');
//ΠΎΡΠΏΡΠ°Π²ΠΊΠ° ΠΎΡΠΈΠ³ΠΈΠ½Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΡΠΎΠ±ΡΡΠΈΡ Π² GA
originalSendHitTask(model);
let jsonPayload = this.parseQuery(payLoad);
//ΠΎΡΠΏΡΠ°Π²ΠΊΠ° ΡΠΎΠ±ΡΡΠΈΡ Π² Π½Π°Ρ ΡΠ΅ΡΠ²ΠΈΡ
this.send3p('ga', jsonPayload);
});
});
}
With the Segment pixel, everything is simpler, it has middleware methods, and we used one of them.
//'analytics' - ΡΡΠ°Π½Π΄Π°ΡΡΠ½ΠΎΠ΅ Π½Π°Π·Π²Π°Π½ΠΈΠ΅ ΠΏΠ΅ΡΠ΅ΠΌΠ΅Π½Π½ΠΎΠΉ Segment
if (window.analytics) {
if (window.analytics.addSourceMiddleware) {
window.analytics.addSourceMiddleware(chain => {
try {
//Π΄ΡΠ±Π»ΠΈΡΠΎΠ²Π°Π½ΠΈΠ΅ ΡΠΎΠ±ΡΡΠΈΡ Π² Π½Π°Ρ ΡΠ΅ΡΠ²ΠΈΡ
this.send3p('ajs', chain.payload);
} catch (e) {
LOG.warn('Failed to send an event', e)
}
//ΠΎΡΠΏΡΠ°Π²ΠΊΠ° ΠΎΡΠΈΠ³ΠΈΠ½Π°Π»ΡΠ½ΠΎΠ³ΠΎ ΡΠΎΠ±ΡΡΠΈΡ Π² Segment
chain.next(chain.payload);
});
} else {
LOG.warn("Invalid interceptor state. Analytics js initialized, but not completely");
}
} else {
LOG.warn('Analytics.js listener is not set.');
}
In addition to copying events, we added the ability to send arbitrary json:
//ΠΡΠΏΡΠ°Π²ΠΊΠ° ΡΠΎΠ±ΡΡΠΈΠΉ Ρ ΠΏΡΠΎΠΈΠ·Π²ΠΎΠ»ΡΠ½ΡΠΌ json ΠΎΠ±ΡΠ΅ΠΊΡΠΎΠΌ
eventN.track('product_page_view', {
product_id: '1e48fb70-ef12-4ea9-ab10-fd0b910c49ce',
product_price: 399.99,
price_currency: 'USD'
product_release_start: '2020-09-25T12:38:27.763000Z'
});
Next, let's talk about the server side. The backend should accept http requests, fill them with additional information, for example, geodata (thanks
//Π²Ρ
ΠΎΠ΄ΡΡΠΈΠΉ json
{
"field_1": {
"sub_field_1": "text1",
"sub_field_2": 100
},
"field_2": "text2",
"field_3": {
"sub_field_1": {
"sub_sub_field_1": "2020-09-25T12:38:27.763000Z"
}
}
}
//ΡΠ΅Π·ΡΠ»ΡΡΠ°Ρ
{
"field_1_sub_field_1": "text1",
"field_1_sub_field_2": 100,
"field_2": "text2",
"field_3_sub_field_1_sub_sub_field_1": "2020-09-25T12:38:27.763000Z"
}
However, arrays are currently simply converted to strings. not all relational databases support repeated fields. It is also possible to change field names or remove them using optional mapping rules. They allow you to change the data schema, if necessary, or cast one data type to another. For example, if the json field contains a string with a timestamp (field_3_sub_field_1_sub_sub_field_1 from the example above), then in order to create a field in the database with the timestamp type, you need to write a mapping rule in the configuration. In other words, the data type of the field is determined first by the json value, and then the type casting rule (if configured) is applied. We have identified 4 main data types: STRING, FLOAT64, INT64 and TIMESTAMP. The mapping and casting rules look like this:
rules:
- "/field_1/subfield_1 -> " #ΠΏΡΠ°Π²ΠΈΠ»ΠΎ ΡΠ΄Π°Π»Π΅Π½ΠΈΡ ΠΏΠΎΠ»Ρ
- "/field_2/subfield_1 -> /field_10/subfield_1" #ΠΏΡΠ°Π²ΠΈΠ»ΠΎ ΠΏΠ΅ΡΠ΅Π½ΠΎΡΠ° ΠΏΠΎΠ»Ρ
- "/field_3/subfield_1/subsubfield_1 -> (timestamp) /field_20" #ΠΏΡΠ°Π²ΠΈΠ»ΠΎ ΠΏΠ΅ΡΠ΅Π½ΠΎΡΠ° ΠΏΠΎΠ»Ρ ΠΈ ΠΏΡΠΈΠ²Π΅Π΄Π΅Π½ΠΈΡ ΡΠΈΠΏΠ°
Algorithm for determining the data type:
- convert json struct to flat struct
- determining the data type of fields by values
- applying mapping and type casting rules
Then from the incoming json structure:
{
"product_id": "1e48fb70-ef12-4ea9-ab10-fd0b910c49ce",
"product_price": 399.99,
"price_currency": "USD",
"product_type": "supplies",
"product_release_start": "2020-09-25T12:38:27.763000Z",
"images": {
"main": "picture1",
"sub": "picture2"
}
}
data schema will be obtained:
"product_id" character varying,
"product_price" numeric (38,18),
"price_currency" character varying,
"product_type" character varying,
"product_release_start" timestamp,
"images_main" character varying,
"images_sub" character varying
We also thought that the user should be able to set up partitioning or split data in the database according to other criteria and implemented the ability to set the table name as a constant or
tableName: '{{.product_type}}_{{._timestamp.Format "2006_01"}}'
However, the structure of incoming events can change at runtime. We have implemented an algorithm for checking the difference between the structure of an existing table and the structure of an incoming event. If a difference is found, the table will be updated with new fields. To do this, use the patch SQL query:
#ΠΡΠΈΠΌΠ΅Ρ Π΄Π»Ρ Postgres
ALTER TABLE "schema"."table" ADD COLUMN new_column character varying
Architecture
Why do you need to write events to the file system, and not just write them directly to the database? Databases do not always show high performance with a large number of inserts (
Open Source and future plans
At some point, the service became like a full-fledged product and we decided to put it in Open Source. At the moment, integrations with Postgres, ClickHouse, BigQuery, Redshift, S3, Snowflake have been implemented. All integrations support both batch and streaming data loading modes. Added support for requests via API.
The current integration scheme looks like this:
Although the service can be used independently (for example, using Docker), we also have
β
β
β
We will be glad if EventNative will help you solve your problems!
Only registered users can participate in the survey.
What statistics collection system is used in your company
-
48,0%Google Analytics12
-
4,0%Segment1
-
16,0%Other (write in the comments) 4
-
32,0%Implemented your service8
25 users voted. 6 users abstained.
Source: habr.com