Getting Amplitude data via API

Introduction

Amplitude as a product analytics tool has proven itself very well due to the simple setup of events and the flexibility of visualizations. And often there is a need to set up your own attribution model, cluster users or build a dashboard in another BI system. It is only possible to perform such a scam if you have raw event data from Amplitude. About how to get this data with minimal programming knowledge - and will be in this article.

prerequisite

  1. A project in Amplitude, in which events are already correctly configured and statistics are collected on them
  2. Python is installed (I work in version 3.8.3), with which a potential reader already knows how to work at least at a basic level

Instructions

Step 1. Getting API-key and secret-key

To upload data, you first need to get an API-key and a secret-key.

You can find them by going to the following path:

  1. "Manage data" (located at the bottom left of the screen)
  2. Select the desired project from which the data will be uploaded, and go to it
  3. In the project menu that opens, select "Project settings"
  4. We find the API-key and secret-key lines, copy and save them in a safe place.

Without clicking, you can follow the link, which in general looks like this:
analytics.amplitude.com/$$$$$$$/manage/project/******/settings,
where $$$$$$ is your organization's login in amplitude, ****** is the project number

Step 2: Checking for Required Libraries

The good news is that you almost certainly already have these libraries installed by default or downloaded, but you need to check. The full list of libraries I use at the time of this writing (versions are indicated in brackets, where appropriate):

  1. requests (2.10.0) - sending a request via api to get data
  2. pandas (1.0.1) - reading json, creating a dataframe and then writing to a file
  3. zipfile - extract files from an archive received via the API
  4. gzip - unpacking json files from .gz
  5. os - getting a list of files from an unpacked archive
  6. time - optional, measurement of the script's running time
  7. tqdm - optional, for the convenience of monitoring the progress of file processing

Step 3. Writing the data loading script

Hint: the full download script is at the end of the article, if you wish, you can immediately take it and refer to step-by-step explanations if necessary.

Step 3.1. Import libraries

We import all the libraries listed in the second step.

# Импорт библиотек
import requests
import pandas as pd
import zipfile
import gzip
import os
import time
import tqdm
from tqdm import tqdm

Step 3.2. Sending a Request to Amplitude

Let's detect the beginning of the script execution and write it to the variable a.

startdate and enddate are responsible for the period for uploading data and are embedded in the text of the sent request to the Amplitude server, in addition to the date, you can also specify the hour by changing the value after the 'T' in the request.

api_key and secret_key correspond to the values ​​obtained in the first step, for security reasons I indicate random sequences here instead of my own.

a = time.time()
# Параметры начальной и конечной даты
startdate = '20200627'
enddate = '20200628'

api_key = 'kldfg844203rkwekfjs9234'
secret_key = '094tfjdsfmw93mxwfek'
# Отправление запроса в Amplitude
response = requests.get('https://amplitude.com/api/2/export?start='+startdate+'T0&end='+enddate+'T0', auth = (api_key, secret_key))
print('1. Запрос отправлен')

Step 3.3. Download archive with data

We come up with a name for the archive and write it to the filename variable. For my convenience, I specify the period + indicate that this is amplitude data. Next, we write the received response from Amplitude to the archive.

# Скачивание архива с данными
filename = 'period_since'+startdate+'to'+enddate+'_amplitude_data'
with open(filename + '.zip', "wb") as code:
    code.write(response.content)
print('2. Архив с файлами успешно скачан')  

Step 3.4. Extract files in a folder on a computer

The zipfile library comes into play, which will help you extract files. In the third line, be careful and write down your path, where it is more convenient for you to extract.

# Извлечение файлов в папку на компьютере
z = zipfile.ZipFile(filename + '.zip', 'r')
z.extractall(path = 'C:\Users\...\'+filename)
print('3. Архив с файлами извлечен и записан в папку ' + filename)

Step 3.5. json conversion

After extracting the files from the archive, you need to convert the json files that are in the .gz format and write them to the dataframe for further work.

I draw your attention to the fact that here you need to change the path to your own again, and instead of 000000 write your project number from Amplitude (or manually open the path where the archive was extracted and look at the name of the folder inside).

In order:

Writing a directory to a variable, getting a list of files from the directory, creating an empty dataframe, time.sleep(1) for tqdm to work correctly, inside the loop we open .gz files and immediately read json using pandas and fill the given dataframe.

# Преобразование json к обычному табличному формату
directory = 'C:\Users\...\'+filename+'
# Преобразование json к обычному табличному формату
directory = 'C:\Users\...\'+filename+'\000000'
files = os.listdir(directory)
amplitude_dataframe = pd.DataFrame()
print('Прогресс обработки файлов:')
time.sleep(1)
for i in tqdm(files):
with gzip.open(directory + '\' + i) as f:
add = pd.read_json(f, lines = 'True')
amplitude_dataframe = pd.concat([amplitude_dataframe, add])
time.sleep(1)    
print('4. JSON файлы из архива успешно преобразованы и записаны в dataframe')
0000' files = os.listdir(directory) amplitude_dataframe = pd.DataFrame() print('Прогресс обработки файлов:') time.sleep(1) for i in tqdm(files): with gzip.open(directory + '\' + i) as f: add = pd.read_json(f, lines = 'True') amplitude_dataframe = pd.concat([amplitude_dataframe, add]) time.sleep(1) print('4. JSON файлы из архива успешно преобразованы и записаны в dataframe')

Step 3.6. Writing dataframe to excel

Exporting to excel is just an example here. In many cases, it's more convenient to work with the received data frame inside python or store the data.

You will also have to replace the data upload path with your own.

# Записать полученной таблицы в Excel-файл
amplitude_dataframe.to_excel('C:\Users\...\'+filename+'.xlsx',index=False)
print('5. Dataframe успешно записан в файл ' + filename)

Step 3.7. Counting the script running time

Writing the current time to the variable b, calculating the difference and the number of minutes, outputting the total minutes. This is the last step.

b = time.time()
diff = b-a
minutes = diff//60
print('Выполнение кода заняло: {:.0f} минут(ы)'.format( minutes))

Conclusion

You can call the table and start working with it by calling the amplitude_dataframe variable into which the data was written. It will contain about 50 columns, of which in 80% of cases you will use: event_type - event name, event_properties - event parameters, event_time - event time, uuid - client id, user_properties - client parameters, you should start working with them first . And when comparing numbers from your own calculations with indicators from Amplitude dashboards, you should not forget that the system uses its own methodology for calculating unique customers / funnels, etc., and before that, you should definitely read the Amplitude documentation.

Thank you for your attention! Now you can upload raw event data to Amplitude and use it fully in your work.

Whole script:

# Импорт библиотек
import requests
import pandas as pd
import zipfile
import gzip
import os
import time
import tqdm
from tqdm import tqdm
a = time.time()
# Параметры начальной и конечной даты
startdate = '20200627'
enddate = '20200628'

api_key = 'd988fddd7cfc0a8a'
secret_key = 'da05cf1aeb3a361a61'
# Отправление запроса в Amplitude
response = requests.get('https://amplitude.com/api/2/export?start='+startdate+'T0&end='+enddate+'T0', auth = (api_key, secret_key))
print('1. Запрос отправлен')

# Скачивание архива с данными
filename = 'period_since'+startdate+'to'+enddate+'_amplitude_data'
with open(filename + '.zip', "wb") as code:
    code.write(response.content)
print('2. Архив с файлами успешно скачан')  

# Извлечение файлов в папку на компьютере
z = zipfile.ZipFile(filename + '.zip', 'r')
z.extractall(path = 'C:\Users\...\'+filename)
print('3. Архив с файлами извлечен и записан в папку ' + filename)

# Преобразование json к обычному табличному формату
directory = 'C:\Users\...\'+filename+'
# Импорт библиотек
import requests
import pandas as pd
import zipfile
import gzip
import os
import time
import tqdm
from tqdm import tqdm
a = time.time()
# Параметры начальной и конечной даты
startdate = '20200627'
enddate = '20200628'
api_key = 'd988fddd7cfc0a8a'
secret_key = 'da05cf1aeb3a361a61'
# Отправление запроса в Amplitude
response = requests.get('https://amplitude.com/api/2/export?start='+startdate+'T0&end='+enddate+'T0', auth = (api_key, secret_key))
print('1. Запрос отправлен')
# Скачивание архива с данными
filename = 'period_since'+startdate+'to'+enddate+'_amplitude_data'
with open(filename + '.zip', "wb") as code:
code.write(response.content)
print('2. Архив с файлами успешно скачан')  
# Извлечение файлов в папку на компьютере
z = zipfile.ZipFile(filename + '.zip', 'r')
z.extractall(path = 'C:\Users\...\'+filename)
print('3. Архив с файлами извлечен и записан в папку ' + filename)
# Преобразование json к обычному табличному формату
directory = 'C:\Users\...\'+filename+'\000000'
files = os.listdir(directory)
amplitude_dataframe = pd.DataFrame()
print('Прогресс обработки файлов:')
time.sleep(1)
for i in tqdm(files):
with gzip.open(directory + '\' + i) as f:
add = pd.read_json(f, lines = 'True')
amplitude_dataframe = pd.concat([amplitude_dataframe, add])
time.sleep(1)    
print('4. JSON файлы из архива успешно преобразованы и записаны в dataframe')
# Записать полученной таблицы в Excel-файл
amplitude_dataframe.to_excel('C:\Users\...\'+filename+'.xlsx',index=False)
print('5. Dataframe успешно записан в файл ' + filename)
b = time.time()
diff = b-a
minutes = diff//60
print('Выполнение кода заняло: {:.0f} минут(ы)'.format( minutes))
0000' files = os.listdir(directory) amplitude_dataframe = pd.DataFrame() print('Прогресс обработки файлов:') time.sleep(1) for i in tqdm(files): with gzip.open(directory + '\' + i) as f: add = pd.read_json(f, lines = 'True') amplitude_dataframe = pd.concat([amplitude_dataframe, add]) time.sleep(1) print('4. JSON файлы из архива успешно преобразованы и записаны в dataframe') # Записать полученной таблицы в Excel-файл amplitude_dataframe.to_excel('C:\Users\...\'+filename+'.xlsx',index=False) print('5. Dataframe успешно записан в файл ' + filename) b = time.time() diff = b-a minutes = diff//60 print('Выполнение кода заняло: {:.0f} минут(ы)'.format( minutes))

Source: habr.com

Add a comment