Bon apremidi. Sa fè 2 zan depi li ekri. dènye atik sou analiz Habr, ak kèk pwen yo te chanje.
Lè mwen te vle gen yon kopi Habr, mwen te deside ekri yon parser ki ta sove tout kontni otè yo nan baz done a. Ki jan li te rive ak ki erè mwen te rankontre - ou ka li anba koupe a.
Premye vèsyon analizeur la. Yon sèl fil, anpil pwoblèm
Pou kòmanse, mwen deside fè yon pwototip script, nan ki atik la ta dwe analize imedyatman apre telechaje epi mete yo nan baz done a. San yo pa reflechi de fwa, mwen te itilize sqlite3, paske. li te mwens travay-entansif: pa bezwen gen yon sèvè lokal, kreye-gade-efase ak bagay konsa.
one_thread.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "При парсинге этой странице произошла ошибка."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Tout se klasik - nou itilize bèl soup, demann ak yon pwototip rapid pare. Se jis...
Paj telechaje se nan yon sèl fil
Si ou entèwonp ekzekisyon an nan script la, Lè sa a, tout baz done a pral ale okenn kote. Apre yo tout, komèt la fèt sèlman apre tout analiz la.
Natirèlman, ou ka komèt chanjman nan baz done a apre chak ensèsyon, men Lè sa a, tan an ekzekisyon script ap ogmante anpil.
Analize premye 100 atik yo te pran m '000 èdtan.
Apre mwen jwenn atik itilizatè a koentegre, ke mwen li epi mwen jwenn kèk antay lavi pou akselere pwosesis sa a:
Sèvi ak multithreading akselere telechaje pafwa.
Ou ka jwenn pa vèsyon konplè a nan habr la, men vèsyon mobil li yo.
Pou egzanp, si yon atik cointegrated nan vèsyon an Desktop peze 378 KB, Lè sa a, nan vèsyon an mobil li deja 126 KB.
Dezyèm vèsyon. Anpil fil, entèdiksyon tanporè nan Habr
Lè mwen te fouye entènèt la sou sijè a nan multithreading nan python, mwen te chwazi opsyon ki pi senp ak multiprocessing.dummy, mwen remake ke pwoblèm parèt ansanm ak multithreading.
SQLite3 pa vle travay ak plis pase yon fil.
fiks check_same_thread=False, men erè sa a se pa youn nan sèlman, lè w ap eseye insert nan baz done a, erè pafwa rive ke mwen pa t 'kapab rezoud.
Se poutèt sa, mwen deside abandone ensèsyon enstantane nan atik dirèkteman nan baz done a epi, sonje solisyon an kointegrated, mwen deside sèvi ak fichye, paske pa gen okenn pwoblèm ak ekri milti-threaded nan yon dosye.
Habr kòmanse entèdi pou itilize plis pase twa fil.
Espesyalman tantativ zele pou jwenn nan Habr ka fini ak yon entèdiksyon IP pou yon koup de èdtan. Se konsa, ou dwe itilize sèlman 3 fil, men sa a se deja bon, depi lè a repete plis pase 100 atik redwi soti nan 26 a 12 segonn.
Li se vo anyen ke vèsyon sa a se pito enstab, ak telechaje detanzantan tonbe sou yon gwo kantite atik.
async_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# Запись заблокированных запросов на сервер
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# Если поста не существует или он был скрыт
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# Метка, что пост является переводом или туториалом.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "При парсинге этой странице произошла ошибка."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# Записываем статью в json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Twazyèm vèsyon. Final
Pandan debogaj dezyèm vèsyon an, mwen dekouvri ke Habr, toudenkou, gen yon API ke vèsyon mobil sit la jwenn aksè. Li chaje pi vit pase vèsyon mobil lan, depi li jis json, ki pa menm bezwen analize. Nan fen a, mwen deside reekri script mwen an ankò.
Se konsa, li te jwenn lyen sa a API, ou ka kòmanse analize li.
async_v2.py
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)
try:
r = requests.get(url)
if r.status_code == 503:
logging.critical("503 Error")
return 503
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
data = json.loads(r.text)
if data['success']:
article = data['data']['article']
id = article['id']
is_tutorial = article['is_tutorial']
time_published = article['time_published']
comments_count = article['comments_count']
lang = article['lang']
tags_string = article['tags_string']
title = article['title']
content = article['text_html']
reading_count = article['reading_count']
author = article['author']['login']
score = article['voting']['score']
data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
with open(currentFile, "w") as write_file:
json.dump(data, write_file)
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: asyc.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Li genyen domèn ki gen rapò ak atik la li menm ak otè ki te ekri li a.
API.png
Mwen pa t jete json konplè chak atik, men mwen te sove sèlman jaden mwen te bezwen yo:
id
se_tutorial
tan_pibliye
tit
kontni
comments_count
lang se lang ki ekri atik la. Jiskaprezan, li gen sèlman en ak ru.
tags_string - tout tags nan pòs la
reading_count
otè
nòt — evalyasyon atik.
Kidonk, lè l sèvi avèk API a, mwen redwi tan an ekzekisyon script a 8 segonn pou chak 100 url.
Apre nou fin telechaje done nou bezwen yo, nou bezwen trete yo epi antre nan baz done a. Mwen pa t 'gen okenn pwoblèm ak sa a tou:
parser.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
# Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
Estatistik
Oke, tradisyonèlman, finalman, ou ka ekstrè kèk estatistik nan done yo:
Nan 490 telechajman yo te espere, se sèlman 406 atik yo te telechaje. Li sanble ke plis pase mwatye (228) nan atik yo sou Habré yo te kache oswa efase.
Baz done a tout antye, ki fòme ak prèske mwatye yon milyon atik, peze 2.95 GB. Nan fòm konprese - 495 MB.
An total, 37804 moun se otè Habré. Mwen raple w ke estatistik sa yo soti sèlman nan pòs ap viv yo.
Otè ki pi pwodiktif sou Habré - alizar - 8774 atik.