Wilujeng sonten. Geus 2 taun ti harita ditulis. artikel panungtungan ngeunaan parsing Habr, sarta sababaraha titik geus robah.
Nalika kuring hoyong gaduh salinan Habr, kuring mutuskeun pikeun nyerat parser anu bakal ngahémat sadaya eusi pangarang kana pangkalan data. Kumaha éta kajadian sareng naon kasalahan anu kuring hadapi - anjeun tiasa maca dina handapeun cut.
Pikeun dimimitian ku, kuring mutuskeun pikeun nyieun prototipe Aksara, nu artikel bakal parsed geuwat sanggeus diundeur jeung disimpen dina database. Tanpa mikir dua kali, kuring dipaké sqlite3, sabab. éta kirang kuli-intensif: teu kudu boga server lokal, dijieun-katingali-dihapus jeung hal kawas éta.
one_thread.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "При парсинге этой странице произошла ошибка."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Sadayana klasik - kami nganggo Sup Éndah, pamundut sareng prototipe gancang parantos siap. Éta ngan…
Page download aya dina hiji thread
Upami anjeun ngaganggu palaksanaan naskah, maka sadaya pangkalan data moal kamana-mana. Barina ogé, komitmen dilaksanakeun ngan saatos sadaya parsing.
Tangtosna, anjeun tiasa ngalakukeun parobihan kana pangkalan data saatos unggal sisipan, tapi waktos palaksanaan naskah bakal ningkat sacara signifikan.
Parsing 100 artikel munggaran nyandak kuring 000 jam.
Salajengna kuring manggihan artikel pamaké dihijikeun, anu kuring baca sareng mendakan sababaraha hacks hirup pikeun nyepetkeun prosés ieu:
Ngagunakeun multithreading speeds nepi diundeur di kali.
Anjeun teu bisa meunangkeun versi pinuh ku habr, tapi versi mobile na.
Salaku conto, upami artikel kointegrasi dina versi desktop beuratna 378 KB, maka dina versi mobile éta parantos 126 KB.
Vérsi kadua. Loba threads, larangan samentara ti Habr
Nalika kuring scoured Internet dina topik multithreading di python, Kuring milih pilihan pangbasajanna kalawan multiprocessing.dummy, Kuring noticed nu masalah muncul sapanjang kalawan multithreading.
SQLite3 henteu hoyong damel sareng langkung ti hiji utas.
dibereskeun check_same_thread=False, Tapi kasalahan ieu teu ngan hiji, nalika nyobian nyelapkeun kana database, kasalahan kadang lumangsung nu kuring teu bisa ngajawab.
Ku alatan éta, kuring mutuskeun pikeun abandon sisipan instan artikel langsung kana database na, remembering solusi cointegrated, abdi mutuskeun pikeun ngagunakeun file, sabab teu aya masalah sareng tulisan multi-threaded kana file.
Habr mimiti ngalarang pikeun ngagunakeun langkung ti tilu utas.
Utamana usaha getol pikeun ngaliwat ka Habr tiasa ditungtungan ku larangan ip pikeun sababaraha jam. Janten anjeun kedah ngan ukur nganggo 3 utas, tapi ieu parantos saé, sabab waktos pikeun ngulang langkung ti 100 tulisan diréduksi tina 26 dugi ka 12 detik.
Eta sia noting yén versi ieu rada teu stabil, sarta undeuran périodik ragrag kaluar dina angka nu gede ngarupakeun artikel.
async_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# Запись заблокированных запросов на сервер
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# Если поста не существует или он был скрыт
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# Метка, что пост является переводом или туториалом.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "При парсинге этой странице произошла ошибка."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# Записываем статью в json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Vérsi katilu. Pamungkas
Nalika nga-debug versi kadua, kuring mendakan yén Habr, ujug-ujug, ngagaduhan API anu diakses ku versi mobile situs éta. Ieu beban gancang ti versi mobile, saprak éta ngan json, nu malah teu perlu parsed. Tungtungna, kuring mutuskeun pikeun nulis deui naskah kuring.
Ku kituna, sanggeus kapanggih link ieu API, Anjeun bisa ngamimitian parsing eta.
async_v2.py
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)
try:
r = requests.get(url)
if r.status_code == 503:
logging.critical("503 Error")
return 503
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
data = json.loads(r.text)
if data['success']:
article = data['data']['article']
id = article['id']
is_tutorial = article['is_tutorial']
time_published = article['time_published']
comments_count = article['comments_count']
lang = article['lang']
tags_string = article['tags_string']
title = article['title']
content = article['text_html']
reading_count = article['reading_count']
author = article['author']['login']
score = article['voting']['score']
data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
with open(currentFile, "w") as write_file:
json.dump(data, write_file)
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: asyc.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Éta ngandung widang anu aya hubunganana sareng tulisan sorangan sareng panulis anu nyeratna.
API.png
Kuring henteu ngalungkeun json pinuh ku unggal tulisan, tapi ngan ukur ngahemat widang anu kuring peryogikeun:
id
nyaéta_tutorial
time_published
gelar
eusi
comment_count
lang nya éta basa nu artikel ditulis. Sajauh, eta boga ngan en jeung ru.
tags_string - sadaya tag tina pos
maca_itung
nu ngarang
skor - rating artikel.
Ku kituna, ngagunakeun API, abdi ngurangan waktu palaksanaan naskah ka 8 detik per 100 url.
Saatos urang ngaunduh data anu diperyogikeun, urang kedah ngolah sareng ngalebetkeun kana pangkalan data. Abdi henteu ngagaduhan masalah sareng ieu ogé:
parser.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
# Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
statistik
Nya, sacara tradisional, tungtungna, anjeun tiasa nimba sababaraha statistik tina data:
Tina 490 undeuran anu diperkirakeun, ngan 406 artikel anu diunduh. Tétéla leuwih ti satengah (228) artikel ngeunaan Habré disumputkeun atawa dihapus.
Sakabéh database, diwangun ku ampir satengah juta artikel, beuratna 2.95 GB. Dina bentuk dikomprés - 495 MB.
Jumlahna aya 37804 urang pangarang Habré. Kuring ngingetan yén statistik ieu ngan ukur tina tulisan langsung.
Pangarang paling produktif dina Habré - alizar - 8774 artikel.