Ehihie ọma. Afọ 2 agaala ka edere ya. ikpeazụ isiokwu gbasara ịkọwa Habr, isi ihe ụfọdụ agbanweela.
Mgbe m chọrọ inwe otu Habr, ekpebiri m ide ihe nzacha nke ga-echekwa ọdịnaya niile nke ndị dere na nchekwa data. Otú o si mee na ihe njehie m zutere - ị nwere ike ịgụ n'okpuru ịkpụ.
Iji malite, ekpebiri m ịme ụdị edemede nke a ga-atụgharị ma tinye ya na nchekwa data ozugbo na nbudata. N'echeghị echiche ugboro abụọ, ejiri m sqlite3, n'ihi na. ọ dị obere na-arụsi ọrụ ike: ọ dịghị mkpa ịnweta ihe nkesa mpaghara, ehichapụrụ-ele anya na ihe ndị dị otú ahụ.
otu_thread.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "При парсинге этой странице произошла ошибка."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Ihe niile bụ kpochapụwo - anyị na-eji ofe mara mma, arịrịọ na ngwa ngwa prototype dị njikere. Nke ahụ bụ naanị…
Nbudata ibe dị n'otu eri
Ọ bụrụ na ị kwụsịrị ogbugbu nke edemede, mgbe ahụ dum nchekwa data agaghị aga ebe ọ bụla. E kwuwerị, a na-eme mkpebi ahụ naanị mgbe nyochachara niile.
N'ezie, ị nwere ike ịme mgbanwe na nchekwa data mgbe ntinye nke ọ bụla, ma mgbe ahụ, oge igbu oge ga-abawanye nke ukwuu.
Ịtụle akụkọ 100 mbụ were m awa 000.
Ọzọ, ahụrụ m akụkọ onye ọrụ jikọtara ọnụ, nke m gụrụ wee chọta ndụ hacks ole na ole iji mee ka usoro a dị ngwa:
Iji multithreading na-eme ka nbudata ngwa ngwa mgbe ụfọdụ.
Ịnweghị ike nweta ụdị habr zuru ezu, mana ụdị mkpanaka ya.
Dịka ọmụmaatụ, ọ bụrụ na isiokwu jikọtara na ụdị desktọpụ dị 378 KB, mgbe ahụ na ụdị mkpanaka ọ dịlarị 126 KB.
Ụdị nke abụọ. Ọtụtụ eri, mmachibido iwu nwa oge na Habr
Mgbe m nyochara ịntanetị na isiokwu nke multithreading na Python, ahọpụtara m nhọrọ kachasị mfe na multiprocessing.dummy, achọpụtara m na nsogbu pụtara yana multithreading.
SQLite3 achọghị iji ihe karịrị otu eri rụọ ọrụ.
edoziri check_same_thread=False, ma njehie a abụghị naanị, mgbe ị na-agbalị itinye n'ime nchekwa data, njehie na-eme mgbe ụfọdụ nke m na-enweghị ike idozi.
Ya mere, m na-ekpebi ịhapụ ntinye ngwa ngwa nke isiokwu ozugbo na nchekwa data na, na-echeta ngwọta nke jikọtara ọnụ, m na-ekpebi iji faịlụ, n'ihi na ọ dịghị nsogbu na multi-threaded ederede na faịlụ.
Habr malitere machibido maka iji ihe karịrị eri atọ.
Mgbalị ịnụ ọkụ n'obi na-eme iji gafere Habr nwere ike mechie site na mmachibido ip maka awa ole na ole. Ya mere, ị ga-eji naanị 3 eri, ma nke a adịlarị mma, ebe ọ bụ na oge ikwugharị ihe karịrị 100 akụkọ na-ebelata site na 26 ruo 12 sekọnd.
Ọ dị mma ịmara na ụdị a adịghị akwụsi ike, na nbudata nbudata kwa oge na-adaba na ọnụ ọgụgụ buru ibu nke akụkọ.
async_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# Запись заблокированных запросов на сервер
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# Если поста не существует или он был скрыт
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# Метка, что пост является переводом или туториалом.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "При парсинге этой странице произошла ошибка."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# Записываем статью в json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Ụdị nke atọ. Ikpeazụ
Mgbe m na-emezigharị ụdị nke abụọ, achọpụtara m na Habr, na mberede, nwere API nke ụdị mkpanaka nke saịtị ahụ na-enweta. Ọ na-ebu ngwa ngwa karịa ụdị mkpanaka, ebe ọ bụ naanị json, nke na-adịghị mkpa ka atụgharị ya. N'ikpeazụ, ekpebiri m idegharị edemede m ọzọ.
Ya mere, ọ chọtara njikọ a API, ị nwere ike ịmalite ịtụgharị ya.
async_v2.py
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)
try:
r = requests.get(url)
if r.status_code == 503:
logging.critical("503 Error")
return 503
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
data = json.loads(r.text)
if data['success']:
article = data['data']['article']
id = article['id']
is_tutorial = article['is_tutorial']
time_published = article['time_published']
comments_count = article['comments_count']
lang = article['lang']
tags_string = article['tags_string']
title = article['title']
content = article['text_html']
reading_count = article['reading_count']
author = article['author']['login']
score = article['voting']['score']
data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
with open(currentFile, "w") as write_file:
json.dump(data, write_file)
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: asyc.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
O nwere ngalaba metụtara ma akụkọ ahụ n'onwe ya ma onye dere ya dere ya.
API.png
Atụfuru m json zuru ezu nke akụkọ ọ bụla, mana echekwara m naanị ubi m chọrọ:
id
bụ_nkuzi
oge_ebipụta
aha
content
nkọwa_count
lang bụ asụsụ e ji dee akụkọ ahụ. Ka ọ dị ugbu a, ọ nwere naanị en na ru.
tags_string - mkpado niile sitere na post
ọgụgụ_agụ
odee
akara - isiokwu rating.
Ya mere, n'iji API, m belatara oge mkpochapụ ederede gaa na 8 sekọnd kwa 100 url.
Mgbe anyị ebudatara data anyị chọrọ, anyị kwesịrị ịhazi ya wee tinye ya na nchekwa data. Enweghị m nsogbu ọ bụla na nke a:
okwu.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
# Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
Ọnụ ọgụgụ
Ọfọn, omenala, n'ikpeazụ, ị nwere ike wepụ ụfọdụ ọnụ ọgụgụ na data:
N'ime nbudata 490 a tụrụ anya ya, naanị 406 ka ebudatara. Ọ tụgharịrị na ihe karịrị ọkara (228) nke akụkọ dị na Habré zoro ezo ma ọ bụ ehichapụ.
Ebe nchekwa data niile, nke nwere ihe fọrọ nke nta ka ọ bụrụ ọkara nde akụkọ, ruru 2.95 GB. Na abịakọrọ ụdị - 495 MB.
Na mkpokọta, mmadụ 37804 bụ ndị dere Habré. A na m echetara gị na ọnụ ọgụgụ ndị a sitere na akwụkwọ ozi dị ndụ.
Onye ode akwụkwọ kacha arụ ọrụ na Habré - alizar - 8774 akụkọ.