E kaasan. O ti jẹ ọdun 2 lati igba ti Mo kowe kẹhin article nipa Habr parsing, ati diẹ ninu awọn ohun ti yi pada.
Nigbati Mo fẹ lati ni ẹda Habr, Mo pinnu lati kọ parser kan ti yoo fipamọ gbogbo akoonu ti awọn onkọwe sinu ibi ipamọ data. Bii o ṣe ṣẹlẹ ati awọn aṣiṣe wo ni Mo pade - o le ka labẹ gige.
Lati bẹrẹ pẹlu, Mo pinnu lati ṣe apẹrẹ kan ti iwe afọwọkọ ninu eyiti, lẹsẹkẹsẹ lẹhin igbasilẹ, nkan naa yoo ṣe itupalẹ ati gbe sinu ibi ipamọ data. Laisi ronu lẹmeji, Mo lo sqlite3, nitori ... o kere si alaapọn: iwọ ko nilo lati ni olupin agbegbe, ṣẹda, wo, paarẹ ati nkan bii iyẹn.
ọkan_thread.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "При парсинге этой странице произошла ошибка."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Ohun gbogbo wa ni ibamu si awọn kilasika - a lo Bimo Lẹwa, awọn ibeere ati apẹrẹ iyara ti ṣetan. Iyẹn lasan...
Oju-iwe naa ti wa ni igbasilẹ ni okun kan
Ti o ba da idaduro ipaniyan ti iwe afọwọkọ naa, lẹhinna gbogbo ibi ipamọ data kii yoo lọ nibikibi. Lẹhin gbogbo ẹ, ifaramọ naa jẹ ṣiṣe nikan lẹhin gbogbo itọka.
Nitoribẹẹ, o le ṣe awọn ayipada si ibi ipamọ data lẹhin fifi sii kọọkan, ṣugbọn lẹhinna akoko ipaniyan iwe afọwọkọ yoo pọ si ni pataki.
Ṣiṣayẹwo awọn nkan 100 akọkọ gba mi wakati 000.
Lẹhinna Mo wa nkan olumulo ajọpọ, eyiti Mo ka ati rii ọpọlọpọ awọn hakii igbesi aye lati ṣe iyara ilana yii:
Lilo multithreading yiyara gbigba lati ayelujara ni pataki.
O le gba kii ṣe ẹya kikun ti Habr, ṣugbọn ẹya alagbeka rẹ.
Fun apẹẹrẹ, ti nkan iṣọpọ ninu ẹya tabili ṣe iwọn 378 KB, lẹhinna ninu ẹya alagbeka o ti jẹ 126 KB tẹlẹ.
Ẹya keji. Ọpọlọpọ awọn okun, idinamọ igba diẹ lati Habr
Nigbati mo ṣawari Intanẹẹti lori koko-ọrọ ti multithreading ni Python ati yan aṣayan ti o rọrun julọ pẹlu multiprocessing.dummy, Mo woye pe awọn iṣoro han pẹlu multithreading.
SQLite3 ko fẹ ṣiṣẹ pẹlu okun to ju ọkan lọ.
Ti o wa titi check_same_thread=False, ṣugbọn aṣiṣe yii kii ṣe ọkan nikan; nigba igbiyanju lati fi sii sinu ibi ipamọ data, nigbami awọn aṣiṣe waye ti emi ko le yanju.
Nitorinaa, Mo pinnu lati kọ ifibọ lẹsẹkẹsẹ ti awọn nkan taara sinu ibi ipamọ data ati, ni iranti ojutu iṣọpọ, Mo pinnu lati lo awọn faili, nitori ko si awọn iṣoro pẹlu kikọ ọpọlọpọ-asapo si faili kan.
Habr bẹrẹ idinamọ fun lilo diẹ ẹ sii ju awọn okun mẹta lọ.
Paapa awọn igbiyanju itara lati de ọdọ Habr le ja si ni wiwọle IP fun awọn wakati meji. Nitorina o ni lati lo awọn okun 3 nikan, ṣugbọn eyi ti dara tẹlẹ, niwon akoko lati to awọn nkan 100 ti dinku lati 26 si 12 awọn aaya.
O tọ lati ṣe akiyesi pe ẹya yii jẹ riru, ati igbasilẹ lorekore kuna lori nọmba nla ti awọn nkan.
async_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# Запись заблокированных запросов на сервер
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# Если поста не существует или он был скрыт
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# Метка, что пост является переводом или туториалом.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "При парсинге этой странице произошла ошибка."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# Записываем статью в json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Ẹya kẹta. Ipari
Lakoko ti n ṣatunṣe ẹya keji, Mo ṣe awari pe Habr lojiji ni API kan ti o wọle nipasẹ ẹya alagbeka ti aaye naa. O fifuye yiyara ju ẹya alagbeka lọ, nitori pe o kan json, eyiti ko paapaa nilo lati ṣe itupalẹ. Ni ipari, Mo pinnu lati tun kọ iwe afọwọkọ mi lẹẹkansi.
Nitorina, ti o ti ṣe awari ọna asopọ yii API, o le bẹrẹ itupalẹ rẹ.
async_v2.py
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)
try:
r = requests.get(url)
if r.status_code == 503:
logging.critical("503 Error")
return 503
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
data = json.loads(r.text)
if data['success']:
article = data['data']['article']
id = article['id']
is_tutorial = article['is_tutorial']
time_published = article['time_published']
comments_count = article['comments_count']
lang = article['lang']
tags_string = article['tags_string']
title = article['title']
content = article['text_html']
reading_count = article['reading_count']
author = article['author']['login']
score = article['voting']['score']
data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
with open(currentFile, "w") as write_file:
json.dump(data, write_file)
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: asyc.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
O ni awọn aaye ti o ni ibatan si mejeeji nkan naa funrararẹ ati onkọwe ti o kọ ọ.
API.png
Emi ko padanu json ni kikun ti nkan kọọkan, ṣugbọn ti o fipamọ awọn aaye ti Mo nilo nikan:
id
jẹ_ tutorial
akoko_atejade
akọle
akoonu
comments_count
lang ni ede ti a ti kọ nkan naa. Nítorí jina o nikan ni en ati ru.
tags_string - gbogbo awọn afi lati ifiweranṣẹ
kika_count
onkowe
Dimegilio - article Rating.
Nitorinaa, ni lilo API, Mo dinku akoko ipaniyan iwe afọwọkọ si awọn aaya 8 fun url 100.
Lẹhin ti a ti ṣe igbasilẹ data ti a nilo, a nilo lati ṣe ilana rẹ ki o tẹ sii sinu ibi ipamọ data. Ko si awọn iṣoro pẹlu eyi boya:
parser.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
# Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
Статистика
O dara, ni aṣa, nikẹhin, o le jade diẹ ninu awọn iṣiro lati inu data naa:
Ninu 490 ti a nireti, awọn nkan 406 nikan ni a ṣe igbasilẹ. O wa jade pe diẹ sii ju idaji (228) ti awọn nkan lori Habré ti wa ni pamọ tabi paarẹ.
Gbogbo ibi ipamọ data, ti o ni awọn nkan ti o fẹrẹ to idaji miliọnu, ṣe iwọn 2.95 GB. Ni fisinuirindigbindigbin fọọmu - 495 MB.
Lapapọ, awọn onkọwe 37804 wa lori Habré. Jẹ ki n leti pe iwọnyi jẹ awọn iṣiro nikan lati awọn ifiweranṣẹ laaye.
Onkọwe ti o munadoko julọ lori Habré - alizar - 8774 ìwé.