Mholo emvakwemini. Ibiyi 2 years ndayibhala
Xa ndifuna ukuba nekopi yeHabr, ndagqiba ekubeni ndibhale i-parser eya kugcina yonke imixholo yababhali kwisiseko sedatha. Kwenzeka njani kwaye zeziphi iimpazamo endidibana nazo - unokufunda phantsi kokusikwa.
TL;DR -
Uguqulelo lokuqala lomhlalutyi. Umsonto omnye, iingxaki ezininzi
Ukuqala, ndaye ndagqiba ekubeni ndenze iprototype yeskripthi apho, ngokukhawuleza emva kokukhuphela, inqaku liya kucazululwa kwaye libekwe kwisiseko sedatha. Ngaphandle kokucinga kabini, ndasebenzisa i-sqlite3, kuba... bekunzima kakhulu: awudingi ukuba neseva yendawo, yenza, jonga, cima kunye nezinto ezinjalo.
umsonto_omnye.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "При парсинге этой странице произошла ошибка."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Yonke into ihambelana neeklasikhi - sisebenzisa iSobho esihle, izicelo kunye neprototype ekhawulezayo ilungile. Yiloo nto nje...
-
Iphepha likhutshelwe kumsonto omnye
-
Ukuba uphazamisa ukuphunyezwa kweskripthi, ngoko yonke i-database ayiyi kuya ndawo. Emva kwayo yonke loo nto, isibophelelo senziwa kuphela emva kokwahlulahlula konke.
Ngokuqinisekileyo, unokwenza utshintsho kwisiseko sedatha emva kokufaka ngalunye, kodwa ke ixesha lokwenziwa kwescript liya kwanda kakhulu. -
Ukuhlaziya amanqaku okuqala ayi-100 kundithathe iiyure ezisi-000.
Emva koko ndifumana inqaku lomsebenzisi
- Ukusebenzisa imisonto emininzi kukhawuleza ukukhuphela ngokubalulekileyo.
- Awunakufumana inguqulelo epheleleyo yeHabr, kodwa inguqulelo yayo yeselula.
Ngokomzekelo, ukuba inqaku elidibeneyo kwi-desktop version inobunzima be-378 KB, ngoko kwi-mobile version sele i-126 KB.
Inguqulelo yesibini. Imisonto emininzi, ukuvalwa okwethutyana kuHabr
Xa ndikhangela i-Intanethi ngesihloko sokufundwa kwee-multithreading kwipython kwaye ndakhetha eyona ndlela ilula nge-multiprocessing.dummy, ndaqaphela ukuba iingxaki zivela kunye nokuphindaphinda okuninzi.
I-SQLite3 ayifuni ukusebenza ngomsonto ongaphezulu kwesinye.
Ilungisiwe check_same_thread=False
, kodwa le mpazamo asiyiyo yodwa, xa uzama ukufaka kwisiseko sedatha, ngamanye amaxesha iimpazamo zivela endingakwaziyo ukuzisombulula.
Ngoko ke, ndigqiba ekubeni ndilahle ukufakwa kwamanqaku ngokukhawuleza kwisiseko sedatha kwaye, ndikhumbula isisombululo esidibeneyo, ndigqiba ekubeni ndisebenzise iifayile, kuba akukho ngxaki ngokubhala imisonto emininzi kwifayile.
UHabr uqala ukuvala ukusebenzisa imisonto engaphezu kwemithathu.
Ingakumbi imizamo yenzondelelo yokufikelela kuHabr inokubangela ukuvalwa kwe-IP iiyure ezimbalwa. Ke kufuneka usebenzise imisonto emi-3 kuphela, kodwa oku sele kulungile, kuba ixesha lokuhlela amanqaku ali-100 lincitshisiwe ukusuka kwimizuzwana engama-26 ukuya kwe-12.
Kuyaphawuleka ukuba le nguqulo ayizinzanga, kwaye ukukhuphela rhoqo kusilela kwinani elikhulu lamanqaku.
isync_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# Запись заблокированных запросов на сервер
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# Если поста не существует или он был скрыт
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# Метка, что пост является переводом или туториалом.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "При парсинге этой странице произошла ошибка."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# Записываем статью в json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Inguqulelo yesithathu. Okokugqibela
Ngelixa ndilungisa inguqulelo yesibini, ndafumanisa ukuba uHabr ngequbuliso une-API efikelelwa yinguqulelo yeselula yesiza. Ilayisha ngokukhawuleza kunenguqulelo yeselula, kuba yi-json nje, engadingi kwahlulwa. Ekugqibeleni, ndagqiba ekubeni ndibhale kwakhona iskripthi sam kwakhona.
Ngoko, emva kokufumana
isync_v2.py
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)
try:
r = requests.get(url)
if r.status_code == 503:
logging.critical("503 Error")
return 503
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
data = json.loads(r.text)
if data['success']:
article = data['data']['article']
id = article['id']
is_tutorial = article['is_tutorial']
time_published = article['time_published']
comments_count = article['comments_count']
lang = article['lang']
tags_string = article['tags_string']
title = article['title']
content = article['text_html']
reading_count = article['reading_count']
author = article['author']['login']
score = article['voting']['score']
data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
with open(currentFile, "w") as write_file:
json.dump(data, write_file)
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: asyc.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Iqulethe imimandla ehambelana nenqaku ngokwalo kunye nombhali oyibhalileyo.
API.png
Khange ndilahle i-json epheleleyo yenqaku ngalinye, kodwa ndigcine kuphela iindawo endizifunayo:
- id
- si_sisifundo
- ixesha_lipapashiwe
- isihloko
- umxholo
- izimvo_ukubalwa
- isiLang lulwimi elibhalwa ngalo inqaku. Ukuza kuthi ga ngoku iqulethe kuphela i-en kunye ne-ru.
- tags_string - zonke iithegi ezivela kwiposi
- ukufunda_ukubala
- umbhali
- amanqaku - amanqaku amanqaku.
Ngaloo ndlela, usebenzisa i-API, ndanciphisa ixesha lokubhalwa kweskripthi ukuya kwimizuzwana eyi-8 nge-100 url.
Emva kokuba sikhuphe idatha esiyifunayo, kufuneka siyiqhube kwaye siyifake kwisiseko sedatha. Bekungekho ngxaki nakule:
uhlalutyi.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
# Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
Izibalo
Ewe, ngokwesiko, ekugqibeleni, unokukhupha iinkcukacha-manani kwidatha:
- Kuma-490 406 ebekulindeleke, kukhutshelwe amanqaku angama-228 kuphela. Kuvela ukuba ngaphezu kwesiqingatha (512) samanqaku eHabré afihliweyo okanye acinywa.
- I-database yonke, equka malunga nesiqingatha sesigidi samanqaku, inobunzima be-2.95 GB. Kwifom ecinezelweyo - 495 MB.
- Bebonke, kukho ababhali abangama-37804 kwiHabré. Makhe ndikukhumbuze ukuba ezi zibalo zivela kwizithuba eziphilayo kuphela.
- Oyena mbhali unemveliso kuHabré -
alizar - 8774 amanqaku. Inqaku elikalwe phezulu - 1448 plussUninzi lwenqaku elifundwayo - 1660841 izimvoUninzi lwathetha ngenqaku - 2444 izimvo
Ewe, ngendlela yeetopsTop 15 ababhali
Top 15 ngokukala
I-15 ephezulu efundiweyo
Top 15 Kuxoxiwe
umthombo: www.habr.com