Galab wanaagsan. 2 sano ayaa laga joogaa markii la qoray. maqaalkii ugu dambeeyay ku saabsan falanqaynta Habr, oo qodobbada qaar ayaa isbedelay.
Markii aan damcay in aan haysto nuqul Habr ah, waxa aan go'aansaday in aan qoro falanqeeye ka badbaadin doona dhammaan waxa ku jira qorayaasha kaydka xogta. Sida ay u dhacday iyo wixii khaladaad ah ee aan la kulmay - waxaad ka akhrisan kartaa goynta hoosteeda.
Nooca ugu horreeya ee falanqaynta. Hal dun, dhibaatooyin badan
Bilawga, waxaan go'aansaday in aan sameeyo prototype qoraal ah kaas oo maqaalka lagu kala saari doono oo la gelin doono xogta isla markiiba marka la soo dejiyo. Anigoon ka fikirin laba jeer, waxaan isticmaalay sqlite3, sababtoo ah. waxa ay ahayd mid yar oo xoog badan: looma baahna in la helo server maxalli ah, la abuuray-muuqaal la tirtiray iyo wax la mid ah.
hal_thread.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "При парсинге этой странице произошла ошибка."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Haddii aad joojiso fulinta qoraalka, markaa xogta oo dhan meelna ma aadi doonto. Ka dib oo dhan, go'aanka waxaa la sameeyaa kaliya ka dib dhammaan falanqaynta.
Dabcan, waxaad samayn kartaa isbeddel ku saabsan xogta ka dib gelinta kasta, laakiin markaa wakhtiga fulinta qoraalka ayaa si weyn u kordhin doona.
Falanqaynta 100 ee maqaal ee hore waxay igu qaadatay 000 saacadood.
Marka xigta waxaan helaa maqaalka isticmaalaha isku duubni, kuwaas oo aan akhriyey oo aan helay dhawr hacks nolosha si loo dedejiyo hawshan:
Isticmaalka isku-xidhka badan waxa ay kordhisaa soo dejinta mararka qaarkood.
Ma heli kartid nooca buuxa ee habr, laakiin ma heli kartid nooca gacanta.
Tusaale ahaan, haddii maqaal la isku daray oo ku jira nooca desktop-ka uu miisaankiisu yahay 378 KB, ka dibna nooca mobilada wuxuu horeyba u ahaa 126 KB.
Nooca labaad. Duub badan, mamnuuc ku meel gaar ah Habr
Markii aan ku dhex wareegay internetka mawduuca isku-xirka badan ee Python, waxaan doortay ikhtiyaarka ugu fudud ee multiprocessing.dummy, waxaan ogaaday in dhibaatooyin ay la socdaan multithreading.
SQLite3 ma rabto inay ku shaqeyso wax ka badan hal dun.
go'an check_same_thread=False, laakiin qaladkan ma aha ka kaliya, marka la isku dayayo in la geliyo database-ka, mararka qaarkood waxaa dhaca qaladaad aan xalin karin.
Sidaa darteed, waxaan go'aansaday in aan ka tago gelinta degdega ah ee maqaallada si toos ah xogta xogta iyo, xusuusta xalka la isku daray, waxaan go'aansaday in aan isticmaalo faylasha, sababtoo ah ma jiraan wax dhibaatooyin ah oo ku saabsan qorista xargaha badan ee faylka.
Habr wuxuu bilaabay mamnuucida isticmaalka wax ka badan seddex xadhig.
Gaar ahaan isku dayga xamaasadda leh ee lagu gelayo Habr waxay ku dambayn kartaa xayiraad ip ah dhowr saacadood. Markaa waa inaad isticmaashaa 3 threads oo kaliya, laakiin tani mar hore ayay wanaagsan tahay, maadaama wakhtiga lagu celcelinayo in ka badan 100 maqaal laga dhimay 26 ilaa 12 ilbidhiqsi.
Waxaa xusid mudan in noocani uu yahay mid aan xasilloonayn, iyo soo-dejintu si xilliyo ah u dhacaan tiro badan oo maqaallo ah.
async_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# Запись заблокированных запросов на сервер
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# Если поста не существует или он был скрыт
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# Метка, что пост является переводом или туториалом.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "При парсинге этой странице произошла ошибка."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# Записываем статью в json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Nooca saddexaad. Ugu dambayn
Markii aan khaladayay nooca labaad, waxaan ogaaday in Habr, si lama filaan ah, uu haysto API kaas oo nooca moobilka ee goobta uu galo. Way ka xawli badan tahay nooca mobaylka, maadaama ay tahay json oo keliya, taas oo aan xataa u baahnayn in la kala saaro. Dhammaadkii, waxaan go'aansaday inaan dib u qoro qoraalkayga mar kale.
Ka dib markii aan soo dejinay xogta aan u baahanahay, waxaan u baahanahay inaan ka baaraandegno oo galno kaydka xogta. Middana wax dhib ah kalama kulmin:
parser.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
# Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
Tirakoob
Hagaag, dhaqan ahaan, ugu dambeyntii, waxaad ka soo saari kartaa xoogaa xog ah:
490 ee la filayey in la soo dejiyo, kaliya 406 maqaal ayaa la soo dejiyey. Waxa soo baxday in in ka badan kala badh (228) ee maqaallada Habré la qariyay ama la tirtiray.
Dhammaan xogta, oo ka kooban ku dhawaad nus milyan maqaallo, ayaa miisaankeedu yahay 2.95 GB. Qaab cufan - 495 MB.
Wadar ahaan, 37804 qof ayaa ah qorayaasha Habré. Waxaan ku xasuusinayaa in tirakoobyadani ay ka yihiin kaliya qoraallada tooska ah.
Qoraaga ugu wax soo saarka badan Habré - alizar - 8774 maqaallo.