Masikati akanaka. Pava ne2 years kubva zvanyorwa.
Pandakada kuva nekopi yaHabr, ndakafunga kunyora parser yaizochengeta zvese zvirimo zvevanyori kudhatabhesi. Zvakaitika sei uye ndezvipi zvikanganiso zvandakasangana nazvo - unogona kuverenga pasi pekucheka.
TLDR-
Shanduro yekutanga yeparser. Imwe thread, matambudziko akawanda
Kutanga, ndakafunga kugadzira script prototype, umo chinyorwa chaizopatsanurwa nekukurumidza pakurodha uye kuiswa mudhatabhesi. Pasina kufunga kaviri, ndakashandisa sqlite3, nekuti. yakanga isinganyanyi kushanda: hapana chikonzero chekuve nesevha yemunharaunda, yakagadzirwa-yakataridzika-yakabviswa uye zvinhu zvakadaro.
one_thread.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "ΠΡΠΈ ΠΏΠ°ΡΡΠΈΠ½Π³Π΅ ΡΡΠΎΠΉ ΡΡΡΠ°Π½ΠΈΡΠ΅ ΠΏΡΠΎΠΈΠ·ΠΎΡΠ»Π° ΠΎΡΠΈΠ±ΠΊΠ°."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Zvese ndezvemhando yepamusoro - isu tinoshandisa Yakanaka Soup, zvikumbiro uye inokurumidza prototype yakagadzirira. Ndizvo cheteβ¦
-
Kudhawunirodha peji kuri mushinda imwe
-
Kana iwe ukakanganisa kuitwa kweiyo script, ipapo dhatabhesi rese harizoendi kupi. Mushure mezvose, kuzvipira kunoitwa chete mushure mekuita kwese.
Ehe, iwe unogona kuita shanduko kune dhatabhesi mushure mekuiswa kwega kwega, asi ipapo iyo script execution nguva ichawedzera zvakanyanya. -
Kuongorora zvinyorwa zve100 zvekutanga zvakanditorera maawa masere.
Tevere ndinowana chinyorwa chemushandisi
- Kushandisa multithreading kunomhanyisa kurodha dzimwe nguva.
- Iwe haugone kuwana kwete iyo yakazara vhezheni yehabr, asi nharembozha yayo.
Semuenzaniso, kana chinyorwa chakabatanidzwa mudesktop vhezheni ichirema 378 KB, saka mune nharembozha yatove 126 KB.
Second version. Mashinda mazhinji, kurambidzwa kwenguva pfupi kubva kuna Habr
Pandakatsvaga Indaneti pamusoro pehurukuro yakawanda mu python, ndakasarudza sarudzo yakapfava ne multiprocessing.dummy, ndakaona kuti matambudziko akaonekwa pamwe chete nehuwandu hwekuverenga.
SQLite3 haidi kushanda neshinda imwe chete.
fixed check_same_thread=False
, asi kukanganisa uku hakusi kwega, pakuedza kuisa mu database, dzimwe nguva zvikanganiso zvinoitika zvandisina kukwanisa kugadzirisa.
Naizvozvo, ndinosarudza kusiya nekukasira kuiswa kwezvinyorwa zvakananga mudhatabhesi uye, ndichirangarira mhinduro yakabatanidzwa, ndinosarudza kushandisa mafaera, nekuti hapana matambudziko neakawanda-tambo kunyora kune faira.
Habr anotanga kurambidza kushandisa shinda dzinopfuura nhatu.
Kunyanya kuedza kwekushingaira kusvika kuna Habr kunogona kuguma nekurambidzwa kwep kwemaawa akati wandei. Saka iwe unofanirwa kushandisa tambo nhatu chete, asi izvi zvatonaka, sezvo nguva yekudzokorora zvinyorwa zvinopfuura zana yakaderedzwa kubva pamasekonzi makumi maviri nematanhatu kusvika gumi nemaviri.
Zvakakosha kucherechedza kuti iyi vhezheni haina kugadzikana, uye kudhawunirodha nguva nenguva kunodonha pane nhamba huru yezvinyorwa.
async_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# ΠΠ°ΠΏΠΈΡΡ Π·Π°Π±Π»ΠΎΠΊΠΈΡΠΎΠ²Π°Π½Π½ΡΡ
Π·Π°ΠΏΡΠΎΡΠΎΠ² Π½Π° ΡΠ΅ΡΠ²Π΅Ρ
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# ΠΡΠ»ΠΈ ΠΏΠΎΡΡΠ° Π½Π΅ ΡΡΡΠ΅ΡΡΠ²ΡΠ΅Ρ ΠΈΠ»ΠΈ ΠΎΠ½ Π±ΡΠ» ΡΠΊΡΡΡ
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# ΠΠ΅ΡΠΊΠ°, ΡΡΠΎ ΠΏΠΎΡΡ ΡΠ²Π»ΡΠ΅ΡΡΡ ΠΏΠ΅ΡΠ΅Π²ΠΎΠ΄ΠΎΠΌ ΠΈΠ»ΠΈ ΡΡΡΠΎΡΠΈΠ°Π»ΠΎΠΌ.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "ΠΡΠΈ ΠΏΠ°ΡΡΠΈΠ½Π³Π΅ ΡΡΠΎΠΉ ΡΡΡΠ°Π½ΠΈΡΠ΅ ΠΏΡΠΎΠΈΠ·ΠΎΡΠ»Π° ΠΎΡΠΈΠ±ΠΊΠ°."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# ΠΠ°ΠΏΠΈΡΡΠ²Π°Π΅ΠΌ ΡΡΠ°ΡΡΡ Π² json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("ΠΠ΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΡ min ΠΈ max. ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# ΠΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΊΠΎΠ² >3
# ΡΠΎ Ρ
Π°Π±Ρ Π±Π°Π½ΠΈΡ ipΡΠ½ΠΈΠΊ Π½Π° Π²ΡΠ΅ΠΌΡ
pool = ThreadPool(3)
# ΠΡΡΡΠ΅Ρ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ, Π·Π°ΠΏΡΡΠΊ ΠΏΠΎΡΠΎΠΊΠΎΠ²
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# ΠΠΎΡΠ»Π΅ Π·Π°ΠΊΡΡΡΠΈΡ Π²ΡΠ΅Ρ
ΠΏΠΎΡΠΎΠΊΠΎΠ² ΠΏΠ΅ΡΠ°ΡΠ°Π΅ΠΌ Π²ΡΠ΅ΠΌΡ
pool.close()
pool.join()
print(datetime.now() - start_time)
Third version. Final
Ndichiri kugadzirisa iyo yechipiri vhezheni, ndakaona kuti Habr, kamwe kamwe, ane API iyo nharembozha yesaiti inowana. Inotakura nekukurumidza kupfuura iyo mobile vhezheni, sezvo ingori json, iyo isingatomboda kupatsanurwa. Pakupedzisira, ndakasarudza kunyora zvakare script yangu zvakare.
Saka, kuwana
async_v2.py
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)
try:
r = requests.get(url)
if r.status_code == 503:
logging.critical("503 Error")
return 503
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
data = json.loads(r.text)
if data['success']:
article = data['data']['article']
id = article['id']
is_tutorial = article['is_tutorial']
time_published = article['time_published']
comments_count = article['comments_count']
lang = article['lang']
tags_string = article['tags_string']
title = article['title']
content = article['text_html']
reading_count = article['reading_count']
author = article['author']['login']
score = article['voting']['score']
data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
with open(currentFile, "w") as write_file:
json.dump(data, write_file)
if __name__ == '__main__':
if len(sys.argv) < 3:
print("ΠΠ΅ΠΎΠ±Ρ
ΠΎΠ΄ΠΈΠΌΡ ΠΏΠ°ΡΠ°ΠΌΠ΅ΡΡΡ min ΠΈ max. ΠΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅: asyc.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# ΠΡΠ»ΠΈ ΠΏΠΎΡΠΎΠΊΠΎΠ² >3
# ΡΠΎ Ρ
Π°Π±Ρ Π±Π°Π½ΠΈΡ ipΡΠ½ΠΈΠΊ Π½Π° Π²ΡΠ΅ΠΌΡ
pool = ThreadPool(3)
# ΠΡΡΡΠ΅Ρ Π²ΡΠ΅ΠΌΠ΅Π½ΠΈ, Π·Π°ΠΏΡΡΠΊ ΠΏΠΎΡΠΎΠΊΠΎΠ²
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# ΠΠΎΡΠ»Π΅ Π·Π°ΠΊΡΡΡΠΈΡ Π²ΡΠ΅Ρ
ΠΏΠΎΡΠΎΠΊΠΎΠ² ΠΏΠ΅ΡΠ°ΡΠ°Π΅ΠΌ Π²ΡΠ΅ΠΌΡ
pool.close()
pool.join()
print(datetime.now() - start_time)
Iine minda ine chekuita nezvose zviri zviviri chinyorwa pachacho uye nemunyori akachinyora.
API.png
Ini handina kurasa json yakazara yechinyorwa chimwe nechimwe, asi ndakachengeta chete minda yandaida:
- id
- i_dzidziso
- nguva_yakabudiswa
- musoro wenyaya
- gutsikana
- comments_count
- lang ndiwo mutauro unonyorwa nyaya yacho. Kusvika ikozvino, ine chete en uye ru.
- tags_string - ese ma tag kubva pane post
- kuverenga_kuverenga
- munyori
- chibodzwa - chinyorwa rating.
Nokudaro, ndichishandisa API, ndakaderedza nguva yekunyora script kusvika 8 seconds per 100 url.
Mushure mekudhawunirodha data yatinoda, isu tinofanirwa kuigadzirisa uye kuiisa mudhatabhesi. Iniwo handina kana dambudziko nazvo.
parser.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # ΠΡΠΊΠ»ΡΡΠ°Π΅ΠΌ ΠΏΠΎΠ΄ΡΠ²Π΅ΡΠΆΠ΄Π΅Π½ΠΈΠ΅ Π·Π°ΠΏΠΈΡΠΈ, ΡΠ°ΠΊ ΡΠΊΠΎΡΠΎΡΡΡ ΡΠ²Π΅Π»ΠΈΡΠΈΠ²Π°Π΅ΡΡΡ Π² ΡΠ°Π·Ρ.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Π Π°Π΄ΠΈ Π»ΡΡΡΠ΅ΠΉ ΡΠΈΡΠ°Π΅ΠΌΠΎΡΡΠΈ Π±Π°Π·Ρ ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡΠ΅Π½Π΅Π±ΡΠ΅ΡΡ ΡΠΈΡΠ°Π΅ΠΌΠΎΡΡΡΡ ΠΊΠΎΠ΄Π°. ΠΠ»ΠΈ Π½Π΅Ρ?
# ΠΡΠ»ΠΈ Π²Π°ΠΌ ΡΠ°ΠΊ ΠΊΠ°ΠΆΠ΅ΡΡΡ, ΠΌΠΎΠΆΠ½ΠΎ ΠΏΡΠΎΡΡΠΎ Π·Π°ΠΌΠ΅Π½ΠΈΡΡ ΠΊΠΎΡΡΠ΅ΠΆ Π°ΡΠ³ΡΠΌΠ΅Π½ΡΠΎΠΌ data. Π Π΅ΡΠ°ΡΡ Π²Π°ΠΌ.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
Π‘ΡΠ°ΡΠΈΡΡΠΈΠΊΠ°
Zvakanaka, zvechinyakare, pakupedzisira, unogona kutora mamwe manhamba kubva kune data:
- Pazvinotarisirwa 490 zvekudhawunirodha, chete 406 zvinyorwa zvakatorwa. Zvinoitika kuti inopfuura hafu (228) yezvinyorwa zveHabrΓ© zvakavanzwa kana kubviswa.
- Iyo yese dhatabhesi, ine inosvika hafu yemiriyoni zvinyorwa, inorema 2.95 GB. Mune fomu yakamanikidzwa - 495 MB.
- Pakazara, 37804 vanhu ndivo vanyori veHabrΓ©. Ndinokuyeuchidzai kuti nhamba idzi dzinongobva kuzvinyorwa zvepamoyo.
- Munyori anonyanya kugadzira paHabrΓ© -
alizar - 8774 zvinyorwa. Top rated article - 1448 plusesNyaya inoverengwa yakawanda - 1660841 maoneroNyaya Inonyanya Kukurukurwa - 2444 comments
Zvakanaka, muchimiro chepamusoroVanyori vepamusoro gumi nevashanu
Pamusoro 15 nechiyero
Top 15 verenga
Top 15 Yakakurukurwa
Source: www.habr.com