Lumelang. Ho se ho fetile lilemo tse 2 esale e ngoloa. sengoloa sa ho qetela mabapi le ho qhekella Habr, 'me lintlha tse ling li fetohile.
Ha ke batla ho ba le kopi ea Habr, ke ile ka etsa qeto ea ho ngola moqolotsi o tla boloka litaba tsohle tsa bangoli ho database. Ho etsahetse joang le hore na ke ile ka kopana le liphoso life - u ka bala tlas'a sehiloeng.
Phetolelo ea pele ea mohlahlobi. Khoele e le 'ngoe, mathata a mangata
Taba ea pele, ke ile ka etsa qeto ea ho etsa script prototype, eo ho eona sengoloa se neng se tla aroloa hang ha se khoasolla le ho beoa polokelong ea litaba. Ntle le ho nahana habeli, ke sebelisitse sqlite3, hobane. e ne e sa sebetse haholo: ha ho hlokahale ho ba le seva sa lehae, se shebahalang se hlakotsoe le lintho tse joalo.
'ngoe_khoele.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "При парсинге этой странице произошла ошибка."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Ntho e ngoe le e ngoe ke ea khale - re sebelisa Sopho e Ntle, likopo le prototype e potlakileng e se e lokile. Ke eona feela…
Ho khoasolla leqephe ho khoele e le 'ngoe
Haeba u sitisa ho etsoa ha script, joale database eohle e ke ke ea ea kae kapa kae. Ha e le hantle, boitlamo bo etsoa feela ka mor'a hore ho behoe likarolo tsohle.
Ha e le hantle, u ka etsa liphetoho ho database ka mor'a ho kenngoa ka 'ngoe, empa joale nako ea ts'ebetso ea script e tla eketseha haholo.
Ho bala lingoliloeng tsa pele tse 100 ho nkile lihora tse 000.
E latelang ke fumana sengoloa sa mosebelisi kopanetsoe, eo ke e balileng mme ka fumana li-hacks tse 'maloa tsa bophelo ho potlakisa ts'ebetso ena:
Ho sebelisa multithreading ho potlakisa ho khoasolla ka linako tse ling.
U ka se fumane mofuta o felletseng oa habr, empa mofuta oa eona oa mobile.
Mohlala, haeba sengoloa se kopantsoeng ka har'a mofuta oa desktop se boima ba 378 KB, joale mofuta oa mobile se se se ntse se le 126 KB.
Phetolelo ea bobeli. Likhoele tse ngata, thibelo ea nakoana ho tsoa ho Habr
Ha ke sheba Inthanete ka taba ea ho bala ka bongata ka python, ke ile ka khetha khetho e bonolo ka ho fetisisa ka multiprocessing.dummy, ke hlokometse hore mathata a hlaha hammoho le ho bala ka bongata.
SQLite3 ha e batle ho sebetsa ka likhoele tse fetang bonngoe.
E tsitsitse check_same_thread=False, empa phoso ena hase eona feela, ha u leka ho kenya polokelong ea litaba, ka linako tse ling ho etsahala liphoso tseo ke neng ke sitoa ho li rarolla.
Ka hona, ke etsa qeto ea ho tlohela ho kenngoa hang-hang ha lingoloa ka kotloloho ho database mme, ha ke hopola tharollo e kopaneng, ke etsa qeto ea ho sebelisa lifaele, hobane ha ho na mathata a ho ngola ka likhoele tse ngata faeleng.
Habr o qala ho thibela ho sebelisa likhoele tse fetang tse tharo.
Haholo-holo liteko tse chesehang tsa ho fihlela Habr li ka qetella ka thibelo ea ip ka lihora tse 'maloa. Kahoo o tlameha ho sebelisa likhoele tse 3 feela, empa sena se se se le molemo, kaha nako ea ho pheta-pheta lihlooho tse fetang 100 e fokotsehile ho tloha metsotsoana e 26 ho isa ho e 12.
Ho bohlokoa ho hlokomela hore mofuta ona ha oa tsitsa, 'me likopi hangata li oela ho palo e kholo ea lingoliloeng.
async_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# Запись заблокированных запросов на сервер
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# Если поста не существует или он был скрыт
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# Метка, что пост является переводом или туториалом.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "При парсинге этой странице произошла ошибка."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# Записываем статью в json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Phetolelo ea boraro. Qetellong
Ha ke ntse ke lokisa phetolelo ea bobeli, ke ile ka fumana hore Habr, ka tšohanyetso, o na le API eo mofuta oa mohala oa sebaka sa marang-rang o fihlang ho oona. E jara ka potlako ho feta mofuta oa selefouno, kaha ke json feela, e sa hlokeng le ho hlalosoa. Qetellong, ke ile ka etsa qeto ea ho ngola lengolo la ka hape.
Kahoo, ho fumana khokahano ena API, o ka qala ho e hlalosa.
async_v2.py
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)
try:
r = requests.get(url)
if r.status_code == 503:
logging.critical("503 Error")
return 503
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
data = json.loads(r.text)
if data['success']:
article = data['data']['article']
id = article['id']
is_tutorial = article['is_tutorial']
time_published = article['time_published']
comments_count = article['comments_count']
lang = article['lang']
tags_string = article['tags_string']
title = article['title']
content = article['text_html']
reading_count = article['reading_count']
author = article['author']['login']
score = article['voting']['score']
data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
with open(currentFile, "w") as write_file:
json.dump(data, write_file)
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: asyc.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
E na le likarolo tse amanang le sengoloa ka bosona le mongoli ea se ngotseng.
API.png
Ha kea lahla json e felletseng ea sengoloa ka seng, empa ke bolokile masimo feela ao ke a hlokang:
id
ke_thuto
nako_e hatisitsoeng
tlotla
dikahare
maikutlo_palo
lang ke puo eo sehlooho se ngotsoeng ka eona. Ho fihlela joale, e na le en le ru feela.
tags_string - li-tag tsohle ho tsoa posong
bala_bala
mokwadi
lintlha - lintlha tsa sehlooho.
Kahoo, ka ho sebelisa API, ke ile ka fokotsa nako ea ho etsa script ho metsotsoana ea 8 ho 100 url.
Ka mor'a hore re khoasolle data eo re e hlokang, re lokela ho e sebetsana le ho e kenya ka har'a database. Le 'na ha kea ba le bothata ka sena:
parser.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
# Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
Lipalo
Che, ka moetlo, qetellong, o ka ntša lipalo-palo ho data:
Har'a tse lebelletsoeng ho khoasolla tse 490, ke lingoloa tse 406 feela tse jarollotsoeng. Hoa etsahala hore karolo e fetang halofo (228) ea lihlooho tse buang ka Habré e ne e patiloe kapa e hlakotsoe.
Database eohle, e nang le lihlooho tse ka bang halofo ea milione, e boima ba 2.95 GB. Ka foromo e hatelitsoeng - 495 MB.
Ka kakaretso, batho ba 37804 ke bangoli ba Habré. Ke u hopotsa hore lipalo-palo tsena li tsoa feela lipapatsong tse hlahang.
Sengoli se hlahisang litholoana haholo ho Habré - alizar - 8774 lingoloa.