Hayrli kun. Yozilganiga 2 yil bo'ldi. oxirgi maqola Habrni tahlil qilish haqida va ba'zi fikrlar o'zgargan.
Men Habr nusxasiga ega bo'lishni xohlaganimda, mualliflarning barcha mazmunini ma'lumotlar bazasida saqlaydigan tahlilchi yozishga qaror qildim. Bu qanday sodir bo'ldi va qanday xatolarga duch keldim - siz kesma ostida o'qishingiz mumkin.
Parserning birinchi versiyasi. Bitta mavzu, ko'p muammolar
Boshlash uchun men skript prototipini yaratishga qaror qildim, unda maqola yuklab olingandan so'ng darhol tahlil qilinadi va ma'lumotlar bazasiga joylashtiriladi. Ikki marta o'ylamasdan, men sqlite3 dan foydalandim, chunki. u kamroq mehnat talab qildi: mahalliy serverga ega bo'lishning hojati yo'q, yaratilgan-ko'rinishi-o'chirilgan va shunga o'xshash narsalar.
one_thread.py
from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime
def main(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content TEXT, tags TEXT)")
start_time = datetime.now()
c.execute("begin")
for i in range(min, max):
url = "https://m.habr.com/post/{}".format(i)
try:
r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
continue
if(r.status_code != 200):
print("{} - {}".format(i, r.status_code))
continue
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html.parser')
try:
author = soup.find(class_="tm-user-info__username").get_text()
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
except:
author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
content = "При парсинге этой странице произошла ошибка."
c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
print(i)
c.execute("commit")
print(datetime.now() - start_time)
main(1, 490406)
Hammasi klassik - biz chiroyli sho'rvadan foydalanamiz, so'rovlar va tezkor prototip tayyor. Bu shunchaki…
Sahifani yuklab olish bitta mavzuda
Agar siz skriptning bajarilishini to'xtatsangiz, butun ma'lumotlar bazasi hech qaerga ketmaydi. Axir, majburiyat faqat barcha tahlillardan so'ng amalga oshiriladi.
Albatta, har bir kiritilgandan so'ng ma'lumotlar bazasiga o'zgartirishlar kiritishingiz mumkin, ammo keyin skriptni bajarish vaqti sezilarli darajada oshadi.
Dastlabki 100 000 ta maqolani tahlil qilish menga 8 soat vaqt oldi.
Keyin men foydalanuvchining maqolasini topaman kointegratsiyalashgan, men uni o'qib chiqdim va bu jarayonni tezlashtirish uchun bir nechta hayotiy xakerlarni topdim:
Multithreadingdan foydalanish vaqti-vaqti bilan yuklab olishni tezlashtiradi.
Siz habrning to'liq versiyasini emas, balki uning mobil versiyasini olishingiz mumkin.
Masalan, ish stoli versiyasida kointegratsiyalangan maqolaning og'irligi 378 KB bo'lsa, mobil versiyada u allaqachon 126 KB ni tashkil qiladi.
Ikkinchi versiya. Ko'p mavzular, Habrdan vaqtinchalik taqiq
Pythonda multithreading mavzusida Internetni ko'zdan kechirganimda multiprocessing.dummy bilan eng oddiy variantni tanladim, men ko'p oqim bilan birga muammolar paydo bo'lganini payqadim.
SQLite3 bir nechta ip bilan ishlashni xohlamaydi.
Tugallangan check_same_thread=False, lekin bu xato yagona emas, ma'lumotlar bazasiga kiritishga urinayotganda, ba'zida men hal qila olmaydigan xatolar yuzaga keladi.
Shuning uchun men maqolalarni to'g'ridan-to'g'ri ma'lumotlar bazasiga bir zumda kiritishdan voz kechishga qaror qildim va kointegratsiyalangan yechimni eslab, fayllardan foydalanishga qaror qildim, chunki faylga ko'p tarmoqli yozishda hech qanday muammo yo'q.
Habr uchdan ortiq ipdan foydalanishni taqiqlashni boshlaydi.
Ayniqsa, Habr-ga o'tishga bo'lgan g'ayratli urinishlar bir necha soat davomida IP-ni taqiqlash bilan yakunlanishi mumkin. Shunday qilib, siz faqat 3 ta mavzuni ishlatishingiz kerak, lekin bu allaqachon yaxshi, chunki 100 dan ortiq maqolalarni takrorlash vaqti 26 dan 12 soniyagacha qisqaradi.
Shuni ta'kidlash kerakki, ushbu versiya juda beqaror va yuklab olish vaqti-vaqti bilan ko'p sonli maqolalarga tushadi.
async_v1.py
from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/post/{}".format(i)
try: r = requests.get(url)
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
# Запись заблокированных запросов на сервер
if (r.status_code == 503):
with open("Error503.txt", "a") as write_file:
write_file.write(str(i) + "n")
logging.warning('{} / 503 Error'.format(i))
# Если поста не существует или он был скрыт
if (r.status_code != 200):
logging.info("{} / {} Code".format(i, r.status_code))
return r.status_code
html_doc = r.text
soup = BeautifulSoup(html_doc, 'html5lib')
try:
author = soup.find(class_="tm-user-info__username").get_text()
timestamp = soup.find(class_='tm-user-meta__date')
timestamp = timestamp['title']
content = soup.find(id="post-content-body")
content = str(content)
title = soup.find(class_="tm-article-title__text").get_text()
tags = soup.find(class_="tm-article__tags").get_text()
tags = tags[5:]
# Метка, что пост является переводом или туториалом.
tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()
rating = soup.find(class_="tm-votes-score").get_text()
except:
author = title = tags = timestamp = tm_tag = rating = "Error"
content = "При парсинге этой странице произошла ошибка."
logging.warning("Error parsing - {}".format(i))
with open("Errors.txt", "a") as write_file:
write_file.write(str(i) + "n")
# Записываем статью в json
try:
article = [i, timestamp, author, title, content, tm_tag, rating, tags]
with open(currentFile, "w") as write_file:
json.dump(article, write_file)
except:
print(i)
raise
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: async_v1.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Uchinchi versiya. Final
Ikkinchi versiyani disk raskadrovka qilishda men birdaniga Habr-da saytning mobil versiyasi kiradigan API borligini aniqladim. U mobil versiyaga qaraganda tezroq yuklanadi, chunki u shunchaki json bo'lib, uni tahlil qilish ham shart emas. Oxir-oqibat, ssenariyni qayta yozishga qaror qildim.
Shunday qilib, topib Ushbu havola API, siz uni tahlil qilishni boshlashingiz mumkin.
async_v2.py
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging
def worker(i):
currentFile = "files\{}.json".format(i)
if os.path.isfile(currentFile):
logging.info("{} - File exists".format(i))
return 1
url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)
try:
r = requests.get(url)
if r.status_code == 503:
logging.critical("503 Error")
return 503
except:
with open("req_errors.txt") as file:
file.write(i)
return 2
data = json.loads(r.text)
if data['success']:
article = data['data']['article']
id = article['id']
is_tutorial = article['is_tutorial']
time_published = article['time_published']
comments_count = article['comments_count']
lang = article['lang']
tags_string = article['tags_string']
title = article['title']
content = article['text_html']
reading_count = article['reading_count']
author = article['author']['login']
score = article['voting']['score']
data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
with open(currentFile, "w") as write_file:
json.dump(data, write_file)
if __name__ == '__main__':
if len(sys.argv) < 3:
print("Необходимы параметры min и max. Использование: asyc.py 1 100")
sys.exit(1)
min = int(sys.argv[1])
max = int(sys.argv[2])
# Если потоков >3
# то хабр банит ipшник на время
pool = ThreadPool(3)
# Отсчет времени, запуск потоков
start_time = datetime.now()
results = pool.map(worker, range(min, max))
# После закрытия всех потоков печатаем время
pool.close()
pool.join()
print(datetime.now() - start_time)
Unda maqolaning o'ziga ham, uni yozgan muallifga ham tegishli maydonlar mavjud.
API.png
Men har bir maqolaning to'liq json faylini tashlamadim, faqat kerakli maydonlarni saqladim:
id
darslik
time_published
sarlavha
kontent
comments_count
lang - maqola yozilgan til. Hozircha unda faqat en va ru mavjud.
tags_string - postdagi barcha teglar
o'qish_hisob
muallif
ball - maqola reytingi.
Shunday qilib, API yordamida skriptni bajarish vaqtini 8 url uchun 100 soniyagacha qisqartirdim.
Biz kerakli ma'lumotlarni yuklab olganimizdan so'ng, biz uni qayta ishlashimiz va ma'lumotlar bazasiga kiritishimiz kerak. Menda ham bu borada hech qanday muammo yo'q edi:
parser.py
import json
import sqlite3
import logging
from datetime import datetime
def parser(min, max):
conn = sqlite3.connect('habr.db')
c = conn.cursor()
c.execute('PRAGMA encoding = "UTF-8"')
c.execute('PRAGMA synchronous = 0') # Отключаем подтверждение записи, так скорость увеличивается в разы.
c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT,
lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
try:
for i in range(min, max):
try:
filename = "files\{}.json".format(i)
f = open(filename)
data = json.load(f)
(id, is_tutorial, time_published, title, content, comments_count, lang,
tags_string, reading_count, author, score) = data
# Ради лучшей читаемости базы можно пренебречь читаемостью кода. Или нет?
# Если вам так кажется, можно просто заменить кортеж аргументом data. Решать вам.
c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
title, content, lang,
comments_count, reading_count,
score, is_tutorial,
tags_string))
f.close()
except IOError:
logging.info('FileNotExists')
continue
finally:
conn.commit()
start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)
Статистика
Xo'sh, an'anaviy ravishda, nihoyat, ma'lumotlardan ba'zi statistik ma'lumotlarni olishingiz mumkin:
Kutilayotgan 490 406 yuklamadan atigi 228 512 ta maqola yuklab olingan. Ma'lum bo'lishicha, Habré haqidagi maqolalarning yarmidan ko'pi (261894 tasi) yashirilgan yoki o'chirilgan.
Deyarli yarim million maqoladan iborat butun ma'lumotlar bazasi 2.95 Gb og'irlikda. Siqilgan shaklda - 495 MB.
Hammasi bo'lib 37804 kishi Habré mualliflaridir. Sizga eslatib o'tamanki, bu statistika faqat jonli postlardan olingan.
Habredagi eng samarali muallif - alizar - 8774 ta maqola.