ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์žˆ๋Š” ๋ชจ๋“  Habr

์ข‹์€ ์˜คํ›„์—์š”. ์“ด์ง€ 2๋…„์ด ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋งˆ์ง€๋ง‰ ๊ธฐ์‚ฌ Habr ๊ตฌ๋ฌธ ๋ถ„์„์— ๋Œ€ํ•ด ๋ช‡ ๊ฐ€์ง€ ์‚ฌํ•ญ์ด ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Habr์˜ ๋ณต์‚ฌ๋ณธ์„ ๊ฐ–๊ณ  ์‹ถ์—ˆ์„ ๋•Œ ์ €์ž์˜ ๋ชจ๋“  ์ฝ˜ํ…์ธ ๋ฅผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ €์žฅํ•˜๋Š” ํŒŒ์„œ๋ฅผ ์ž‘์„ฑํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ์–ด๋–ป๊ฒŒ ๋ฐœ์ƒํ–ˆ๊ณ  ์–ด๋–ค ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ–ˆ๋Š”์ง€ - ์•„๋ž˜์—์„œ ์ฝ์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ์š”์•ฝ โ€” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค ๋งํฌ

ํŒŒ์„œ์˜ ์ฒซ ๋ฒˆ์งธ ๋ฒ„์ „์ž…๋‹ˆ๋‹ค. ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ, ๋งŽ์€ ๋ฌธ์ œ

์šฐ์„ , ๊ธฐ์‚ฌ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•˜๋Š” ์ฆ‰์‹œ ๊ตฌ๋ฌธ ๋ถ„์„ํ•˜์—ฌ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ๋ฐฐ์น˜ํ•˜๋Š” ์Šคํฌ๋ฆฝํŠธ ํ”„๋กœํ† ํƒ€์ž…์„ ๋งŒ๋“ค๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๋ฒˆ ์ƒ๊ฐํ•˜์ง€ ์•Š๊ณ  sqlite3๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. ๋œ ๋…ธ๋™ ์ง‘์•ฝ์ ์ด์—ˆ์Šต๋‹ˆ๋‹ค. ๋กœ์ปฌ ์„œ๋ฒ„, ์ƒ์„ฑ-ํ‘œ์‹œ-์‚ญ์ œ ๋“ฑ์ด ํ•„์š”ํ•˜์ง€ ์•Š์•˜์Šต๋‹ˆ๋‹ค.

one_thread.py

from bs4 import BeautifulSoup
import sqlite3
import requests
from datetime import datetime

def main(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute("CREATE TABLE IF NOT EXISTS habr(id INT, author VARCHAR(255), title VARCHAR(255), content  TEXT, tags TEXT)")

    start_time = datetime.now()
    c.execute("begin")
    for i in range(min, max):
        url = "https://m.habr.com/post/{}".format(i)
        try:
            r = requests.get(url)
        except:
            with open("req_errors.txt") as file:
                file.write(i)
            continue
        if(r.status_code != 200):
            print("{} - {}".format(i, r.status_code))
            continue

        html_doc = r.text
        soup = BeautifulSoup(html_doc, 'html.parser')

        try:
            author = soup.find(class_="tm-user-info__username").get_text()
            content = soup.find(id="post-content-body")
            content = str(content)
            title = soup.find(class_="tm-article-title__text").get_text()
            tags = soup.find(class_="tm-article__tags").get_text()
            tags = tags[5:]
        except:
            author,title,tags = "Error", "Error {}".format(r.status_code), "Error"
            content = "ะŸั€ะธ ะฟะฐั€ัะธะฝะณะต ัั‚ะพะน ัั‚ั€ะฐะฝะธั†ะต ะฟั€ะพะธะทะพัˆะปะฐ ะพัˆะธะฑะบะฐ."

        c.execute('INSERT INTO habr VALUES (?, ?, ?, ?, ?)', (i, author, title, content, tags))
        print(i)
    c.execute("commit")
    print(datetime.now() - start_time)

main(1, 490406)

๋ชจ๋“  ๊ฒƒ์ด ๊ณ ์ „์ ์ž…๋‹ˆ๋‹ค. Beautiful Soup, ์š”์ฒญ ๋ฐ ๋น ๋ฅธ ํ”„๋กœํ†  ํƒ€์ž…์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๊ฑด ๊ทธ๋ƒฅโ€ฆ

  • ํŽ˜์ด์ง€ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•˜๋‚˜์˜ ์Šค๋ ˆ๋“œ์— ์žˆ์Œ

  • ์Šคํฌ๋ฆฝํŠธ ์‹คํ–‰์„ ์ค‘๋‹จํ•˜๋ฉด ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๊ฐ€ ์•„๋ฌด๋ฐ๋„ ๊ฐ€์ง€ ์•Š์Šต๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ์ปค๋ฐ‹์€ ๋ชจ๋“  ๊ตฌ๋ฌธ ๋ถ„์„ ํ›„์— ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.
    ๋ฌผ๋ก  ์‚ฝ์ž…ํ•  ๋•Œ๋งˆ๋‹ค ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ์ปค๋ฐ‹ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ ์Šคํฌ๋ฆฝํŠธ ์‹คํ–‰ ์‹œ๊ฐ„์ด ํฌ๊ฒŒ ๋Š˜์–ด๋‚ฉ๋‹ˆ๋‹ค.

  • ์ฒ˜์Œ 100๊ฐœ์˜ ๊ธฐ์‚ฌ๋ฅผ ํŒŒ์‹ฑํ•˜๋Š” ๋ฐ 000์‹œ๊ฐ„์ด ๊ฑธ๋ ธ์Šต๋‹ˆ๋‹ค.

๋‹ค์Œ์œผ๋กœ ์‚ฌ์šฉ์ž์˜ ๊ธฐ์‚ฌ๋ฅผ ์ฐพ์Šต๋‹ˆ๋‹ค. ๊ณต์ ๋ถ„, ์ด ํ”„๋กœ์„ธ์Šค๋ฅผ ๊ฐ€์†ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ๋ช‡ ๊ฐ€์ง€ ์ƒํ™œ ํ•ดํ‚น์„ ์ฝ๊ณ  ์ฐพ์•˜์Šต๋‹ˆ๋‹ค.

  • ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ์„ ์‚ฌ์šฉํ•˜๋ฉด ๋•Œ๋•Œ๋กœ ๋‹ค์šด๋กœ๋“œ ์†๋„๊ฐ€ ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค.
  • habr์˜ ์ •์‹ ๋ฒ„์ „์ด ์•„๋‹Œ ๋ชจ๋ฐ”์ผ ๋ฒ„์ „์„ ์–ป์„ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    ์˜ˆ๋ฅผ ๋“ค์–ด ๋ฐ์Šคํฌํ†ฑ ๋ฒ„์ „์—์„œ ๊ณต๋™ ํ†ตํ•ฉ๋œ ๊ธฐ์‚ฌ์˜ ๋ฌด๊ฒŒ๊ฐ€ 378KB์ธ ๊ฒฝ์šฐ ๋ชจ๋ฐ”์ผ ๋ฒ„์ „์—์„œ๋Š” ์ด๋ฏธ 126KB์ž…๋‹ˆ๋‹ค.

๋‘ ๋ฒˆ์งธ ๋ฒ„์ „. ๋งŽ์€ ์Šค๋ ˆ๋“œ, Habr์—์„œ ์ผ์‹œ์ ์ธ ๊ธˆ์ง€

ํŒŒ์ด์ฌ์˜ ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ ์ฃผ์ œ์— ๋Œ€ํ•ด ์ธํ„ฐ๋„ท์„ ์ƒ…์ƒ…์ด ๋’ค์กŒ์„ ๋•Œ multiprocessing.dummy๋กœ ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ์˜ต์…˜์„ ์„ ํƒํ–ˆ๋Š”๋ฐ ๋ฉ€ํ‹ฐ์Šค๋ ˆ๋”ฉ๊ณผ ํ•จ๊ป˜ ๋ฌธ์ œ๊ฐ€ ๋‚˜ํƒ€๋‚˜๋Š” ๊ฒƒ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค.

SQLite3๋Š” ๋‘˜ ์ด์ƒ์˜ ์Šค๋ ˆ๋“œ์—์„œ ์ž‘์—…ํ•˜๊ธฐ๋ฅผ ์›ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค..
๊ฒฐ์ •๋œ check_same_thread=False,ํ•˜์ง€๋งŒ์ด ์˜ค๋ฅ˜๊ฐ€ ์œ ์ผํ•œ ๊ฒƒ์ด ์•„๋‹ˆ๋ผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์‚ฝ์ž…ํ•˜๋ ค๊ณ  ํ•  ๋•Œ ๋•Œ๋•Œ๋กœ ํ•ด๊ฒฐํ•  ์ˆ˜์—†๋Š” ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.

๋”ฐ๋ผ์„œ ๋‚˜๋Š” ๊ธฐ์‚ฌ๋ฅผ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ง์ ‘ ์‚ฝ์ž…ํ•˜๋Š” ๊ฒƒ์„ ํฌ๊ธฐํ•˜๊ณ  ๊ณต๋™ ํ†ตํ•ฉ ์†”๋ฃจ์…˜์„ ๊ธฐ์–ตํ•˜๋ฉด์„œ ํŒŒ์ผ์„ ์‚ฌ์šฉํ•˜๊ธฐ๋กœ ๊ฒฐ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค. ํŒŒ์ผ์— ๋Œ€ํ•œ ๋‹ค์ค‘ ์Šค๋ ˆ๋“œ ์“ฐ๊ธฐ์— ๋ฌธ์ œ๊ฐ€ ์—†๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

Habr, ์“ฐ๋ ˆ๋“œ XNUMX๊ฐœ ์ด์ƒ ์‚ฌ์šฉ ๊ธˆ์ง€ ์‹œ์ž‘.
ํŠนํžˆ Habr์„ ํ†ต๊ณผํ•˜๋ ค๋Š” ์—ด์„ฑ์ ์ธ ์‹œ๋„๋Š” ๋ช‡ ์‹œ๊ฐ„ ๋™์•ˆ IP ๊ธˆ์ง€๋กœ ๋๋‚  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ 3๊ฐœ์˜ ์Šค๋ ˆ๋“œ๋งŒ ์‚ฌ์šฉํ•ด์•ผ ํ•˜์ง€๋งŒ 100๊ฐœ ์ด์ƒ์˜ ๊ธฐ์‚ฌ๋ฅผ ๋ฐ˜๋ณตํ•˜๋Š” ์‹œ๊ฐ„์ด 26์ดˆ์—์„œ 12์ดˆ๋กœ ์ค„์–ด๋“ค์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ด๊ฒƒ์€ ์ด๋ฏธ ์ข‹์Šต๋‹ˆ๋‹ค.

์ด ๋ฒ„์ „์€ ๋‹ค์†Œ ๋ถˆ์•ˆ์ •ํ•˜๋ฉฐ ๋งŽ์€ ์ˆ˜์˜ ๊ธฐ์‚ฌ์—์„œ ์ฃผ๊ธฐ์ ์œผ๋กœ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ์ค‘๋‹จ๋œ๋‹ค๋Š” ์ ์€ ์ฃผ๋ชฉํ•  ๊ฐ€์น˜๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

async_v1.py

from bs4 import BeautifulSoup
import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/post/{}".format(i)

    try: r = requests.get(url)
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    # ะ—ะฐะฟะธััŒ ะทะฐะฑะปะพะบะธั€ะพะฒะฐะฝะฝั‹ั… ะทะฐะฟั€ะพัะพะฒ ะฝะฐ ัะตั€ะฒะตั€
    if (r.status_code == 503):
        with open("Error503.txt", "a") as write_file:
            write_file.write(str(i) + "n")
            logging.warning('{} / 503 Error'.format(i))

    # ะ•ัะปะธ ะฟะพัั‚ะฐ ะฝะต ััƒั‰ะตัั‚ะฒัƒะตั‚ ะธะปะธ ะพะฝ ะฑั‹ะป ัะบั€ั‹ั‚
    if (r.status_code != 200):
        logging.info("{} / {} Code".format(i, r.status_code))
        return r.status_code

    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html5lib')

    try:
        author = soup.find(class_="tm-user-info__username").get_text()

        timestamp = soup.find(class_='tm-user-meta__date')
        timestamp = timestamp['title']

        content = soup.find(id="post-content-body")
        content = str(content)
        title = soup.find(class_="tm-article-title__text").get_text()
        tags = soup.find(class_="tm-article__tags").get_text()
        tags = tags[5:]

        # ะœะตั‚ะบะฐ, ั‡ั‚ะพ ะฟะพัั‚ ัะฒะปัะตั‚ัั ะฟะตั€ะตะฒะพะดะพะผ ะธะปะธ ั‚ัƒั‚ะพั€ะธะฐะปะพะผ.
        tm_tag = soup.find(class_="tm-tags tm-tags_post").get_text()

        rating = soup.find(class_="tm-votes-score").get_text()
    except:
        author = title = tags = timestamp = tm_tag = rating = "Error" 
        content = "ะŸั€ะธ ะฟะฐั€ัะธะฝะณะต ัั‚ะพะน ัั‚ั€ะฐะฝะธั†ะต ะฟั€ะพะธะทะพัˆะปะฐ ะพัˆะธะฑะบะฐ."
        logging.warning("Error parsing - {}".format(i))
        with open("Errors.txt", "a") as write_file:
            write_file.write(str(i) + "n")

    # ะ—ะฐะฟะธัั‹ะฒะฐะตะผ ัั‚ะฐั‚ัŒัŽ ะฒ json
    try:
        article = [i, timestamp, author, title, content, tm_tag, rating, tags]
        with open(currentFile, "w") as write_file:
            json.dump(article, write_file)
    except:
        print(i)
        raise

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("ะะตะพะฑั…ะพะดะธะผั‹ ะฟะฐั€ะฐะผะตั‚ั€ั‹ min ะธ max. ะ˜ัะฟะพะปัŒะทะพะฒะฐะฝะธะต: async_v1.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # ะ•ัะปะธ ะฟะพั‚ะพะบะพะฒ >3
    # ั‚ะพ ั…ะฐะฑั€ ะฑะฐะฝะธั‚ ipัˆะฝะธะบ ะฝะฐ ะฒั€ะตะผั
    pool = ThreadPool(3)

    # ะžั‚ัั‡ะตั‚ ะฒั€ะตะผะตะฝะธ, ะทะฐะฟัƒัะบ ะฟะพั‚ะพะบะพะฒ
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # ะŸะพัะปะต ะทะฐะบั€ั‹ั‚ะธั ะฒัะตั… ะฟะพั‚ะพะบะพะฒ ะฟะตั‡ะฐั‚ะฐะตะผ ะฒั€ะตะผั
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

์„ธ ๋ฒˆ์งธ ๋ฒ„์ „. ๊ฒฐ์ •์ ์ธ

๋‘ ๋ฒˆ์งธ ๋ฒ„์ „์„ ๋””๋ฒ„๊น…ํ•˜๋‹ค๊ฐ€ ๊ฐ‘์ž๊ธฐ Habr์— ์‚ฌ์ดํŠธ์˜ ๋ชจ๋ฐ”์ผ ๋ฒ„์ „์ด ์•ก์„ธ์Šคํ•˜๋Š” API๊ฐ€ ์žˆ์Œ์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค. ํŒŒ์‹ฑํ•  ํ•„์š”์กฐ์ฐจ ์—†๋Š” json์ด๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ฐ”์ผ ๋ฒ„์ „๋ณด๋‹ค ๋น ๋ฅด๊ฒŒ ๋กœ๋“œ๋ฉ๋‹ˆ๋‹ค. ๊ฒฐ๊ตญ ๋‚˜๋Š” ๋Œ€๋ณธ์„ ๋‹ค์‹œ ์“ฐ๊ธฐ๋กœ ํ–ˆ๋‹ค.

๊ทธ๋ž˜์„œ, ์ฐพ์€ ์ด ๋งํฌ API, ๊ตฌ๋ฌธ ๋ถ„์„์„ ์‹œ์ž‘ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

async_v2.py

import requests
import os, sys
import json
from multiprocessing.dummy import Pool as ThreadPool
from datetime import datetime
import logging

def worker(i):
    currentFile = "files\{}.json".format(i)

    if os.path.isfile(currentFile):
        logging.info("{} - File exists".format(i))
        return 1

    url = "https://m.habr.com/kek/v1/articles/{}/?fl=ru%2Cen&hl=ru".format(i)

    try:
        r = requests.get(url)
        if r.status_code == 503:
            logging.critical("503 Error")
            return 503
    except:
        with open("req_errors.txt") as file:
            file.write(i)
        return 2

    data = json.loads(r.text)

    if data['success']:
        article = data['data']['article']

        id = article['id']
        is_tutorial = article['is_tutorial']
        time_published = article['time_published']
        comments_count = article['comments_count']
        lang = article['lang']
        tags_string = article['tags_string']
        title = article['title']
        content = article['text_html']
        reading_count = article['reading_count']
        author = article['author']['login']
        score = article['voting']['score']

        data = (id, is_tutorial, time_published, title, content, comments_count, lang, tags_string, reading_count, author, score)
        with open(currentFile, "w") as write_file:
            json.dump(data, write_file)

if __name__ == '__main__':
    if len(sys.argv) < 3:
        print("ะะตะพะฑั…ะพะดะธะผั‹ ะฟะฐั€ะฐะผะตั‚ั€ั‹ min ะธ max. ะ˜ัะฟะพะปัŒะทะพะฒะฐะฝะธะต: asyc.py 1 100")
        sys.exit(1)
    min = int(sys.argv[1])
    max = int(sys.argv[2])

    # ะ•ัะปะธ ะฟะพั‚ะพะบะพะฒ >3
    # ั‚ะพ ั…ะฐะฑั€ ะฑะฐะฝะธั‚ ipัˆะฝะธะบ ะฝะฐ ะฒั€ะตะผั
    pool = ThreadPool(3)

    # ะžั‚ัั‡ะตั‚ ะฒั€ะตะผะตะฝะธ, ะทะฐะฟัƒัะบ ะฟะพั‚ะพะบะพะฒ
    start_time = datetime.now()
    results = pool.map(worker, range(min, max))

    # ะŸะพัะปะต ะทะฐะบั€ั‹ั‚ะธั ะฒัะตั… ะฟะพั‚ะพะบะพะฒ ะฟะตั‡ะฐั‚ะฐะตะผ ะฒั€ะตะผั
    pool.close()
    pool.join()
    print(datetime.now() - start_time)

์—ฌ๊ธฐ์—๋Š” ๊ธฐ์‚ฌ ์ž์ฒด์™€ ๊ธฐ์‚ฌ๋ฅผ ์ž‘์„ฑํ•œ ์ž‘์„ฑ์ž ๋ชจ๋‘์™€ ๊ด€๋ จ๋œ ํ•„๋“œ๊ฐ€ ํฌํ•จ๋ฉ๋‹ˆ๋‹ค.

API.png

ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์žˆ๋Š” ๋ชจ๋“  Habr

๊ฐ ๊ธฐ์‚ฌ์˜ ์ „์ฒด json์„ ๋คํ”„ํ•˜์ง€ ์•Š๊ณ  ํ•„์š”ํ•œ ํ•„๋“œ๋งŒ ์ €์žฅํ–ˆ์Šต๋‹ˆ๋‹ค.

  • id
  • is_tutorial
  • ์‹œ๊ฐ„_๊ฒŒ์‹œ๋จ
  • ์ œ๋ชฉ
  • ํ•จ์œ ๋Ÿ‰
  • ๋Œ“๊ธ€_๊ฐœ์ˆ˜
  • lang์€ ๊ธฐ์‚ฌ๊ฐ€ ์ž‘์„ฑ๋œ ์–ธ์–ด์ž…๋‹ˆ๋‹ค. ์ง€๊ธˆ๊นŒ์ง€๋Š” en๊ณผ ru๋งŒ ์žˆ์Šต๋‹ˆ๋‹ค.
  • tags_string - ๊ฒŒ์‹œ๋ฌผ์˜ ๋ชจ๋“  ํƒœ๊ทธ
  • ๋…์„œ ํšŸ์ˆ˜
  • ์ €์ž
  • ์ ์ˆ˜ โ€” ๊ธฐ์‚ฌ ๋“ฑ๊ธ‰.

๊ทธ๋ž˜์„œ API๋ฅผ ์ด์šฉํ•ด์„œ URL 8๊ฐœ๋‹น ์Šคํฌ๋ฆฝํŠธ ์‹คํ–‰์‹œ๊ฐ„์„ 100์ดˆ๋กœ ์ค„์˜€์Šต๋‹ˆ๋‹ค.

ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค์šด๋กœ๋“œํ•œ ํ›„ ์ฒ˜๋ฆฌํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์ž…๋ ฅํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ๋‚˜๋Š” ์ด๊ฒƒ์—๋„ ์•„๋ฌด๋Ÿฐ ๋ฌธ์ œ๊ฐ€ ์—†์—ˆ์Šต๋‹ˆ๋‹ค.

ํŒŒ์„œ.py

import json
import sqlite3
import logging
from datetime import datetime

def parser(min, max):
    conn = sqlite3.connect('habr.db')
    c = conn.cursor()
    c.execute('PRAGMA encoding = "UTF-8"')
    c.execute('PRAGMA synchronous = 0') # ะžั‚ะบะปัŽั‡ะฐะตะผ ะฟะพะดั‚ะฒะตั€ะถะดะตะฝะธะต ะทะฐะฟะธัะธ, ั‚ะฐะบ ัะบะพั€ะพัั‚ัŒ ัƒะฒะตะปะธั‡ะธะฒะฐะตั‚ัั ะฒ ั€ะฐะทั‹.
    c.execute("CREATE TABLE IF NOT EXISTS articles(id INTEGER, time_published TEXT, author TEXT, title TEXT, content TEXT, 
    lang TEXT, comments_count INTEGER, reading_count INTEGER, score INTEGER, is_tutorial INTEGER, tags_string TEXT)")
    try:
        for i in range(min, max):
            try:
                filename = "files\{}.json".format(i)
                f = open(filename)
                data = json.load(f)

                (id, is_tutorial, time_published, title, content, comments_count, lang,
                 tags_string, reading_count, author, score) = data

                # ะ ะฐะดะธ ะปัƒั‡ัˆะตะน ั‡ะธั‚ะฐะตะผะพัั‚ะธ ะฑะฐะทั‹ ะผะพะถะฝะพ ะฟั€ะตะฝะตะฑั€ะตั‡ัŒ ั‡ะธั‚ะฐะตะผะพัั‚ัŒัŽ ะบะพะดะฐ. ะ˜ะปะธ ะฝะตั‚?
                # ะ•ัะปะธ ะฒะฐะผ ั‚ะฐะบ ะบะฐะถะตั‚ัั, ะผะพะถะฝะพ ะฟั€ะพัั‚ะพ ะทะฐะผะตะฝะธั‚ัŒ ะบะพั€ั‚ะตะถ ะฐั€ะณัƒะผะตะฝั‚ะพะผ data. ะ ะตัˆะฐั‚ัŒ ะฒะฐะผ.

                c.execute('INSERT INTO articles VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)', (id, time_published, author,
                                                                                        title, content, lang,
                                                                                        comments_count, reading_count,
                                                                                        score, is_tutorial,
                                                                                        tags_string))
                f.close()

            except IOError:
                logging.info('FileNotExists')
                continue

    finally:
        conn.commit()

start_time = datetime.now()
parser(490000, 490918)
print(datetime.now() - start_time)

ํ†ต๊ณ„

์Œ, ์ „ํ†ต์ ์œผ๋กœ ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ฐ์ดํ„ฐ์—์„œ ๋ช‡ ๊ฐ€์ง€ ํ†ต๊ณ„๋ฅผ ์ถ”์ถœํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์˜ˆ์ƒ๋˜๋Š” 490๊ฑด์˜ ๋‹ค์šด๋กœ๋“œ ์ค‘ 406๊ฑด์˜ ๊ธฐ์‚ฌ๋งŒ ๋‹ค์šด๋กœ๋“œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ํ•˜๋ธŒ๋ ˆ ๊ธฐ์‚ฌ์˜ ์ ˆ๋ฐ˜ ์ด์ƒ(228๊ฑด)์ด ์ˆจ๊ฑฐ๋‚˜ ์‚ญ์ œ๋œ ๊ฒƒ์œผ๋กœ ๋“œ๋Ÿฌ๋‚ฌ๋‹ค.
  • ๊ฑฐ์˜ 2.95๋งŒ ๊ฐœ์˜ ๊ธฐ์‚ฌ๋กœ ๊ตฌ์„ฑ๋œ ์ „์ฒด ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์˜ ๋ฌด๊ฒŒ๋Š” 495GB์ž…๋‹ˆ๋‹ค. ์••์ถ•๋œ ํ˜•์‹ - XNUMXMB.
  • ์ด 37804๋ช…์ด Habrรฉ์˜ ์ž‘๊ฐ€์ž…๋‹ˆ๋‹ค. ์ด ํ†ต๊ณ„๋Š” ๋ผ์ด๋ธŒ ๊ฒŒ์‹œ๋ฌผ์—์„œ๋งŒ ๊ฐ€์ ธ์˜จ ๊ฒƒ์ž„์„ ์•Œ๋ ค๋“œ๋ฆฝ๋‹ˆ๋‹ค.
  • Habrรฉ์—์„œ ๊ฐ€์žฅ ์ƒ์‚ฐ์ ์ธ ์ž‘๊ฐ€ - ์•Œ๋ฆฌ์ž๋ฅด - 8774๊ฐœ์˜ ๊ธฐ์‚ฌ.
  • ์ตœ๊ณ  ํ‰์  ๊ธฐ์‚ฌ โ€” 1448 ํ”Œ๋Ÿฌ์Šค
  • ๊ฐ€์žฅ ๋งŽ์ด ์ฝ์€ ๊ธฐ์‚ฌ โ€” 1660841 ์กฐํšŒ์ˆ˜
  • ๊ฐ€์žฅ ๋งŽ์ด ๋…ผ์˜๋œ ๊ธฐ์‚ฌ โ€” 2444๊ฐœ์˜ ๋Œ“๊ธ€

๊ธ€์Ž„,์ƒ์˜์˜ ํ˜•ํƒœ๋กœ์ƒ์œ„ 15๋ช…์˜ ์ €์žํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์žˆ๋Š” ๋ชจ๋“  Habr
ํ‰๊ฐ€ ๊ธฐ์ค€ ์ƒ์œ„ 15์œ„ํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์žˆ๋Š” ๋ชจ๋“  Habr
์ƒ์œ„ 15๊ฐœ ์ฝ๊ธฐํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์žˆ๋Š” ๋ชจ๋“  Habr
๋…ผ์˜๋œ ์ƒ์œ„ 15๊ฐœํ•˜๋‚˜์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค์— ์žˆ๋Š” ๋ชจ๋“  Habr

์ถœ์ฒ˜ : habr.com

์ฝ”๋ฉ˜ํŠธ๋ฅผ ์ถ”๊ฐ€