Momwe ndidasinthira Habr, gawo 1: machitidwe

Pamene Olivier wa Chaka Chatsopano adatha, ndinalibe chochita, ndipo ndinaganiza zokopera zolemba zonse kuchokera ku Habrahabr (ndi mapulaneti okhudzana) ku kompyuta yanga ndikufufuza.

Panali nkhani zingapo zosangalatsa. Yoyamba ya iwo ndikukula kwa mawonekedwe ndi mitu yankhani pazaka 12 zakukhalapo kwa tsambalo. Mwachitsanzo, kusinthasintha kwa mitu ina kumakhala kowonetsera. Kupitiliza - pansi pa odulidwa.

Momwe ndidasinthira Habr, gawo 1: machitidwe

Parsing process

Kuti mumvetsetse momwe Habr adakulira, kunali kofunikira kuti mudutse zolemba zake zonse ndikuchotsa zidziwitso kuchokera kwa iwo (mwachitsanzo, masiku). Njira yodutsa inali yosavuta, chifukwa maulalo a zolemba zonse amawoneka ngati "habrahabr.ru/post/337722/", ndipo manambala amaperekedwa mosamalitsa. Podziwa kuti positi yomaliza ili ndi nambala yocheperako pang'ono kuposa 350, ndangodutsamo zolemba zonse zomwe zingatheke mu loop (code ya Python):

import numpy as np
from multiprocessing import Pool
with Pool(100) as p:
    docs = p.map(download_document, np.arange(350000))

ntchito download_document imayesa kuyika tsambalo ndi id yofananira ndikuyesera kuchotsa zidziwitso zatanthauzo kuchokera kumapangidwe a html.

import requests
from bs4 import BeautifulSoup

def download_document(pid):
    """ Download and process a Habr document and its comments """
    # выгрузка документа
    r = requests.get('https://habrahabr.ru/post/' +str(pid) + '/')
    # парсинг документа
    soup = BeautifulSoup(r.text, 'html5lib') # instead of html.parser
    doc = {}
    doc['id'] = pid
    if not soup.find("span", {"class": "post__title-text"}):
        # такое бывает, если статья не существовала или удалена
        doc['status'] = 'title_not_found'
    else:
        doc['status'] = 'ok'
        doc['title'] = soup.find("span", {"class": "post__title-text"}).text
        doc['text'] = soup.find("div", {"class": "post__text"}).text
        doc['time'] = soup.find("span", {"class": "post__time"}).text
        # create other fields: hubs, tags, views, comments, votes, etc.
        # ...
    # сохранение результата в отдельный файл
    fname = r'files/' + str(pid) + '.pkl'
    with open(fname, 'wb') as f:
        pickle.dump(doc, f)

M'kupita kwanthawi, ndidapeza zinthu zingapo zatsopano.

Choyamba, amati kupanga njira zambiri kuposa zomwe zili mu purosesa ndizopanda ntchito. Koma kwa ine, zidapezeka kuti gwero lochepetsa si purosesa, koma maukonde, ndi njira 100 zimagwira ntchito mwachangu kuposa 4 kapena, tinene, 20.

Kachiwiri, m'makalata ena munali ophatikizika a zilembo zapadera - mwachitsanzo, ma euphemisms ngati "%&#@". Zinapezeka kuti html.parser, yomwe ndidagwiritsa ntchito poyamba, imakhudzidwa ndi kuphatikiza &# zowawa, poganizira kuti ndi chiyambi cha gulu la html. Ndinkachita kale matsenga akuda, koma bwaloli linanena kuti mutha kungosintha zosintha.

Chachitatu, ndinatha kutsitsa mabuku onse, kupatulapo atatu. Zolemba zokhala ndi 65927, 162075, ndi 275987 zidachotsedwa nthawi yomweyo ndi antivayirasi yanga. Izi ndi zolemba, motsatana, za unyolo wa javascript womwe umatsitsa pdf yoyipa, SMS ransomware mu mawonekedwe a mapulagini osatsegula, ndi tsamba la CrashSafari.com lomwe limatumiza ma iPhones kuti ayambitsenso. Antivayirasi adapezanso nkhani ina pambuyo pake, pakujambula makina: tumizani 338586 za zolemba patsamba la sitolo ya ziweto zomwe zimagwiritsa ntchito purosesa ya wogwiritsa ntchito kukumba cryptocurrency. Chifukwa chake titha kuwona kuti ntchito ya antivayirasi ndiyokwanira.

Zolemba za "Live" zidangokhala theka lazomwe zingatheke - zidutswa za 166307. Pazotsalazo, Habr amapereka zosankha "tsambalo ndi lachikale, lachotsedwa kapena kulibe konse." Chabwino, chirichonse chikhoza kuchitika.

Kukweza zolemba kudatsatiridwa ndi ntchito zaukadaulo: mwachitsanzo, masiku osindikizidwa adayenera kusinthidwa kuchoka ku "'21 December 2006 at 10:47 am" kukhala mulingo datetime, ndi "12,8k" mawonedwe - mu 12800. Panthawiyi, zochitika zina zochepa zinatuluka. Choseketsa kwambiri ndichokhudza kuchuluka kwa mavoti ndi mitundu ya data: zolemba zina zakale zinali ndi kusefukira kwa int ndipo zinalandira mavoti 65535 iliyonse.

Momwe ndidasinthira Habr, gawo 1: machitidwe

Chotsatira chake, malemba a nkhani (popanda zithunzi) ananditengera 1.5 gigabytes, ndemanga ndi meta-information - wina 3, ndi pafupifupi zana megabytes - meta-zambiri za nkhani. Izi zitha kusungidwa kwathunthu mu RAM, zomwe zinali zodabwitsa kwa ine.

Ndinayamba kusanthula zolemba osati zolemba zokha, koma kuchokera ku chidziwitso cha meta: masiku, ma tag, ma hubs, malingaliro ndi zokonda. Zinapezeka kuti amatha kunena zambiri.

Mayendedwe a Habrahabr Development

Zolemba patsambali zasindikizidwa kuyambira 2006; kwambiri - mu 2008-2016.

Momwe ndidasinthira Habr, gawo 1: machitidwe

Momwe nkhanizi zidawerengedwa mwachangu nthawi zosiyanasiyana sikophweka kuwunika. Zolemba kuchokera ku 2012 ndi ang'ono adalandira ndemanga zambiri ndi mavoti, koma malemba atsopano ali ndi malingaliro ambiri ndi ma bookmarks. Ma metrics awa adachita chimodzimodzi (theka) kamodzi kokha, mu 2015. Mwina, pamavuto azachuma ndi ndale, chidwi cha owerenga chasintha kuchokera ku mabulogu a IT kupita kuzinthu zowawa kwambiri.

Momwe ndidasinthira Habr, gawo 1: machitidwe

Kuwonjezera pa nkhani zomwezo, ndinatsitsa ndemanga zambiri kwa iwo. Panali ndemanga za 6 miliyoni, komabe, 240 zikwizikwi za izo zinaletsedwa ("UFO inawulukira ndikusindikiza izi apa"). Chinthu chothandiza cha ndemanga ndikuti ali ndi sitampu ya nthawi. Powerenga nthawi ya ndemanga, mutha kumvetsetsa bwino nkhani zikawerengedwa konse.

Zinapezeka kuti zambiri mwazolemba zonse zidalembedwa ndikuyankha kwinakwake kuyambira 10am mpaka 20pm, i.e. pa tsiku wamba Moscow ntchito. Izi zitha kutanthauza kuti Habr amawerengedwa chifukwa chaukadaulo, ndikuti iyi ndi njira yabwino yozengereza pantchito. Mwa njira, kugawa kumeneku kwa nthawi ya tsiku kuli kokhazikika kuyambira pa maziko a Habri mpaka lero.

Momwe ndidasinthira Habr, gawo 1: machitidwe

Komabe, phindu lalikulu la sitempu ya ndemanga si nthawi ya tsiku, koma nthawi ya "moyo wokangalika" wa nkhaniyi. Ndinawerengera momwe nthawi imagawidwira kuchokera ku kusindikizidwa kwa nkhaniyo mpaka ndemanga yake. Zinapezeka kuti tsopano ndemanga yapakatikati (mzere wobiriwira) imabwera pafupifupi maola 20, i.e. pa tsiku loyamba pambuyo pa kufalitsidwa, pafupifupi, kupitirira pang’ono theka la ndemanga zonse za nkhaniyo zimasiyidwa. Ndipo m'masiku awiri amasiya 75% ya ndemanga zonse. Panthawi imodzimodziyo, nkhani zoyambirira zinawerengedwa mofulumira kwambiri - mwachitsanzo, mu 2010, theka la ndemanga zinabwera m'maola 6 oyambirira.

Momwe ndidasinthira Habr, gawo 1: machitidwe

Zinandidabwitsa kuti ndemanga zatalika: chiŵerengero cha anthu otchulidwa mu ndemanga chawonjezeka pafupifupi kawiri pa moyo wa Habr!

Momwe ndidasinthira Habr, gawo 1: machitidwe

Ndemanga zosavuta kuposa ndemanga ndi mavoti. Mosiyana ndi zina zambiri, pa Habré mutha kuyika ma pluses okha, komanso minuses. Komabe, owerenga sagwiritsa ntchito mwayi womaliza nthawi zambiri: gawo lapano la zomwe sakonda ndi pafupifupi 15% ya mavoti onse omwe aponya. Panali zambiri, koma patapita nthawi, owerenga akhala okoma mtima.

Momwe ndidasinthira Habr, gawo 1: machitidwe

Malemba omwewo asintha pakapita nthawi. Mwachitsanzo, kutalika kwa mawu sikusiya kukula pang'onopang'ono kuyambira pomwe tsamba limakhazikitsidwa, ngakhale pali zovuta. M’zaka khumi, malemba achuluka kuŵirikiza nthaŵi khumi!

Momwe ndidasinthira Habr, gawo 1: machitidwe

Kalembedwe ka zolemba (kungoyerekeza koyamba) zidasinthanso. M'zaka zoyambirira za kukhalapo kwa Habr, mwachitsanzo, gawo la ma code ndi manambala m'malemba linakula:

Momwe ndidasinthira Habr, gawo 1: machitidwe

Nditamvetsetsa zochitika zonse za malowa, ndinaganiza zoyesa momwe kutchuka kwa mitu yosiyanasiyana kunasinthira. Mitu imatha kusankhidwa yokha kuchokera m'malemba, koma poyambira, simungathe kubwezeretsanso gudumu, koma gwiritsani ntchito ma tag okonzeka olembedwa ndi olemba nkhani iliyonse. Ndajambula zinthu zinayi zodziwika bwino pa tchati. Mutu wa "Google" poyamba unkalamulira (mwina makamaka chifukwa cha SEO), koma wakhala ukuwonda kwa zaka zambiri. Javascript yakhala mutu wotchuka ndipo ikukula pang'onopang'ono, koma kuphunzira pamakina kwayamba kutchuka mwachangu m'zaka zaposachedwa. Linux, kumbali ina, yakhalabe yofunikira pazaka khumi zonse.

Momwe ndidasinthira Habr, gawo 1: machitidwe

Inde, ndinakhala ndi chidwi ndi nkhani zomwe zimakopa chidwi cha owerenga. Ndinawerengera mawonedwe apakatikati, mavoti ndi ndemanga pamutu uliwonse. Nazi zomwe zidachitika:

  • Mitu yomwe imawonedwa kwambiri: arduino, kapangidwe ka intaneti, chitukuko cha intaneti, digest, maulalo, css, html, html5, nginx, ma aligorivimu.
  • Mitu "yokondedwa" kwambiri: vkontakte, nthabwala, jquery, opera, c, html, chitukuko cha intaneti, html5, css, kapangidwe ka intaneti.
  • Mitu yomwe imakambidwa kwambiri: opera, skype, freelance, vkontakte, ubuntu, ntchito, nokia, nginx, arduino, firefox.

Mwa njira, popeza ndikufanizira mitu, mutha kuyiyika pafupipafupi (ndikuyerekeza zotsatira ndi zofanana ndi 2013).

  • Kwa zaka zonse za kukhalapo kwa Habr, ma tag odziwika kwambiri (motsika) ndi google, android, javascript, microsoft, linux, php, apple, java, python, programming, startups, development, ios, startup, social network.
  • Mu 2017, otchuka kwambiri anali javascript, python, java, android, chitukuko, linux, c++, mapulogalamu, php, c#, ios, kuphunzira makina, chitetezo chazidziwitso, microsoft, react

Poyerekeza izi, munthu akhoza kumvetsera, mwachitsanzo, ulendo wopambana wa Python ndi kutha kwa php, kapena "kulowa kwadzuwa" kwa mitu yoyambira ndi kukwera kwa kuphunzira makina.

Si ma tag onse pa Habré omwe ali ndi utoto wowoneka bwino. Mwachitsanzo, apa pali ma tag khumi ndi awiri omwe adakumana kamodzi kokha, koma amangowoneka oseketsa kwa ine. Chifukwa chake: "lingaliro ndi injini yakupita patsogolo", "boot kuchokera ku floppy disk image", "Iowa State", "sewero", "superalesh", "steam engine", "zochita Loweruka", "Ndili ndi nkhandwe mu chopukusira nyama", "zinakhala ngati nthawi zonse", "sitinathe kubwera ndi zolemba zoseketsa". Kuti mudziwe mutu wankhani ngati izi, ma tag sali okwanira - muyenera kupanga fanizo lazolemba zamalembawo.

Kusanthula mwatsatanetsatane zomwe zili m'nkhanizi zidzakhala mu positi yotsatira. Choyamba, ndipanga chitsanzo chomwe chimalosera kuchuluka kwa masamba omwe amawonera nkhani kutengera zomwe zili. Kachiwiri, ndikufuna kuphunzitsa neural network kuti ipange zolemba zofanana ndi zomwe olemba a Habr. Chifukwa chake lembetsani 🙂

PS Ndipo apa pali beep deta.

Source: www.habr.com

Kuwonjezera ndemanga