Habrastatistics: Tshawb nrhiav feem ntau thiab tsawg tshaj plaws mus xyuas ntawm lub xaib

Hlo Habr.

Π’ yav dhau los Habr txoj kev khiav tsheb tau raug tshuaj xyuas raws li qhov tseem ceeb ntawm cov ntsiab lus - tus naj npawb ntawm cov ntawv, lawv cov kev xav thiab kev ntaus nqi. Txawm li cas los xij, qhov teeb meem ntawm qhov nrov ntawm qhov chaw seem tseem tsis tau tshuaj xyuas. Nws tau dhau los ua kev nthuav dav los saib qhov no hauv kev nthuav dav ntxiv thiab pom cov chaw nrov tshaj plaws thiab tsis nyiam tshaj plaws. Thaum kawg, kuv yuav saib cov nyhuv geektimes kom ntxaws ntxiv, xaus nrog kev xaiv tshiab ntawm cov ntawv zoo tshaj plaws raws li qib tshiab.

Habrastatistics: Tshawb nrhiav feem ntau thiab tsawg tshaj plaws mus xyuas ntawm lub xaib

Rau cov neeg uas txaus siab rau qhov tshwm sim, qhov txuas ntxiv yog nyob rau hauv kev txiav.

Cia kuv ceeb toom koj ib zaug ntxiv tias cov txheeb cais thiab kev ntaus nqi tsis raug cai, Kuv tsis muaj cov ntaub ntawv sab hauv. Nws tseem tsis tau lees tias kuv tsis tau ua yuam kev nyob qhov twg los yog nco ib yam dab tsi. Tab sis tseem, kuv xav tias nws muab tawm nthuav. Peb yuav pib nrog tus lej ua ntej; cov neeg uas tsis nyiam qhov no tuaj yeem hla thawj ntu.

Kev sau cov ntaub ntawv

Nyob rau hauv thawj version ntawm tus parser, tsuas yog tus naj npawb ntawm views, cov lus pom thiab kev ntsuam xyuas tsab xov xwm raug coj mus rau hauv tus account. Qhov no twb zoo lawm, tab sis nws tsis tso cai rau koj los ua ntau cov lus nug nyuaj. Nws yog lub sijhawm los txheeb xyuas cov ntsiab lus ntawm lub xaib; qhov no yuav tso cai rau koj los ua cov kev tshawb fawb nthuav dav, piv txwv li, saib seb qhov kev nyiam ntawm "C ++" ntu tau hloov pauv ntau xyoo.

Cov kab lus parser tau raug txhim kho, tam sim no nws rov qab cov hubs uas tsab xov xwm koom nrog, nrog rau tus sau lub npe menyuam yaus thiab nws qhov kev ntaus nqi (ntau yam nthuav tuaj yeem ua tiav ntawm no, ib yam nkaus, tab sis qhov ntawd yuav los tom qab). Cov ntaub ntawv tau txais kev cawmdim hauv csv cov ntaub ntawv uas zoo li no:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"ΠœΠ΅ΡΡΠ΅Π½Π΄ΠΆΠ΅Ρ€ Slack β€” ΠΏΡ€ΠΈΡ‡ΠΈΠ½Ρ‹ Π²Ρ‹Π±ΠΎΡ€Π°, косяки ΠΏΡ€ΠΈ Π²Π½Π΅Π΄Ρ€Π΅Π½ΠΈΠΈ ΠΈ особСнности сСрвиса, ΠΎΠ±Π»Π΅Π³Ρ‡Π°ΡŽΡ‰ΠΈΠ΅ Тизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Peb yuav tau txais ib daim ntawv teev cov ntsiab lus tseem ceeb ntawm lub xaib.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Lub find_between muaj nuj nqi thiab cov chav kawm Str xaiv ib txoj hlua ntawm ob lub cim npe, kuv siv lawv ua ntej lawm. Thematic hubs tau cim nrog "*" yog li lawv tuaj yeem pom tau yooj yim, thiab koj tuaj yeem tsis tawm tswv yim rau cov kab sib txuas kom tau txais ntu ntawm lwm pawg.

Cov txiaj ntsig ntawm get_hubs muaj nuj nqi yog ib daim ntawv teev npe zoo, uas peb khaws cia ua phau ntawv txhais lus. Kuv tshwj xeeb tshaj tawm cov npe hauv nws tag nrho kom koj tuaj yeem kwv yees nws qhov ntim.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Rau kev sib piv, cov ntu geektimes saib zoo dua:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Cov hubs uas tseem tshuav tau khaws cia zoo ib yam. Tam sim no nws yooj yim los sau cov haujlwm uas rov qab qhov tshwm sim txawm tias tsab xov xwm belongs rau geektimes lossis profile hub.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Cov haujlwm zoo sib xws tau tsim rau lwm ntu ("kev txhim kho", "kev tswj hwm", thiab lwm yam).

Ua

Nws yog lub sijhawm los pib tshuaj xyuas. Peb thauj cov dataset thiab ua cov ntaub ntawv hub.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Tam sim no peb tuaj yeem pab pawg cov ntaub ntawv los ntawm ib hnub thiab tso saib cov naj npawb ntawm cov ntawv tshaj tawm rau cov chaw sib txawv.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Peb tso saib cov xov tooj ntawm cov ntawv luam tawm siv Matplotlib:

Habrastatistics: Tshawb nrhiav feem ntau thiab tsawg tshaj plaws mus xyuas ntawm lub xaib

Kuv faib cov ntawv "geektimes" thiab "geektimes nkaus xwb" hauv daim ntawv, vim Ib tsab xov xwm tuaj yeem koom rau ob ntu tib lub sijhawm (piv txwv li, "DIY" + "microcontrollers" + "C ++"). Kuv siv lub npe "profile" los qhia txog cov ntawv profile ntawm lub xaib, txawm hais tias tej zaum cov lus Askiv lub ntsiab lus rau qhov no tsis yog kiag li.

Hauv ntu dhau los peb tau nug txog "geektimes effect" cuam tshuam nrog kev hloov pauv hauv cov cai them nyiaj rau cov khoom rau geektimes pib lub caij ntuj sov no. Cia peb tso saib cov ntawv geektimes nyias:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Qhov tshwm sim yog nthuav. Qhov kwv yees piv ntawm kev pom ntawm cov ntawv geektimes rau tag nrho yog qhov chaw nyob ib puag ncig 1: 5. Tab sis thaum tag nrho cov kev pom tau hloov pauv tau pom, qhov kev saib ntawm "kev lom zem" cov khoom tseem nyob ntawm kwv yees li tib theem.

Habrastatistics: Tshawb nrhiav feem ntau thiab tsawg tshaj plaws mus xyuas ntawm lub xaib

Koj tuaj yeem pom tias tag nrho cov kev pom ntawm cov ntawv hauv "geektimes" seem tseem poob tom qab hloov cov cai, tab sis "los ntawm qhov muag", tsis pub ntau tshaj 5% ntawm tag nrho cov nqi.

Nws yog qhov nthuav kom saib qhov nruab nrab tus naj npawb ntawm cov kev pom ib kab lus:

Habrastatistics: Tshawb nrhiav feem ntau thiab tsawg tshaj plaws mus xyuas ntawm lub xaib

Rau cov ntawv "kev lom zem" nws yog kwv yees li 40% siab dua qhov nruab nrab. Qhov no tej zaum tsis xav tsis thoob. Qhov tsis ua tiav thaum pib lub Plaub Hlis yog qhov tsis nkag siab rau kuv, tej zaum qhov ntawd yog qhov tshwm sim, lossis nws yog qee yam kev txheeb xyuas yuam kev, lossis tej zaum ib tus kws sau ntawv geektimes mus so ;).

Los ntawm txoj kev, daim duab qhia ob qhov pom pom ntau dua hauv cov xov tooj ntawm cov lus pom - Xyoo Tshiab thiab Tsib Hlis hnub so.

Hubs

Cia peb txav mus rau qhov kev cog lus tsom xam ntawm hubs. Cia peb teev cov 20 hubs saum toj kawg nkaus los ntawm tus naj npawb ntawm kev pom:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Tshwm sim:

Habrastatistics: Tshawb nrhiav feem ntau thiab tsawg tshaj plaws mus xyuas ntawm lub xaib

Kuj ceeb tias, qhov chaw nrov tshaj plaws nyob rau hauv cov ntsiab lus ntawm kev pom yog "Cov Ntaub Ntawv Kev Ruaj Ntseg"; cov thawj coj saum toj kawg nkaus 5 tseem suav nrog "Kev Pabcuam" thiab "Nrov science".

Antitop occupies Gtk thiab Cocoa.

Habrastatistics: Tshawb nrhiav feem ntau thiab tsawg tshaj plaws mus xyuas ntawm lub xaib

Kuv mam li qhia rau koj paub, cov hubs saum toj kawg nkaus tuaj yeem pom no, txawm hais tias tus naj npawb ntawm kev pom tsis pom muaj.

Ntsuam Xyuas

Thiab thaum kawg, qhov kev ntsuas tau cog lus tseg. Siv cov ntaub ntawv txheeb xyuas hub, peb tuaj yeem tso saib cov lus nrov tshaj plaws rau cov chaw nrov tshaj plaws rau xyoo 2019 no.

Cov Ntaub Ntawv Kev Ruaj Ntseg

Lub cajmeem

Nrov Science

Hauj Lwm

Kev cai lij choj hauv IT

Kev txhim kho lub vev xaib

GTK

Thiab thaum kawg, kom tsis muaj leej twg ua txhaum, kuv yuav muab qhov ntsuas ntawm qhov tsawg tshaj plaws mus xyuas hub "gtk". Hauv ib xyoos nws tau luam tawm ib tug Cov kab lus, uas kuj "automatically" occupies thawj kab ntawm kev ntsuam xyuas.

xaus

Yuav tsis muaj qhov xaus. Zoo siab nyeem sawv daws.

Tau qhov twg los: www.hab.com

Ntxiv ib saib