Habrastatistics: eksplore seksyon ki pi plis ak mwens vizite nan sit la

Hey Habr.

В pati anvan Trafik Habr te analize dapre paramèt prensipal yo - kantite atik, opinyon yo ak evalyasyon yo. Sepandan, pwoblèm nan nan popilarite a nan seksyon sit yo rete san egzamine. Li te vin enteresan yo gade sa a an plis detay epi jwenn sant ki pi popilè ak pi popilè. Finalman, mwen pral gade nan efè geektimes an plis detay, fini ak yon nouvo seleksyon nan pi bon atik yo ki baze sou nouvo klasman.

Habrastatistics: eksplore seksyon ki pi plis ak mwens vizite nan sit la

Pou moun ki enterese nan sa ki te pase, kontinyasyon an se anba koupe a.

Kite m fè w sonje yon lòt fwa ankò ke estatistik yo ak evalyasyon yo pa ofisyèl, mwen pa gen okenn enfòmasyon inisye. Li pa garanti tou ke mwen pa t 'fè yon erè yon kote oswa rate yon bagay. Men, toujou, mwen panse ke li te tounen enteresan. Nou pral kòmanse ak kòd la an premye moun ki pa enterese nan sa a ka sote premye seksyon yo.

Koleksyon done

Nan premye vèsyon an analizeur a, se sèlman kantite opinyon, kòmantè ak evalyasyon atik yo te pran an kont. Sa a deja bon, men li pa pèmèt ou fè demann ki pi konplèks. Li lè yo analize seksyon yo tematik nan sit la sa a pral pèmèt ou fè rechèch trè enteresan, pou egzanp, wè ki jan popilarite nan seksyon "C++" te chanje pandan plizyè ane.

Te analizeur atik la amelyore, kounye a li retounen sant yo ki atik la fè pati, osi byen ke tinon otè a ak evalyasyon li (yon anpil bagay enteresan yo ka fè isit la, tou, men sa ap vini pita). Done yo sove nan yon dosye csv ki sanble yon bagay tankou sa a:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Nou pral resevwa yon lis prensipal sant tematik nan sit la.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Fonksyon find_between ak klas Str chwazi yon fisèl ant de tags, mwen te itilize yo pi bonè. Sant tematik yo make ak yon "*" pou yo ka idantifye yo fasil, epi ou ka tou dekomantè liy korespondan yo pou jwenn seksyon nan lòt kategori.

Pwodiksyon fonksyon get_hubs la se yon lis san patipri enpresyonan, ke nou sove kòm yon diksyonè. Mwen espesyalman prezante lis la an antye pou w ka estime volim li.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Pou konparezon, seksyon geektimes yo sanble pi modès:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Sant ki rete yo te konsève nan menm fason an. Koulye a, li fasil ekri yon fonksyon ki retounen rezilta a si atik la fè pati geektimes oswa yon sant pwofil.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Fonksyon menm jan an te fèt pou lòt seksyon ("devlopman", "administrasyon", elatriye).

Processing

Li lè yo kòmanse analize. Nou chaje dataset la epi trete done mwaye a.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Koulye a, nou ka gwoup done yo pa jou epi montre kantite piblikasyon pou sant diferan.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Nou montre kantite atik pibliye lè l sèvi avèk Matplotlib:

Habrastatistics: eksplore seksyon ki pi plis ak mwens vizite nan sit la

Mwen divize atik yo "geektimes" ak "geektimes sèlman" nan tablo a, paske Yon atik ka fè pati tou de seksyon an menm tan (pa egzanp, "DIY" + "mikrokontroleur" + "C++"). Mwen te itilize deziyasyon "pwofil" pou mete aksan sou atik pwofil yo sou sit la, byenke petèt pwofil tèm angle pou sa a pa totalman kòrèk.

Nan pati anvan an, yo te mande sou "efè geektimes" ki asosye ak chanjman nan règ yo peman pou atik pou geektimes ete sa a. Ann montre atik geektimes yo separeman:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Rezilta a enteresan. Rapò apwoksimatif opinyon atik geektimes ak total la se yon kote alantou 1:5. Men, pandan ke kantite total opinyon yo varye notables, gade nan atik "divètisman" rete nan apeprè menm nivo.

Habrastatistics: eksplore seksyon ki pi plis ak mwens vizite nan sit la

Ou ka remake tou ke kantite total opinyon nan atik nan seksyon "geektimes" toujou tonbe apre yo fin chanje règ yo, men "pa je", pa plis pase 5% nan valè total yo.

Li enteresan pou gade kantite mwayèn opinyon pou chak atik:

Habrastatistics: eksplore seksyon ki pi plis ak mwens vizite nan sit la

Pou atik "divètisman" li se anviwon 40% pi wo pase mwayèn. Sa a se pwobableman pa etone. Echèk la nan kòmansman mwa avril se enkonpreyansib pou mwen, petèt se sa ki te pase, oswa li nan yon kalite erè analiz, oswa petèt youn nan otè yo geektimes te ale nan vakans ;).

By wout la, graf la montre de pik plis aparan nan kantite opinyon atik - Nouvèl Ane a ak jou ferye Me.

Mwaye

Ann ale nan analiz yo te pwomèt nan sant. Ann fè lis 20 pi gwo sant yo pa kantite opinyon:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Rezilta:

Habrastatistics: eksplore seksyon ki pi plis ak mwens vizite nan sit la

Etonan, sant ki pi popilè an tèm de opinyon se te "Sekirite Enfòmasyon" 5 lidè yo tou enkli "Programming" ak "Syans Popilè".

Antitop okipe Gtk ak Cocoa.

Habrastatistics: eksplore seksyon ki pi plis ak mwens vizite nan sit la

Mwen pral di ou yon sekrè, sant yo an tèt yo ka wè tou isit la, byenke kantite opinyon yo pa montre la.

Rating

Epi finalman, evalyasyon an te pwomèt la. Sèvi ak done analiz sant, nou ka montre atik ki pi popilè pou sant ki pi popilè pou ane 2019 sa a.

Sekirite Enfòmasyon

Programming

Syans popilè

karyè

Lejislasyon nan IT

Devlopman entènèt

gk

Epi finalman, pou pa gen moun ki ofanse, mwen pral bay Rating nan mwaye ki pi piti vizite "gtk". Nan yon ane li te pibliye youn Atik la, ki tou "otomatikman" okipe premye liy evalyasyon an.

Konklizyon

Pa pral gen okenn konklizyon. Bon lekti tout moun.

Sous: www.habr.com

Add nouvo kòmantè