Habrastatistics: ngajalajah bagian anu paling sering dilongok situs

Hejo Habr.

В bagian saméméhna Lalu lintas Habr dianalisis dumasar kana parameter utama - jumlah tulisan, pandangan sareng ratingna. Sanajan kitu, isu popularitas bagian situs tetep unexamined. Éta janten istiméwa pikeun ningali ieu sacara langkung rinci sareng milari hub anu pang populerna sareng paling henteu populer. Tungtungna, kuring bakal ningali pangaruh geektimes sacara langkung rinci, ditungtungan ku pilihan énggal tina tulisan anu pangsaéna dumasar kana réngking énggal.

Habrastatistics: ngajalajah bagian anu paling sering dilongok situs

Pikeun maranéhanana anu kabetot dina naon anu lumangsung, lajengkeun aya dina cut nu.

Hayu atuh ngingetan sakali deui yén statistik na ratings teu resmi, abdi teu boga informasi insider nanaon. Ogé teu dijamin yén kuring teu nyieun kasalahan wae atawa lasut hal. Tapi tetep, Jigana tétéla metot. Urang mimitian ku kode heula; jalma anu henteu resep ieu tiasa ngalangkungan bagian anu munggaran.

Ngumpulkeun data

Dina versi mimiti parser, ukur jumlah pintonan, koméntar sarta ratings artikel dicokot kana rekening. Ieu geus alus, tapi teu ngidinan Anjeun pikeun nyieun queries leuwih kompleks. Waktosna pikeun nganalisis bagian tematik situs éta; ieu bakal ngamungkinkeun anjeun pikeun ngalakukeun panalungtikan anu cukup pikaresepeun, contona, tingali kumaha popularitas bagian "C ++" parantos robih sababaraha taun.

Parser artikel geus ningkat, ayeuna eta mulih hubs nu artikel milik, kitu ogé nickname panulis sarta rating na (loba hal metot bisa dipigawé di dieu, teuing, tapi nu bakal datang engké). Data disimpen dina file csv anu katingali sapertos kieu:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Kami bakal nampi daptar hub tematik utama situs éta.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Fungsi find_between jeung kelas Str pilih string antara dua tag, I dipaké aranjeunna tadi. Hub tematik ditandaan ku "*" supados tiasa disorot kalayan gampang, sareng anjeun ogé tiasa ngahapus koméntar garis anu cocog pikeun nyandak bagian tina kategori anu sanés.

Kaluaran tina fungsi get_hubs mangrupakeun daptar cukup impressive, nu urang simpen salaku kamus. Kuring sacara khusus nampilkeun daptar sacara lengkep supados anjeun tiasa ngira-ngira volumena.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Pikeun babandingan, bagian geektimes katingalina langkung sederhana:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

The hubs sésana dilestarikan dina cara nu sarua. Ayeuna éta gampang nulis fungsi nu mulih hasilna naha artikel milik geektimes atawa hub profil.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Fungsi anu sami dilakukeun pikeun bagian anu sanés ("pangmekaran", "administrasi", jsb.).

carana ngokolakeun

Geus waktuna pikeun ngamimitian analisa. Urang muka dataset jeung ngolah data hub.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Ayeuna urang tiasa ngagolongkeun data dumasar dinten sareng nampilkeun jumlah publikasi pikeun hub anu béda.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Kami nunjukkeun jumlah tulisan anu diterbitkeun nganggo Matplotlib:

Habrastatistics: ngajalajah bagian anu paling sering dilongok situs

Kuring dibagi artikel "geektimes" jeung "geektimes wungkul" dina bagan, sabab Artikel bisa jadi milik duanana bagian dina waktos anu sareng (contona, "DIY" + "mikrokontroler" + "C++"). I dipaké designation "profil" pikeun nyorot artikel profil dina loka, sanajan meureun propil istilah Inggris pikeun ieu teu sagemblengna bener.

Dina bagian saméméhna kami nanya ngeunaan "efek geektimes" pakait sareng parobahan dina aturan pamayaran pikeun artikel pikeun geektimes dimimitian usum panas ieu. Hayu urang mintonkeun artikel geektimes misah:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

hasilna metot. Rasio perkiraan tina pintonan artikel geektimes ka total aya di sabudeureun 1: 5. Tapi bari jumlah total pintonan fluctuated noticeably, nempoan artikel "hiburan" tetep dina kurang leuwih tingkat sarua.

Habrastatistics: ngajalajah bagian anu paling sering dilongok situs

Anjeun oge bisa perhatikeun yén jumlah total pintonan artikel dina bagian "geektimes" masih turun sanggeus ngarobah aturan, tapi "ku panon", teu leuwih ti 5% tina total nilai.

Éta metot pikeun nempo rata-rata jumlah pintonan per artikel:

Habrastatistics: ngajalajah bagian anu paling sering dilongok situs

Pikeun artikel "hiburan" éta ngeunaan 40% luhur rata. Ieu meureun teu heran. Kagagalan dina awal April mah can écés keur kuring, meureun éta naon anu lumangsung, atawa éta sababaraha jenis kasalahan parsing, atawa meureun salah sahiji geektimes pangarang indit dina pakansi ;).

Ku jalan kitu, grafik nembongkeun dua puncak leuwih noticeable dina Jumlah pintonan artikel - libur Taun Anyar jeung Méi.

Hubs

Hayu urang ngaléngkah ka analisis jangji hubs. Hayu urang daptar luhureun 20 hub dumasar jumlah pintonan:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Hasil:

Habrastatistics: ngajalajah bagian anu paling sering dilongok situs

Ahéng, hub anu pang populerna dina hal pandangan nyaéta "Kaamanan Informasi"; 5 pamimpin anu paling luhur ogé kalebet "Programming" sareng "Élmu Populer".

Antitop nempatan Gtk sareng Cocoa.

Habrastatistics: ngajalajah bagian anu paling sering dilongok situs

Kuring gé ngabejaan Anjeun rusiah, hubs luhur ogé bisa ditempo di dieu, sanajan jumlah pintonan teu ditémbongkeun aya.

rating

Sarta pamustunganana, rating jangji. Nganggo data analisis hub, urang tiasa nampilkeun tulisan anu paling populér pikeun hub anu pang populerna pikeun taun 2019 ieu.

Kaamanan Émbaran

programming

Élmu populér

karir

Legislasi dina IT

Pangwangunan wéb

GTK

Sarta pamustunganana, ambéh teu aya anu gelo, Kuring bakal méré rating tina hub sahenteuna dilongok "gtk". Dina sataun ieu diterbitkeun hiji Artikel, nu ogé "otomatis" nempatan baris kahiji tina rating.

kacindekan

Moal aya kacindekan. Wilujeng maca sadayana.

sumber: www.habr.com

Tambahkeun komentar