Habrastatistika: njelajah bagean situs sing paling akeh lan paling ora dibukak

Hey Habr.

В bagean sadurunge Lalu lintas Habr dianalisis miturut paramèter utama - jumlah artikel, tampilan lan rating. Nanging, masalah popularitas bagean situs tetep ora ditliti. Dadi menarik kanggo ndeleng iki kanthi luwih rinci lan nemokake hub sing paling populer lan paling ora populer. Pungkasan, aku bakal ndeleng efek geektimes kanthi luwih rinci, dipungkasi karo pilihan anyar artikel paling apik adhedhasar peringkat anyar.

Habrastatistika: njelajah bagean situs sing paling akeh lan paling ora dibukak

Kanggo sing kasengsem ing apa kedaden, tutugan ing ngisor Cut.

Ayo kula ngelingake maneh yen statistik lan rating ora resmi, aku ora duwe informasi wong njero. Iku uga ora dijamin aku ora nggawe kesalahan nang endi wae utawa ora kejawab soko. Nanging isih, aku mikir ternyata menarik. Kita bakal miwiti karo kode pisanan; sing ora kasengsem ing iki bisa ngliwati bagean pisanan.

Pangumpulan data

Ing versi pisanan parser, mung jumlah tampilan, komentar lan rating artikel sing dianggep. Iki wis apik, nanging ora ngidini sampeyan nggawe pitakon sing luwih rumit. Wektu kanggo nganalisa bagean tematik situs kasebut bakal ngidini sampeyan nindakake riset sing cukup menarik, umpamane, ndeleng kepiye popularitas bagean "C ++" wis diganti sajrone pirang-pirang taun.

Parser artikel wis apik, saiki ngasilake hub sing ana ing artikel kasebut, uga julukan penulis lan rating (akeh perkara sing menarik bisa ditindakake ing kene, nanging bakal teka mengko). Data disimpen ing file csv sing katon kaya iki:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Kita bakal nampa dhaptar hub tematik utama situs kasebut.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Fungsi find_between lan kelas Str pilih senar antarane rong tags, Aku digunakake sadurunge. Hub tematik ditandhani nganggo "*" supaya gampang disorot, lan sampeyan uga bisa mbusak komentar ing baris sing cocog kanggo entuk bagean saka kategori liyane.

Output saka fungsi get_hubs minangka dhaptar sing cukup nyengsemaken, sing disimpen minangka kamus. Aku khusus nampilake dhaptar kanthi lengkap supaya sampeyan bisa ngira volume.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Kanggo mbandhingake, bagean geektimes katon luwih andhap asor:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Hub sing isih ana disimpen kanthi cara sing padha. Saiki gampang nulis fungsi sing ngasilake asil manawa artikel kasebut kalebu geektimes utawa hub profil.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Fungsi sing padha digawe kanggo bagean liyane ("pembangunan", "administrasi", lsp.).

Processing

Wektu kanggo miwiti nganalisa. We mbukak dataset lan proses data hub.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Saiki kita bisa nglumpukake data miturut dina lan nampilake jumlah publikasi kanggo macem-macem hub.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Kita nampilake jumlah artikel sing diterbitake nggunakake Matplotlib:

Habrastatistika: njelajah bagean situs sing paling akeh lan paling ora dibukak

Aku dibagi artikel "geektimes" lan "geektimes mung" ing grafik, amarga Artikel bisa dadi kagungane loro bagean ing wektu sing padha (contone, "DIY" + "mikrokontroler" + "C++"). Aku nggunakake sebutan "profil" kanggo nyorot artikel profil ing situs kasebut, sanajan bisa uga profil istilah Inggris kanggo iki ora bener.

Ing bagean sadurunge kita takon babagan "efek geektimes" sing digandhengake karo owah-owahan ing aturan pembayaran kanggo artikel kanggo geektimes wiwit musim panas iki. Ayo nampilake artikel geektimes kanthi kapisah:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Asil menarik. Rasio kira-kira saka tampilan artikel geektimes kanggo total ana ing sekitar 1:5. Nanging nalika jumlah total views fluctuated noticeably, ndeleng artikel "hiburan" tetep ing kira-kira tingkat padha.

Habrastatistika: njelajah bagean situs sing paling akeh lan paling ora dibukak

Sampeyan uga bisa sok dong mirsani yen jumlah total tampilan artikel ing bagean "geektimes" isih mudhun sawise ngganti aturan, nanging "dening mripat", ora luwih saka 5% saka total nilai.

Iku menarik kanggo ndeleng jumlah rata-rata tampilan saben artikel:

Habrastatistika: njelajah bagean situs sing paling akeh lan paling ora dibukak

Kanggo artikel "hiburan" kira-kira 40% ndhuwur rata-rata. Iki mbokmenawa ora ngagetne. Gagal ing awal April ora cetha kanggo kula, Mungkin sing kedaden, utawa sawetara jenis kesalahan parsing, utawa bisa uga salah siji saka geektimes penulis lunga ing vacation;).

Miturut cara, grafik nuduhake rong puncak sing luwih katon ing jumlah tampilan artikel - preian Taun Anyar lan Mei.

Hub

Ayo dadi pindhah menyang analisis prajanji hub. Ayo dhaptar 20 hub paling dhuwur miturut jumlah tampilan:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Asil:

Habrastatistika: njelajah bagean situs sing paling akeh lan paling ora dibukak

Kaget, hub sing paling populer ing babagan tampilan yaiku "Keamanan Informasi" ing ndhuwur 5 pimpinan uga kalebu "Programming" lan "Ilmu populer".

Antitop manggoni Gtk lan Cocoa.

Habrastatistika: njelajah bagean situs sing paling akeh lan paling ora dibukak

Aku bakal pitutur marang kowe rahasia, hub ndhuwur uga bisa katon kene, sanajan jumlah tampilan ora ditampilake ing kana.

Rating

Lan pungkasane, rating sing dijanjekake. Nggunakake data analisis hub, kita bisa nampilake artikel paling populer kanggo hub paling populer ing taun 2019 iki.

Keamanan Informasi

Program

Ilmu Popular

Karir

Legislasi ing IT

Pangembangan web

GTK

Lan pungkasanipun, supaya ora ana sing gelo, Aku bakal menehi HFS saka hub paling dibukak "gtk". Ing taun iki diterbitake siji artikel, kang uga "otomatis" manggoni baris pisanan HFS.

kesimpulan

Ora bakal ana kesimpulan. Sugeng maca kabeh.

Source: www.habr.com

Add a comment