Habrastatistics: binciko mafi kuma mafi ƙarancin ziyartan sassan rukunin yanar gizon

Hai Habr.

В bangaren da ya gabata An yi nazarin zirga-zirgar Habr bisa ga manyan sigogi - adadin labarai, ra'ayoyinsu da ƙimar su. Koyaya, batun shaharar sassan rukunin yanar gizon ya kasance ba a bincika ba. Ya zama mai ban sha'awa don kallon wannan dalla-dalla kuma sami mafi mashahuri kuma mafi yawan wuraren da ba a san su ba. A ƙarshe, zan kalli tasirin geektimes daki-daki, yana ƙarewa tare da sabon zaɓi na mafi kyawun labarai dangane da sabbin martaba.

Habrastatistics: binciko mafi kuma mafi ƙarancin ziyartan sassan rukunin yanar gizon

Ga masu sha'awar abin da ya faru, ci gaba yana ƙarƙashin yanke.

Bari in sake tunatar da ku cewa ƙididdiga da ƙididdiga ba na hukuma ba ne, ba ni da wani bayani na ciki. Har ila yau, ba a tabbatar da cewa ban yi kuskure a wani wuri ba ko kuma na rasa wani abu. Amma duk da haka, ina tsammanin ya zama mai ban sha'awa. Za mu fara da lambar farko; waɗanda ba su da sha'awar wannan za su iya tsallake sassan farko.

Tarin bayanai

A cikin sigar farko ta parser, adadin ra'ayoyi, sharhi da kimar labarin kawai aka yi la'akari da su. Wannan ya riga ya yi kyau, amma baya ba ku damar yin ƙarin hadaddun tambayoyi. Lokaci ya yi da za a bincika sassan jigogi na rukunin yanar gizon; wannan zai ba ku damar yin bincike mai ban sha'awa, alal misali, duba yadda shaharar sashin "C++" ya canza cikin shekaru da yawa.

An inganta fassarar labarin, yanzu ya dawo da wuraren da labarin ya kasance, da kuma sunan lakabin marubucin da ƙimarsa (ana iya yin abubuwa masu ban sha'awa da yawa a nan, amma wannan zai zo daga baya). Ana adana bayanan a cikin fayil ɗin csv mai kama da wani abu kamar haka:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Za mu sami jerin manyan wuraren jigo na rukunin yanar gizon.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Find_tsakanin aikin da ajin Str ya zaɓi kirtani tsakanin alamun biyu, na yi amfani da su a baya. Ana yiwa matattarar jigogi alamar "*" don haka za'a iya haskaka su cikin sauƙi, kuma kuna iya rashin gamsuwa da layukan da suka dace don samun sassan wasu nau'ikan.

Fitowar aikin get_hubs jeri ne mai ban sha'awa, wanda muke ajiyewa azaman ƙamus. Ina gabatar da jeri na musamman gaba dayansa domin ku iya kimanta girmansa.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Don kwatanta, sassan geektimes sun fi dacewa:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Sauran wuraren an kiyaye su ta hanya guda. Yanzu yana da sauƙi don rubuta aikin da ke mayar da sakamakon ko labarin ya kasance na geektimes ko cibiyar bayanin martaba.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

An yi irin wannan ayyuka don wasu sassan ("ci gaba", "Gudanarwa", da dai sauransu).

Tsarin aiki

Lokaci ya yi da za a fara nazari. Muna loda saitin bayanai kuma muna sarrafa bayanan cibiya.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Yanzu za mu iya tattara bayanan da rana kuma mu nuna adadin wallafe-wallafe don cibiyoyi daban-daban.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Muna nuna adadin labaran da aka buga ta amfani da Matplotlib:

Habrastatistics: binciko mafi kuma mafi ƙarancin ziyartan sassan rukunin yanar gizon

Na raba labaran "geektimes" da "geektimes kawai" a cikin ginshiƙi, saboda Labari na iya kasancewa cikin sassan biyu a lokaci guda (misali, “DIY” + “microcontrollers” + “C ++”). Na yi amfani da sunan “profile” don haskaka labaran bayanin martaba akan rukunin yanar gizon, kodayake wataƙila bayanin martaba na Ingilishi na wannan bai yi daidai ba.

A cikin ɓangaren da ya gabata mun yi tambaya game da "tasirin geektimes" wanda ke da alaƙa da canji a cikin ka'idodin biyan kuɗi don labarai don geektimes farawa wannan lokacin rani. Bari mu nuna labaran geektimes daban:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Sakamakon yana da ban sha'awa. Matsakaicin ra'ayi na labaran geektimes zuwa jimillar wani wuri kusa da 1:5. Amma yayin da jimillar ra'ayoyi suka bambanta sosai, kallon labaran "nishaɗi" ya kasance a kusan matakin ɗaya.

Habrastatistics: binciko mafi kuma mafi ƙarancin ziyartan sassan rukunin yanar gizon

Hakanan zaka iya lura cewa yawan adadin ra'ayoyin labarai a cikin sashin "geektimes" har yanzu ya faɗi bayan canza dokoki, amma "ta ido", ba fiye da 5% na jimlar ƙimar ba.

Yana da ban sha'awa don duba matsakaicin adadin ra'ayoyi a kowace labarin:

Habrastatistics: binciko mafi kuma mafi ƙarancin ziyartan sassan rukunin yanar gizon

Don labaran “nishadi” kusan kashi 40 ne sama da matsakaici. Wataƙila wannan ba abin mamaki ba ne. Rashin gazawa a farkon watan Afrilu ba a sani ba a gare ni, watakila abin da ya faru ke nan, ko kuma wani nau'in kuskure ne, ko watakila ɗaya daga cikin mawallafin geektimes ya tafi hutu;).

Af, jadawali yana nuna ƙarin kololuwa guda biyu a cikin adadin ra'ayoyin labarai - Sabuwar Shekara da hutun Mayu.

Hubs

Bari mu ci gaba zuwa ga binciken da aka yi alkawari na cibiyoyi. Bari mu lissafa manyan cibiyoyi 20 da adadin ra'ayoyi:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Sakamako:

Habrastatistics: binciko mafi kuma mafi ƙarancin ziyartan sassan rukunin yanar gizon

Abin mamaki shine, cibiyar da aka fi sani da ra'ayi ita ce "Tsaron Bayanai"; manyan shugabannin 5 kuma sun hada da "Programming" da "Kimiyya Popular".

Antitop ya mamaye Gtk da koko.

Habrastatistics: binciko mafi kuma mafi ƙarancin ziyartan sassan rukunin yanar gizon

Zan gaya muku wani sirri, ana iya ganin manyan cibiyoyin sadarwa a nan, kodayake ba a nuna adadin ra'ayoyi a wurin ba.

Bayani

Kuma a ƙarshe, ƙimar da aka yi alkawari. Yin amfani da bayanan binciken cibiya, za mu iya nuna shahararrun labarai don fitattun wuraren cibiyoyi na wannan shekara ta 2019.

Tsaron Bayani

Shiryawa

Shahararren Kimiyya

Hanya

Doka a cikin IT

Ci gaban yanar gizo

GTK

Kuma a ƙarshe, don kada kowa ya yi fushi, zan ba da ƙimar mafi ƙarancin ziyarta "gtk". A cikin shekara guda aka buga daya Labarin, wanda kuma "ta atomatik" ya mamaye layin farko na ƙimar.

ƙarshe

Ba za a yi ƙarshe ba. Barka da karatu kowa.

source: www.habr.com

Add a comment