I-Habrastatistics: ukuhlola izingxenye zesayithi ezivakashelwa kakhulu nezivakashelwe kancane

Sawubona, Habr.

В ingxenye edlule Ithrafikhi kaHabr yahlaziywa ngokuya ngemingcele eyinhloko - inani lama-athikili, imibono yabo nezilinganiso. Kodwa-ke, indaba yokuthandwa kwezigaba zesayithi ayizange ihlolwe. Kube mnandi ukubheka lokhu ngokuningiliziwe nokuthola izindawo ezithandwa kakhulu nezingathandwa kakhulu. Ekugcineni, ngizobheka umphumela we-geektimes ngemininingwane eyengeziwe, ngiphethe ngokukhethwa okusha kwama-athikili angcono kakhulu asuselwe kumazinga amasha.

I-Habrastatistics: ukuhlola izingxenye zesayithi ezivakashelwa kakhulu nezivakashelwe kancane

Kulabo abanentshisekelo ngokwenzekile, ukuqhubeka kungaphansi kwe-cut.

Ake ngikukhumbuze futhi ukuthi izibalo nezilinganiso azikho emthethweni, anginalo ulwazi lwangaphakathi. Futhi akuqinisekisiwe ukuthi angenzanga iphutha endaweni ethile noma ngiphuthelwe okuthile. Kodwa noma kunjalo, ngicabanga ukuthi kube mnandi. Sizoqala ngekhodi kuqala; labo abangenantshisekelo kulokhu bangeqa izigaba zokuqala.

Ukuqoqwa kwedatha

Enguqulweni yokuqala yomhlahleli, kucatshangelwe inani lokubuka kuphela, ukuphawula kanye nezilinganiso ze-athikili. Lokhu sekuvele kuhle, kodwa akukuvumeli ukuthi wenze imibuzo enzima kakhulu. Sekuyisikhathi sokuhlaziya izigaba ezinezihloko zesayithi, lokhu kuzokuvumela ukuthi wenze ucwaningo oluthokozisayo, ngokwesibonelo, ubone ukuthi ukuthandwa kwesigaba "C++" kushintshe kanjani eminyakeni embalwa.

Umhlahleli we-athikili uthuthukisiwe, manje ubuyisela amahabhu okungowawo isihloko, kanye nesidlaliso sombhali kanye nesilinganiso sakhe (izinto eziningi ezithakazelisayo zingenziwa lapha, futhi, kodwa lokho kuzofika kamuva). Idatha ilondolozwe kufayela le-csv elibukeka kanjena:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Sizothola uhlu lwamahabhu abalulekile esayithi.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Umsebenzi we-find_between kanye ne-Str class khetha intambo phakathi kwamathegi amabili, ngiwasebenzisile phambilini. Amahabhu etimu amakwe ngo-"*" ukuze agqanyiswe kalula, futhi ungakwazi nokukhulula imigqa ehambisanayo ukuze uthole izigaba zezinye izigaba.

Okukhiphayo komsebenzi we-get_hubs kuwuhlu oluhlaba umxhwele, esilugcina njengesichazamazwi. Ngethula ngokukhethekile uhlu lulonke ukuze ukwazi ukulinganisa umthamo walo.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Ukuze uqhathanise, izigaba ze-geektimes zibukeka zinesizotha kakhulu:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Ama-hubs asele agcinwe ngendlela efanayo. Manje sekulula ukubhala umsebenzi obuyisela umphumela noma ngabe i-athikili ingeye-geektimes noma ihabhu lephrofayela.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Imisebenzi efanayo yenzelwe ezinye izigaba (“ukuthuthukiswa”, “ukuphatha”, njll.).

Iyacubungula

Isikhathi sokuqala ukuhlaziya. Silayisha idathasethi bese sicubungula idatha yehabhu.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Manje singakwazi ukuqoqa idatha ngosuku futhi sibonise inombolo yokushicilelwa kwamahabhu ahlukene.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Sibonisa inani lezindatshana ezishicilelwe sisebenzisa i-Matplotlib:

I-Habrastatistics: ukuhlola izingxenye zesayithi ezivakashelwa kakhulu nezivakashelwe kancane

Ngihlukanise izihloko ezithi “geektimes” kanye nethi “geektimes kuphela” eshadini, ngoba I-athikili ingaba yazo zombili izigaba ngesikhathi esisodwa (isibonelo, “DIY” + “microcontrollers” + “C++”). Ngisebenzise igama elithi “iphrofayela” ukuze ngigqamise izindatshana zephrofayili kusayithi, noma mhlawumbe igama lesiNgisi lephrofayili yalokhu alilungile ngokuphelele.

Engxenyeni edlule sibuze "ngomphumela we-geektimes" ohlobene noshintsho lwemithetho yokukhokha yama-athikili ezikhathi ze-geektime eziqala kuleli hlobo. Masibonise izindatshana ze-geektimes ngokwehlukana:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Umphumela uyathakazelisa. Isilinganiso esilinganiselwe sokubukwa kwama-athikili e-geektimes kuya enani liphelele licishe libe ngu-1:5. Kodwa nakuba inani eliphelele lokubuka liye laguquguquka ngokuphawulekayo, ukubukwa kwezindatshana “zokuzijabulisa” kuhlale kusezingeni elifanayo.

I-Habrastatistics: ukuhlola izingxenye zesayithi ezivakashelwa kakhulu nezivakashelwe kancane

Ungaqaphela futhi ukuthi inani eliphelele lokubukwa kwama-athikili esigabeni esithi “geektimes” lisawile ngemva kokushintsha imithetho, kodwa “ngeso”, lingekho ngaphezu kuka-5% yenani eliphelele.

Kuyathakazelisa ukubheka isilinganiso senani lokubuka nge-athikili ngayinye:

I-Habrastatistics: ukuhlola izingxenye zesayithi ezivakashelwa kakhulu nezivakashelwe kancane

Ezihlokweni "zokuzijabulisa" cishe ku-40% ngaphezu kwesilinganiso. Lokhu cishe akumangazi. Ukwehluleka ekuqaleni kuka-Ephreli akucacile kimi, mhlawumbe yilokho okwenzekile, noma uhlobo oluthile lwephutha lokuhlaziya, noma mhlawumbe omunye wababhali be-geektimes uye eholidini;).

Ngendlela, igrafu ibonisa iziqongo ezimbili eziphawulekayo enanini lokubukwa kwezihloko - amaholide oNyaka Omusha noMeyi.

Amahabhu

Ake siqhubekele ekuhlaziyweni okuthenjisiwe kwamahabhu. Masibhale amahabhu aphezulu angama-20 ngenani lokubuka:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Umphumela:

I-Habrastatistics: ukuhlola izingxenye zesayithi ezivakashelwa kakhulu nezivakashelwe kancane

Ngokumangazayo, ihabhu ethandwa kakhulu ngokwemibono bekuyi-"Information Security"; abaholi abaphezulu aba-5 bahlanganise "Ukuhlela" kanye "Nesayensi Edumile".

I-Antitop ithatha i-Gtk ne-Cocoa.

I-Habrastatistics: ukuhlola izingxenye zesayithi ezivakashelwa kakhulu nezivakashelwe kancane

Ngizokutshela imfihlo, ama-hubs aphezulu nawo angabonakala lapha, nakuba inani lokubuka lingaboniswa lapho.

Isilinganiso

Futhi ekugcineni, isilinganiso esithenjisiwe. Ngokusebenzisa idatha yokuhlaziya ihabhu, singabonisa izindatshana ezidume kakhulu zamahabhu adume kakhulu kulo nyaka ka-2019.

Ukuphepha Kolwazi

Ukuhlela

Isayensi Ethandwayo

Umsebenzi

Umthetho ku-IT

Ukuthuthukiswa kwewebhu

I-GTK

Futhi ekugcineni, ukuze kungabikho muntu ocasulayo, ngizonikeza isilinganiso sehabhu elivakashelwe okungenani elithi "gtk". Ungakapheli unyaka yashicilelwa eyodwa I-athikili, nayo "ngokuzenzakalelayo" ithatha umugqa wokuqala wokulinganisa.

isiphetho

Ngeke kube nesiphetho. Kujabulele ukufunda wonke umuntu.

Source: www.habr.com

Engeza amazwana