I-Habrastatistics: ukuhlola awona macandelo angawona atyelelwa kancinci esizeni

Molo, Habr.

Π’ inxalenye yangaphambili I-traffic kaHabr yahlaziywa ngokweeparamitha eziphambili - inani lamanqaku, iimbono zabo kunye nokulinganisa. Nangona kunjalo, umba wokuthandwa kwamacandelo esayithi awuzange uhlolwe. Kwaba nomdla ukujonga oku ngakumbi kwaye ufumane ii-hubs ezidumileyo kunye nezingathandwayo. Okokugqibela, ndiza kujonga iziphumo ze-geektimes ngokweenkcukacha ezithe kratya, ndiphele ngokhetho olutsha lwamanqaku angcono asekwe kumanqanaba amatsha.

I-Habrastatistics: ukuhlola awona macandelo angawona atyelelwa kancinci esizeni

Kwabo banomdla kwinto eyenzekayo, ukuqhubeka kuphantsi kokusikwa.

Mandikukhumbuze kwakhona ukuba izibalo kunye neereyithingi azikho ngokusemthethweni, andinalo naluphi na ulwazi lwangaphakathi. Kananjalo akuqinisekwanga ukuba andenzanga mpazamo kwenye indawo okanye ndiphoswe yinto. Kodwa nangona kunjalo, ndicinga ukuba iye yanika umdla. Siza kuqala ngekhowudi kuqala; abo bangenamdla kule nto banokutsiba amacandelo okuqala.

Ukuqokelelwa kwedatha

Kwinguqulelo yokuqala ye-parser, kuphela inani leembono, izimvo kunye nokulinganisa amanqaku athathelwe ingqalelo. Oku sele kulungile, kodwa akukuvumeli ukuba wenze imibuzo enzima ngakumbi. Lixesha lokuba uhlalutye amacandelo anomxholo wendawo; oku kuya kukuvumela ukuba wenze uphando olunomdla kakhulu, umzekelo, ubone indlela ukuthandwa kwecandelo "C ++" litshintshile kwiminyaka eliqela.

I-parser yenqaku iphuculwe, ngoku ibuyisela ii-hubs apho inqaku lihlala khona, kunye nesiteketiso sombhali kunye nokulinganisa kwakhe (izinto ezininzi ezinomdla zingenziwa apha, nazo, kodwa oko kuya kuza kamva). Idatha igcinwe kwifayile ye-csv ejongeka ngolu hlobo:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"ΠœΠ΅ΡΡΠ΅Π½Π΄ΠΆΠ΅Ρ€ Slack β€” ΠΏΡ€ΠΈΡ‡ΠΈΠ½Ρ‹ Π²Ρ‹Π±ΠΎΡ€Π°, косяки ΠΏΡ€ΠΈ Π²Π½Π΅Π΄Ρ€Π΅Π½ΠΈΠΈ ΠΈ особСнности сСрвиса, ΠΎΠ±Π»Π΅Π³Ρ‡Π°ΡŽΡ‰ΠΈΠ΅ Тизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Siza kufumana uluhlu lwee-hubs eziphambili ze-thematic yesayithi.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

I find_between function kunye ne Str class khetha umtya phakathi kweethegi ezimbini, ndazisebenzisa ngaphambili. Ii-hubs zethematics ziphawulwe ngo-"*" ukuze zibe nokuqaqambisa ngokulula, kwaye ungaphinda ukhulule imigca ehambelanayo ukufumana amacandelo amanye amacandelo.

Imveliso ye get_hubs umsebenzi luludwe olunomtsalane, esilugcinayo njengesichazi-magama. Ndibonisa uluhlu lulonke ngokukodwa ukuze ukwazi ukuqikelela umthamo walo.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Ukuthelekisa, amacandelo exesha le-geektime akhangeleka ethobeke ngakumbi:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Ii-hubs eziseleyo zagcinwa ngendlela efanayo. Ngoku kulula ukubhala umsebenzi obuyisela isiphumo nokuba inqaku lele-geektimes okanye i-hub yeprofayili.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Imisebenzi efanayo yenzelwe amanye amacandelo (β€œuphuhliso”, β€œulawulo”, njl. njl.).

Ukuqhubekeka

Lixesha lokuba uqale ukuhlalutya. Silayisha i-dataset kwaye siqhube idatha ye-hub.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Ngoku sinokuqokelela idatha ngemini kwaye sibonise inani lopapasho lweehabhu ezahlukeneyo.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Sibonisa inani lamanqaku apapashiweyo sisebenzisa iMatplotlib:

I-Habrastatistics: ukuhlola awona macandelo angawona atyelelwa kancinci esizeni

Ndahlula amanqaku athi "i-geektimes" kunye ne "geektimes kuphela" kwitshathi, kuba Inqaku linokuba ngamacandelo omabini ngexesha elinye (umzekelo, "DIY" + "microcontrollers" + "C ++"). Ndisebenzise igama elithi "iprofayile" ukuqaqambisa amanqaku eprofayile kwindawo, nangona mhlawumbi igama lesiNgesi leprofayile le ayilunganga ngokupheleleyo.

Kwinxalenye yangaphambili sibuze "nge-geektimes effect" ehambelana notshintsho kwimithetho yentlawulo yamanqaku exesha le-geektimes eliqala kweli hlobo. Masibonise amanqaku e-geektime ngokwahlukeneyo:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Isiphumo sinomdla. Umlinganiselo oqikelelweyo wemibono yamanqaku exesha le geektimes ukuya kwitotali ikwindawo ethile malunga ne-1:5. Kodwa ngelixa inani lilonke leembono laliguquguquka ngokubonakalayo, ukujongwa kwamanqaku "olonwabo" kuhlala kukwinqanaba elifanayo.

I-Habrastatistics: ukuhlola awona macandelo angawona atyelelwa kancinci esizeni

Unokuqaphela kwakhona ukuba inani elipheleleyo leembono zamanqaku kwicandelo elithi "geektimes" lisawa emva kokutshintsha imithetho, kodwa "ngeso", akukho ngaphezu kwe-5% yexabiso lilonke.

Kunika umdla ukujonga umndilili wenani leembono ngenqaku ngalinye:

I-Habrastatistics: ukuhlola awona macandelo angawona atyelelwa kancinci esizeni

Kumanqaku "ezolonwabo" malunga ne-40% ngaphezu komndilili. Oku mhlawumbi akumangalisi. Ukungaphumeleli ekuqaleni kuka-Epreli akucaci kum, mhlawumbi yiloo nto eyenzekayo, okanye luhlobo oluthile lwempazamo yokwahlulahlula, okanye mhlawumbi omnye wababhali be-geektimes waya ekhefini;).

Ngendlela, igrafu ibonisa iincopho ezimbini eziphawulekayo kwinani leembono zamanqaku - uNyaka omtsha kunye neeholide zikaMeyi.

IiHub

Masiqhubele phambili kuhlalutyo oluthenjisiweyo lwee-hubs. Masidwelise iindawo eziphezulu ezingama-20 ngenani leembono:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Isiphumo:

I-Habrastatistics: ukuhlola awona macandelo angawona atyelelwa kancinci esizeni

Okumangalisayo kukuba, i-hub ethandwa kakhulu ngokwemibono yayiyi "Information Security" iinkokeli eziphezulu ze-5 nazo zazibandakanya "Ucwangciso" kunye "nenzululwazi edumileyo".

I-Antitop ihlala kwi-Gtk kunye neCocoa.

I-Habrastatistics: ukuhlola awona macandelo angawona atyelelwa kancinci esizeni

Ndiza kukuxelela imfihlo, ii-hubs eziphezulu nazo zingabonwa apha, nangona inani leembono lingaboniswa apho.

Inqanaba

Kwaye ekugqibeleni, umlinganiselo othenjisiweyo. Sisebenzisa idatha yohlalutyo lwehabhu, sinokubonisa amanqaku adumileyo kwezona ndawo zidumileyo kulo nyaka ka-2019.

Ukhuseleko loLwazi

Programming

Inzululwazi edumileyo

Umsebenzi

Umthetho kwi-IT

Uphuhliso lwewebhu

GTK

Kwaye ekugqibeleni, ukuze kungabikho mntu ukhubekileyo, ndiya kunika umlinganiselo weyona hub encinci etyelelweyo "gtk". Kwisithuba esingangonyaka yapapashwa ΠΎΠ΄Π½Π° Inqaku, elithi "ngokuzenzekelayo" lithatha umgca wokuqala wokulinganisa.

isiphelo

Akuyi kubakho sigqibo. Kumnandi ukufunda wonke umntu.

umthombo: www.habr.com

Yongeza izimvo