Habrastatistics: ho hlahloba likarolo tsa sebaka sa marang-rang le tse sa eteleheng haholo

Hello Habr.

В karolo e fetileng Sephethephethe sa Habr se ile sa hlahlojoa ho ea ka mekhahlelo e meholo - palo ea lingoliloeng, maikutlo a bona le litekanyetso. Leha ho le joalo, taba ea ho tsebahala ha likarolo tsa sebaka sa marang-rang ha ea ka ea hlahlojoa. Ho ile ha thahasellisa ho sheba sena ka ho qaqileng le ho fumana li-hubs tse ratoang ka ho fetisisa le tse sa rateheng. Qetellong, ke tla sheba phello ea geektimes ka botlalo, ke qetella ka khetho e ncha ea lingoliloeng tse ntle ka ho fetisisa tse ipapisitseng le maemo a macha.

Habrastatistics: ho hlahloba likarolo tsa sebaka sa marang-rang le tse sa eteleheng haholo

Bakeng sa ba thahasellang se etsahetseng, ho tsoelapele ho tlas'a sehiloeng.

E-re ke u hopotse hape hore lipalo-palo le litekanyetso ha li molaong, ha ke na lintlha tsa ka hare. Hape ha ho tiisetsoe hore ha kea etsa phoso kae-kae kapa ke fositse ho hong. Empa leha ho le joalo, ke nahana hore e ile ea khahla. Re tla qala ka khoutu pele; ba sa thahaselleng sena ba ka tlola likarolo tsa pele.

Pokello ea lintlha

Phetolelong ea pele ea mohlahlobi, ke palo feela ea maikutlo, maikutlo le lintlha tsa sengoloa tse ileng tsa hlokomeloa. Sena se se se ntse se le molemo, empa ha se u lumelle ho etsa lipotso tse rarahaneng. Ke nako ea ho sekaseka likarolo tsa sehlooho tsa sebaka sa marang-rang; sena se tla u lumella ho etsa lipatlisiso tse khahlisang haholo, mohlala, bona hore na ho tsebahala ha karolo ea "C ++" ho fetohile joang ka lilemo tse 'maloa.

Sengoloa sa sengoloa se ntlafalitsoe, joale se khutlisetsa li-hubs tseo sengoloa se leng ho tsona, hammoho le lebitso la bosoasoi la mongoli le tekanyo ea hae (lintho tse ngata tse khahlisang li ka etsoa mona, hape, empa seo se tla tla hamorao). Lintlha li bolokiloe faeleng ea csv e shebahalang tjena:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Re tla fumana lethathamo la li-hubs tsa sehlooho tsa sebaka sa marang-rang.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

The find_between function le Str class khetha khoele pakeng tsa li-tag tse peli, ke li sebelisitse pejana. Li-hubs tsa sehlooho li tšoailoe ka "*" e le hore li ka totobatsoa habonolo, 'me u ka boela ua hlakola mela e tsamaisanang le eona ho fumana likarolo tsa mekhahlelo e meng.

Sephetho sa mosebetsi oa get_hubs ke lenane le khahlisang, leo re le bolokang e le bukantswe. Ke hlahisa lenane ka botlalo ka botlalo hore o tle o hakanye bophahamo ba lona.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Ha ho bapisoa, likarolo tsa geektimes li shebahala li le bonolo haholoanyane:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Li-hubs tse setseng li ile tsa bolokoa ka tsela e tšoanang. Hona joale ho bonolo ho ngola mosebetsi o khutlisetsang sephetho hore na sengoloa ke sa geektimes kapa profil hub.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Mesebetsi e tšoanang e ile ea etsoa bakeng sa likarolo tse ling ("ntlafatso", "tsamaiso", joalo-joalo).

Ho sebetsa

Ke nako ea ho qala ho sekaseka. Re kenya dataset ebe re sebetsana le data ea hub.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Hona joale re ka hlophisa lintlha ka letsatsi 'me ra bontša palo ea likhatiso bakeng sa li-hubs tse fapaneng.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Re bonts'a palo ea lingoliloeng tse hatisitsoeng re sebelisa Matplotlib:

Habrastatistics: ho hlahloba likarolo tsa sebaka sa marang-rang le tse sa eteleheng haholo

Ke arotse lihlooho "geektimes" le "geektimes feela" chateng, hobane Sengoloa se ka ba karolo ea likarolo tseo ka bobeli ka nako e le 'ngoe (mohlala, "DIY" + "microcontrollers" + "C ++"). Ke sebelisitse lebitso "profile" ho totobatsa lingoloa tsa profil sebakeng sa marang-rang, leha mohlomong profil ea Senyesemane bakeng sa sena e sa nepahala ka botlalo.

Karolong e fetileng re ile ra botsa ka "phello ea nako ea geektimes" e amanang le phetoho ea melao ea tefo bakeng sa lihlooho tsa linako tsa geek ho qala lehlabuleng lena. Ha re bontsheng lingoliloeng tsa geektimes ka thoko:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Phello e thahasellisa. Karolelano e hakanyetsoang ea maikutlo a lingoloa tsa geektimes ho kakaretso e batla e le 1:5. Empa leha palo eohle ea maikutlo e ne e feto-fetoha ka mokhoa o hlokomelehang, ho shejoa ha lingoliloeng tsa "boithabiso" ho ntse ho batla ho lekana.

Habrastatistics: ho hlahloba likarolo tsa sebaka sa marang-rang le tse sa eteleheng haholo

U ka boela ua hlokomela hore palo eohle ea maikutlo a lihlooho tse karolong ea "geektimes" e ntse e oela ka mor'a ho fetola melao, empa "ka leihlo", e seng ho feta 5% ea litekanyetso tsohle.

Hoa thahasellisa ho sheba palo e tloaelehileng ea maikutlo a sengoliloeng ka 'ngoe:

Habrastatistics: ho hlahloba likarolo tsa sebaka sa marang-rang le tse sa eteleheng haholo

Bakeng sa lihlooho tsa "boithabiso" ke hoo e ka bang 40% ka holimo ho karolelano. Mohlomong sena ha se makatse. Ho hlōleha qalong ea April ha hoa hlaka ho 'na, mohlomong ke sona se etsahetseng, kapa ke mofuta o itseng oa phoso ea ho hlalosa, kapa mohlomong e mong oa bangoli ba geektimes o ile a ea phomolong;).

Ka tsela, graph e bonts'a litlhōrō tse ling tse peli tse hlokomelehang palo ea maikutlo a lihlooho - matsatsi a phomolo a Selemo se Secha le May.

Hubs

Ha re feteleng ka tlhahlobo e tšepisitsoeng ea li-hubs. Ha re thathamiseng li-hubs tse 20 ka palo ea maikutlo:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Sephetho:

Habrastatistics: ho hlahloba likarolo tsa sebaka sa marang-rang le tse sa eteleheng haholo

Ho makatsang ke hore sebaka se tummeng ka ho fetisisa mabapi le maikutlo e ne e le "Ts'ireletso ea Boitsebiso"; baetapele ba ka holimo ba 5 ba ne ba boetse ba kenyelletsa "Programming" le "Popular science".

Antitop e lula Gtk le Cocoa.

Habrastatistics: ho hlahloba likarolo tsa sebaka sa marang-rang le tse sa eteleheng haholo

Ke tla u bolella sephiri, li-hubs tse holimo li ka boela tsa bonoa mona, le hoja palo ea maikutlo e sa bontšoa moo.

Lintlha

'Me qetellong, tekanyo e tšepisitsoeng. Re sebelisa lintlha tsa tlhahlobo ea hub, re ka bonts'a lingoliloeng tse tsebahalang haholo bakeng sa li-hubs tse tsebahalang haholo selemong sena sa 2019.

Tšireletso ea Boitsebiso

Lenaneo

Saense e Tloaelehileng

Mosebetsi

Molao ho IT

Ntlafatso ea webo

GTK

'Me qetellong, e le hore ho se ke ha e-ba le motho ea khopisitsoeng, ke tla fana ka tekanyo ea "gtk" e nyenyane e eteloang. Pele selemo se fela se ile sa hatisoa e le 'ngoe Sengoloa, seo hape "ka tsela e iketsang" se tšoarellang moleng oa pele oa lintlha.

fihlela qeto e

Ho ke ke ha ba le sephetho. Ho thabela ho bala bohle.

Source: www.habr.com

Eketsa ka tlhaloso