Habrastatistics: na-enyocha akụkụ saịtị kachasị na nke kacha nta

Hey Habr.

В akụkụ gara aga A na-enyocha okporo ụzọ Habr dị ka ihe ndị bụ isi - ọnụ ọgụgụ nke isiokwu, echiche ha na ọkwa ha. Agbanyeghị, okwu gbasara ewu ewu nke ngalaba saịtị ahụ ka enyochabeghị. Ọ ghọrọ ihe na-adọrọ mmasị ileba anya na nke a n'ụzọ zuru ezu ma chọta ebe ndị kasị ewu ewu na ndị na-enweghị mmasị. N'ikpeazụ, m ga-eleba anya na mmetụta geektimes n'ụzọ zuru ezu, na-ejedebe na nhọrọ ọhụrụ nke isiokwu kachasị mma dabere na ọkwa ọhụrụ.

Habrastatistics: na-enyocha akụkụ saịtị kachasị na nke kacha nta

Maka ndị nwere mmasị na ihe merenụ, ihe na-aga n'ihu dị n'okpuru ịkpụ.

Ka m chetara gị ọzọ na ọnụ ọgụgụ na ọkwa abụghị nke gọọmentị, enweghị m ozi ọ bụla. A naghịkwa ekwe nkwa na emehieghị m ebe ma ọ bụ ihe funahụrụ m. Ma, m na-eche na ọ tụgharịrị na-akpali. Anyị ga-ebu ụzọ malite koodu ahụ; ndị na-enweghị mmasị na nke a nwere ike ịgafe akụkụ nke mbụ.

Nchịkọta data

Na ụdị nke mbụ nke parser, naanị ọnụọgụ nlele, nkọwa na ọkwa akụkọ ka etinyere na akaụntụ. Nke a adịlarị mma, mana ọ naghị enye gị ohere ịme ajụjụ dị mgbagwoju anya. Ọ bụ oge iji nyochaa akụkụ isiokwu nke saịtị ahụ; nke a ga-enye gị ohere ịme nyocha na-adọrọ mmasị, dịka ọmụmaatụ, lee ka ewu ewu nke ngalaba "C ++" siri gbanwee kemgbe ọtụtụ afọ.

A na-emeziwanye ihe nchịkọta akụkọ ahụ, ugbu a ọ na-eweghachite hubs nke isiokwu ahụ bụ, yana aha njirimara onye edemede na ọkwa ya (ọtụtụ ihe na-adọrọ mmasị nwere ike ime ebe a kwa, ma nke ahụ ga-abịa mgbe e mesịrị). A na-echekwa data ahụ na faịlụ csv nke yiri nke a:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Anyị ga-enweta ndepụta nke isi isiokwu isiokwu nke saịtị.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Find_between ọrụ na klaasị Str họrọ eriri n'etiti mkpado abụọ, ejiri m ha na mbụ. A na-eji "*" akara akara ngosi nke nwere ike ime ka ọ pụta ìhè n'ụzọ dị mfe, ma ị nwekwara ike mebie ahịrị ndị kwekọrọ iji nweta ngalaba nke edemede ndị ọzọ.

Nsonaazụ nke ọrụ get_hubs bụ ndepụta na-adọrọ mmasị, nke anyị na-echekwa dị ka akwụkwọ ọkọwa okwu. Ana m ewepụta ndepụta ahụ n'ụzọ zuru ezu ka ị nwee ike tụọ ụda olu ya.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Maka ntụnyere, ngalaba geektimes na-ele anya karịa:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

A na-echekwa oghere ndị fọdụrụ n'otu ụzọ ahụ. Ugbu a ọ dị mfe ịde ọrụ na-eweghachi nsonaazụ ma akụkọ a bụ nke geektimes ma ọ bụ ebe profaịlụ.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Emere ọrụ ndị yiri ya maka ngalaba ndị ọzọ (“mmepe”, “nchịkwa”, wdg).

Nhazi

Oge erugo ịmalite nyocha. Anyị na-ebuba dataset ma hazie data hub.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Ugbu a, anyị nwere ike ịchịkọta data ahụ kwa ụbọchị wee gosipụta ọnụọgụ mbipụta maka ebe dị iche iche.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Anyị na-eji Matplotlib na-egosipụta ọnụọgụ akụkọ ebipụta:

Habrastatistics: na-enyocha akụkụ saịtị kachasị na nke kacha nta

M kewara isiokwu "geektimes" na "geektimes naanị" na chaatị ahụ, n'ihi na Edemede nwere ike ịbanye na ngalaba abụọ ahụ n'otu oge (dịka ọmụmaatụ, “DIY” + “microcontrollers” + “C ++”). Eji m nhọpụta "profaịlụ" pụta ìhè akụkọ profaịlụ na saịtị ahụ, n'agbanyeghị ma eleghị anya profaịlụ okwu Bekee maka nke a ezughị oke.

N'akụkụ nke gara aga anyị jụrụ banyere "mmetụta geektimes" metụtara mgbanwe nke iwu ịkwụ ụgwọ maka isiokwu maka geektimes malite n'oge okpomọkụ a. Ka anyị gosipụta akụkọ geektimes iche iche:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Ihe si na ya pụta na-adọrọ mmasị. Odika ihe nleba anya nke akụkọ geektimes na mkpokọta bụ ebe gburugburu 1:5. Mana ka ngụkọta ọnụ ọgụgụ nke echiche gbanwere n'ụzọ pụtara ìhè, nlele nke akụkọ “ntụrụndụ” ka nọ n'ihe dịka otu ọkwa.

Habrastatistics: na-enyocha akụkụ saịtị kachasị na nke kacha nta

Ị nwekwara ike ịchọpụta na ọnụ ọgụgụ nke echiche nke isiokwu dị na ngalaba "geektimes" ka dara mgbe ọ gbanwere iwu, ma "site na anya", ọ bụghị ihe karịrị 5% nke ngụkọta ụkpụrụ.

Ọ na-adọrọ mmasị ileba anya na nkezi ọnụọgụ nlele kwa edemede:

Habrastatistics: na-enyocha akụkụ saịtị kachasị na nke kacha nta

Maka akụkọ "ntụrụndụ" ọ bụ ihe dịka 40% karịa nkezi. Nke a eleghị anya ọ bụghị ihe ijuanya. Ọdịda na mmalite nke Eprel edoghị m anya, ma eleghị anya nke ahụ bụ ihe merenụ, ma ọ bụ na ọ bụ ụdị njehie ntule, ma ọ bụ ikekwe otu n'ime ndị edemede geektimes gara ezumike;).

Site n'ụzọ, eserese ahụ na-egosi elu abụọ a na-ahụ anya na ọnụ ọgụgụ nke echiche nke isiokwu - Afọ Ọhụrụ na ezumike May.

Ebe nchekwa

Ka anyị na-aga n'ihu na nkwa nyocha nke hubs. Ka anyị were ọnụ ọgụgụ nlele depụta ọdụ 20 kacha elu:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Nsonaazụ:

Habrastatistics: na-enyocha akụkụ saịtị kachasị na nke kacha nta

N'ụzọ dị ịtụnanya, ebe kachasị ewu ewu n'ihe gbasara echiche bụ "Nchekwa ozi"; ndị isi 5 kachasị elu gụnyekwara "Programming" na "Sayensị na-ewu ewu".

Antitop nwere Gtk na koko.

Habrastatistics: na-enyocha akụkụ saịtị kachasị na nke kacha nta

Aga m agwa gị ihe nzuzo, a pụkwara ịhụ ebe ndị dị n'elu ebe a, ọ bụ ezie na egosighi ọnụ ọgụgụ echiche n'ebe ahụ.

Ntụle

Na n'ikpeazụ, nkwa nke ọkwa. N'iji data nyocha hub, anyị nwere ike igosipụta akụkọ kachasị ewu ewu maka ebe kachasị ewu ewu maka afọ 2019 a.

Nchekwa ozi

Mmemme

Sayensị ewu ewu

Ọrụ

Iwu na IT

Mmepe webụ

GTK

N'ikpeazụ, ka onye ọ bụla ghara iwe iwe, m ga-enye ọkwa nke ụlọ kacha nta eleta "gtk". N'ime otu afọ e bipụtara ya otu Edemede ahụ, nke nwekwara “na-akpaghị aka” na-ejide akara mbụ nke ọkwa ahụ.

nkwubi

A gaghị enwe nkwubi okwu. Obi ụtọ ịgụ onye ọ bụla.

isi: www.habr.com

Tinye a comment