Habrastatistics: ṣawari pupọ julọ ati awọn apakan abẹwo ti o kere julọ ti aaye naa

Hello, Habr.

В ti tẹlẹ apakan A ṣe atupale ijabọ Habr ni ibamu si awọn ipilẹ akọkọ - nọmba awọn nkan, awọn iwo wọn ati awọn idiyele. Bibẹẹkọ, ọran ti gbaye-gbale ti awọn apakan aaye naa ko ṣe ayẹwo. O di ohun ti o nifẹ lati wo eyi ni awọn alaye diẹ sii ki o wa awọn ibudo olokiki julọ ati olokiki julọ. Ni ipari, Emi yoo wo ipa geektimes ni awọn alaye diẹ sii, ti o pari pẹlu yiyan tuntun ti awọn nkan ti o dara julọ ti o da lori awọn ipo tuntun.

Habrastatistics: ṣawari pupọ julọ ati awọn apakan abẹwo ti o kere julọ ti aaye naa

Fun awọn ti o nifẹ si ohun ti o ṣẹlẹ, itesiwaju wa labẹ gige.

Jẹ ki n leti lekan si pe awọn iṣiro ati awọn idiyele kii ṣe osise, Emi ko ni alaye inu eyikeyi. O tun ko ni idaniloju pe Emi ko ṣe aṣiṣe kan ni ibikan tabi padanu nkankan. Sugbon si tun, Mo ro pe o wa ni jade awon. A yoo bẹrẹ pẹlu koodu akọkọ; awọn ti ko nifẹ si eyi le foju awọn apakan akọkọ.

Gbigba data

Ninu ẹya akọkọ ti parser, nọmba awọn iwo nikan, awọn asọye ati awọn idiyele nkan ni a ṣe sinu akọọlẹ. Eyi ti dara tẹlẹ, ṣugbọn ko gba ọ laaye lati ṣe awọn ibeere eka diẹ sii. O to akoko lati ṣe itupalẹ awọn apakan akori ti aaye naa; eyi yoo gba ọ laaye lati ṣe iwadii ti o nifẹ pupọ, fun apẹẹrẹ, wo bii olokiki ti apakan “C ++” ti yipada ni ọpọlọpọ ọdun.

A ti ni ilọsiwaju sisọ ọrọ naa, ni bayi o da awọn ibudo ti nkan naa jẹ, ati orukọ apeso onkọwe ati idiyele rẹ (ọpọlọpọ awọn nkan ti o nifẹ le ṣee ṣe nibi, paapaa, ṣugbọn iyẹn yoo wa nigbamii). Awọn data ti wa ni fipamọ ni csv faili ti o dabi nkan bi eleyi:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

A yoo gba atokọ ti awọn ibudo koko akọkọ ti aaye naa.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Iṣẹ find_between ati kilasi Str yan okun laarin awọn aami meji, Mo lo wọn sẹyìn. Awọn ibudo akori ti samisi pẹlu "*" ki wọn le ṣe afihan ni rọọrun, ati pe o tun le ṣe akiyesi awọn ila ti o baamu lati gba awọn apakan ti awọn ẹka miiran.

Ijade ti iṣẹ get_hubs jẹ atokọ iwunilori kan, eyiti a fipamọ bi iwe-itumọ. Mo n ṣafihan atokọ ni pataki ni gbogbo rẹ ki o le ṣe iṣiro iwọn rẹ.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Fun lafiwe, awọn apakan geektimes wo iwọntunwọnsi diẹ sii:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Awọn ibudo ti o ku ni a tọju ni ọna kanna. Bayi o rọrun lati kọ iṣẹ kan ti o da abajade pada boya nkan naa jẹ ti geektimes tabi ibudo profaili kan.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Awọn iṣẹ kanna ni a ṣe fun awọn apakan miiran (“idagbasoke”, “isakoso”, ati bẹbẹ lọ).

Itọju

O to akoko lati bẹrẹ itupalẹ. A kojọpọ dataset ati ilana data ibudo.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Bayi a le ṣe akojọpọ data ni ọjọ kan ati ṣafihan nọmba awọn atẹjade fun awọn ibudo oriṣiriṣi.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

A ṣe afihan nọmba awọn nkan ti a tẹjade nipa lilo Matplotlib:

Habrastatistics: ṣawari pupọ julọ ati awọn apakan abẹwo ti o kere julọ ti aaye naa

Mo ti pin awọn nkan "geektimes" ati "geektimes nikan" ninu chart, nitori Nkan le jẹ ti awọn apakan mejeeji ni akoko kanna (fun apẹẹrẹ, “DIY” + “microcontrollers” + “C ++”). Mo lo “profaili” yiyan lati ṣe afihan awọn nkan profaili lori aaye naa, botilẹjẹpe boya profaili ọrọ Gẹẹsi fun eyi ko pe patapata.

Ni apakan ti tẹlẹ a beere nipa “ipa geektimes” ti o ni nkan ṣe pẹlu iyipada ninu awọn ofin isanwo fun awọn nkan fun awọn geektimes ti o bẹrẹ ni igba ooru yii. Jẹ ki a ṣe afihan awọn nkan geektimes lọtọ:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Abajade jẹ iyanilenu. Ipin isunmọ ti awọn iwo ti awọn nkan geektimes si lapapọ jẹ ibikan ni ayika 1:5. Ṣugbọn lakoko ti nọmba lapapọ ti awọn iwo yipada ni akiyesi, wiwo awọn nkan “idaraya” wa ni isunmọ ipele kanna.

Habrastatistics: ṣawari pupọ julọ ati awọn apakan abẹwo ti o kere julọ ti aaye naa

O tun le ṣe akiyesi pe nọmba lapapọ ti awọn iwo ti awọn nkan ni apakan “geektimes” tun ṣubu lẹhin iyipada awọn ofin, ṣugbọn “nipasẹ oju”, laisi diẹ sii ju 5% ti awọn iye lapapọ.

O jẹ ohun ti o nifẹ lati wo nọmba apapọ ti awọn iwo fun nkan kan:

Habrastatistics: ṣawari pupọ julọ ati awọn apakan abẹwo ti o kere julọ ti aaye naa

Fun awọn nkan “idaraya” o jẹ nipa 40% loke apapọ. Eleyi jẹ jasi ko yanilenu. Ikuna ni ibẹrẹ Oṣu Kẹrin ko ṣe akiyesi fun mi, boya iyẹn ni ohun ti o ṣẹlẹ, tabi o jẹ diẹ ninu iru aṣiṣe ti n ṣalaye, tabi boya ọkan ninu awọn onkọwe geektimes lọ si isinmi;).

Nipa ọna, aworan naa fihan awọn oke giga meji ti o ṣe akiyesi ni nọmba awọn iwo ti awọn nkan - Ọdun Titun ati awọn isinmi May.

Awọn ibudo

Jẹ ki a lọ siwaju si igbekale ileri ti awọn ibudo. Jẹ ki a ṣe atokọ awọn ibudo 20 ti o ga julọ nipasẹ nọmba awọn iwo:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Esi:

Habrastatistics: ṣawari pupọ julọ ati awọn apakan abẹwo ti o kere julọ ti aaye naa

Iyalenu, ibudo olokiki julọ ni awọn ọna wiwo ni “Aabo Alaye”; awọn oludari 5 ti o ga julọ tun pẹlu “Eto” ati “Imọ-jinlẹ olokiki”.

Antitop gba Gtk ati koko.

Habrastatistics: ṣawari pupọ julọ ati awọn apakan abẹwo ti o kere julọ ti aaye naa

Emi yoo sọ aṣiri kan fun ọ, awọn ibudo oke tun le rii nibi, biotilejepe awọn nọmba ti wiwo ti wa ni ko han nibẹ.

Rating

Ati nikẹhin, igbelewọn ileri. Lilo data itupalẹ hub, a le ṣafihan awọn nkan olokiki julọ fun awọn ibudo olokiki julọ fun ọdun 2019 yii.

Aabo Alaye

Eto eto

Imọye olokiki

Ọmọ

Ofin ni IT

Idagbasoke wẹẹbu

GTK

Ati nikẹhin, ki ẹnikẹni ko ba binu, Emi yoo fun ni idiyele ti ibudo ti o kere julọ “gtk”. Laarin odun kan ti o ti atejade ọkan Nkan naa, eyiti o tun “laifọwọyi” wa laini akọkọ ti idiyele naa.

ipari

Nibẹ ni yio je ko si ipari. Idunnu kika gbogbo eniyan.

orisun: www.habr.com

Fi ọrọìwòye kun