Habrastatistics: kuyang'ana magawo omwe sanachedwe kwambiri ndi malowa

Pa Habr.

Π’ gawo lapitalo Magalimoto a Habr adawunikidwa molingana ndi magawo akulu - kuchuluka kwa zolemba, malingaliro awo ndi mavoti. Komabe, nkhani ya kutchuka kwa magawo a malowa idakhalabe yosafufuzidwa. Zinakhala zosangalatsa kuyang'ana izi mwatsatanetsatane ndikupeza malo otchuka kwambiri komanso osavomerezeka. Pomaliza, ndiyang'ana momwe geektimes imathandizira mwatsatanetsatane, ndikumaliza ndi kusankha kwatsopano kwa zolemba zabwino kwambiri kutengera masanjidwe atsopano.

Habrastatistics: kuyang'ana magawo omwe sanachedwe kwambiri ndi malowa

Kwa iwo omwe ali ndi chidwi ndi zomwe zinachitika, kupitiriza kuli pansi pa kudula.

Ndiroleni ndikukumbutseninso kuti ziwerengero ndi mavoti sizovomerezeka, ndilibe zambiri zamkati. Komanso sizikutsimikiziridwa kuti sindinalakwitse penapake kapena kuphonya chinachake. Komabe, ndikuganiza kuti zidakhala zosangalatsa. Tiyamba ndi kachidindo kaye; omwe alibe chidwi ndi izi akhoza kudumpha magawo oyamba.

Kusonkhanitsa deta

Mu mtundu woyamba wa ophatikiza, kuchuluka kwa mawonedwe, ndemanga ndi zolemba zokha zomwe zidaganiziridwa. Izi ndizabwino kale, koma sizikulolani kuti mufunse mafunso ovuta. Yakwana nthawi yoti muwunike magawo atsambali; izi zikuthandizani kuti mupange kafukufuku wosangalatsa, mwachitsanzo, onani momwe kutchuka kwa gawo la "C ++" kwasinthira zaka zingapo.

Wolemba nkhaniyo adawongoleredwa bwino, tsopano akubwezeretsanso ma hubs omwe nkhaniyo ndi yake, komanso dzina lakutchulidwa la wolemba komanso kuchuluka kwake (zinthu zambiri zosangalatsa zitha kuchitika pano, nayenso, koma pambuyo pake). Deta imasungidwa mu fayilo ya csv yomwe imawoneka motere:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"ΠœΠ΅ΡΡΠ΅Π½Π΄ΠΆΠ΅Ρ€ Slack β€” ΠΏΡ€ΠΈΡ‡ΠΈΠ½Ρ‹ Π²Ρ‹Π±ΠΎΡ€Π°, косяки ΠΏΡ€ΠΈ Π²Π½Π΅Π΄Ρ€Π΅Π½ΠΈΠΈ ΠΈ особСнности сСрвиса, ΠΎΠ±Π»Π΅Π³Ρ‡Π°ΡŽΡ‰ΠΈΠ΅ Тизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Tidzalandila mndandanda wazida zazikulu zamasamba.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

The find_between function ndi Str class sankhani chingwe pakati pa ma tag awiri, ndidawagwiritsa ntchito kale. Thematic hubs amalembedwa ndi "*" kuti athe kuwunikira mosavuta, komanso mutha kumasula mizere yofananirayo kuti mupeze magawo amagulu ena.

Zotsatira za ntchito ya get_hubs ndi mndandanda wochititsa chidwi, womwe timasunga ngati dikishonale. Ndikupereka mndandanda wonse wathunthu kuti muwerenge kuchuluka kwake.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Poyerekeza, magawo a geektimes amawoneka odekha kwambiri:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Malo otsalawo anasungidwa chimodzimodzi. Tsopano ndi zophweka kulemba ntchito yomwe imabweretsa zotsatira kaya nkhaniyo ndi ya geektimes kapena hub ya mbiri.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Ntchito zofanana zinapangidwira zigawo zina ("chitukuko", "ulamuliro", ndi zina zotero).

Processing

Yakwana nthawi yoti muyambe kusanthula. Timayika dataset ndikukonza data ya hub.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Tsopano titha kupanga magulu masana ndikuwonetsa kuchuluka kwa zofalitsa zamahabu osiyanasiyana.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Tikuwonetsa kuchuluka kwa zolemba zomwe zasindikizidwa pogwiritsa ntchito Matplotlib:

Habrastatistics: kuyang'ana magawo omwe sanachedwe kwambiri ndi malowa

Ndinagawa nkhani za "geektimes" ndi "geektimes only" patchati, chifukwa Nkhani ikhoza kukhala ya zigawo zonse ziwiri nthawi imodzi (mwachitsanzo, "DIY" + "microcontrollers" + "C ++"). Ndidagwiritsa ntchito dzina loti "mbiri" kuwunikira zolemba patsamba, ngakhale mwina mawu achingerezi pazimenezi sizolondola.

M'gawo lapitalo tidafunsa za "geektimes effect" yokhudzana ndi kusintha kwa malamulo olipira pazolemba zanthawi ya geek kuyambira chilimwe chino. Tiyeni tiwonetse zolemba za geektimes padera:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Zotsatira zake ndi zosangalatsa. Chiyerekezo cha mawonedwe a zolemba za geektimes zonse zili penapake mozungulira 1:5. Koma ngakhale kuchuluka kwa mawonedwe kunkasinthasintha momveka bwino, kuonetsedwa kwa nkhani za β€œzosangalatsa” kunalibe pamlingo womwewo.

Habrastatistics: kuyang'ana magawo omwe sanachedwe kwambiri ndi malowa

Mukhozanso kuzindikira kuti chiwerengero chonse cha malingaliro a nkhani mu gawo la "geektimes" chidakalipo pambuyo posintha malamulo, koma "ndi diso", osapitirira 5% ya chiwerengero chonse.

Ndizosangalatsa kuyang'ana kuchuluka kwa mawonedwe pa nkhani iliyonse:

Habrastatistics: kuyang'ana magawo omwe sanachedwe kwambiri ndi malowa

Pazolemba za "zosangalatsa" ndi pafupifupi 40% kuposa avareji. Izi mwina sizodabwitsa. Kulephera koyambirira kwa Epulo sikudziwika kwa ine, mwina ndi zomwe zidachitika, kapena ndi zolakwika zamtundu wina, kapena mwina m'modzi mwa olemba geektimes adapita kutchuthi;).

Mwa njira, graph ikuwonetsa nsonga ziwiri zowoneka bwino pamalingaliro ankhani - tchuthi cha Chaka Chatsopano ndi Meyi.

Malo ochezera

Tiyeni tipitirire ku kusanthula kolonjezedwa kwa ma hubs. Tiyeni titchule malo 20 apamwamba ndi mawonedwe angapo:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Zotsatira:

Habrastatistics: kuyang'ana magawo omwe sanachedwe kwambiri ndi malowa

Chodabwitsa n'chakuti, malo otchuka kwambiri pamaganizo anali "Chitetezo Chachidziwitso"; Atsogoleri apamwamba a 5 adaphatikizaponso "Programming" ndi "Popular science".

Antitop imakhala Gtk ndi Cocoa.

Habrastatistics: kuyang'ana magawo omwe sanachedwe kwambiri ndi malowa

Ndikukuuzani chinsinsi, malo apamwamba amatha kuwoneka apa, ngakhale chiwerengero cha mawonedwe sichikusonyezedwa pamenepo.

Kuwerengera

Ndipo potsiriza, mlingo wolonjezedwa. Pogwiritsa ntchito data yowunikira ma hub, titha kuwonetsa zolemba zodziwika kwambiri zamahabhu otchuka kwambiri chaka chino cha 2019.

Information Security

Mapulogalamu

Sayansi Yodziwika

Ntchito

Malamulo mu IT

Kukula kwa intaneti

GTK

Ndipo potsiriza, kuti palibe amene akhumudwitse, ndikupatsani chiwerengero cha "gtk". Pasanathe chaka chinasindikizidwa m'modzi Nkhaniyi, yomwenso "modzidzimutsa" imakhala pamzere woyamba wa mavoti.

Pomaliza

Sipadzakhala mapeto. Wodala kuwerenga aliyense.

Source: www.habr.com

Kuwonjezera ndemanga