Habrastatistics: kuongorora zvakanyanya uye zvishoma zvakashanyirwa zvikamu zvesaiti

Hei Habr.

Π’ yapfuura chikamu Kufamba kwaHabr kwakaongororwa zvichienderana nematanho makuru - nhamba yezvinyorwa, maonero avo uye zviyero. Zvisinei, nyaya yekuzivikanwa kwezvikamu zvesaiti yakaramba isina kuongororwa. Zvakava zvinonakidza kutarisa izvi zvakadzama uye nekuwana iyo inonyanya kufarirwa uye isingafarirwe hubs. Chekupedzisira, ini ndichatarisa iyo geektimes mhedzisiro mune zvakadzama, ichipera nesarudzo nyowani yezvakanakisa zvinyorwa zvichibva pazvitsva zvitsva.

Habrastatistics: kuongorora zvakanyanya uye zvishoma zvakashanyirwa zvikamu zvesaiti

Kune avo vanofarira zvakaitika, kuenderera mberi kuri pasi pekucheka.

Rega ndikuyeuchidze zvakare kuti nhamba uye zviyero hazvisi zvepamutemo, ini handina ruzivo rwemukati. Izvo zvakare hazvina kuvimbiswa kuti ini handina kukanganisa pane imwe nzvimbo kana kupotsa chimwe chinhu. Asi zvakadaro, ndinofunga zvakazonakidza. Tichatanga nekodhi kutanga; avo vasingafarire izvi vanogona kusvetuka zvikamu zvekutanga.

Data collection

Mune yekutanga vhezheni yeparser, iyo chete nhamba yemaonero, makomendi uye zvinyorwa zvinyorwa zvakaverengerwa. Izvi zvatove zvakanaka, asi hazvikubvumidze kuti uite mibvunzo yakaoma. Yave nguva yekuongorora zvikamu zvine musoro zvesaiti; izvi zvinokutendera kuti uite tsvakiridzo inonakidza, semuenzaniso, ona kuti kufarirwa kwechikamu che "C ++" kwakachinja sei mumakore akati wandei.

Chinyorwa parser chakagadziridzwa, ikozvino chinodzosera hubs kune iyo chinyorwa, pamwe nezita remadunhurirwa remunyori uye chiyero chake (zvakawanda zvinofadza zvinogona kuitwa pano, zvakare, asi izvo zvinozouya gare gare). Iyo data inochengetwa mu csv faira rinotaridzika seizvi:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"ΠœΠ΅ΡΡΠ΅Π½Π΄ΠΆΠ΅Ρ€ Slack β€” ΠΏΡ€ΠΈΡ‡ΠΈΠ½Ρ‹ Π²Ρ‹Π±ΠΎΡ€Π°, косяки ΠΏΡ€ΠΈ Π²Π½Π΅Π΄Ρ€Π΅Π½ΠΈΠΈ ΠΈ особСнности сСрвиса, ΠΎΠ±Π»Π΅Π³Ρ‡Π°ΡŽΡ‰ΠΈΠ΅ Тизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Isu tinogashira runyorwa rweakanyanya thematic hubs yesaiti.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

The find_between function uye Str class sarudza tambo pakati pema tag maviri, ndakavashandisa pakutanga. Thematic hubs yakanyorwa ne "*" kuti igone kujekeswa zviri nyore, uye iwe unogona zvakare kusunungura mitsara inoenderana kuti uwane zvikamu zvemamwe mapoka.

Kubuda kweiyo get_hubs basa irondedzero inonakidza, yatinochengeta seduramazwi. Ndiri kunyatso ratidza runyoro rwakazara kuti iwe ugone kufungidzira huwandu hwayo.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Kuenzanisa, zvikamu zve geektimes zvinotaridzika zvine mwero:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Hubs dzakasara dzakachengetedzwa nenzira imwechete. Iye zvino zviri nyore kunyora basa rinodzosa mhedzisiro ingave chinyorwa che geektimes kana profil hub.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Mabasa akafanana akaitirwa zvimwe zvikamu ("development", "administration", etc.).

Processing

Inguva yekutanga kuongorora. Isu tinorodha dataset uye tinogadzira iyo hub data.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Iye zvino tinogona kuunganidza data nezuva uye kuratidza huwandu hwezvinyorwa zvehubs dzakasiyana.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Isu tinoratidza huwandu hwezvinyorwa zvakaburitswa tichishandisa Matplotlib:

Habrastatistics: kuongorora zvakanyanya uye zvishoma zvakashanyirwa zvikamu zvesaiti

Ndakapatsanura zvinyorwa "geektimes" uye "geektimes chete" muchati, nekuti Chinyorwa chinogona kuva chezvikamu zviviri panguva imwe chete (semuenzaniso, "DIY" + "microcontrollers" + "C ++"). Ndakashandisa zita rekuti "profile" kuratidza zvinyorwa pasaiti, kunyangwe pamwe izwi reChirungu chimiro cheizvi harina kunyatso kurongeka.

Muchikamu chakapfuura takabvunza nezve "geektimes effect" ine chekuita nekuchinja kwemitemo yekubhadhara yezvinyorwa zve geektimes kutanga zhizha rino. Ngatiratidze zvinyorwa zve geektimes zvakasiyana:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Mhedzisiro yacho inofadza. Humwe hwuyero yemaonero ezvinyorwa zve geektimes kune yakazara pane imwe nzvimbo yakatenderedza 1:5. Asi nepo huwandu hwemaonero huchichinja zvinooneka, kutariswa kwezvinyorwa zve "varaidzo" kwakaramba kuri padanho rimwe chete.

Habrastatistics: kuongorora zvakanyanya uye zvishoma zvakashanyirwa zvikamu zvesaiti

Iwe unogonawo kuona kuti nhamba yose yemaonero ezvinyorwa muchikamu che "geektimes" ichiri kudonha mushure mekushandura mitemo, asi "neziso", kwete kupfuura 5% yehuwandu hwehuwandu.

Zvinonakidza kutarisa avhareji yenhamba yemaonero pachinyorwa chimwe nechimwe:

Habrastatistics: kuongorora zvakanyanya uye zvishoma zvakashanyirwa zvikamu zvesaiti

Zve "varaidzo" zvinyorwa zvinenge 40% pamusoro peavhareji. Izvi zvimwe hazvishamisi. Kukundikana kwekutanga kwaApril hakuna kujeka kwandiri, pamwe ndizvo zvakaitika, kana kuti imhando yekukanganisa, kana kuti mumwe wevanyori ve geektimes akaenda kuzororo;).

Nenzira, girafu rinoratidza mamwe maviri epamusoro anooneka muhuwandu hwemaonero ezvinyorwa - New Year naMay mazororo.

Hubs

Ngatienderere mberi kuongororo yakavimbiswa yehubs. Ngatinyorei epamusoro 20 hubs nehuwandu hwemaonero:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Mhinduro:

Habrastatistics: kuongorora zvakanyanya uye zvishoma zvakashanyirwa zvikamu zvesaiti

Sezvineiwo, iyo inonyanya kufarirwa hubhu maererano nemaonero yaive "Ruzivo Chengetedzo"; vatungamiriri vepamusoro 5 vaisanganisira "Programming" uye "Yakakurumbira sainzi".

Antitop inogara Gtk neCocoa.

Habrastatistics: kuongorora zvakanyanya uye zvishoma zvakashanyirwa zvikamu zvesaiti

Ini ndichakuudza chakavanzika, iyo yepamusoro hubs inogonawo kuonekwa pano, kunyange nhamba yemaonero isina kuratidzwa ipapo.

Rating

Uye pakupedzisira, chiyero chakavimbiswa. Tichishandisa hub yekuongorora data, tinogona kuratidza zvinyorwa zvinonyanya kufarirwa zveanonyanya kufarirwa hubs zvegore rino ra2019.

Information Security

Zvirongwa

Popular Science

Career

Mitemo muIT

Web development

GTK

Uye pakupedzisira, kuitira kuti pasave nemunhu anogumburwa, ini ndichapa chiyero cheiyo shoma yakashanyirwa hub "gtk". Mukati megore rakabudiswa imwe Chinyorwa, icho zvakare "otomatiki" chinotora mutsetse wekutanga wechiyero.

mhedziso

Hapazovi nemhedziso. Kufara kuverenga munhu wese.

Source: www.habr.com

Voeg