Habrastatistics: sahaminta qaybaha ugu iyo ugu yar ee goobta

Haye Habr.

Π’ qayb hore Gaadiidka Habr ayaa loo falanqeeyay iyadoo loo eegayo cabbirrada ugu muhiimsan - tirada maqaallada, aragtidooda iyo qiimeyntooda. Si kastaba ha ahaatee, arrinta caanka ah ee qaybaha goobta ayaa ahaa mid aan la baarin. Waxay noqotay mid xiiso leh in arrintan si faahfaahsan loo eego oo la helo xarumaha ugu caansan iyo kuwa ugu caansan. Ugu dambeyntii, waxaan si faahfaahsan u eegi doonaa saamaynta geektimes, oo ku dhammaanaysa xulashada cusub ee maqaallada ugu fiican ee ku salaysan darajooyinka cusub.

Habrastatistics: sahaminta qaybaha ugu iyo ugu yar ee goobta

Kuwa xiisaynaya waxa dhacay, sii waditaanka ayaa hoos imanaya.

Aan mar kale ku xasuusiyo in tirakoobka iyo qiimeynta aysan ahayn mid rasmi ah, ma hayo wax macluumaad ah oo gudaha ah. Sidoo kale lama dammaanad qaadayo in aanan meel ku qaldamin ama aanan waxba seegin. Laakiin weli, waxaan qabaa inay soo baxday mid xiiso leh. Waxaan ku bilaabi doonaa koodhka marka hore, kuwa aan xiiseyneynin arrintan waxay ka boodi karaan qaybaha ugu horreeya.

Xog ururin

Nuqulkii ugu horreeyay ee falanqaynta, kaliya tirada aragtiyada, faallooyinka iyo qiimaynta maqaallada ayaa la tixgeliyey. Tani waa horeba wanaagsanayd, laakiin kuma ogola inaad samayso su'aalo kakan. Waa waqtigii lagu falanqeyn lahaa qaybaha mawduuca ee goobta; tani waxay kuu ogolaaneysaa inaad sameyso cilmi baaris aad u xiiso badan, tusaale ahaan, fiiri sida caanka ah ee qaybta "C ++" ay isu beddeshay dhowr sano.

Falanqaynta maqaalku waa la hagaajiyay, hadda waxay soo celinaysaa xudunta maqaalku leeyahay, iyo sidoo kale naaneesta qoraaga iyo qiimeyntiisa (waxyaabo badan oo xiiso leh ayaa sidoo kale lagu samayn karaa halkan, laakiin taasi way iman doontaa). Xogta waxaa lagu kaydiyaa faylka csv oo u eg sidan:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"ΠœΠ΅ΡΡΠ΅Π½Π΄ΠΆΠ΅Ρ€ Slack β€” ΠΏΡ€ΠΈΡ‡ΠΈΠ½Ρ‹ Π²Ρ‹Π±ΠΎΡ€Π°, косяки ΠΏΡ€ΠΈ Π²Π½Π΅Π΄Ρ€Π΅Π½ΠΈΠΈ ΠΈ особСнности сСрвиса, ΠΎΠ±Π»Π΅Π³Ρ‡Π°ΡŽΡ‰ΠΈΠ΅ Тизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Waxaan heli doonaa liiska mawduucyada ugu muhiimsan ee goobta.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Find_internet function iyo fasalka Str ayaa doorta xadhig u dhexeeya laba tag, waan isticmaalay hore. Xuddunta mawduucyada waxaa lagu calaamadeeyay "*" si si sahal ah loo aqoonsan karo, waxaad sidoo kale dhibi kartaa xadadka u dhigma si aad u hesho qaybaha qaybaha kale.

Soo saarida shaqada get_hubs waa liis cadaalad ah oo cajiib ah, kaas oo aanu u kaydinay qaamuus ahaan. Waxaan si gaar ah u soo bandhigayaa liiska oo dhan si aad u qiyaastid muggiisa.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Isbarbardhigga, qaybaha geektimes waxay u muuqdaan kuwo dhexdhexaad ah:

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Xubnihii hadhayna si la mid ah ayaa loo dhawray. Hadda way fududahay in la qoro shaqo soo celinaysa natiijada haddii maqaalku leeyahay geektimes ama xarun profile.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Shaqooyin la mid ah ayaa loo sameeyay qaybaha kale ("horumarinta", "maamulka", iwm.).

Kala shaqeynta

Waa waqtigii la bilaabi lahaa falanqaynta. Waxaan soo rarnaa keydka xogta waxaanan farsameyneynaa xogta xuddunta.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Hadda waxaan u ururin karnaa xogta maalintii oo aan muujin karnaa tirada daabacaadaha ee xarumaha kala duwan.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Waxaan soo bandhignaa tirada maqaallada la daabacay anagoo adeegsanayna Matplotlib:

Habrastatistics: sahaminta qaybaha ugu iyo ugu yar ee goobta

Waxaan u qaybiyay maqaallada "geektimes" iyo "geektimes kaliya" ee shaxda, sababtoo ah Maqaalku wuxuu ka tirsanaan karaa labada qaybood isku mar (tusaale, "DIY" + "microcontrollers" + "C++"). Waxaan isticmaalay "profile" magacaabista si aan u muujiyo maqaallada astaanta u ah goobta, in kasta oo laga yaabo in ereyga Ingiriisiga ee tani aysan sax ahayn.

Qaybtii hore waxaan ku waydiinay "saamaynta geektimes" ee la xidhiidha isbeddelka xeerarka lacag bixinta ee maqaallada geektimes ee bilaabmaya xagaagan. Aan si gaar ah u soo bandhigno maqaallada geektimes:

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Natiijadu waa mid xiiso leh. Qiyaasta saamiga qiyaasaha ee maqaallada geektimes ilaa wadarta guud waa meel ku dhow 1:5. Laakiin iyadoo tirada guud ee ra'yigu ay si muuqata isu beddeshay, daawashada maqaallada "madadaalada" waxay ku hareen ku dhawaad ​​isla heer.

Habrastatistics: sahaminta qaybaha ugu iyo ugu yar ee goobta

Waxa kale oo aad ogaan kartaa in tirada guud ee ra'yiga maqaallada ee qaybta "geektimes" ay weli hoos u dhacday ka dib markii la beddelo xeerarka, laakiin "indhaha", oo aan ka badnayn 5% qiimaha wadarta.

Waxa xiiso leh in la eego celceliska tirada aragtiyada maqaalkii:

Habrastatistics: sahaminta qaybaha ugu iyo ugu yar ee goobta

Maqaallada "madadaalada" waxay qiyaastii 40% ka sarreeyaan celceliska. Tani malaha yaab maaha. Guuldarada bilawga Abriil aniga iima cadda, malaha taasi waa wixii dhacay, ama waa nooc ka mid ah khaladaadka falanqaynta, ama laga yaabee mid ka mid ah qorayaasha geektimes ayaa fasax aaday;).

Jid ahaan, garaafku wuxuu muujinayaa laba meelood oo kale oo la dareemi karo oo ku saabsan tirada aragtida maqaallada - Sannadka Cusub iyo Fasaxyada May.

Hub

Aan u gudubno falanqaynta la ballanqaaday ee hub. Aynu ku taxno 20-ka xarumood ee ugu sarreeya tiro aragtiyo ah:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

Natiijada:

Habrastatistics: sahaminta qaybaha ugu iyo ugu yar ee goobta

Waxa la yaab leh, in xudunta ugu caansan xagga aragtida ay ahayd β€œAmmaanka Macluumaadka”, 5-ta hoggaamiye ee ugu sarreeya ayaa sidoo kale ku jiray β€œBarnaamijka” iyo β€œSayniska caanka ah”.

Antitop-ka ayaa ku jira Gtk iyo Cocoa.

Habrastatistics: sahaminta qaybaha ugu iyo ugu yar ee goobta

Waxaan kuu sheegi doonaa sir, xarumaha ugu sarreeya ayaa sidoo kale la arki karaa halkan, inkasta oo tirada aragtiyada aan halkaas lagu muujin.

Qiimeynta

Iyo ugu dambeyntii, qiimeynta la ballanqaaday. Isticmaalka xogta falanqaynta hub, waxaan soo bandhigi karnaa maqaallada ugu caansan xarumaha ugu caansan sanadkan 2019.

Amniga Warfaafinta

Barnaamijka

Cilmiga caanka ah

Xirfadda

Sharciga IT-ga

Horumarinta shabakada

GTK

Ugu dambeyntiina, si aan qofna u xumaanin, waxaan siin doonaa qiimeynta xarunta ugu yar ee la booqdo "gtk". Sanad gudihii ayaa la daabacay ΠΎΠ΄Π½Π° Maqaalka, kaas oo sidoo kale "si toos ah" u fadhiya safka koowaad ee qiimeynta.

gunaanad

Ma jiri doonto gunaanad. Akhris wacan qof walba.

Source: www.habr.com

Add a comment