Habrastatistics: explorans maxime minimeque visitavit sectiones situs

Salve, Habr.

В prior pars Negotiatio Habr secundum ambitum principalem - numerus articulorum, opiniones et aestimationes resolvitur. Eventus tamen sectionum favoris situs inexcusabilis mansit. Is interesting factus est fusius intueri et invenire cantus populares et invidiosissimos. Denique effectum geektimorum accuratius intuebor, cum nova delectu optimorum articulorum e novis ordo.

Habrastatistics: explorans maxime minimeque visitavit sectiones situs

Nam ea, quae intersunt, quae facta sunt, continuatio sub incisa est.

Iterum me moneam te statisticae et ratings officiales non esse, nullum interiorem informationem habeo. Illud etiam non praestatur me alicubi errasse aut aliquid desiderari. Sed tamen iucundam esse puto. Primum a codice incipiemus, qui hoc non quaero, sectiones primas transilire possunt.

Notitia collectio

In prima versione parser, solum numerus sententiarum, commentorum et aestimationes articuli habendae sunt. Hoc iam bonum est, sed non permittit ut plures interrogationes complectatur. Tempus est sectiones thematicas situs resolvere, hoc tibi permittit satis interesting investigationes facere, e.g., vide quomodo favoris sectionis "C++" per plures annos mutata est.

Articulus parser emendatus est, nunc cantae ad quas articulus pertinet, tum agnomen auctoris et aestimationem eius refert (multa interesting res hic quoque fieri potest, sed quae postea futura sunt). Notitia servata est in fasciculo csv quod simile hoc spectat:

2018-12-18T12:43Z,https://habr.com/ru/post/433550/,"Мессенджер Slack — причины выбора, косяки при внедрении и особенности сервиса, облегчающие жизнь",votes:7,votesplus:8,votesmin:1,bookmarks:32,
views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft
...

Indicem principalium canti thematicarum situs recipiemus.

def get_as_str(link: str) -> Str:
    try:
        r = requests.get(link)
        return Str(r.text)
    except Exception as e:
        return Str("")

def get_hubs():
    hubs = []
    for p in range(1, 12):
        page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p)
        # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p)  # Geektimes
        # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p)  # Develop
        # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p)  # Admin
        for hub in page_html.split("media-obj media-obj_hub"):
            info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') 
            if "*</span>" in info:
                hub_name = info.find_between('/', '/"')
                if len(hub_name) > 0 and len(hub_name) < 32:
                    hubs.append(hub_name)
    print(hubs)

Munus interretialis et chorda in genere inter duas tags filo desumo, ego illis usus sum ante. Cantae thematicae cum "*" signatae sunt, ut facile illustrari possint, et etiam insculpere lineas correspondentes ut sectiones aliorum categoriae capias.

Output of the get_hubs function is satis infigo indicem, quod servamus ut dictionarium. Praecipue catalogum totum exhibeo ut eius volumen aestimare possitis.

hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi',  'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}

Comparationis geektime sectiones modestiores spectant;

hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}

Ceterae cantae eodem modo conservatae sunt. Nunc facile est scribere functionem quae exitum reddat utrum articulum geektimes an centrum profile pertineat.

def is_geektimes(hubs: List) -> bool:
    return len(set(hubs) & hubs_gt) > 0

def is_geektimes_only(hubs: List) -> bool:
    return is_geektimes(hubs) is True and is_profile(hubs) is False

def is_profile(hubs: List) -> bool:
    return len(set(hubs) & hubs_profile) > 0

Similia munera pro aliis sectionibus facta sunt ("progressio", "administratio" etc.

processus

Praesent tempus analyzing incipere. Onerantes dataset et aliquid centrum notitia.

def to_list(s: str) -> List[str]:
    # "user:popular_science+astronomy" => [popular_science, astronomy]
    return s.split(':')[1].split('+')

def to_date(dt: datetime) -> datetime.date:
    return dt.date()

df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#')
dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ')
dates += datetime.timedelta(hours=3)
df['date'] = dates.map(to_date, na_action=None)
hubs = df["hubs"].map(to_list, na_action=None)
df['hubs'] = hubs
df['is_profile'] = hubs.map(is_profile, na_action=None)
df['is_geektimes'] = hubs.map(is_geektimes, na_action=None)
df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None)
df['is_admin'] = hubs.map(is_admin, na_action=None)
df['is_develop'] = hubs.map(is_develop, na_action=None)

Nunc notitias interdiu comprehendere possumus et numerum publicationum pro diversis canibus exhibere.

g = df.groupby(['date'])
days_count = g.size().reset_index(name='counts')
year_days = days_count['date'].values
grouped = g.sum().reset_index()
profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean()
geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean()
geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean()
admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean()
develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()

Matplotlib numerum articulorum editorum exhibemus utentes:

Habrastatistics: explorans maxime minimeque visitavit sectiones situs

articulos "geektimes" et "geektimes" tantum in charta distinxi, quia Articulus ad utramque partem simul pertinere potest (exempli gratia "DIY" + "microcontrollers" + "C++"). Denominationem "profile" ad exaggerandam articulos profile in situ posui, quamquam fortasse vocabulum Anglicum pro profile hoc omnino recte non est.

In priore parte quaesivimus de effectibus "geektimes" consociata cum mutatione regulae solutionis articulorum pro geektime incipiente hac aestate. Geektimes articulos separatim ostendamus;

df_gt = df[(df['is_geektimes_only'] == True)]
group_gt = df_gt.groupby(['date'])
days_count_gt = group_gt.size().reset_index(name='counts')
grouped = group_gt.sum().reset_index()
year_days_gt = days_count_gt['date'].values
view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()

Effectus est interesting. Proxima ratio sententiarum geektimorum articulorum ad summam alicubi est circa 1:5. Sed cum numerus sententiarum notabiliter fluctuabat, inspectio articulorum "convivii" in eodem fere gradu manebat.

Habrastatistics: explorans maxime minimeque visitavit sectiones situs

Etiam animadvertere potes numerum sententiarum articulorum in sectione "geektime" adhuc incidisse mutatis regulis, sed "per oculum", nullo plusquam 5% totali valores.

Est interesting intueri mediocris numerus sententiarum per articulum:

Habrastatistics: explorans maxime minimeque visitavit sectiones situs

Articuli enim "convivii" sunt circiter 40% supra mediocris. Hoc verisimile non est mirum. Defectio ineunte Aprili incertus est mihi, fortasse id quod factum est, vel aliquo errore parsing, vel fortasse unus ex auctoribus geektimis ferias agit;).

Obiter graphus duo cacumina notabiliora ostendit in numero sententiarum articuli - Novus Annus et dies festus Maii.

Hubs

Ad analysim canti promissam transeamus. Sit vertice XX canti enumerare per numerum sententiarum:

hubs_info = []
for hub_name in hubs_all:
    mask = df['hubs'].apply(lambda x: hub_name in x)
    df_hub = df[mask]

    count, views = df_hub.shape[0], df_hub['views'].sum()
    hubs_info.append((hub_name, count, views))

# Draw hubs
hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20]
top_views = list(map(lambda x: x[2], hubs_top))
top_names = list(map(lambda x: x[0], hubs_top))

plt.rcParams["figure.figsize"] = (8, 6)
plt.bar(range(0, len(top_views)), top_views)
plt.xticks(range(0, len(top_names)), top_names, rotation=90)
plt.ticklabel_format(style='plain', axis='y')
plt.tight_layout()
plt.show()

effectus:

Habrastatistics: explorans maxime minimeque visitavit sectiones situs

Mire, popularis centrum in terminis sententiarum erat "Informatio Securitatis", summum 5 duces etiam "Programmatum" et "Scientiam Popularem" comprehenderunt.

Antitop occupat Gtk et Cocos.

Habrastatistics: explorans maxime minimeque visitavit sectiones situs

Secretum Dicam, canti vertice cerni hicquamvis numerus sententiarum ibi non exhibeatur.

rating

Et tandem promissum censum. Usura analysi centrum data, maxime populares articulos ostendere possumus pro canti populari hoc anno MMXIX.

Notitia Securitatis

programming

Popular Science

vitae

Leges in IT'

Interreti progressionis

GTK

Ac denique, ne quis offendatur, censum minimi hub "gtk" visitavit dabo. Intra annum editum est одна Articulus, qui etiam "automatice" primam aciem census obtinet.

conclusio,

conclusio nulla erit. Beatus quisque legens.

Source: www.habr.com

Add a comment