Habrastatistics: exploring the most and least visited sections of the site

Hello, Habr.



In the previous part , Habr’s attendance was analyzed by the main parameters - the number of articles, their views and ratings. However, the question of the popularity of sections of the site has not been considered. It became interesting to consider this in more detail, and find the most popular and most unpopular hubs. Finally, I will examine the “geektimes effect” in more detail, and in the end, readers will receive a new selection of the best articles on the new ratings.







Who cares what happened, continued under the cut.



I remind you once again that statistics and ratings are not official, I do not have any insider information. It is also not guaranteed that I was not mistaken somewhere or didn’t miss something. But still, I think it turned out interesting. We will start with the code first, to whom this is irrelevant, the first sections may be skipped.



Data collection



In the first version of the parser, only the number of views, comments and the rating of articles were taken into account. This is already good, but does not allow you to make more complex queries. It's time to analyze the thematic sections of the site, this will allow you to do quite interesting research, for example, to see how the popularity of the "C ++" section has changed over several years.



The article parser has been improved, now it returns the hubs to which the article belongs, as well as the author’s nickname and rating (here you can also do a lot of interesting things, but this later). The data is saved in a csv file of approximately the following form:



2018-12-18T12:43Z,https://habr.com/ru/post/433550/," Slack —  ,      ,  ",votes:7,votesplus:8,votesmin:1,bookmarks:32, views:8300,comments:10,user:ReDisque,karma:5,subscribers:2,hubs:productpm+soft ...
      
      





Get a list of the main thematic hubs of the site.



 def get_as_str(link: str) -> Str: try: r = requests.get(link) return Str(r.text) except Exception as e: return Str("") def get_hubs(): hubs = [] for p in range(1, 12): page_html = get_as_str("https://habr.com/ru/hubs/page%d/" % p) # page_html = get_as_str("https://habr.com/ru/hubs/geektimes/page%d/" % p) # Geektimes # page_html = get_as_str("https://habr.com/ru/hubs/develop/page%d/" % p) # Develop # page_html = get_as_str("https://habr.com/ru/hubs/admin/page%d" % p) # Admin for hub in page_html.split("media-obj media-obj_hub"): info = Str(hub).find_between('"https://habr.com/ru/hub', 'list-snippet__tags') if "*</span>" in info: hub_name = info.find_between('/', '/"') if len(hub_name) > 0 and len(hub_name) < 32: hubs.append(hub_name) print(hubs)
      
      





The find_between function and the Str class highlight a line between two tags, I used them earlier . Thematic hubs are marked with "*", so they are easy to highlight, you can also uncomment the corresponding lines to get sections of other categories.



At the output of the get_hubs function, we get a fairly impressive list, which we save as a dictionary. I specially cite the entire list so that its volume can be estimated.



 hubs_profile = {'infosecurity', 'programming', 'webdev', 'python', 'sys_admin', 'it-infrastructure', 'devops', 'javascript', 'open_source', 'network_technologies', 'gamedev', 'cpp', 'machine_learning', 'pm', 'hr_management', 'linux', 'analysis_design', 'ui', 'net', 'hi', 'maths', 'mobile_dev', 'productpm', 'win_dev', 'it_testing', 'dev_management', 'algorithms', 'go', 'php', 'csharp', 'nix', 'data_visualization', 'web_testing', 's_admin', 'crazydev', 'data_mining', 'bigdata', 'c', 'java', 'usability', 'instant_messaging', 'gtd', 'system_programming', 'ios_dev', 'oop', 'nginx', 'kubernetes', 'sql', '3d_graphics', 'css', 'geo', 'image_processing', 'controllers', 'game_design', 'html5', 'community_management', 'electronics', 'android_dev', 'crypto', 'netdev', 'cisconetworks', 'db_admins', 'funcprog', 'wireless', 'dwh', 'linux_dev', 'assembler', 'reactjs', 'sales', 'microservices', 'search_technologies', 'compilers', 'virtualization', 'client_side_optimization', 'distributed_systems', 'api', 'media_management', 'complete_code', 'typescript', 'postgresql', 'rust', 'agile', 'refactoring', 'parallel_programming', 'mssql', 'game_promotion', 'robo_dev', 'reverse-engineering', 'web_analytics', 'unity', 'symfony', 'build_automation', 'swift', 'raspberrypi', 'web_design', 'kotlin', 'debug', 'pay_system', 'apps_design', 'git', 'shells', 'laravel', 'mobile_testing', 'openstreetmap', 'lua', 'vs', 'yii', 'sport_programming', 'service_desk', 'itstandarts', 'nodejs', 'data_warehouse', 'ctf', 'erp', 'video', 'mobileanalytics', 'ipv6', 'virus', 'crm', 'backup', 'mesh_networking', 'cad_cam', 'patents', 'cloud_computing', 'growthhacking', 'iot_dev', 'server_side_optimization', 'latex', 'natural_language_processing', 'scala', 'unreal_engine', 'mongodb', 'delphi', 'industrial_control_system', 'r', 'fpga', 'oracle', 'arduino', 'magento', 'ruby', 'nosql', 'flutter', 'xml', 'apache', 'sveltejs', 'devmail', 'ecommerce_development', 'opendata', 'Hadoop', 'yandex_api', 'game_monetization', 'ror', 'graph_design', 'scada', 'mobile_monetization', 'sqlite', 'accessibility', 'saas', 'helpdesk', 'matlab', 'julia', 'aws', 'data_recovery', 'erlang', 'angular', 'osx_dev', 'dns', 'dart', 'vector_graphics', 'asp', 'domains', 'cvs', 'asterisk', 'iis', 'it_monetization', 'localization', 'objectivec', 'IPFS', 'jquery', 'lisp', 'arvrdev', 'powershell', 'd', 'conversion', 'animation', 'webgl', 'wordpress', 'elm', 'qt_software', 'google_api', 'groovy_grails', 'Sailfish_dev', 'Atlassian', 'desktop_environment', 'game_testing', 'mysql', 'ecm', 'cms', 'Xamarin', 'haskell', 'prototyping', 'sw', 'django', 'gradle', 'billing', 'tdd', 'openshift', 'canvas', 'map_api', 'vuejs', 'data_compression', 'tizen_dev', 'iptv', 'mono', 'labview', 'perl', 'AJAX', 'ms_access', 'gpgpu', 'infolust', 'microformats', 'facebook_api', 'vba', 'twitter_api', 'twisted', 'phalcon', 'joomla', 'action_script', 'flex', 'gtk', 'meteorjs', 'iconoskaz', 'cobol', 'cocoa', 'fortran', 'uml', 'codeigniter', 'prolog', 'mercurial', 'drupal', 'wp_dev', 'smallbasic', 'webassembly', 'cubrid', 'fido', 'bada_dev', 'cgi', 'extjs', 'zend_framework', 'typography', 'UEFI', 'geo_systems', 'vim', 'creative_commons', 'modx', 'derbyjs', 'xcode', 'greasemonkey', 'i2p', 'flash_platform', 'coffeescript', 'fsharp', 'clojure', 'puppet', 'forth', 'processing_lang', 'firebird', 'javame_dev', 'cakephp', 'google_cloud_vision_api', 'kohanaphp', 'elixirphoenix', 'eclipse', 'xslt', 'smalltalk', 'googlecloud', 'gae', 'mootools', 'emacs', 'flask', 'gwt', 'web_monetization', 'circuit-design', 'office365dev', 'haxe', 'doctrine', 'typo3', 'regex', 'solidity', 'brainfuck', 'sphinx', 'san', 'vk_api', 'ecommerce'}
      
      





For comparison, geektimes sections look more modest:



 hubs_gt = {'popular_science', 'history', 'soft', 'lifehacks', 'health', 'finance', 'artificial_intelligence', 'itcompanies', 'DIY', 'energy', 'transport', 'gadgets', 'social_networks', 'space', 'futurenow', 'it_bigraphy', 'antikvariat', 'games', 'hardware', 'learning_languages', 'urban', 'brain', 'internet_of_things', 'easyelectronics', 'cellular', 'physics', 'cryptocurrency', 'interviews', 'biotech', 'network_hardware', 'autogadgets', 'lasers', 'sound', 'home_automation', 'smartphones', 'statistics', 'robot', 'cpu', 'video_tech', 'Ecology', 'presentation', 'desktops', 'wearable_electronics', 'quantum', 'notebooks', 'cyberpunk', 'Peripheral', 'demoscene', 'copyright', 'astronomy', 'arvr', 'medgadgets', '3d-printers', 'Chemistry', 'storages', 'sci-fi', 'logic_games', 'office', 'tablets', 'displays', 'video_conferencing', 'videocards', 'photo', 'multicopters', 'supercomputers', 'telemedicine', 'cybersport', 'nano', 'crowdsourcing', 'infographics'}
      
      





Similarly, the remaining hubs were saved. Now it’s easy to write a function that returns the result, the article refers to geektimes or to a profile hub.



 def is_geektimes(hubs: List) -> bool: return len(set(hubs) & hubs_gt) > 0 def is_geektimes_only(hubs: List) -> bool: return is_geektimes(hubs) is True and is_profile(hubs) is False def is_profile(hubs: List) -> bool: return len(set(hubs) & hubs_profile) > 0
      
      





Similar functions were made for other sections (“development”, “administration”, etc.).



Treatment



It's time to start the analysis. We load the dataset and process the data of the hubs.



 def to_list(s: str) -> List[str]: # "user:popular_science+astronomy" => [popular_science, astronomy] return s.split(':')[1].split('+') def to_date(dt: datetime) -> datetime.date: return dt.date() df = pd.read_csv("habr_2019.csv", sep=',', encoding='utf-8', error_bad_lines=True, quotechar='"', comment='#') dates = pd.to_datetime(df['datetime'], format='%Y-%m-%dT%H:%MZ') dates += datetime.timedelta(hours=3) df['date'] = dates.map(to_date, na_action=None) hubs = df["hubs"].map(to_list, na_action=None) df['hubs'] = hubs df['is_profile'] = hubs.map(is_profile, na_action=None) df['is_geektimes'] = hubs.map(is_geektimes, na_action=None) df['is_geektimes_only'] = hubs.map(is_geektimes_only, na_action=None) df['is_admin'] = hubs.map(is_admin, na_action=None) df['is_develop'] = hubs.map(is_develop, na_action=None)
      
      





Now we can group the data by day and display the number of publications by different hubs.



 g = df.groupby(['date']) days_count = g.size().reset_index(name='counts') year_days = days_count['date'].values grouped = g.sum().reset_index() profile_per_day_avg = grouped['is_profile'].rolling(window=20, min_periods=1).mean() geektimes_per_day_avg = grouped['is_geektimes'].rolling(window=20, min_periods=1).mean() geektimesonly_per_day_avg = grouped['is_geektimes_only'].rolling(window=20, min_periods=1).mean() admin_per_day_avg = grouped['is_admin'].rolling(window=20, min_periods=1).mean() develop_per_day_avg = grouped['is_develop'].rolling(window=20, min_periods=1).mean()
      
      





Display the number of published articles using Matplotlib:







I divided the articles “geektimes” and “geektimes only” in the graph, because an article can belong to both sections simultaneously (for example, “DIY” + “microcontrollers” + “C ++”). With the designation “profile”, I highlighted the profile articles of the site, although it is possible that the English term profile is not quite correct for this.



In the previous part we asked about the "geektimes effect" associated with the change in the rules for paying articles for geektimes from this summer. We derive separate geektimes articles:



 df_gt = df[(df['is_geektimes_only'] == True)] group_gt = df_gt.groupby(['date']) days_count_gt = group_gt.size().reset_index(name='counts') grouped = group_gt.sum().reset_index() year_days_gt = days_count_gt['date'].values view_gt_per_day_avg = grouped['views'].rolling(window=20, min_periods=1).mean()
      
      





The result is interesting. The approximate ratio of views of articles geektimes to the total somewhere around 1: 5. But if the total number of views fluctuated noticeably, then the viewing of "entertaining" articles was kept at approximately the same level.







You can also notice that the total number of views of articles in the “geektimes” section after changing the rules nevertheless fell, but “by eye”, by no more than 5% of the total values.



It is interesting to see the average number of views per article:







For "entertaining" articles, it is about 40% above average. This is probably not surprising. The failure at the beginning of April is not clear to me, maybe it was, or is it some kind of parsing error, or maybe one of the authors geektimes went on vacation;).



By the way, the chart also shows two noticeable peaks in the number of article views - New Year and May holidays.



Hubs



Let's move on to the promised analysis of the hubs. We will display the top 20 hubs by the number of views:



 hubs_info = [] for hub_name in hubs_all: mask = df['hubs'].apply(lambda x: hub_name in x) df_hub = df[mask] count, views = df_hub.shape[0], df_hub['views'].sum() hubs_info.append((hub_name, count, views)) # Draw hubs hubs_top = sorted(hubs_info, key=lambda v: v[2], reverse=True)[:20] top_views = list(map(lambda x: x[2], hubs_top)) top_names = list(map(lambda x: x[0], hubs_top)) plt.rcParams["figure.figsize"] = (8, 6) plt.bar(range(0, len(top_views)), top_views) plt.xticks(range(0, len(top_names)), top_names, rotation=90) plt.ticklabel_format(style='plain', axis='y') plt.tight_layout() plt.show()
      
      





Result:







Surprisingly, the “Information Security” hub turned out to be the most popular in terms of viewing, also “Programming” and “Popular science” are in the top 5 leaders.



Antitope takes Gtk and Cocoa.







I’ll tell you a secret, the top hubs can also be seen here , although the number of views is not shown there.



Rating



And finally, the promised rating. Using data from the analysis of hubs, we can display the most popular articles on the most popular hubs for this 2019 year.



Information Security



How I did not work for a year at Sberbank 304,000 views, 599 comments, rating + 457.0 / -14.0

Smart bulbs thrown into the trash are a valuable source of personal information 232,000 views, 147 comments, rating + 75.0 / -11.0

Fraudsters and EDS - everything is very bad 176,000 views, 778 comments, rating + 356.0 / -0.0

How Megafon slept on mobile subscriptions 166,000 views, 676 comments, rating + 624.0 / -2.0

Hacking VK, two-factor authentication will not save 148,000 views, 332 comments, rating + 124.0 / -17.0

How the browser helps Comrade Major 132,000 views, 321 comments, rating + 246.0 / -19.0

The largest dump in history: 2.7 billion accounts, of which 773 million unique 123,000 views, 154 comments, rating + 86.0 / -5.0

Honey, we kill the Internet 121,000 views, 933 comments, rating + 392.0 / -83.0

'Mobile content' is free, without SMS and registrations. Megafon fraud details 114,000 views, 478 comments, rating + 488.0 / -8.0

Port scanner in the personal account of Rostelecom 111,000 views, 194 comments, rating + 300.0 / -8.0



Programming



About one guy 167,000 views, 249 comments, rating + 239.0 / -33.0

The faster you forget OOP, the better for you and your programs 129,000 views, 1271 comments, rating + 131.0 / -63.0

Why Senior Developers can't get a job 119,000 views, 901 comments, rating + 151.0 / -14.0

Old people don't belong here? We program after thirty-five 116,000 views, 649 comments, rating + 222.0 / -16.0

New programming languages ​​imperceptibly kill our connection with reality 106,000 views, 764 comments, rating + 164.0 / -52.0

What I learned from my bitter experience (over 30 years in software development) 101,000 views, 128 comments, rating + 178.0 / -9.0

The rarest and most expensive programming languages 82900 views, 119 comments, rating + 38.0 / -10.0

Lecture course on JavaScript and Node.js in KPI 80300 views, 14 comments, rating + 34.0 / -2.0

IT terms on the example of the process of growing potatoes 78000 views, 86 comments, rating + 84.0 / -14.0

256 lines of bare C ++: writing a ray tracer from scratch in a few hours 77600 views, 124 comments, rating + 241.0 / -0.0



Popular science



What the designer smoked: an unusual firearm 236,000 views, 123 comments, rating + 119.0 / -9.0

Scientists have found the oldest living vertebrate on Earth 234,000 views, 212 comments, rating + 82.0 / -14.0

The series 'Chernobyl': watch and think 173,000 views, 803 comments, rating + 164.0 / -25.0

A 12-year-old teenager conducted a nuclear fusion reaction in his home laboratory 145,000 views, 280 comments, rating + 126.0 / -29.0

The Tale of the Rose Alloy and the Fallen Krenka 134,000 views, 244 comments, rating + 217.0 / -1.0

Increase it! The current increase in resolution is 134,000 views, 235 comments, rating + 377.0 / -1.0

Software for the Boeing-737 Max was written by outsourcers earning $ 9 per hour ; 126,000 views; 560 comments; rating + 153.0 / -6.0

Do not be nervous, do not rush, do not interrupt: the story of one tragedy 121,000 views, 384 comments, rating + 242.0 / -4.0

Mathematicians have found the perfect way to multiply numbers 108,000 views, 222 comments, rating + 173.0 / -10.0

New programming languages ​​imperceptibly kill our connection with reality 106,000 views, 764 comments, rating + 164.0 / -52.0



Career



How I did not work for a year at Sberbank 304,000 views, 599 comments, rating + 457.0 / -14.0

I ruin developers' lives with my code reviews and I'm sorry 187,000 views, 21 comments, rating + 37.0 / -3.0

Development King 179,000 views, 668 comments, rating + 315.0 / -60.0

About one guy 167,000 views, 249 comments, rating + 239.0 / -33.0

Retired at 22,158,000 views, 927 comments, rating + 259.0 / -100.0

How to replace a light bulb in the workplace so that you are not fired? 139000 views, 762 comments, rating + 200.0 / -20.0

Innovations in Russian 128000 views, 612 comments, rating + 480.0 / -33.0

Why Senior Developers can't get a job 119,000 views, 901 comments, rating + 151.0 / -14.0

'Burnt' employees: is there a way out? 117000 views, 398 comments, rating + 210.0 / -14.0

Old people don't belong here? We program after thirty-five 116,000 views, 649 comments, rating + 222.0 / -16.0



Legislation in IT



Fraudsters and EDS - everything is very bad 176,000 views, 778 comments, rating + 356.0 / -0.0

How Megafon slept on mobile subscriptions 166,000 views, 676 comments, rating + 624.0 / -2.0

Innovations in Russian 128000 views, 612 comments, rating + 480.0 / -33.0

'Mobile content' is free, without SMS and registrations. Megafon fraud details 114,000 views, 478 comments, rating + 488.0 / -8.0

As the authorities of Kazakhstan try to cover up their failure with the introduction of the certificate 111,000 views, 77 comments, rating + 122.0 / -14.0

How Protonmail is blocked in Russia 102000 views, 398 comments, rating + 418.0 / -7.0

The law on isolation of the Runet was adopted by the State Duma in three readings, 88,200 views, 878 comments, rating + 73.0 / -18.0

As a programmer, the bank chose and read the contract 87,200 views, 611 comments, rating + 166.0 / -9.0

The Ministry of Communications and Mass Media has approved the draft law on isolation of Runet 83600 views, 364 comments, rating + 79.0 / -9.0

A detailed answer to the comment, as well as a little about the life of providers in the Russian Federation, 74700 views, 389 comments, rating + 290.0 / -1.0



Web development



Old people don't belong here? We program after thirty-five 116,000 views, 649 comments, rating + 222.0 / -16.0

How to make sites in 2019 110,000 views, 278 comments, rating + 233.0 / -11.0

Learning Docker, Part 1: Basics 91300 views, 24 comments, rating + 52.0 / -10.0

Lecture course on JavaScript and Node.js in KPI 80300 views, 14 comments, rating + 34.0 / -2.0

Trainee Vasya and his stories about idempotency API 68900 views, 160 comments, rating + 216.0 / -3.0

The understanding of joins is broken. This is definitely not the intersection of circles, honestly 65,900 views, 223 comments, rating + 138.0 / -41.0

Why you do not need to spend your time creating niche thematic sites 62700 views, 243 comments, rating + 179.0 / -13.0

We make a modern web application from scratch 62200 views, 122 comments, rating + 56.0 / -8.0

A dark day for Vue.js 60,800 views, 133 comments, rating + 77.0 / -6.0

Why is modern web development so complicated? Part 1,577,700 views, 319 comments, rating + 101.0 / -6.0



GTK



And finally, so as not to offend anyone, I’ll give you the rating of the least visited hub “gtk”. In it, one article was published over the year, it is also “automatically” occupies the first line of the rating.



Using GtkApplication. Features rendering librsvg 1700 views, 9 comments, rating + 9.0 / -1.0



Conclusion



There will be no conclusion. Enjoy reading to everyone.



All Articles