Habrastatistics: analyzing reader comments

Hi Habr. In the previous part , the popularity of various sections of the site was analyzed, and in parallel, the question arose - what data can be extracted from the comments on the articles. I also wanted to test one hypothesis, which I will discuss below.





The data turned out to be quite interesting, it was also possible to make a small “mini-rating” of commentators. Continued under the cut.



Data collection



For analysis, we will use the data for this year 2019, especially since I already received a list of articles in the form of csv. It remains to extract comments from each article, fortunately for us, they are stored there, and no additional requests are needed.



To extract comments from an article, the following code is enough:



r = requests.get("https://habr.com/ru/post/467453/")
data_html = r.text
comments = data_html.split('<div class="comment" id=')

comments_list = []
for comment in comments:
    body = Str(comment).find_between('<div class="comment__message', '<div class="comment__footer"').find_between('>', '</div>')# .replace('\n', '-')
    if len(body) < 4: continue

    body = body.translate(str.maketrans(dict.fromkeys("\t\n\r\v\f")))
    body = body.replace('"', "'").replace(',', " ").replace('<br>', ' ').replace('<p>', '').replace('</p>', '').replace('  ', ' ')

    user = Str(comment).find_between('data-user-login', '>').find_between('"', '"')
    date_str = Str(comment).find_between('<time class="comment__date-time comment__date-time_published', 'time>').find_between('>', '<')
    vote = Str(comment).find_between('<div class="voting-wjt', '</div>').find_between('<span', 'span>').find_between('>', '<')
    date = dateparser.parse(date_str)

    csv_data = "{},{},{},{}".format(user, date, vote, body)
    comments_list.append(csv_data)

      
      





( ):



xxxxxxx,2019-02-06 11:50:00,0, ?

xxxxxxx-02-24 16:15:00,+1, .

xxxxxxx,2019-02-23 20:15:00,–5,









, , , , . , .



, , — , . , youtube — , , , . , , , -. , … , , .. , . , , . «» — , __ . , .





, disclaimer. , , . , . , , .



, . , 2019 ( ). 448533 , csv- 288. , .





, .







, . « », 10 18 ;) , , .



:







- — , , ( ).



, , , — , .





, . , , 25000 .



, , :







, . 5% 60% . 10% — 74% ( , , 450). , , ( , ).





, — . , , , .



, 5 VoXXXX (3377 ), 0xdXXXXX (3286 ), strXXXX (3043 ), AmXXXX (2897 ) khXXXX (2748 ).



, 5 amXXXX (1395 , +3231/-309), tvXXXX (1544 , +3231/-97), WhuXXXX (921 , +2288/-13), MTXXXX (1328 , +1383/-7) amaXXXX (736 , +1340/-16).



( ) Milfgard Boomburum. , , , .



. siXX (473 , 699 ), khXX (1915 , 573 ) nicXXXXX (456 , 487 ). , . «» vladXXXX (55 , 84 , 0 ), ekoXXXX (77 , 92 , 1 ) iMXXXX (225 , 205 , 12 ).





, , .



, . , « » . - , .



All Articles