create account

Sunday coding sessions: #1 Detecting the language of HIVE content by emrebeyler

View this thread on: hive.blogpeakd.comecency.com
· @emrebeyler · (edited)
$79.68
Sunday coding sessions: #1 Detecting the language of HIVE content
Hive hosts lots of different languages. If you want to categorize the content, one of the categorization points is the language used for the posts/comments.

Detecting the post's language was one of the requirements in my current project. It turns out it was pretty easy to do it, dropping here some experience for future reference.

#### What do we need?
***

- A function fetches the post body and removes the noise from the post body
- A function returns the language code detected in the post body

#### Clearing the noise
****
To feed the language detector, we need to filter out the HTML and Markdown tags. Passing clean data as much as possible will increase the language detection's correctness.

```
from bs4 import BeatifulSoup
from markdown import markdown

def clear_noise(post_body):
    body_as_html = markdown(post_body)
    soup = BeautifulSoup(body_as_html, "html.parser")
    post_body = ''.join(soup.findAll(text=True))
    post_body = re.sub('<.*?>', '', post_body)
    post_body = re.sub(
    '```.*?```', '', post_body, flags=re.MULTILINE | re.DOTALL)
    return post_body
```
***
This simple snippet removes Markdown, HTML tags, and try to return only sentences/words. 

#### Detecting the language
***
*"There is a library for that"* is the answer to most questions in Python world. I've used [langdetect](https://pypi.org/project/langdetect/) in the past and it was working considerably well.

```
from lighthive.client import Client
from langdetect import detect

def detect_language(author, permlink):
    c = Client()
    post_body = c.get_content(author, permlink)["body"]
    post_body = clear_noise(post_body)

    return detect(post_body)
```
***
We've used lighthive's Client to fetch the post body, cleared the post text with the `clear_noise` and passed it to the language detector library.

Let's test it with some example posts; I've picked four different posts with four different languages:

```
posts = [
    ('themarkymark', 'stemgeek-s-first-hackathon'),
    ('emrebeyler', 'yeni-baslayanlar-icin-hive'),
    ('clayop', '4v8phu-9'),
    ('satren', 'virtuelles-dach-meetup-am-sonntag-dem-17-05-2020-18-00-q9r1qv'),
]

for author, permlink in posts:
    language = detect_language(author, permlink)
    print("Language of @%s/%s is %s" % (author, permlink, language))
```
***

Output:
```
Language of @themarkymark/stemgeek-s-first-hackathon is en
Language of @emrebeyler/yeni-baslayanlar-icin-hive is tr
Language of @clayop/4v8phu-9 is ko
Language of @satren/virtuelles-dach-meetup-am-sonntag-dem-17-05-2020-18-00-q9r1qv is de
```
***
Pretty accurate. You can try the script without installing anything at [repl.it](https://repl.it/@emre2/HIVE-language-detection).


#### Bonus data: Language saturation on HIVE
***

This was for a small period of time, but probably gives a good idea on the actively used languages at HIVE.

![Screen Shot 2020-05-03 at 18.59.24.png](https://images.hive.blog/DQmZ9U8ujS5zy1CcsTvTVBoGcRWNdeS7yPaTZm1SKsueKxP/Screen%20Shot%202020-05-03%20at%2018.59.24.png)


#### Notes
***
- langdetect supports 55 languages and the returned code is [ISO 639-1 ](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes) language code.

- There is also a [langid](https://github.com/saffsd/langid.py) library with a little bit less precision, but with faster detection times. Check it out if you have tight time constraints.
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and 195 others
👎  
properties (23)
authoremrebeyler
permlinksunday-coding-sessions-detecting-the-language-of-hive-content
categoryhive-139531
json_metadata{"app":"hiveblog/0.1","format":"markdown","image":["https://images.hive.blog/DQmZ9U8ujS5zy1CcsTvTVBoGcRWNdeS7yPaTZm1SKsueKxP/Screen%20Shot%202020-05-03%20at%2018.59.24.png"],"links":["https://pypi.org/project/langdetect/","https://repl.it/@emre2/HIVE-language-detection","https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes","https://github.com/saffsd/langid.py"],"tags":["python","programming"]}
created2020-05-03 17:14:18
last_update2020-05-03 17:14:33
depth0
children9
last_payout2020-05-10 17:14:18
cashout_time1969-12-31 23:59:59
total_payout_value42.812 HBD
curator_payout_value36.870 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length3,378
author_reputation448,535,049,068,622
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,163,293
net_rshares114,283,324,819,226
author_curate_reward""
vote details (260)
@bashadow ·
Interesting to see the number of languages being used, as hive Block Chain awareness grows it will be nice to see the number of native languages grow also. now all we need is an on-chain post translation button on some of the front ends instead of having to use google translate.

A translation widget that front end developers can add to their settings page and then a translate post button directly on the post the users are looking at. Adding one more item to the three dot, (ellipsis), function that peakd uses for example.
properties (22)
authorbashadow
permlinkre-emrebeyler-q9ro4q
categoryhive-139531
json_metadata{"tags":["hive-139531"],"app":"peakd/2020.04.5"}
created2020-05-03 17:48:27
last_update2020-05-03 17:48:27
depth1
children0
last_payout2020-05-10 17:48:27
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length527
author_reputation100,388,692,638,882
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,163,904
net_rshares0
@doze ·
@emrebeyler my friend, pt please! ;)
Cheers!
properties (22)
authordoze
permlinkre-emrebeyler-q9rngb
categoryhive-139531
json_metadata{"tags":["hive-139531"],"app":"peakd/2020.04.5"}
created2020-05-03 17:33:51
last_update2020-05-03 17:33:51
depth1
children0
last_payout2020-05-10 17:33:51
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length44
author_reputation489,128,026,579,070
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,163,630
net_rshares0
@gitplait ·
This is a fun and useful publication. We are picking it as one of the nice dev publication on the Hive chain for the day,  and we will feature it on our front-end, gitplait.tech. Well done, and we wish you success on the project you are building.
properties (22)
authorgitplait
permlinkq9t9ax
categoryhive-139531
json_metadata{"app":"hiveblog/0.1"}
created2020-05-04 14:23:27
last_update2020-05-04 14:23:27
depth1
children0
last_payout2020-05-11 14:23:27
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length246
author_reputation911,220,543,569
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,181,576
net_rshares0
@mys ·
Polish community in the top#10 😊
👍  
properties (23)
authormys
permlinkre-emrebeyler-q9sp6q
categoryhive-139531
json_metadata{"tags":["hive-139531"],"app":"peakd/2020.04.5"}
created2020-05-04 07:08:51
last_update2020-05-04 07:08:51
depth1
children1
last_payout2020-05-11 07:08:51
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length32
author_reputation14,948,575,541,320
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,174,795
net_rshares9,413,677,147
author_curate_reward""
vote details (1)
@emrebeyler ·
heh, it's a very small time frame so that may not show the real picture but yeah, pl looks strong.
properties (22)
authoremrebeyler
permlinkq9sr8s
categoryhive-139531
json_metadata{"app":"hiveblog/0.1"}
created2020-05-04 07:53:15
last_update2020-05-04 07:53:15
depth2
children0
last_payout2020-05-11 07:53:15
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length98
author_reputation448,535,049,068,622
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,175,336
net_rshares0
@rishi556 ·
Damn it. This is exactly what I needed, but in JS. Time to go hunting for a library myself.
properties (22)
authorrishi556
permlinkre-emrebeyler-qa3mav
categoryhive-139531
json_metadata{"tags":["hive-139531"],"app":"peakd/2020.05.1"}
created2020-05-10 04:40:06
last_update2020-05-10 04:40:06
depth1
children0
last_payout2020-05-17 04:40:06
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length91
author_reputation132,595,269,899,271
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,287,149
net_rshares0
@shmoogleosukami ·
That's pretty neat!
properties (22)
authorshmoogleosukami
permlinkre-emrebeyler-q9s35a
categoryhive-139531
json_metadata{"tags":["hive-139531"],"app":"peakd/2020.04.5"}
created2020-05-03 23:12:45
last_update2020-05-03 23:12:45
depth1
children0
last_payout2020-05-10 23:12:45
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length19
author_reputation225,552,949,972,923
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,168,732
net_rshares0
@tngflx ·
Hmm was expecting korean to be second.. Perhaps it is on steem blockchain.
properties (22)
authortngflx
permlinkre-emrebeyler-q9sv68
categoryhive-139531
json_metadata{"tags":["hive-139531"],"app":"peakd/2020.04.5"}
created2020-05-04 09:18:09
last_update2020-05-04 09:18:09
depth1
children1
last_payout2020-05-11 09:18:09
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length74
author_reputation17,396,455,988,713
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,176,356
net_rshares0
@emrebeyler ·
It might be the case that it was their sleeping time. This data is a timeframe of 4-5 hours only.
properties (22)
authoremrebeyler
permlinkq9swbd
categoryhive-139531
json_metadata{"app":"hiveblog/0.1"}
created2020-05-04 09:42:48
last_update2020-05-04 09:42:48
depth2
children0
last_payout2020-05-11 09:42:48
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length97
author_reputation448,535,049,068,622
root_title"Sunday coding sessions: #1 Detecting the language of HIVE content"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id97,176,691
net_rshares0