create account

Improving Harvesting Strategy for a Distributed Social Media Search Engine by singhpratyush

View this thread on: hive.blogpeakd.comecency.com
· @singhpratyush · (edited)
$0.02
Improving Harvesting Strategy for a Distributed Social Media Search Engine
https://i2.wp.com/blog.fossasia.org/wp-content/uploads/2017/07/Loklak-blog-2-banner.png?w=640&ssl=1

# About Kaizen Harvester

Kaizen is an alternative approach to do harvesting in [loklak](http://loklak.org/). It focuses on query and information collecting to generate more queries from collected timelines. It maintains a queue of query that is populated by extracting following information from timelines –

1. Hashtags in Tweets
2. User mentions in Tweets
3. Tweets from areas near to each Tweet in the timeline
4. Tweets older than oldest Tweet in the timeline

Further, it can also utilise Twitter API to get trending keywords from Twitter and get search suggestions from other loklak peers.

It was introduced by [@yukiisbored](https://github.com/yukiisbored) in pull request [loklak/loklak_server#960](https://github.com/loklak/loklak_server/pull/960).

# The Problem: Unbiased Harvesting Decision

The Kaizen harvester either searches for queries from the queue or tries to grab trending queries (using Twitter API or from backend). In the [previous version of KaizenHarvester](https://github.com/loklak/loklak_server/blob/89455cc84b1d5df54d8fc80627ee74318774c2ce/src/org/loklak/harvester/strategy/KaizenHarvester.java), the decision of “harvesting vs. info-grabbing” was taken based on the value from a random boolean generator –

```java
@Override
public int harvest() {
   if (!queries.isEmpty() && random.nextBoolean())
       return harvestMessages();
   grabSuggestions();
   return 0;
}
```
[[SOURCE](https://github.com/loklak/loklak_server/blob/89455cc84b1d5df54d8fc80627ee74318774c2ce/src/org/loklak/harvester/strategy/KaizenHarvester.java#L216-L224)]

Under sane situations, the Kaizen harvester is configured to use a fixed size queue and drops the queries which are requested to get added once the queue is full. And since the decision doesn’t take into account the amount to which queue is filled, it would often call the `grabSuggestions(`) method.

But since the queue would be full, the grabbed suggestions would simply be lost. This would result in wastage of time and resources in fetching the suggestions (from backend or API). To overcome this, something better was to be done in this part.

# The Solution: Making Decision Biased

To solve the problem of dumb harvesting decision, the harvester was triggered based on the following steps –

1. Calculate the ratio of queue filled (q.size() / q.maxSize()).
2. Generate a random floating point number between 0 and 1.
3. If the number is less than the fraction, harvest. Otherwise get harvesting suggestions.

# Why would this work?

Initially, when the queue is mostly empty, the ratio would be a small number. So, it would be highly probable that a random number generated between 0 and 1 would be greater than the ratio. And Kaizen would go for grabbing search suggestions.

If this ratio is large (i.e. the queue is almost full), it would be highly likely that the random number generated would be less than it, making it more likely to search for results instead of grabbing suggestions

# Graph?

The following graph shows how the harvester decision would change. It performs 10k iterations for a given queue ratio and plots the number of times harvesting decision was taken.

# Change in Code

The `harvest()` method was changed in [loklak/loklak_server#1158](https://github.com/loklak/loklak_server/pull/1158) to take smart decision of harvesting vs. info-grabbing in the following manner –

```java
@Override
public int harvest() {
   float targetProb = random.nextFloat();
   float prob = 0.5F;
   if (QUERIES_LIMIT > 0) {
       prob = queries.size() / (float)QUERIES_LIMIT;
   }
   if (!queries.isEmpty() && targetProb < prob) {
       return harvestMessages();
   }

   grabSuggestions();

   return 0;
}
```

[[SOURCE](https://github.com/loklak/loklak_server/blob/a0ccbdfdd86b128824e56235c877e942cee7c325/src/org/loklak/harvester/strategy/KaizenHarvester.java#L216-L230)]

# Conclusion

This change brought enhancement in the Kaizen harvester and made it more sensible to how fast its queue if filling. There are no more requests made to the backend for suggestions whose queries are not added to the queue.

# Resources

* Current state of Kaizen Harvester – https://github.com/loklak/loklak_server/blob/development/src/org/loklak/harvester/strategy/KaizenHarvester.java.
* Kaizen Harvester usage guide – https://github.com/loklak/loklak_server/blob/development/docs/kaizen.md.
* Code used to generate the graph – https://gist.github.com/singhpratyush/8292b6fc815e5a18311848f635724f99.

> Originally posted at FOSSASIA blog - [Improving Harvesting Decision for Kaizen Harvester in loklak server](https://blog.fossasia.org/improving-harvesting-decision-for-kaizen-harvester-in-loklak-server/)
👍  , , , , , ,
properties (23)
authorsinghpratyush
permlinkimproving-harvesting-strategy-for-a-distributed-social-media-search-engine
categorysearch-engine
json_metadata{"tags":["search-engine","gsoc","big-data","crawler","open-source"],"image":["https://i2.wp.com/blog.fossasia.org/wp-content/uploads/2017/07/Loklak-blog-2-banner.png?w=640&amp;ssl=1"],"links":["http://loklak.org/","https://github.com/yukiisbored","https://github.com/loklak/loklak_server/pull/960","https://github.com/loklak/loklak_server/blob/89455cc84b1d5df54d8fc80627ee74318774c2ce/src/org/loklak/harvester/strategy/KaizenHarvester.java","https://github.com/loklak/loklak_server/blob/89455cc84b1d5df54d8fc80627ee74318774c2ce/src/org/loklak/harvester/strategy/KaizenHarvester.java#L216-L224","https://github.com/loklak/loklak_server/pull/1158","https://github.com/loklak/loklak_server/blob/a0ccbdfdd86b128824e56235c877e942cee7c325/src/org/loklak/harvester/strategy/KaizenHarvester.java#L216-L230","https://github.com/loklak/loklak_server/blob/development/src/org/loklak/harvester/strategy/KaizenHarvester.java","https://github.com/loklak/loklak_server/blob/development/docs/kaizen.md","https://gist.github.com/singhpratyush/8292b6fc815e5a18311848f635724f99","https://blog.fossasia.org/improving-harvesting-decision-for-kaizen-harvester-in-loklak-server/"],"app":"steemit/0.1","format":"markdown"}
created2017-12-07 05:02:21
last_update2017-12-07 05:09:54
depth0
children3
last_payout2017-12-14 05:02:21
cashout_time1969-12-31 23:59:59
total_payout_value0.024 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length4,784
author_reputation7,035,648,262,478
root_title"Improving Harvesting Strategy for a Distributed Social Media Search Engine"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id22,630,796
net_rshares6,979,268,666
author_curate_reward""
vote details (7)
@cheetah ·
Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
https://blog.fossasia.org/improving-harvesting-decision-for-kaizen-harvester-in-loklak-server/
properties (22)
authorcheetah
permlinkcheetah-re-singhpratyushimproving-harvesting-strategy-for-a-distributed-social-media-search-engine
categorysearch-engine
json_metadata""
created2017-12-07 05:02:39
last_update2017-12-07 05:02:39
depth1
children0
last_payout2017-12-14 05:02:39
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length193
author_reputation942,693,160,055,713
root_title"Improving Harvesting Strategy for a Distributed Social Media Search Engine"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id22,630,817
net_rshares0
@rajatdangi ·
The picture is funny!
👍  
properties (23)
authorrajatdangi
permlinkre-singhpratyush-improving-harvesting-strategy-for-a-distributed-social-media-search-engine-20171209t111108815z
categorysearch-engine
json_metadata{"tags":["search-engine"],"app":"steemit/0.1"}
created2017-12-09 11:11:09
last_update2017-12-09 11:11:09
depth1
children1
last_payout2017-12-16 11:11:09
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length21
author_reputation138,007,313,338
root_title"Improving Harvesting Strategy for a Distributed Social Media Search Engine"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id22,872,662
net_rshares1,160,544,926
author_curate_reward""
vote details (1)
@singhpratyush ·
Thanks, Rajat!
properties (22)
authorsinghpratyush
permlinkre-rajatdangi-re-singhpratyush-improving-harvesting-strategy-for-a-distributed-social-media-search-engine-20171212t065602979z
categorysearch-engine
json_metadata{"tags":["search-engine"],"app":"steemit/0.1"}
created2017-12-12 06:56:03
last_update2017-12-12 06:56:03
depth2
children0
last_payout2017-12-19 06:56:03
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length14
author_reputation7,035,648,262,478
root_title"Improving Harvesting Strategy for a Distributed Social Media Search Engine"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id23,204,866
net_rshares0