create account

Improving Hive’s semantic search performance by blocktrades

View this thread on: hive.blogpeakd.comecency.com
· @blocktrades ·
$166.75
Improving Hive’s semantic search performance
![blocktrades update.png](https://images.hive.blog/DQmSihw8Kz4U7TuCQa98DDdCzqbqPFRumuVWAbareiYZW1Z/blocktrades%20update.png)

[HiveSense](https://gitlab.syncad.com/hive/hivesense) is a HAF app that creates semantic embeddings for Hive posts. These embeddings will be used by the new HiveSense API calls to find posts that are similar to each other and to find posts that match a user’s search request.

A while back, one of our devs posted on the early work we did developing a semantic search API server for hive posts called HiveSense. 

Today’s post describes the follow-on work in the past 2.5 months as we prepare for an official release of HiveSense as part of the standard HAF API stack, so this post assumes you’ve previously read [the original post about HiveSense](https://hive.blog/hive-139531/@thebeedevs/hivesense-why-nothing-worked-at-first-and-what-we-did-about-it). 

This post is even more technical than the previous one, so the primary audience is devs interested in practical considerations associated with using semantic search in their apps or who would like to contribute to HiveSense development in the future.

### Performance Optimizations

* Use Ollama’s batch API to generate multiple embeddings in a single call
* Add support for, and default to, using 16 bit precision floating point vectors. This reduced the 768-dimension embeddings table size from 100GB to 50GB.
* Use PySBD for sentence detection instead of spacy (smaller)
* Dramatically reduced size of docker images (many changes here, but biggest win was removal of pgai from default config)
* Allow HiveSense to process data concurrently while hivemind is still in massive sync (originally Hivesense could only be synced after hivemind was in livesync, and hivemind takes about 2.5 days to sync on a very fast system). Since HiveSense syncs faster than hivemind, and can now sync in parallel with hivemind except for the time to create the HNSW index, HiveSense now only adds 1.5 hours to overall time to sync up a HAF API node from scratch.
* Redesigned worker thread implementation

### Recall Optimizations (Recall here basically is referring to the “quality” of the search results)

* Chunk long posts on sentence boundaries to capture semantic meaning better in the chunks
* Chunk all of a post instead of limiting a post to a max of 3 chunks. This creates more embeddings, but allows for better matching of long posts, especially if they switch topics.
* Target chunk size based on number of tokens rather than raw character count
* Prepend the title to the post so the title will be used when calculating the embedding of the post’s first chunk
* Change the minimum word count for posts to a minimum token count, make the token count configurable, and default the minimum to 75 tokens
* Add many options to ease switching to better embedding models in the future and to examine tradeoffs between models
* Use query prefix to improve embeddings generated for search queries
* Don’t discard non-ASCII characters (these were being removed for embedding calculations)
* Improve filtering of HTML
* Increased m and ef_construction to improve recall quality, especially for “small queries” that sometimes got poor search results.

### Miscellaneous changes

* Track the number of tokens in each post and order the embeddings generated for each post
* Allow filtering out shorter posts at search time (previously this could only be done at index time)
* Change API results to make paging deterministic (in progress)

## Redesign of embeddings tables and indexing methodology

Originally we just used a single table with 768-dimension vectors and built a HNSW index on this table. Both the table and the index were originally 100GB each in size (e.g. 200GB total storage required by HiveSense). Our first optimization was to use 16-bit precision numbers instead of 32-bit to cut storage requirements in half (100GB in total size, which seemed like a reasonable amount of storage).

But another problem we found was that it was very time consuming to create an HNSW index this large. On systems without a LOT of memory, it would take quite a few days. On systems with 128GB of RAM installed, this time could be cut down to around 8.5 hours (the current code that computes this index really favors having sufficient memory for the index creation), but this seemed a steep requirement for most API servers (we have internal servers that have this much memory, but the cloud servers we rent only have 64GB).

The solution we arrived at was to create secondary smaller embeddings table and create an HNSW index on that smaller table, then use the larger embeddings table for the final similarity computation.

HiveSense uses principal component analysis (PCA) to generate a second embeddings table with much smaller 128-dimension vectors (table size 9GB) and a much smaller HNSW index on this table (the new HNSW index is only 16GB). This new index only takes 1.5 hours to build and only requires 28GB of memory (can be build in 4.5 hours on systems with less memory).

Storage-wise, with all the optimizations, we reduced total storage usage from 100+100=200GB down to 50+9+16=75GB.

This approach also dramatically speeds up API query time as we’re searching a much smaller index, but we don’t have full statistics for this yet (our guess is somewhere between 3x and 10x faster).

Of course, we did have one concern about this approach: we needed to ensure it didn’t negatively effect recall results. To ensure this, we compared search results for various queries between a full brute force search of the embeddings and a search using the new index to ensure the results didn’t significantly change.


## New Sync Mode for HiveSense

A normal CPU is sufficient to generate embeddings for short text phrases like those used for search queries, but generating semantic embeddings for posts is too computationally intensive, so a GPU is required to generate them at a reasonable speed. 

We didn’t want to force API node operators to have a GPU, so HiveSense can be configured to operate in two different modes: independent mode and sync mode. 

In the independent mode, HiveSense expects to have access to one or more Ollama servers with GPUs providing computation power. 

In sync mode, the embeddings for posts are fetched from another HiveSense server, so the local HiveSense server only needs to compute embeddings for user search queries (which can computed with an Ollama server powered just by a reasonable CPU).

As we don’t expect most current API nodes to have access to a GPU (our primary API node, api.hive.blog doesn’t), we expect most API node operators will configure HiveSense to operate in sync mode, sparing their server from repeating the expensive computations required for computing post embeddings.

## What’s next for HiveSense?

We need to change the API to stabilize paging of search results based on our new approach: we will return 1000 results, with the first 20 results including permlink + summary results for the post, and the remaining results just providing permlinks. Client side apps will need to fetch further post summaries in case the user pages beyond the first page.

We need to update our app testing API server, api.syncad.com, with the new stack so that Hive apps can add support for HiveSense and perform “real-world” testing.

Finally, we need to officially release HiveSense along the other updated HAF apps. Currently I expect that to happen near the end of this quarter (sometime in September).
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and 725 others
properties (23)
authorblocktrades
permlinkimproving-hive-s-semantic-search-performance
categoryhive-139531
json_metadata{"tags":["hive-139531","hive","blockchain","software","blocktrades"],"image":["https://images.hive.blog/DQmSihw8Kz4U7TuCQa98DDdCzqbqPFRumuVWAbareiYZW1Z/blocktrades%20update.png"],"links":["https://gitlab.syncad.com/hive/hivesense"],"app":"hiveblog/0.1","format":"markdown"}
created2025-08-01 21:58:48
last_update2025-08-01 21:58:48
depth0
children9
last_payout1969-12-31 23:59:59
cashout_time2025-08-08 21:58:48
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value166.746 HBD
promoted0.000 HBD
body_length7,525
author_reputation1,294,040,559,663,139
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,605,562
net_rshares549,027,347,613,491
author_curate_reward""
vote details (789)
@amazing23 ·
This like a kind of encouraging for hive communities 
properties (22)
authoramazing23
permlinkre-blocktrades-202582t221841761z
categoryhive-139531
json_metadata{"links":[],"type":"comment","tags":["hive-139531","hive","blockchain","software","blocktrades"],"app":"ecency/3.3.3-mobile","format":"markdown+html"}
created2025-08-02 21:18:45
last_update2025-08-02 21:18:45
depth1
children0
last_payout1969-12-31 23:59:59
cashout_time2025-08-09 21:18:45
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length53
author_reputation965,105,717,012
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,638,753
net_rshares0
@celeste413 ·
不明觉厉👍
properties (22)
authorceleste413
permlinkre-blocktrades-202582t112626248z
categoryhive-139531
json_metadata{"type":"comment","tags":["hive-139531","hive","blockchain","software","blocktrades"],"app":"ecency/3.2.1-mobile","format":"markdown+html"}
created2025-08-02 03:26:27
last_update2025-08-02 03:26:27
depth1
children0
last_payout1969-12-31 23:59:59
cashout_time2025-08-09 03:26:27
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length5
author_reputation471,686,531,132,535
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,615,300
net_rshares0
@hatoto ·
that is a real important update. Thanks aöpt for workong on it!
properties (22)
authorhatoto
permlinkre-blocktrades-202582t01926685z
categoryhive-139531
json_metadata{"links":[],"type":"comment","tags":["hive-139531","hive","blockchain","software","blocktrades"],"app":"ecency/3.3.3-mobile","format":"markdown+html"}
created2025-08-01 22:19:27
last_update2025-08-01 22:19:27
depth1
children0
last_payout1969-12-31 23:59:59
cashout_time2025-08-08 22:19:27
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length63
author_reputation102,534,895,511,040
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,606,300
net_rshares0
@holoz0r ·
Great news. Hive search can really use these improvements for content discoverability. 
properties (22)
authorholoz0r
permlinkre-blocktrades-t0cbzy
categoryhive-139531
json_metadata{"tags":["hive-139531"],"app":"peakd/2025.7.3","image":[],"users":[]}
created2025-08-02 00:28:48
last_update2025-08-02 00:28:48
depth1
children0
last_payout1969-12-31 23:59:59
cashout_time2025-08-09 00:28:48
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length87
author_reputation546,741,890,301,193
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,610,493
net_rshares0
@latinowinner ·
very technical information
properties (22)
authorlatinowinner
permlinkre-blocktrades-202582t184925463z
categoryhive-139531
json_metadata{"tags":["hive-139531","hive","blockchain","software","blocktrades"],"app":"ecency/4.2.2-vision","format":"markdown+html"}
created2025-08-02 08:49:27
last_update2025-08-02 08:49:27
depth1
children0
last_payout1969-12-31 23:59:59
cashout_time2025-08-09 08:49:27
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length26
author_reputation2,990,986,174,963
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,621,465
net_rshares0
@mahirv ·
This is a very important update. Thank you very much for the hard work on this.
properties (22)
authormahirv
permlinkre-blocktrades-202583t52211223z
categoryhive-139531
json_metadata{"links":[],"type":"comment","tags":["hive-139531","hive","blockchain","software","blocktrades"],"app":"ecency/3.3.3-mobile","format":"markdown+html"}
created2025-08-02 23:22:12
last_update2025-08-02 23:22:12
depth1
children0
last_payout1969-12-31 23:59:59
cashout_time2025-08-09 23:22:12
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length79
author_reputation2,904,013,128,060
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,640,709
net_rshares0
@spiritabsolute ·
What would we do without you? It's good that you exist! Well done!
properties (22)
authorspiritabsolute
permlinkre-blocktrades-202582t165823220z
categoryhive-139531
json_metadata{"links":[],"type":"comment","tags":["hive-139531","hive","blockchain","software","blocktrades"],"app":"ecency/3.3.3-mobile","format":"markdown+html"}
created2025-08-02 14:58:24
last_update2025-08-02 14:58:24
depth1
children0
last_payout1969-12-31 23:59:59
cashout_time2025-08-09 14:58:24
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length66
author_reputation12,435,657,509,874
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,629,318
net_rshares0
@theguruasia ·
$WINE
properties (22)
authortheguruasia
permlinkre-blocktrades-t0cxk8
categoryhive-139531
json_metadata{"tags":["hive-139531"],"app":"peakd/2025.7.3","image":[],"users":[]}
created2025-08-02 08:14:33
last_update2025-08-02 08:14:33
depth1
children1
last_payout1969-12-31 23:59:59
cashout_time2025-08-09 08:14:33
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length5
author_reputation72,594,431,822,151
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,620,935
net_rshares0
@wine.bot ·
<center>
<sup>Congratulations, @theguruasia You Successfully Shared <b>0.300</b> <b>WINEX</b> With @blocktrades.</sup>
<sup>You Earned <b>0.300</b> <b>WINEX</b> As Curation Reward.</sup>
<sup>You Utilized <b>3/5</b> Successful Calls.</sup>
<img src="https://images.hive.blog/DQmSWfbie9MTC172sENiA16bsMaz1ofT6AAyTo1ishasrcX/winexcomment.png" alt="wine_logo">
</center>
---
<center>
<sup>Contact Us : [WINEX Token Discord Channel](https://discord.gg/rS3KzjJDCx)</sup>
<sup>[WINEX Current Market Price](https://hive-engine.com/?p=market&t=WINEX) : <b>0.031</b></sup>
</center>
---
<center>
<sup>Swap Your <b>Hive <=> Swap.Hive</b> With Industry <b>Lowest Fee</b> or <b>Highest Reward</b> : [Click This Link](https://uswap.app/)</sup>
<sup>[Read Latest Updates](https://peakd.com/@hiveupme/posts) Or [Contact Us](https://discord.gg/rS3KzjJDCx)</sup>
</center>
properties (22)
authorwine.bot
permlink20250802t081519357z
categoryhive-139531
json_metadata{"tags":["wine","token","winebot"],"app":"ecency/3.0.31-vision","format":"markdown+html"}
created2025-08-02 08:15:18
last_update2025-08-02 08:15:18
depth2
children0
last_payout1969-12-31 23:59:59
cashout_time2025-08-09 08:15:18
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length872
author_reputation8,838,960,845,284
root_title"Improving Hive’s semantic search performance"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,620,958
net_rshares0