create account

BOINC User XML data serialization comparison by cm-steem

View this thread on: hive.blogpeakd.comecency.com
· @cm-steem ·
$8.77
BOINC User XML data serialization comparison
![](https://steemitimages.com/DQmfXomNtSNqX2Mvg8dTp5M2Vd4NphkjvqbT8mzwx228SPU/boinc_logo.png)

# Comparing BOINC User XML data serialization methods

I've been working on converting BOINC project user XML GZ extracts to more desirable data formats, utilizing xmltodict to simply convert the XML to a dict in Python then outputting to [JSON](https://www.json.org/), [MSGPACK](https://msgpack.org/) & [ProtoBuffers](https://developers.google.com/protocol-buffers/).

## Let's start by comparing file sizes!

![](https://cdn.steemitimages.com/DQmWe8s7Sy5VrbXP2XDh7BgxXLjz52tGgCDMPrLkXCQ77vv/image.png)

It's interesting to note the difference in file size between the three data serialization formats - the clear winner being Google's protobuffer!

## Now let's compare how long it took to read the files from disk:

![](https://cdn.steemitimages.com/DQmbMfBCBLPvFPqi1goApQ9zy7usF6APBmjutbth7m1gzhW/image.png)

This was performed on a low power laptop with an SSD, it's clear from the above stats that ProtoBuffers are the winner, followed by MsgPack then shortly after JSON.

---

You can find the constructed data formats within the [GRC HUG REST API Github repo](https://github.com/gridcoin-community/GRC-HUG-REST-API/tree/master/STATS_DUMP).

Have any questions or suggestions for alternative data serialization methods, please do reply below!

Best regards,
CM.
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
👎  
properties (23)
authorcm-steem
permlinkboinc-user-xml-data-serialization-comparison
categorygridcoin
json_metadata{"tags":["gridcoin","boinc","programming","python","json"],"image":["https://steemitimages.com/DQmfXomNtSNqX2Mvg8dTp5M2Vd4NphkjvqbT8mzwx228SPU/boinc_logo.png","https://cdn.steemitimages.com/DQmWe8s7Sy5VrbXP2XDh7BgxXLjz52tGgCDMPrLkXCQ77vv/image.png","https://cdn.steemitimages.com/DQmbMfBCBLPvFPqi1goApQ9zy7usF6APBmjutbth7m1gzhW/image.png"],"links":["https://www.json.org/","https://msgpack.org/","https://developers.google.com/protocol-buffers/","https://github.com/gridcoin-community/GRC-HUG-REST-API/tree/master/STATS_DUMP"],"app":"steemit/0.1","format":"markdown"}
created2018-07-18 23:29:00
last_update2018-07-18 23:29:00
depth0
children7
last_payout2018-07-25 23:29:00
cashout_time1969-12-31 23:59:59
total_payout_value7.471 HBD
curator_payout_value1.298 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length1,363
author_reputation58,522,774,254,119
root_title"BOINC User XML data serialization comparison"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id65,172,397
net_rshares4,190,491,239,172
author_curate_reward""
vote details (32)
@aquatarkus ·
$0.87
Very good work! Results like these can be expected from using protobuf as it's machine-readable only. I consider msgpack to be "almost" human-readable :D
👍  , , ,
properties (23)
authoraquatarkus
permlinkre-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t095526883z
categorygridcoin
json_metadata{"tags":["gridcoin"],"app":"steemit/0.1"}
created2018-07-19 09:55:30
last_update2018-07-19 09:55:30
depth1
children1
last_payout2018-07-26 09:55:30
cashout_time1969-12-31 23:59:59
total_payout_value0.660 HBD
curator_payout_value0.214 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length153
author_reputation125,496,221,660
root_title"BOINC User XML data serialization comparison"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id65,222,079
net_rshares414,288,831,995
author_curate_reward""
vote details (4)
@cm-steem ·
$0.07
Yeah, if you open the msgpack file you can clearly interpret CPIDs in the compressed text, where at the protobuffer text contents look like garbage.

Interestingly enough, if you compress each file type in a GZ file they come out to be approximately the same file size.

Given protobuffer read speeds, you could avoid storing data in memory & rather request it directly from disk for each query, lol.
👍  , ,
properties (23)
authorcm-steem
permlinkre-aquatarkus-re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t151215206z
categorygridcoin
json_metadata{"tags":["gridcoin"],"app":"steemit/0.1"}
created2018-07-19 15:12:15
last_update2018-07-19 15:12:15
depth2
children0
last_payout2018-07-26 15:12:15
cashout_time1969-12-31 23:59:59
total_payout_value0.055 HBD
curator_payout_value0.015 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length400
author_reputation58,522,774,254,119
root_title"BOINC User XML data serialization comparison"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id65,254,621
net_rshares34,505,111,604
author_curate_reward""
vote details (3)
@barton26 · (edited)
$0.89
ProtoBuffers seems almost too good to be true.  Are there any significant downsides to using ProtoBuffers for serialization?  Does it require a lot of CPU to serialize/deserialize?  Does it require special software?
👍  , , , , ,
properties (23)
authorbarton26
permlinkre-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t014531949z
categorygridcoin
json_metadata{"tags":["gridcoin"],"app":"steemit/0.1"}
created2018-07-19 01:45:30
last_update2018-07-19 01:45:45
depth1
children4
last_payout2018-07-26 01:45:30
cashout_time1969-12-31 23:59:59
total_payout_value0.675 HBD
curator_payout_value0.216 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length215
author_reputation3,089,378,353,442
root_title"BOINC User XML data serialization comparison"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id65,181,484
net_rshares425,580,873,962
author_curate_reward""
vote details (6)
@cm-mobile ·
$0.61
For a fairer comparison I should time how long it took to write to disk.

Downsides of proto buffers is just that it's slightly confusing to work with at first, but now we've got an established proto file it's easily replicated.

Doesnt need much cpu to serialize/deserialize, however I don't have the stats to back that up.

In terms of special software, just the protobuf3 software package - there should be alternative language implementations for interacting with the files in c++ for example.
👍  , ,
properties (23)
authorcm-mobile
permlinkre-barton26-re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t080313096z
categorygridcoin
json_metadata{"tags":["gridcoin"],"app":"steemit/0.1"}
created2018-07-19 08:03:15
last_update2018-07-19 08:03:15
depth2
children0
last_payout2018-07-26 08:03:15
cashout_time1969-12-31 23:59:59
total_payout_value0.457 HBD
curator_payout_value0.151 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length497
author_reputation64,075,241,881
root_title"BOINC User XML data serialization comparison"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id65,212,477
net_rshares289,645,032,465
author_curate_reward""
vote details (3)
@ravonn ·
$1.28
(it's Protobuf, @cm-steem :))

The only downside I can think of is that it's binary so it's more difficult to read off the air. I use Protobuf at work to publish data from a microcontroller to an Android app and a web service. Doing that in text with lexical interpretations would be a nightmare.

To further compress the binary serialization we could use 16 byte binary representation of the CPIDs instead of using it's hexadecimal form. I suspect that's where a lot of the storage goes.
👍  , ,
properties (23)
authorravonn
permlinkre-barton26-re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t152952434z
categorygridcoin
json_metadata{"tags":["gridcoin"],"users":["cm-steem"],"app":"steemit/0.1"}
created2018-07-19 15:29:51
last_update2018-07-19 15:29:51
depth2
children2
last_payout2018-07-26 15:29:51
cashout_time1969-12-31 23:59:59
total_payout_value1.030 HBD
curator_payout_value0.254 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length488
author_reputation1,551,172,951,761
root_title"BOINC User XML data serialization comparison"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id65,256,519
net_rshares610,018,911,012
author_curate_reward""
vote details (3)
@cm-steem ·
$0.08
> The only downside I can think of is that it's binary so it's more difficult to read off the air.

Do you think that's possible via [flat buffers](https://google.github.io/flatbuffers/) or [grpc](https://grpc.io/)?

> To further compress the binary serialization we could use 16 byte binary representation of the CPIDs instead of using it's hexadecimal form. I suspect that's where a lot of the storage goes.

Do you have more details on how this can be done in python? Do you mean [compresing the string](https://github.com/CordySmith/PySmaz) or just converting the CPID from a string to binary?

The files would be far smaller if the CPID was omitted, relying on userId instead & perhaps constructing a separate index for userId:CPID for quick lookup.
👍  ,
properties (23)
authorcm-steem
permlinkre-ravonn-re-barton26-re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t155339281z
categorygridcoin
json_metadata{"tags":["gridcoin"],"links":["https://google.github.io/flatbuffers/","https://grpc.io/","https://github.com/CordySmith/PySmaz"],"app":"steemit/0.1"}
created2018-07-19 15:53:36
last_update2018-07-19 15:53:36
depth3
children1
last_payout2018-07-26 15:53:36
cashout_time1969-12-31 23:59:59
total_payout_value0.058 HBD
curator_payout_value0.017 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length754
author_reputation58,522,774,254,119
root_title"BOINC User XML data serialization comparison"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id65,259,080
net_rshares36,203,595,184
author_curate_reward""
vote details (2)