 # Comparing BOINC User XML data serialization methods I've been working on converting BOINC project user XML GZ extracts to more desirable data formats, utilizing xmltodict to simply convert the XML to a dict in Python then outputting to [JSON](https://www.json.org/), [MSGPACK](https://msgpack.org/) & [ProtoBuffers](https://developers.google.com/protocol-buffers/). ## Let's start by comparing file sizes!  It's interesting to note the difference in file size between the three data serialization formats - the clear winner being Google's protobuffer! ## Now let's compare how long it took to read the files from disk:  This was performed on a low power laptop with an SSD, it's clear from the above stats that ProtoBuffers are the winner, followed by MsgPack then shortly after JSON. --- You can find the constructed data formats within the [GRC HUG REST API Github repo](https://github.com/gridcoin-community/GRC-HUG-REST-API/tree/master/STATS_DUMP). Have any questions or suggestions for alternative data serialization methods, please do reply below! Best regards, CM.
author | cm-steem |
---|---|
permlink | boinc-user-xml-data-serialization-comparison |
category | gridcoin |
json_metadata | {"tags":["gridcoin","boinc","programming","python","json"],"image":["https://steemitimages.com/DQmfXomNtSNqX2Mvg8dTp5M2Vd4NphkjvqbT8mzwx228SPU/boinc_logo.png","https://cdn.steemitimages.com/DQmWe8s7Sy5VrbXP2XDh7BgxXLjz52tGgCDMPrLkXCQ77vv/image.png","https://cdn.steemitimages.com/DQmbMfBCBLPvFPqi1goApQ9zy7usF6APBmjutbth7m1gzhW/image.png"],"links":["https://www.json.org/","https://msgpack.org/","https://developers.google.com/protocol-buffers/","https://github.com/gridcoin-community/GRC-HUG-REST-API/tree/master/STATS_DUMP"],"app":"steemit/0.1","format":"markdown"} |
created | 2018-07-18 23:29:00 |
last_update | 2018-07-18 23:29:00 |
depth | 0 |
children | 7 |
last_payout | 2018-07-25 23:29:00 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 7.471 HBD |
curator_payout_value | 1.298 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 1,363 |
author_reputation | 58,522,774,254,119 |
root_title | "BOINC User XML data serialization comparison" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 65,172,397 |
net_rshares | 4,190,491,239,172 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
scalextrix | 0 | 32,131,907,711 | 100% | ||
cm-steem | 0 | 539,074,761,031 | 100% | ||
vortac | 0 | 3,369,870,875,178 | 100% | ||
neuralminer | 0 | 3,099,658,004 | 100% | ||
sc-steemit | 0 | 71,763,087,292 | 100% | ||
peppernrino | 0 | 15,677,028,640 | 100% | ||
steemtruth | 0 | 4,549,332,978 | 15% | ||
barton26 | 0 | 9,592,044,949 | 100% | ||
ravonn | 0 | 6,800,594,411 | 100% | ||
wilbur | 0 | 38,351,385,572 | 100% | ||
sodom | 0 | 2,512,980,945 | 100% | ||
cm-mobile | 0 | 1,375,732,228 | 100% | ||
grider123 | 0 | 1,480,834,419 | 100% | ||
jkkim | 0 | 76,723,970 | 10% | ||
diogogomes | 0 | 472,796,692 | 85% | ||
jringo | 0 | 27,156,560,057 | 100% | ||
theissen | 0 | 2,012,068,995 | 100% | ||
xxcynicalkidxx | 0 | 211,876,554 | 50% | ||
crt | 0 | 18,920,183,314 | 100% | ||
zipity | 0 | 3,415,042,950 | 100% | ||
unrared | 0 | 13,863,591,971 | 20% | ||
trixiedraws | 0 | 4,220,875,021 | 100% | ||
parejan | 0 | 9,618,899,540 | 100% | ||
lumendan | 0 | 587,798,512 | 100% | ||
gregan | 0 | 6,640,401,126 | 100% | ||
alexmaksto | 0 | 598,438,473 | 100% | ||
itsragged | 0 | 305,145,796 | 100% | ||
c4h8n8o8 | 0 | 434,390,202 | 100% | ||
grwd | 0 | 2,057,695,286 | 100% | ||
steemgridcoin | 0 | 3,027,284,603 | 100% | ||
aquatarkus | 0 | 591,242,752 | 100% | ||
fionsamerica | 0 | 0 | -100% |
Very good work! Results like these can be expected from using protobuf as it's machine-readable only. I consider msgpack to be "almost" human-readable :D
author | aquatarkus |
---|---|
permlink | re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t095526883z |
category | gridcoin |
json_metadata | {"tags":["gridcoin"],"app":"steemit/0.1"} |
created | 2018-07-19 09:55:30 |
last_update | 2018-07-19 09:55:30 |
depth | 1 |
children | 1 |
last_payout | 2018-07-26 09:55:30 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.660 HBD |
curator_payout_value | 0.214 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 153 |
author_reputation | 125,496,221,660 |
root_title | "BOINC User XML data serialization comparison" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 65,222,079 |
net_rshares | 414,288,831,995 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
cm-steem | 0 | 377,975,076,560 | 65% | ||
barton26 | 0 | 10,320,554,692 | 100% | ||
cm-mobile | 0 | 1,455,801,300 | 100% | ||
jringo | 0 | 24,537,399,443 | 100% |
Yeah, if you open the msgpack file you can clearly interpret CPIDs in the compressed text, where at the protobuffer text contents look like garbage. Interestingly enough, if you compress each file type in a GZ file they come out to be approximately the same file size. Given protobuffer read speeds, you could avoid storing data in memory & rather request it directly from disk for each query, lol.
author | cm-steem |
---|---|
permlink | re-aquatarkus-re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t151215206z |
category | gridcoin |
json_metadata | {"tags":["gridcoin"],"app":"steemit/0.1"} |
created | 2018-07-19 15:12:15 |
last_update | 2018-07-19 15:12:15 |
depth | 2 |
children | 0 |
last_payout | 2018-07-26 15:12:15 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.055 HBD |
curator_payout_value | 0.015 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 400 |
author_reputation | 58,522,774,254,119 |
root_title | "BOINC User XML data serialization comparison" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 65,254,621 |
net_rshares | 34,505,111,604 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
barton26 | 0 | 10,381,263,837 | 100% | ||
jringo | 0 | 24,123,847,767 | 100% | ||
aquatarkus | 0 | 0 | 100% |
ProtoBuffers seems almost too good to be true. Are there any significant downsides to using ProtoBuffers for serialization? Does it require a lot of CPU to serialize/deserialize? Does it require special software?
author | barton26 |
---|---|
permlink | re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t014531949z |
category | gridcoin |
json_metadata | {"tags":["gridcoin"],"app":"steemit/0.1"} |
created | 2018-07-19 01:45:30 |
last_update | 2018-07-19 01:45:45 |
depth | 1 |
children | 4 |
last_payout | 2018-07-26 01:45:30 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.675 HBD |
curator_payout_value | 0.216 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 215 |
author_reputation | 3,089,378,353,442 |
root_title | "BOINC User XML data serialization comparison" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 65,181,484 |
net_rshares | 425,580,873,962 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
cm-steem | 0 | 384,171,389,290 | 65% | ||
sodom | 0 | 2,564,266,270 | 100% | ||
diogogomes | 0 | 466,696,090 | 85% | ||
jringo | 0 | 26,605,157,823 | 100% | ||
theissen | 0 | 1,971,827,616 | 100% | ||
parejan | 0 | 9,801,536,873 | 100% |
For a fairer comparison I should time how long it took to write to disk. Downsides of proto buffers is just that it's slightly confusing to work with at first, but now we've got an established proto file it's easily replicated. Doesnt need much cpu to serialize/deserialize, however I don't have the stats to back that up. In terms of special software, just the protobuf3 software package - there should be alternative language implementations for interacting with the files in c++ for example.
author | cm-mobile |
---|---|
permlink | re-barton26-re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t080313096z |
category | gridcoin |
json_metadata | {"tags":["gridcoin"],"app":"steemit/0.1"} |
created | 2018-07-19 08:03:15 |
last_update | 2018-07-19 08:03:15 |
depth | 2 |
children | 0 |
last_payout | 2018-07-26 08:03:15 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.457 HBD |
curator_payout_value | 0.151 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 497 |
author_reputation | 64,075,241,881 |
root_title | "BOINC User XML data serialization comparison" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 65,212,477 |
net_rshares | 289,645,032,465 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
barton26 | 0 | 10,563,391,272 | 100% | ||
jringo | 0 | 25,088,801,677 | 100% | ||
gentlebot | 0 | 253,992,839,516 | 15% |
(it's Protobuf, @cm-steem :)) The only downside I can think of is that it's binary so it's more difficult to read off the air. I use Protobuf at work to publish data from a microcontroller to an Android app and a web service. Doing that in text with lexical interpretations would be a nightmare. To further compress the binary serialization we could use 16 byte binary representation of the CPIDs instead of using it's hexadecimal form. I suspect that's where a lot of the storage goes.
author | ravonn |
---|---|
permlink | re-barton26-re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t152952434z |
category | gridcoin |
json_metadata | {"tags":["gridcoin"],"users":["cm-steem"],"app":"steemit/0.1"} |
created | 2018-07-19 15:29:51 |
last_update | 2018-07-19 15:29:51 |
depth | 2 |
children | 2 |
last_payout | 2018-07-26 15:29:51 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 1.030 HBD |
curator_payout_value | 0.254 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 488 |
author_reputation | 1,551,172,951,761 |
root_title | "BOINC User XML data serialization comparison" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 65,256,519 |
net_rshares | 610,018,911,012 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
cm-steem | 0 | 573,158,927,571 | 100% | ||
barton26 | 0 | 10,806,227,853 | 100% | ||
jringo | 0 | 26,053,755,588 | 100% |
> The only downside I can think of is that it's binary so it's more difficult to read off the air. Do you think that's possible via [flat buffers](https://google.github.io/flatbuffers/) or [grpc](https://grpc.io/)? > To further compress the binary serialization we could use 16 byte binary representation of the CPIDs instead of using it's hexadecimal form. I suspect that's where a lot of the storage goes. Do you have more details on how this can be done in python? Do you mean [compresing the string](https://github.com/CordySmith/PySmaz) or just converting the CPID from a string to binary? The files would be far smaller if the CPID was omitted, relying on userId instead & perhaps constructing a separate index for userId:CPID for quick lookup.
author | cm-steem |
---|---|
permlink | re-ravonn-re-barton26-re-cm-steem-boinc-user-xml-data-serialization-comparison-20180719t155339281z |
category | gridcoin |
json_metadata | {"tags":["gridcoin"],"links":["https://google.github.io/flatbuffers/","https://grpc.io/","https://github.com/CordySmith/PySmaz"],"app":"steemit/0.1"} |
created | 2018-07-19 15:53:36 |
last_update | 2018-07-19 15:53:36 |
depth | 3 |
children | 1 |
last_payout | 2018-07-26 15:53:36 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.058 HBD |
curator_payout_value | 0.017 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 754 |
author_reputation | 58,522,774,254,119 |
root_title | "BOINC User XML data serialization comparison" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 65,259,080 |
net_rshares | 36,203,595,184 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
barton26 | 0 | 10,563,391,272 | 100% | ||
jringo | 0 | 25,640,203,912 | 100% |