create account

Data Analysis Gotchas by gutzofter

View this thread on: hive.blogpeakd.comecency.com
· @gutzofter ·
Data Analysis Gotchas
![](https://i.supload.com/BJyNwsJdx.jpg)

[Pixabay](https://pixabay.com/en/analytics-chart-data-graph-1841554/)

#Introduction
I was very excited to see a analysis post concerning draining reward pool. After looking through the data, I've come to the conclusion that the analysis is flawed. It may be true, but the aggregation of the data is incorrect. This can lead to a suspect conclusion.

I was not necessarily interested in the draining of the reward pool, but in the exploration of the distribution of rewards. Please see below image.

# Initial
The data set seemed to be incomplete in the bin < $1.00, so I fudged (this normally would be a big no-no, but I was doing some exploring).*

# Exploration

A. This was very interesting. I was expecting a linear or power law distribution. Instead I see somewhat bell curve distribution. There is a slight bifurcation in the distribution.

B. The bifurcation of the posts is caused by the $5-$10 bin. I expected to see a power law rising from left to right.

C and D. After further exploring I've come to the conclusion the the binning variability is skewing the distributions.*

# Conclusion
Because of not having access to the source data, I could not create fixed-width bins. I would have to say that any conclusions from this data must be suspect and **not used for any further analysis**.


*These are some gotchas when trying to reach a conclusion or when using data to make decisions.

Link:
[Number of bins and width](https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width)
>There is no "best" number of bins, and different bin sizes can reveal different features of the data. Grouping data is at least as old as Graunt's work in the 17th century, but no systematic guidelines were given[10] until Sturges's work in 1926.[11]

>Using wider bins where the density is low reduces noise due to sampling randomness; using narrower bins where the density is high (so the signal drowns the noise) gives greater precision to the density estimation. Thus varying the bin-width within a histogram can be beneficial. Nonetheless, equal-width bins are widely used.

>Some theoreticians have attempted to determine an optimal number of bins, but these methods generally make strong assumptions about the shape of the distribution. Depending on the actual data distribution and the goals of the analysis, different bin widths may be appropriate, so experimentation is usually needed to determine an appropriate width. There are, however, various useful guidelines and rules of thumb.[12]

Link:
**[Potential fitting biases resulting from grouping data into variable width bins](http://www.sciencedirect.com/science/article/pii/S0370269314004183)**

Image:
![](https://i.supload.com/BJO_Io1ux.png)

@gutzofter is crazy. Crazy like a fox.
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
properties (23)
authorgutzofter
permlinkdata-analysis-gotchas
categorydata
json_metadata{"tags":["data","analysis","gotchas","steemit","introduction"],"users":["gutzofter"],"image":["https://i.supload.com/BJyNwsJdx.jpg","https://i.supload.com/BJO_Io1ux.png"],"links":["https://pixabay.com/en/analytics-chart-data-graph-1841554/","https://en.wikipedia.org/wiki/Histogram#Number_of_bins_and_width","http://www.sciencedirect.com/science/article/pii/S0370269314004183"],"app":"steemit/0.1","format":"markdown"}
created2017-02-01 19:33:51
last_update2017-02-01 19:33:51
depth0
children4
last_payout2017-03-04 19:52:48
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length2,791
author_reputation7,621,537,677,018
root_title"Data Analysis Gotchas"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd0
post_id2,398,797
net_rshares386,931,630,608
author_curate_reward""
vote details (46)
@abit ·
Your post is a bit hard for me to understand.
👍  
properties (23)
authorabit
permlinkre-gutzofter-data-analysis-gotchas-20170205t201328614z
categorydata
json_metadata{"tags":["data"],"app":"steemit/0.1"}
created2017-02-05 20:13:36
last_update2017-02-05 20:13:36
depth1
children3
last_payout2017-03-04 19:52:48
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length45
author_reputation141,171,499,037,785
root_title"Data Analysis Gotchas"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id2,431,090
net_rshares44,623,579,432
author_curate_reward""
vote details (1)
@gutzofter ·
How so? Is it because it is disorganized?
properties (22)
authorgutzofter
permlinkre-abit-re-gutzofter-data-analysis-gotchas-20170205t203118919z
categorydata
json_metadata{"tags":["data"],"app":"steemit/0.1"}
created2017-02-05 20:31:21
last_update2017-02-05 20:31:21
depth2
children2
last_payout2017-03-04 19:52:48
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length41
author_reputation7,621,537,677,018
root_title"Data Analysis Gotchas"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id2,431,221
net_rshares0
@abit ·
Yeah.. not well organized, maybe.
👍  
properties (23)
authorabit
permlinkre-gutzofter-re-abit-re-gutzofter-data-analysis-gotchas-20170206t215452088z
categorydata
json_metadata{"tags":["data"],"app":"steemit/0.1"}
created2017-02-06 21:55:06
last_update2017-02-06 21:55:06
depth3
children1
last_payout2017-03-04 19:52:48
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length33
author_reputation141,171,499,037,785
root_title"Data Analysis Gotchas"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id2,440,228
net_rshares47,187,603,504
author_curate_reward""
vote details (1)