create account

Clean the Data! Story time by ecoinstant

View this thread on: hive.blogpeakd.comecency.com
· @ecoinstant · (edited)
$20.80
Clean the Data! Story time
In another life, I became a "data guy".  In 2009 I was in my second year with the Fiscal and Economic Research Center (FERC) at the University of Wisconsin at Whitewater, working with a professor doing studies on education using open data.  All 421 school districts in Wisconsin were legally obligated to report certain "open" data, things like standardized test results, free lunch program data, and other anonymized statistics about their inputs and "results".  

My first job was to go to the website for each of these 421 school districts and download the data.  Once we did that, we could run analysis on it....right?


![image.png](https://files.peakd.com/file/peakd-hive/ecoinstant/23uR7dUTp51HDgCH4ovunJJbrQtWtYvSMrZQ3jtqkQuw87JFd7bSUjdTDvq3mg9rMiWUx.png)


## Data is not enough, it must be "structured"

So it turned out that while 421 school districts were reporting all the mandatory data - not a single pair of them were reporting it in the same "way", with the same column structure, in the same order, with the same format.  It was all different.  Every single one, some were csv files, some were xlsx files, some were txt files - etc, etc.  

So my job was to "clean the data" - which basically means to get it all in the same format.  Once that is done, regression analysis is easy, but before that - its impossible.  I spent months learning to work with SPSS and cleaning data, before I could even put my econometrics regression analysis skills into practice.  It was a very "real world application" for me, tying everything I have ever learned in the classroom together and... throwing it out the window immediately for some practical obstacle.


![image.png](https://files.peakd.com/file/peakd-hive/ecoinstant/EoEjCQtTq7HZRhgQMiZfJsDJpqkTo3yLGWdpwQZScDDwk6CDfdiVvhWhX7no67Gmz2g.png)

## Unstructured Data vs Data that has not yet been Stuctured

Officially, unstructured data is data that cannot be structured, not data that you can get structured with a undergraduate assistant.  But practically, its the same thing - you cannot run the analysis on unstructured data - so you have to get it structured.  There are a number of ways to do this, for example, you could retype it all out into a new csv or excel file, but this time ensuring that every one has the same format.  But this way has its drawbacks.

For example, assuming we were doing this analysis on the 2010 data, we could - with some effort of an undergraduate tryhard - manually fix all the data.  But what do we do "next year" with all the data from 2011?  Are we doomed to repeat this manual process every year for the rest of the history of the Fiscal and Economic Research Center budget?  

Instead, I was taught how to make data cleaning scripts with a program called SPSS.  

<center>
![image.png](https://files.peakd.com/file/peakd-hive/ecoinstant/EppHnxkcBnSCt2GD7zrxjwt2FyaokG2tLUNCguaKUCrNVmFPATBGWwHGxV5nj5DuVJx.png)</center>

Now SPSS does a lot of other things too, but we used it mostly for cleaning data, as the professor had access to other, more powerful, analytics programs once the data was clean.  

I spent many, many hours of my life earning beer and rent money by cleaning data.  And one day, after nearly 18 months of work on several years of school district data, the professor (who I still remember fondly for all the things he taught me) said, "We could write a paper together".  And you know what I did?

I dropped out of college and ran away to South America without money or a plan.  

<center>
![image.png](https://files.peakd.com/file/peakd-hive/ecoinstant/23xp2As2hbGse6HGHkddaBQbFTyi673qLp6wUs1SikcSeyeoNDhiS148U2F4SEG6JWWmm.png)</center>

## So long, no thanks for the cubicle.

And I haven't thought about it in a while, though what I learned about data and analysis there has always stuck with me, always been "a part" of my tool box or skill set.  It wasn't until recently that I really remembered all this story - when me and @thecrazygm bumped up against the widest and least organized data set I have ever seen.  We are interested in working with this data - its open source data - and we will do.

Right after I clean it.

## Freedom and Friendship

 
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and 274 others
properties (23)
authorecoinstant
permlinkclean-the-data-story-time
categoryhive-186392
json_metadata{"app":"peakd/2025.7.1","format":"markdown","image":["https://files.peakd.com/file/peakd-hive/ecoinstant/23xp2As2hbGse6HGHkddaBQbFTyi673qLp6wUs1SikcSeyeoNDhiS148U2F4SEG6JWWmm.png","https://files.peakd.com/file/peakd-hive/ecoinstant/23uR7dUTp51HDgCH4ovunJJbrQtWtYvSMrZQ3jtqkQuw87JFd7bSUjdTDvq3mg9rMiWUx.png","https://files.peakd.com/file/peakd-hive/ecoinstant/EoEjCQtTq7HZRhgQMiZfJsDJpqkTo3yLGWdpwQZScDDwk6CDfdiVvhWhX7no67Gmz2g.png","https://files.peakd.com/file/peakd-hive/ecoinstant/EppHnxkcBnSCt2GD7zrxjwt2FyaokG2tLUNCguaKUCrNVmFPATBGWwHGxV5nj5DuVJx.png"],"tags":["data","development","dev","tribes","archon","pimp","neoxian","leofinance","proofofbrain","cent"],"users":["thecrazygm"]}
created2025-07-16 14:21:39
last_update2025-07-16 14:28:57
depth0
children10
last_payout2025-07-23 14:21:39
cashout_time1969-12-31 23:59:59
total_payout_value10.420 HBD
curator_payout_value10.378 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length4,177
author_reputation868,883,276,598,857
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,113,342
net_rshares58,426,239,308,734
author_curate_reward""
vote details (338)
@djbravo ·
Previously, there was no such software, so it was quite difficult to remove data from every single thing and every single place. Now, many different software have come up. We will search now and find some very good software, because of which we can now do this work easily. 
properties (22)
authordjbravo
permlinkre-ecoinstant-2025717t8153821z
categoryhive-186392
json_metadata{"tags":["data","development","dev","tribes","archon","pimp","neoxian","leofinance","proofofbrain","cent"],"app":"ecency/4.2.1-vision","format":"markdown+html"}
created2025-07-17 03:01:57
last_update2025-07-17 03:01:57
depth1
children0
last_payout2025-07-24 03:01:57
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length274
author_reputation366,785,035,736,173
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,132,587
net_rshares0
@gornat ·
$0.21
You might want to consider using RStudio  with the Tidyverse Package. I found it is a very strong tool for data cleaning, and is completely free. I haven't used SPSS though so I can't compare. Hope you got your work done soon and with the least pain possible.
Good luck
👍  , , , , ,
properties (23)
authorgornat
permlinkre-ecoinstant-2025716t123922446z
categoryhive-186392
json_metadata{"links":[],"type":"comment","tags":["hive-186392","data","development","dev","tribes","archon","pimp","neoxian","leofinance","proofofbrain","cent"],"app":"ecency/3.3.3-mobile","format":"markdown+html"}
created2025-07-16 18:39:21
last_update2025-07-16 18:39:21
depth1
children2
last_payout2025-07-23 18:39:21
cashout_time1969-12-31 23:59:59
total_payout_value0.107 HBD
curator_payout_value0.107 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length269
author_reputation17,740,451,230,121
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,122,286
net_rshares605,831,769,133
author_curate_reward""
vote details (6)
@ecoinstant ·
I used R for a project a few years later - I love it!!  Appreciate you sharing that!
properties (22)
authorecoinstant
permlinkre-gornat-szi9ot
categoryhive-186392
json_metadata{"tags":["hive-186392"],"app":"peakd/2025.7.1","image":[],"users":[]}
created2025-07-16 18:50:54
last_update2025-07-16 18:50:54
depth2
children1
last_payout2025-07-23 18:50:54
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length84
author_reputation868,883,276,598,857
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,122,699
net_rshares0
@gornat ·
$0.15
Thank you, I had to learn R cuz of work requirements and when I discovered
1. RStudio 
2. Tidyverse 
It was a game changer for working with data.
👍  , , , , , , ,
properties (23)
authorgornat
permlinkre-ecoinstant-2025716t12553932z
categoryhive-186392
json_metadata{"links":[],"type":"comment","tags":["hive-186392"],"app":"ecency/3.3.3-mobile","format":"markdown+html"}
created2025-07-16 18:55:39
last_update2025-07-16 18:55:39
depth3
children0
last_payout2025-07-23 18:55:39
cashout_time1969-12-31 23:59:59
total_payout_value0.076 HBD
curator_payout_value0.076 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length145
author_reputation17,740,451,230,121
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,122,837
net_rshares432,590,670,047
author_curate_reward""
vote details (8)
@holoz0r ·
$0.25
Have you read the paper "Tidydata" ? 

Everyone should read it. It isn't just about data, but about observation, and how each data point should be an observation. 
👍  , , , , , , , ,
properties (23)
authorholoz0r
permlinkre-ecoinstant-szirhs
categoryhive-186392
json_metadata{"tags":["hive-186392"],"app":"peakd/2025.7.1","image":[],"users":[]}
created2025-07-17 01:15:30
last_update2025-07-17 01:15:30
depth1
children1
last_payout2025-07-24 01:15:30
cashout_time1969-12-31 23:59:59
total_payout_value0.126 HBD
curator_payout_value0.125 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length163
author_reputation546,795,439,356,171
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,131,411
net_rshares705,518,853,859
author_curate_reward""
vote details (9)
@ecoinstant ·
I will read it!
properties (22)
authorecoinstant
permlinkre-holoz0r-szj0zd
categoryhive-186392
json_metadata{"tags":["hive-186392"],"app":"peakd/2025.7.1","image":[],"users":[]}
created2025-07-17 04:40:27
last_update2025-07-17 04:40:27
depth2
children0
last_payout2025-07-24 04:40:27
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length15
author_reputation868,883,276,598,857
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,133,791
net_rshares0
@onezetty ·
$0.25
hahahahaha, I love it: 

> *I dropped out of college and ran away to South America without money or a plan.*

You make my day, my friend. ❤️
👍  , , , , , , , ,
👎  
properties (23)
authoronezetty
permlinkre-ecoinstant-szim4s
categoryhive-186392
json_metadata{"tags":["hive-186392"],"app":"peakd/2025.7.1","image":[],"users":[]}
created2025-07-16 23:19:42
last_update2025-07-16 23:19:42
depth1
children0
last_payout2025-07-23 23:19:42
cashout_time1969-12-31 23:59:59
total_payout_value0.124 HBD
curator_payout_value0.124 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length140
author_reputation33,181,190,078,389
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,130,199
net_rshares702,965,896,108
author_curate_reward""
vote details (10)
@shanibeer ·
$0.14
> not a single pair of them were reporting it in the same "way", with the same column structure, in the same order, with the same format. It was all different. Every single one

Welcome to my world. This has happened so many times!
👍  , , , , ,
properties (23)
authorshanibeer
permlinkre-ecoinstant-szhxep
categoryhive-186392
json_metadata{"tags":["hive-186392"],"app":"peakd/2025.7.1","image":[],"users":[]}
created2025-07-16 14:25:36
last_update2025-07-16 14:25:36
depth1
children0
last_payout2025-07-23 14:25:36
cashout_time1969-12-31 23:59:59
total_payout_value0.068 HBD
curator_payout_value0.067 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length231
author_reputation249,319,070,459,741
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,113,472
net_rshares381,983,570,813
author_curate_reward""
vote details (6)
@thecrazygm ·
$0.14
Good Luck and Godspeed to us both, it is no small *feat*. 😂
👍  , , , , , ,
properties (23)
authorthecrazygm
permlinkre-ecoinstant-szhxht
categoryhive-186392
json_metadata{"tags":["hive-186392"],"app":"peakd/2025.7.1","image":[],"users":[]}
created2025-07-16 14:27:30
last_update2025-07-16 14:27:30
depth1
children0
last_payout2025-07-23 14:27:30
cashout_time1969-12-31 23:59:59
total_payout_value0.068 HBD
curator_payout_value0.067 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length59
author_reputation110,427,149,832,670
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,113,538
net_rshares383,084,136,528
author_curate_reward""
vote details (7)
@tydynrain ·
$0.02
Data cleaning is certainly a useful skill to have under your belt. I really like the idea of data cleaning applications or scripts, especially considering how often I've cleaned data manually...lol! May you make your chaotic data immaculate! 😁 🙏 💚 ✨ 🤙 
👍  , , ,
properties (23)
authortydynrain
permlinkre-ecoinstant-2025716t202619834z
categoryhive-186392
json_metadata{"links":[],"type":"comment","tags":["hive-186392","data","development","dev","tribes","archon","pimp","neoxian","leofinance","proofofbrain","cent"],"app":"ecency/3.3.3-mobile","format":"markdown+html"}
created2025-07-17 06:26:21
last_update2025-07-17 06:26:21
depth1
children0
last_payout2025-07-24 06:26:21
cashout_time1969-12-31 23:59:59
total_payout_value0.012 HBD
curator_payout_value0.012 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length252
author_reputation206,446,828,924,444
root_title"Clean the Data! Story time"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id144,134,872
net_rshares73,709,575,954
author_curate_reward""
vote details (4)