Data Science Madness (Updated) by mada

programming · @mada · Jun 13 '23 (edited)

$5.16

Data Science Madness (Updated)

![8sBKecYtxNsiCiMlfkOz--1--7godk_2x.jpg](https://images.hive.blog/DQmNfa7cw8oazvVG31jeTH2A1UzbozEXZjMuDKnQyYgzzXD/8sBKecYtxNsiCiMlfkOz--1--7godk_2x.jpg)

_The above image was made with stable diffusion using the prompt 'a shower of book pages.'_

Today I'm going a little nuts because of a stupid problem. I'm trying to convert a sql dump file into a csv with pipe delimiters and can't seem to find a reasonable way to do this. I've so far tried Google, Stack Exchange, ChatGPT, and all of my friends. No solutions have been forthcoming.

**UPDATE: our tech guy was able to export a pipe separated csv from the database.**

## Context

The file in question is WantToKnow.info's news database. 5 years ago, I wrote a bunch of [code](https://github.com/ma-da/misc-share) for a [news mapping project](https://youtu.be/lW9tvgH5r_w) using this database. At that time, I was able to turn the sql dump into sqlite using an online service, then export the sqlite to csv with pipe delimiters so I could play around with the data in Pandas.

Now, 5 years later, our database has grown to over 12k entries, and I want to tinker with news story recommendation algorithms. My thinking is that trigram vector proximity might make for better recommendations than simple keywords. And maybe entity recognition has improved to the point where it alone could work. Either way, I've already written most of the code necessary to start trying stuff out. But before I can even start tinkering, I need the data to be in usable form.

## Issues

This go around, the first issue I encountered was that there was no longer an easy way to convert the sql into sqlite. My file is 30m, which was too big for any of the free converters I could find. The most popular conversion tool online now charges $89 for the service. There is a program called [DB Browser](https://sqlitebrowser.org/) that claims to be able to import sql files and export csv, but this tool didn't work for me.

Whenever I tried importing the database, the software failed because it detected syntax errors in the file, which came from MariaDB 5.5.68 with latin1 encoding. I'll keep looking for a program that works for my situation, but I'm not hopeful.

## Python

The obvious next step is to use python to convert the file. This would be easy if my data were simpler. Some data can just be split into rows by parentheses and into columns by commas. Not mine. My data includes parentheses and commas in the column text, as well as character encoding errors throughout.

So I've been trying to use regular expressions to transform the sql into csv and it feels like a road to madness. Here's as far as I've gotten:
```
import csv
import re

def convert_sql_to_csv(sql_file, csv_file):
    with open(sql_file, 'r', encoding='utf-8') as f:
        sql_data = f.read()

    # Extracting column names using regular expressions
    column_pattern = r"`(.*?)`"
    column_names = re.findall(column_pattern, sql_data)

    # Extracting data rows using regular expressions
    data_pattern = r"\((.*?)\)"
    data_rows = re.findall(data_pattern, sql_data)

    # Cleaning and splitting data rows
    data_rows = [row.replace("', '", "|") for row in data_rows]
    data_rows = [row.split("|") for row in data_rows]

    with open(csv_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|')
        writer.writerow(column_names)
        writer.writerows(data_rows)

    print(f'Successfully converted {sql_file} to {csv_file}!')
```

This seems like the kind of problem ChatGpt could easily solve, but I went back and forth a dozen times with the AI and no solution was forthcoming. Even so, the exchange was helpful, and provided me with an easy structure to work with. And now I'm examining the data more and more closely in hopes of identifying character patterns that could be used instead of parentheses and commas to correctly arrange the data for a csv file. 

All I want to do is get the news archive into a pandas dataframe, and instead I'm stuck in preprocessing. My rational mind tells me to be patient until a solution presents itself. Yet my experience has been that these kinds of particular technical issues don't just fix themselves. Regardless, I'll keep working on it. 

___

**Read my novels:**
- **Small Gods of Time Travel** is available as a [web book on IPFS](https://rstory.mypinata.cloud/ipfs/QmVt9kp8CJKwUm2cvD1mhMVtswsdFh3iEH8AFGEUxnrMu1/) and as a 41 piece Tezos [NFT collection on Objkt](https://objkt.com/collection/KT1TJEWFRDcudZYwPVfm2j1HAANen3TL7fof).
- **The Paradise Anomaly** is available in [print via Blurb](https://www.blurb.com/b/10994168-the-paradise-anomaly) and [for Kindle on Amazon](https://www.amazon.com/dp/B09NLB95NZ).
- **Psychic Avalanche** is available in [print via Blurb](https://www.blurb.com/b/10891426-psychic-avalanche) and for [Kindle on Amazon](https://www.amazon.com/dp/B09JS6CV9X).
- **One Man Embassy** is available in [print via Blurb](https://blurb.com/b/9876573-one-man-embassy) and for [Kindle on Amazon](https://amazon.com/One-Man-Embassy-Mark-Bailey-ebook/dp/B0836SRC8K).
- **Flying Saucer Shenanigans** is available in [print via Blurb](https://blurb.com/b/10002213-flying-saucer-shenanigans) and for [Kindle on Amazon](https://amazon.com/Flying-Saucer-Shenanigans-Mark-Bailey-ebook/dp/B0863FRJN2).
- **Rainbow Lullaby** is available in [print via Blurb](https://blurb.com/b/9330918-rainbow-lullaby) and for [Kindle on Amazon](https://amazon.com/Rainbow-Lullaby-Mark-Bailey-ebook/dp/B07P4MYTGT).
- **The Ostermann Method** is available in [print via Blurb](https://blurb.com/b/9660167-the-ostermann-method) and for [Kindle on Amazon](https://amazon.com/Ostermann-Method-Mark-Bailey-ebook/dp/B07Y6RDTJF).
- **Blue Dragon Mississippi** is available in [print via Blurb](https://blurb.com/b/10192086-blue-dragon-mississippi) and for [Kindle on Amazon](https://amazon.com/Blue-Dragon-Mississippi-Mark-Bailey-ebook/dp/B08C54F99R).

**See my NFTs:**
- **Small Gods of Time Travel** is a 41 piece Tezos [NFT collection on Objkt](https://objkt.com/collection/KT1TJEWFRDcudZYwPVfm2j1HAANen3TL7fof) that goes with my book by the same name.
- **History and the Machine** is a 20 piece Tezos [NFT collection on Objkt](https://objkt.com/collection/KT1KmnSSXykx4SA6pdVVh9PweSqK5tkDRsLX) based on my series of oil paintings of interesting people from history.
- **Artifacts of Mind Control** is a 15 piece Tezos [NFT collection on Objkt](https://objkt.com/collection/KT1Fc7naXm8XApN1EHoJXkC1Tggv2jK9eeZK) based on declassified CIA documents from the MKULTRA program.

👍 v4vapid, kommienezuspadt, pfunk, arcange, vannour, timcliff, kevmcc, dylanhobalart, nfttunz, darth-azrael, jacuzzi, aljif7, laruche, yanes94, walterjay, amberjyang, newsflash, darth-cryptic, shainemata, retrodroid, pboulet, baltai, lowlightart, aotearoa, steemworld, pedir-museum, sannur, youdontknowme, mapesa, lartist-zen, aceh, munzir, ancolie, creat, cribbio, cisah, fourfourfun, steemcultures, robotics101, potpourry, imcore, gasaeightyfive, michupa, aidefr, oadissin, primeradue, aquaculture, rohansuares, zacharonipizza, gaottantacinque

`author`	mada
`permlink`	data-science-madness
`category`	programming
`json_metadata`	"{"tags":["python","sql","data-science","blog"],"image":["https://images.hive.blog/DQmNfa7cw8oazvVG31jeTH2A1UzbozEXZjMuDKnQyYgzzXD/8sBKecYtxNsiCiMlfkOz--1--7godk_2x.jpg"],"links":["https://github.com/ma-da/misc-share"],"app":"hiveblog/0.1","format":"markdown","description":"I'm trying to turn a sql file into a csv and it's not going well."}"
`created`	2023-06-12 23:18:33
`last_update`	2023-06-13 02:27:18
`depth`	0
`children`	2
`last_payout`	2023-06-19 23:18:33
`cashout_time`	1969-12-31 23:59:59
`total_payout_value`	2.584 HBD
`curator_payout_value`	2.576 HBD
`pending_payout_value`	0.000 HBD
`promoted`	0.000 HBD
`body_length`	6,591
`author_reputation`	190,963,631,359,897
`root_title`	"Data Science Madness (Updated)"
`beneficiaries`	`[]`
`max_accepted_payout`	1,000,000.000 HBD
`percent_hbd`	10,000
`post_id`	124,405,961
`net_rshares`	12,954,450,043,533
`author_curate_reward`	""

properties (23)vote details (50)

voter	rshares	pct
pfunk	1,411,351,577,273	25%
arcange	1,078,208,902,809	5%
timcliff	127,366,531,306	100%
kommienezuspadt	4,159,791,577,144	100%
dylanhobalart	115,506,040,844	21%
vannour	160,055,339,870	100%
steemcultures	1,162,423,345	100%
steemworld	4,847,544,921	100%
creat	1,298,110,288	100%
mapesa	2,013,201,363	50%
ancolie	1,319,057,584	1.13%
cisah	1,277,373,823	100%
walterjay	17,734,906,444	1.62%
v4vapid	5,340,218,705,122	33%
pedir-museum	3,068,276,752	100%
darth-azrael	67,679,968,574	25%
aceh	1,448,387,312	100%
darth-cryptic	12,014,819,297	25%
munzir	1,411,615,290	100%
newsflash	14,063,005,167	8.25%
sannur	2,453,674,001	100%
yanes94	17,961,045,743	50%
aquaculture	528,482,706	100%
aidefr	701,707,285	2.6%
robotics101	1,112,995,189	3.25%
kevmcc	126,557,072,566	60%
fourfourfun	1,247,111,489	4.12%
aotearoa	4,862,375,425	100%
oadissin	588,890,691	2.5%
lartist-zen	1,815,234,787	100%
imcore	863,495,044	10%
gaottantacinque	306,262,829	100%
aljif7	36,767,856,499	14%
gasaeightyfive	832,667,375	100%
pboulet	7,104,145,626	2.6%
cribbio	1,286,972,788	100%
retrodroid	7,310,521,353	25%
shainemata	10,175,914,597	5%
jacuzzi	60,505,667,212	25%
primeradue	549,293,521	33%
baltai	6,951,898,238	3.12%
laruche	23,247,448,432	3.25%
rohansuares	487,597,207	5%
lowlightart	5,063,006,241	40%
potpourry	899,672,858	30%
nfttunz	94,260,214,652	6.25%
zacharonipizza	357,679,533	100%
michupa	752,602,798	2.5%
youdontknowme	2,268,115,085	12.5%
amberjyang	14,793,057,235	100%