create account

Data Science Madness (Updated) by mada

View this thread on: hive.blogpeakd.comecency.com
· @mada · (edited)
$5.16
Data Science Madness (Updated)
![8sBKecYtxNsiCiMlfkOz--1--7godk_2x.jpg](https://images.hive.blog/DQmNfa7cw8oazvVG31jeTH2A1UzbozEXZjMuDKnQyYgzzXD/8sBKecYtxNsiCiMlfkOz--1--7godk_2x.jpg)

_The above image was made with stable diffusion using the prompt 'a shower of book pages.'_

Today I'm going a little nuts because of a stupid problem. I'm trying to convert a sql dump file into a csv with pipe delimiters and can't seem to find a reasonable way to do this. I've so far tried Google, Stack Exchange, ChatGPT, and all of my friends. No solutions have been forthcoming.

**UPDATE: our tech guy was able to export a pipe separated csv from the database.**

## Context

The file in question is WantToKnow.info's news database. 5 years ago, I wrote a bunch of [code](https://github.com/ma-da/misc-share) for a [news mapping project](https://youtu.be/lW9tvgH5r_w) using this database. At that time, I was able to turn the sql dump into sqlite using an online service, then export the sqlite to csv with pipe delimiters so I could play around with the data in Pandas.

Now, 5 years later, our database has grown to over 12k entries, and I want to tinker with news story recommendation algorithms. My thinking is that trigram vector proximity might make for better recommendations than simple keywords. And maybe entity recognition has improved to the point where it alone could work. Either way, I've already written most of the code necessary to start trying stuff out. But before I can even start tinkering, I need the data to be in usable form.

## Issues

This go around, the first issue I encountered was that there was no longer an easy way to convert the sql into sqlite. My file is 30m, which was too big for any of the free converters I could find. The most popular conversion tool online now charges $89 for the service. There is a program called [DB Browser](https://sqlitebrowser.org/) that claims to be able to import sql files and export csv, but this tool didn't work for me.

Whenever I tried importing the database, the software failed because it detected syntax errors in the file, which came from MariaDB 5.5.68 with latin1 encoding. I'll keep looking for a program that works for my situation, but I'm not hopeful.

## Python

The obvious next step is to use python to convert the file. This would be easy if my data were simpler. Some data can just be split into rows by parentheses and into columns by commas. Not mine. My data includes parentheses and commas in the column text, as well as character encoding errors throughout.

So I've been trying to use regular expressions to transform the sql into csv and it feels like a road to madness. Here's as far as I've gotten:
```
import csv
import re

def convert_sql_to_csv(sql_file, csv_file):
    with open(sql_file, 'r', encoding='utf-8') as f:
        sql_data = f.read()

    # Extracting column names using regular expressions
    column_pattern = r"`(.*?)`"
    column_names = re.findall(column_pattern, sql_data)

    # Extracting data rows using regular expressions
    data_pattern = r"\((.*?)\)"
    data_rows = re.findall(data_pattern, sql_data)

    # Cleaning and splitting data rows
    data_rows = [row.replace("', '", "|") for row in data_rows]
    data_rows = [row.split("|") for row in data_rows]

    with open(csv_file, 'w', newline='', encoding='utf-8') as f:
        writer = csv.writer(f, delimiter='|')
        writer.writerow(column_names)
        writer.writerows(data_rows)

    print(f'Successfully converted {sql_file} to {csv_file}!')
```

This seems like the kind of problem ChatGpt could easily solve, but I went back and forth a dozen times with the AI and no solution was forthcoming. Even so, the exchange was helpful, and provided me with an easy structure to work with. And now I'm examining the data more and more closely in hopes of identifying character patterns that could be used instead of parentheses and commas to correctly arrange the data for a csv file. 

All I want to do is get the news archive into a pandas dataframe, and instead I'm stuck in preprocessing. My rational mind tells me to be patient until a solution presents itself. Yet my experience has been that these kinds of particular technical issues don't just fix themselves. Regardless, I'll keep working on it. 

___

**Read my novels:**
- **Small Gods of Time Travel** is available as a [web book on IPFS](https://rstory.mypinata.cloud/ipfs/QmVt9kp8CJKwUm2cvD1mhMVtswsdFh3iEH8AFGEUxnrMu1/) and as a 41 piece Tezos [NFT collection on Objkt](https://objkt.com/collection/KT1TJEWFRDcudZYwPVfm2j1HAANen3TL7fof).
- **The Paradise Anomaly** is available in [print via Blurb](https://www.blurb.com/b/10994168-the-paradise-anomaly) and [for Kindle on Amazon](https://www.amazon.com/dp/B09NLB95NZ).
- **Psychic Avalanche** is available in [print via Blurb](https://www.blurb.com/b/10891426-psychic-avalanche) and for [Kindle on Amazon](https://www.amazon.com/dp/B09JS6CV9X).
- **One Man Embassy** is available in [print via Blurb](https://blurb.com/b/9876573-one-man-embassy) and for [Kindle on Amazon](https://amazon.com/One-Man-Embassy-Mark-Bailey-ebook/dp/B0836SRC8K).
- **Flying Saucer Shenanigans** is available in [print via Blurb](https://blurb.com/b/10002213-flying-saucer-shenanigans) and for [Kindle on Amazon](https://amazon.com/Flying-Saucer-Shenanigans-Mark-Bailey-ebook/dp/B0863FRJN2).
- **Rainbow Lullaby** is available in [print via Blurb](https://blurb.com/b/9330918-rainbow-lullaby) and for [Kindle on Amazon](https://amazon.com/Rainbow-Lullaby-Mark-Bailey-ebook/dp/B07P4MYTGT).
- **The Ostermann Method** is available in [print via Blurb](https://blurb.com/b/9660167-the-ostermann-method) and for [Kindle on Amazon](https://amazon.com/Ostermann-Method-Mark-Bailey-ebook/dp/B07Y6RDTJF).
- **Blue Dragon Mississippi** is available in [print via Blurb](https://blurb.com/b/10192086-blue-dragon-mississippi) and for [Kindle on Amazon](https://amazon.com/Blue-Dragon-Mississippi-Mark-Bailey-ebook/dp/B08C54F99R).

**See my NFTs:**
- **Small Gods of Time Travel** is a 41 piece Tezos [NFT collection on Objkt](https://objkt.com/collection/KT1TJEWFRDcudZYwPVfm2j1HAANen3TL7fof) that goes with my book by the same name.
- **History and the Machine** is a 20 piece Tezos [NFT collection on Objkt](https://objkt.com/collection/KT1KmnSSXykx4SA6pdVVh9PweSqK5tkDRsLX) based on my series of oil paintings of interesting people from history.
- **Artifacts of Mind Control** is a 15 piece Tezos [NFT collection on Objkt](https://objkt.com/collection/KT1Fc7naXm8XApN1EHoJXkC1Tggv2jK9eeZK) based on declassified CIA documents from the MKULTRA program.
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,
properties (23)
authormada
permlinkdata-science-madness
categoryprogramming
json_metadata"{"tags":["python","sql","data-science","blog"],"image":["https://images.hive.blog/DQmNfa7cw8oazvVG31jeTH2A1UzbozEXZjMuDKnQyYgzzXD/8sBKecYtxNsiCiMlfkOz--1--7godk_2x.jpg"],"links":["https://github.com/ma-da/misc-share"],"app":"hiveblog/0.1","format":"markdown","description":"I'm trying to turn a sql file into a csv and it's not going well."}"
created2023-06-12 23:18:33
last_update2023-06-13 02:27:18
depth0
children2
last_payout2023-06-19 23:18:33
cashout_time1969-12-31 23:59:59
total_payout_value2.584 HBD
curator_payout_value2.576 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length6,591
author_reputation190,963,631,359,897
root_title"Data Science Madness (Updated)"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id124,405,961
net_rshares12,954,450,043,533
author_curate_reward""
vote details (50)
@amberjyang ·
$0.03
This sounds complicated and way above my understanding of reality. I'm cheering you on from the sidelines :)

On another note, thanks for working on this! It'll be soooo awesome to have articles and links people can explore while they traverse our site. Sounds like you're staying patient and diligent. Hope you're also having fun and remembering to hydrate and eat :)
👍  
properties (23)
authoramberjyang
permlinkrw9a73
categoryprogramming
json_metadata{"app":"hiveblog/0.1"}
created2023-06-14 18:33:03
last_update2023-06-14 18:33:03
depth1
children1
last_payout2023-06-21 18:33:03
cashout_time1969-12-31 23:59:59
total_payout_value0.017 HBD
curator_payout_value0.017 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length368
author_reputation34,653,513,272,876
root_title"Data Science Madness (Updated)"
beneficiaries
0.
accounthiveonboard
weight100
1.
accountocdb
weight100
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id124,455,956
net_rshares89,515,291,930
author_curate_reward""
vote details (1)
@mada ·
Haha I'm definitely having fun with this. And I plan to keep posting my progress as it's made.
properties (22)
authormada
permlinkrw9j7m
categoryprogramming
json_metadata{"app":"hiveblog/0.1"}
created2023-06-14 21:47:45
last_update2023-06-14 21:47:45
depth2
children0
last_payout2023-06-21 21:47:45
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length94
author_reputation190,963,631,359,897
root_title"Data Science Madness (Updated)"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id124,459,713
net_rshares0