create account

PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands by programarivm

View this thread on: hive.blogpeakd.comecency.com
· @programarivm · (edited)
$3.53
PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands
### Related Repositories
- https://github.com/php-ai/php-ml
- https://github.com/programarivm/pgn-chess
- https://github.com/awesomedata/awesome-public-datasets

![robot.jpg](https://cdn.steemitimages.com/DQmeN5sCJhSApDeiQMQpduxC5U1nAi7u3AeG5bmhk3Ny321/robot.jpg)

### What Will I Learn?
In almost any data science project you need to find, clean and prepare data. We are doing some digging on the web in order to prepare two perfectly formed CSV files (`eng.csv` and `fra.csv`) according to our requirements. These files must contain random phrases in English and French for further processing by PHP scripts.

#### Requirements

- Basic concepts of machine learning
- A few Linux commands
- Some PHP
- Be a little patient

### Difficulty

- Intermediate

### Tutorial Contents

Who said that you cannot do machine learning with PHP?

I am learning about it, and today I'm sharing with you a useful tip to curate a dataset consisting of random phrases written in any imaginable language.

This may interest you if you want to train a machine learning model for pattern recognition purposes, text classification, language detection, and so on. The list could go on.

The reason behind today's tutorial is to help you get familiar with a few basic concepts at the same time that I myself learn about the topic. I want to share with the world my learn by doing process!

And I am excited because I've learned already that finding and curating data by hand is an important thing to keep in mind.

> Remember: First things first, in almost any data science project you need to find, clean and prepare data. The more tools you can master for this purpose, the better.

### A Long Time Ago...

<iframe width="560" height="315" src="https://www.youtube.com/embed/PQD1dT6b5sQ" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

Let me start by giving some context. My story began a few months ago, when after working as a web developer for a while I just thought, "Why don't I write a chess engine in PHP?"

My rhetorical question might sound a bit naive in terms of the mainstream data science trend because the vast majority of data scientists are using Python on their projects. However, PHP web devs may well want to do some machine learning with [PHP-ML](https://php-ml.readthedocs.io/en/latest/), which is in the process of being developed by the way -- currently on version 0.6.2.

Then, I did some research to find out that my chess engine could rely on a multilayer perceptron (MLP) classifier in a similar way as it is described in the paper entitled [Learning to Evaluate Chess Positions with Deep Neural Networks and Limited Lookahead](https://www.researchgate.net/publication/322539902_Learning_to_Evaluate_Chess_Positions_with_Deep_Neural_Networks_and_Limited_Lookahead).

Here is a conclusion:

> The results show how relatively simple Multilayer Perceptrons (MLPs) outperform Convolutional Neural Networks (CNNs) in all the experiments that we have performed.

My ultimate goal would consist in normalizing a bunch of PGN Chess board positions to train the MLP model -- if I am correct, this can be achieved with just transforming the output of the [status()](https://pgn-chess.readthedocs.io/en/latest/game-methods/#status) method into a format that the MLP classifier can understand.

### Mmm

I scratched my head a little harder to only conclude that MLPs are still a bit too much for my current machine learning skill set, and mastering them will take some time as well.

Here is what has to happen: I first need to digest all this [tacit knowledge](https://en.wikipedia.org/wiki/Tacit_knowledge) by taking baby steps. I am being patient. 

A cool thing about machine learning algorithms is that they can be approached as if they were black boxes, meaning that you don't actually need a mathematics background to use them. Just be curious and try experiments by yourself.

The [Naive Bayes](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) algorithm -- this one can be used for [language identification](https://pdfs.semanticscholar.org/3057/8d7a38ca228e912bd65afa30ec9488d945db.pdf) purposes -- is definitely easier to start off than MLP. 

So let's forget about chess for now. Take it easy. Listen to some classical music for brain power!

<iframe width="560" height="315" src="https://www.youtube.com/embed/UdNOqNHxoQI" frameborder="0" allow="autoplay; encrypted-media" allowfullscreen></iframe>

### Preparing and Cleaning Data with Linux Commands

Suppose we're working on a new exciting data science project on language detection and we're using a [Naive Bayes classifier](https://php-ml.readthedocs.io/en/latest/machine-learning/classification/naive-bayes/). Going back to the issue of preparing data for machine learning and AI, the things to do now are:

- Collect quality data
- Adapt the collected data to our requirements

Regarding the collection of data, there's [Tatoeba](https://tatoeba.org/eng/):

> Tatoeba is a collection of sentences and translations. It's collaborative, open, free and even addictive.

Just [download](http://downloads.tatoeba.org/exports/sentences.tar.bz2) this huge [TSV](https://en.wikipedia.org/wiki/Tab-separated_values) file (355.8 Mb) with thousands and thousands, millions of phrases written in any imaginable language in the world.

Here is how Tatoeba's `sentences.csv` file looks like:

```
1	cmn	我們試試看!
2	cmn	我该去睡觉了。
3	cmn	你在干什麼啊?
...
5630	rus	Тем не менее, обратное также верно.
5631	rus	Мы видим вещи не такими, какие они есть, а такими, каковы мы сами.
5632	rus	Мир - это клетка для безумных.
...
5994	eng	Maria has long hair.
5995	fra	Maria a les cheveux longs.
5996	jpn	あしたは、来なくていいよ。
...
7088863	hun	Tom épp most mondta nekünk, hogy kirúgták.
```

![random-chars.jpg](https://cdn.steemitimages.com/DQmPiJCBwP2E921A1VxNBRfmzkCU5fEU1VbW7uHh6JTu4ms/random-chars.jpg)

The problem is that we'd want a tidy, concise, perfectly formed CSV file like the following one containing random sentences in English only.

```
eng,What do you want for Christmas?
eng,There are pictures on alternate pages of the book.
eng,This language is perfectly clear to me when written but absolutely incomprehensible when spoken.
...
eng,In my opinion a well-designed website shouldn't require horizontal scrolling.
```
No worries, let's create a bash shell script with some Linux shell commands:

| Command     | Description                                                                                       |
|-------------|---------------------------------------------------------------------------------------------------|
| `shuf`      | Writes a random permutation of the input lines to standard output.                                |
| `awk`       | A pattern scanning and processing language.                                                       |
| `tr`        | Translates, squeezes, and/or deletes characters from standard input, writing to standard output.  |
| `cut`       | Prints selected parts of lines from each file to standard output.                                 |
| `rm`        | Removes (unlink) file(s).                                                                         |

Here is the bash shell script:

```
#!/bin/bash
shuf -n 5000 sentences.csv > lang_sample.tsv
awk '$2=="eng"' lang_sample.tsv > eng_sample.tsv
cat eng_sample.tsv | tr -d \, | tr "\\t" "," > eng_sample.csv
cut -d, -f1 --complement < eng_sample.csv > eng.csv
rm eng_sample.csv eng_sample.tsv lang_sample.tsv
```

With the help of Linux pipes the commands above can be merged into one:

```
shuf -n 5000 sentences.csv | awk '$2=="eng" {print}' | tr -d \, | tr "\\t" "," | cut -d, -f1 --complement > eng.csv
```

Cool! We can easily fetch any other bunch of random phrases, in French for example:

```
shuf -n 5000 sentences.csv | awk '$2=="fra" {print}' | tr -d \, | tr "\\t" "," | cut -d, -f1 --complement > fra.csv
```

The brand new clean CSVs are a good thing. We are keeping things simple. Now it is a piece of cake for our PHP scripts to process the information:

```
...
$file = fopen($this->filepath, 'r');
while (($line = fgetcsv($file)) !== false) {
    $this->labels[] = $line[0];
    $this->samples[] = $line[1];
}
fclose($file);
...
```

That's all for now. I hope you liked today’s post. Thank you for reading and sharing your human thoughts.

### Conclusion

PHP machine learning can be done with PHP-ML -- currently on version 0.6.2 -- which is in the process of being developed. Web developers can use multiple different algorithms in their PHP projects: `Apriori`, `SVC`, `KNearestNeighbors`, `NaiveBayes`, `LeastSquares`, `MLPClassifier`, among others.

A cool thing about machine learning algorithms is that they can be approached as if they were black boxes, meaning that you don't actually need a mathematics background to use them.

Just be curious and a little patient in the beginning.

The first thing to do in almost any data science project is to find, clean and prepare the data. And that is what we did today. We prepared a couple of concise, perfectly formed CSV files (`eng.csv` and `fra.csv`) for our purposes, containing random phrases in English and French for further processing by PHP scripts.
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and 45 others
properties (23)
authorprogramarivm
permlinkphp-machine-learning-diary-preparing-random-phrases-with-linux-commands
categoryutopian-io
json_metadata{"tags":["utopian-io","tutorials","science","data","ai"],"image":["https://cdn.steemitimages.com/DQmeN5sCJhSApDeiQMQpduxC5U1nAi7u3AeG5bmhk3Ny321/robot.jpg","https://img.youtube.com/vi/PQD1dT6b5sQ/0.jpg","https://img.youtube.com/vi/UdNOqNHxoQI/0.jpg","https://cdn.steemitimages.com/DQmPiJCBwP2E921A1VxNBRfmzkCU5fEU1VbW7uHh6JTu4ms/random-chars.jpg"],"links":["https://github.com/php-ai/php-ml","https://github.com/programarivm/pgn-chess","https://github.com/awesomedata/awesome-public-datasets","https://www.youtube.com/embed/PQD1dT6b5sQ","https://php-ml.readthedocs.io/en/latest/","https://www.researchgate.net/publication/322539902_Learning_to_Evaluate_Chess_Positions_with_Deep_Neural_Networks_and_Limited_Lookahead","https://pgn-chess.readthedocs.io/en/latest/game-methods/#status","https://en.wikipedia.org/wiki/Tacit_knowledge","https://en.wikipedia.org/wiki/Naive_Bayes_classifier","https://pdfs.semanticscholar.org/3057/8d7a38ca228e912bd65afa30ec9488d945db.pdf","https://www.youtube.com/embed/UdNOqNHxoQI","https://php-ml.readthedocs.io/en/latest/machine-learning/classification/naive-bayes/","https://tatoeba.org/eng/","http://downloads.tatoeba.org/exports/sentences.tar.bz2","https://en.wikipedia.org/wiki/Tab-separated_values"],"app":"steemit/0.1","format":"markdown"}
created2018-08-27 22:02:39
last_update2018-08-28 19:19:39
depth0
children7
last_payout2018-09-03 22:02:39
cashout_time1969-12-31 23:59:59
total_payout_value2.768 HBD
curator_payout_value0.760 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length9,208
author_reputation2,631,258,794,707
root_title"PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id69,542,481
net_rshares2,372,432,760,466
author_curate_reward""
vote details (109)
@malpica1 ·
excellent post I love from this moment I follow you, in this way causes to see content in steemit greetings and my respects and my support with my vote
properties (22)
authormalpica1
permlinkre-programarivm-php-machine-learning-diary-preparing-random-phrases-with-linux-commands-20180827t220644166z
categoryutopian-io
json_metadata{"tags":["utopian-io"],"app":"steemit/0.1"}
created2018-08-27 22:06:42
last_update2018-08-27 22:06:42
depth1
children1
last_payout2018-09-03 22:06:42
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length151
author_reputation2,979,703,042,626
root_title"PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id69,542,732
net_rshares0
@programarivm · (edited)
Thanks for your comment @malpica1, it is encouraging! Happy that you liked this post :)
properties (22)
authorprogramarivm
permlinkre-malpica1-re-programarivm-php-machine-learning-diary-preparing-random-phrases-with-linux-commands-20180827t221507904z
categoryutopian-io
json_metadata{"tags":["utopian-io"],"users":["malpica1"],"app":"steemit/0.1"}
created2018-08-27 22:15:09
last_update2018-08-27 22:16:15
depth2
children0
last_payout2018-09-03 22:15:09
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length87
author_reputation2,631,258,794,707
root_title"PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id69,543,233
net_rshares0
@portugalcoin ·
$4.71
Thank you for your contribution.
After analyzing your tutorial we suggest the following:

- The tutorial in technical terms is quite short, we recommend that the next tutorial be more technical.
- It's important to explain in detail the code that is in the tutorial.
- We suggest you always put comments in your code.

Your contribution has been evaluated according to [Utopian policies and guidelines](https://join.utopian.io/guidelines), as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, [click here](https://review.utopian.io/result/8/32334444).

---- 
Need help? Write a ticket on https://support.utopian.io/. 
Chat with us on [Discord](https://discord.gg/uTyJkNm). 
[[utopian-moderator]](https://join.utopian.io/)
👍  , , , , , , ,
properties (23)
authorportugalcoin
permlinkre-programarivm-php-machine-learning-diary-preparing-random-phrases-with-linux-commands-20180828t200912962z
categoryutopian-io
json_metadata{"tags":["utopian-io"],"links":["https://join.utopian.io/guidelines","https://review.utopian.io/result/8/32334444","https://support.utopian.io/","https://discord.gg/uTyJkNm","https://join.utopian.io/"],"app":"steemit/0.1"}
created2018-08-28 20:09:12
last_update2018-08-28 20:09:12
depth1
children1
last_payout2018-09-04 20:09:12
cashout_time1969-12-31 23:59:59
total_payout_value3.565 HBD
curator_payout_value1.148 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length809
author_reputation599,460,462,895,094
root_title"PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id69,635,093
net_rshares3,142,242,193,079
author_curate_reward""
vote details (8)
@utopian-io ·
Thank you for your review, @portugalcoin!

So far this week you've reviewed 17 contributions. Keep up the good work!
properties (22)
authorutopian-io
permlinkre-re-programarivm-php-machine-learning-diary-preparing-random-phrases-with-linux-commands-20180828t200912962z-20180829t030009z
categoryutopian-io
json_metadata"{"app": "beem/0.19.42"}"
created2018-08-29 03:00:09
last_update2018-08-29 03:00:09
depth2
children0
last_payout2018-09-05 03:00:09
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length116
author_reputation152,955,367,999,756
root_title"PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id69,659,237
net_rshares0
@steem-ua ·
Hi @programarivm! We are @steem-ua, a new Steem dApp, using UserAuthority for algorithmic post curation! Your post is eligible for our upvote, thanks to our collaboration with @utopian-io! Thanks for your contribution, keep up the good work, and feel free to join our [@steem-ua Discord server](https://discord.gg/KpBNYGz)
properties (22)
authorsteem-ua
permlinkre-php-machine-learning-diary-preparing-random-phrases-with-linux-commands-20180828t213702z
categoryutopian-io
json_metadata"{"app": "beem/0.19.54"}"
created2018-08-28 21:37:03
last_update2018-08-28 21:37:03
depth1
children0
last_payout2018-09-04 21:37:03
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length322
author_reputation23,214,230,978,060
root_title"PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id69,640,531
net_rshares0
@steemitboard ·
Congratulations @programarivm! You have completed the following achievement on Steemit and have been rewarded with new badge(s) :

[![](https://steemitimages.com/70x80/http://steemitboard.com/notifications/voted.png)](http://steemitboard.com/@programarivm) Award for the number of upvotes received

<sub>_Click on the badge to view your Board of Honor._</sub>
<sub>_If you no longer want to receive notifications, reply to this comment with the word_ `STOP`</sub>



> You can upvote this notification to help all Steemit users. Learn why [here](https://steemit.com/steemitboard/@steemitboard/http-i-cubeupload-com-7ciqeo-png)!
properties (22)
authorsteemitboard
permlinksteemitboard-notify-programarivm-20180828t003947000z
categoryutopian-io
json_metadata{"image":["https://steemitboard.com/img/notify.png"]}
created2018-08-28 00:39:45
last_update2018-08-28 00:39:45
depth1
children0
last_payout2018-09-04 00:39:45
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length627
author_reputation38,975,615,169,260
root_title"PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id69,551,340
net_rshares0
@utopian-io ·
Hey, @programarivm!

**Thanks for contributing on Utopian**.
We’re already looking forward to your next contribution!

**Get higher incentives and support Utopian.io!**
 Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via [SteemPlus](https://chrome.google.com/webstore/detail/steemplus/mjbkjgcplmaneajhcbegoffkedeankaj?hl=en) or [Steeditor](https://steeditor.app)).

**Want to chat? Join us on Discord https://discord.gg/h52nFrV.**

<a href='https://steemconnect.com/sign/account-witness-vote?witness=utopian-io&approve=1'>Vote for Utopian Witness!</a>
properties (22)
authorutopian-io
permlinkre-php-machine-learning-diary-preparing-random-phrases-with-linux-commands-20180828t225537z
categoryutopian-io
json_metadata"{"app": "beem/0.19.42"}"
created2018-08-28 22:55:36
last_update2018-08-28 22:55:36
depth1
children0
last_payout2018-09-04 22:55:36
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length594
author_reputation152,955,367,999,756
root_title"PHP Machine Learning Diary: Preparing Random Phrases with Linux Commands"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id69,644,930
net_rshares0