create account

Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP by programarivm

View this thread on: hive.blogpeakd.comecency.com
· @programarivm · (edited)
$12.11
Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP
#### Repository
- https://github.com/programarivm/unicode-ranges

![many-languages.jpg](https://cdn.steemitimages.com/DQmS9G5KPn5rDc5exJfsuEe98idBgqZGALwgGPoztrsMJUo/many-languages.jpg)

#### Related Repositories
- https://github.com/programarivm/babylon

Have you ever needed to create a random string with Unicode characters encoded in blocks that you'd want to pick at will? I did a few months ago but couldn't find any library to easily achieve my goal. 

So I decided to write Unicode Ranges which is a PHP library that provides you with Unicode ranges -- blocks, if you like -- in a friendly, object-oriented way.

By the way, if you are not very familiar with Unicode [click here](https://en.wikipedia.org/wiki/Unicode_block) for a quick introduction to the ranges: Basic Latin, Cyrillic, Hangul Hamo, and many, many others.

Here is an example that creates a random char encoded in any of these three Unicode ranges: `BasicLatin`, `Tibetan` and `Cherokee`.

```
use UnicodeRanges\Randomizer;
use UnicodeRanges\Range\BasicLatin;
use UnicodeRanges\Range\Tibetan;
use UnicodeRanges\Range\Cherokee;

$char = Randomizer::char([
    new BasicLatin,
    new Tibetan,
    new Cherokee,
]);

echo $char . PHP_EOL;

```

Output:

```
Ꮉ
```

And this is how to create a random string with `Arabic`, `HangulJamo` and `Phoenician` characters:


```
use UnicodeRanges\Randomizer;
use UnicodeRanges\Range\Arabic;
use UnicodeRanges\Range\HangulJamo;
use UnicodeRanges\Range\Phoenician;

$letters = Randomizer::letters([
    new Arabic,
    new HangulJamo,
    new Phoenician,
], 20);

echo $letters . PHP_EOL;
```

Output:

```
ᄺᆺڽ𐤂ᆉᅔᅱ𐤆𐤄ᅰᇼᄓ𐤊𐤄ᄃ𐤋ᆝᆛەᅎ
```

Very useful if you want to create random UTF-8 tokens for example.

I hope these examples will give you the context to follow my explanation -- for further information please read the [Documentation](https://unicode-ranges.readthedocs.io/en/latest/).

### New Features

Let's now cut to the chase.

Yesterday I created the following Unicode Ranges feature for [Babylon](https://github.com/programarivm/babylon) to be able to compute the ranges' frequencies -- or put another way, the number of times that a particular unicode range appears in a text. 

![babel.jpg](https://cdn.steemitimages.com/DQmWqpLYeZoVbbU9BX4n1268tLKcJDCJ7tGeSLSAKV8LKvb/babel.jpg)

The ultimate goal is for the language detector to understand alphabets.

This is how the feature is implemented:

- [Feature/power ranges #1](https://github.com/programarivm/unicode-ranges/pull/1)

On the one hand, `PowerRanges` provides with an array containing all 255 Unicode ranges.

Of course, I didn't manually instantiate the 255 classes, which would have been just tedious! Note that the `PowerRanges` array is dynamically built by reading the files stored in the [unicode-ranges/src/Range/](https://github.com/programarivm/unicode-ranges/tree/master/src/Range) folder. 

This is possible with PHP's [ReflectionClass](http://php.net/manual/en/class.reflectionclass.php).

```
<?php
namespace UnicodeRanges;
class PowerRanges
{
    const RANGES_FOLDER = __DIR__ . '/Range';
    protected $ranges = [];
    public function __construct()
    {
        $files = array_diff(scandir(self::RANGES_FOLDER), ['.', '..']);
        foreach ($files as $file) {
            $filename = pathinfo($file, PATHINFO_FILENAME);
            $classname = "\\UnicodeRanges\\Range\\$filename" ;
            $rangeClass = new \ReflectionClass($classname);
            $rangeObj = $rangeClass->newInstanceArgs();
            $this->ranges[] = $rangeObj;
        }
    }
    public function ranges()
    {
        return $this->ranges;
    }
}
```

On the other hand, `Converter::unicode2range($char)` converts any multibyte char into its object-oriented Unicode range counterpart.
 
Example:

```
use UnicodeRanges\Converter;

$char = 'a';
$range = Converter::unicode2range($char);

echo "Total: {$range->count()}".PHP_EOL;
echo "Name: {$range->name()}".PHP_EOL;
echo "Range: {$range->range()[0]}-{$range->range()[1]}".PHP_EOL;
echo 'Characters: ' . PHP_EOL;
print_r($range->chars());
```

Output:

```
Total: 96
Name: Basic Latin
Range: 0020-007F
Characters:
Array
(
    [0] =>  
    [1] => !
    [2] => "
    [3] => #
    [4] => $
    [5] => %
    [6] => &
    [7] => '
    ...
```

![stats.jpg](https://cdn.steemitimages.com/DQmaT6BmTF8xEMVdmcqHN6LTKMxo7Uun9mjTE2c3FgUaj2n/stats.jpg)

[This is how](https://github.com/programarivm/babylon/blob/master/tests/unit/UnicodeRangeStatsTest.php) Babylon can now analyze the frequency of the Unicode ranges:

```
/**
 * @test
 */
public function freq()
{
    $text = '律絕諸篇俱宇宙古今مليارات في мале,тъйжалнопе hola que tal como 토마토쥬스 estas tu hoy この平安朝の';
    $expected = [
        'Basic Latin' => 25,
        'Cyrillic' => 14,
        'CJK Unified Ideographs' => 12,
        'Arabic' => 9,
        'Hangul Syllables' => 5,
        'Hiragana' => 3,
    ];

    $this->assertEquals($expected, (new UnicodeRangeStats($text))->freq());
}
```

As you can see, a [UnicodeRangeStats](https://github.com/programarivm/babylon/blob/master/src/UnicodeRangeStats.php) class is instantiated, which is the one running `Converter::unicode2range($char);` as it is shown below.

```
<?php

namespace Babylon;

use Babylon;
use UnicodeRanges\Converter;

/**
 * Unicode range stats.
 *
 * @author Jordi Bassagañas <info@programarivm.com>
 * @link https://programarivm.com
 * @license MIT
 */
class UnicodeRangeStats
{
	const N_FREQ_UNICODE_RANGES = 10;

	/**
     * Text to be analyzed.
     *
     * @var string
     */
	protected $text;

	/**
     * Unicode ranges frequency -- number of times that the unicode ranges appear in the text.
     *
     * Example:
     *
     *      Array
     *      (
     *         [Basic Latin] => 25
     *         [Cyrillic] => 14
     *         [CJK Unified Ideographs] => 12
     *         [Arabic] => 9
     *         [Hangul Syllables] => 5
     *         [Hiragana] => 3
	 *          ...
     *      )
     *
     * @var array
     */
	protected $freq;

	/**
     * Constructor.
     *
     * @param string $text
     */
	public function __construct(string $text)
	{
		$this->text = $text;
	}

	/**
     * The most frequent unicode ranges in the text.
     *
     * @return array
     * @throws \InvalidArgumentException
     */
	public function freq(): array
	{
		$chars = $this->mbStrSplit($this->text);
		foreach ($chars as $char) {
			$unicodeRange = Converter::unicode2range($char);
			empty($this->freq[$unicodeRange->name()])
				? $this->freq[$unicodeRange->name()] = 1
				: $this->freq[$unicodeRange->name()] += 1;
		}
		arsort($this->freq);

		return array_slice($this->freq, 0, self::N_FREQ_UNICODE_RANGES);
	}

	/**
     * The most frequent unicode range in the text.
     *
     * @return \UnicodeRanges\AbstractRange
     * @throws \InvalidArgumentException
     */
	public function mostFreq(): string
	{
		return key(array_slice($this->freq(), 0, 1));
	}

	/**
     * Converts a multibyte string into an array of chars.
     *
     * @return array
     */
	private function mbStrSplit(string $text): array
	{
		$text = preg_replace('!\s+!', ' ', $text);
		$text = str_replace (' ', '', $text);

		return preg_split('/(?<!^)(?!$)/u', $text);
	}
}

```

That's all for now! 

Today I showed you a few applications of the Unicode Ranges library:

- Random phrases (tokens) with UTF chars
- Alphabet detection
- Frequency analysis of Unicode ranges

Could you think of any more to add to this list? 

Any ideas are welcome! Thank you for reading today's post and sharing your views with the community.


#### GitHub Account
https://github.com/programarivm
👍  , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , and 10 others
properties (23)
authorprogramarivm
permlinkalphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php
categoryutopian-io
json_metadata{"tags":["utopian-io","development","php","unicode","utf8"],"image":["https://cdn.steemitimages.com/DQmS9G5KPn5rDc5exJfsuEe98idBgqZGALwgGPoztrsMJUo/many-languages.jpg","https://cdn.steemitimages.com/DQmWqpLYeZoVbbU9BX4n1268tLKcJDCJ7tGeSLSAKV8LKvb/babel.jpg","https://cdn.steemitimages.com/DQmaT6BmTF8xEMVdmcqHN6LTKMxo7Uun9mjTE2c3FgUaj2n/stats.jpg"],"links":["https://github.com/programarivm/unicode-ranges","https://github.com/programarivm/babylon","https://en.wikipedia.org/wiki/Unicode_block","https://unicode-ranges.readthedocs.io/en/latest/","https://github.com/programarivm/unicode-ranges/pull/1","https://github.com/programarivm/unicode-ranges/tree/master/src/Range","http://php.net/manual/en/class.reflectionclass.php","https://github.com/programarivm/babylon/blob/master/tests/unit/UnicodeRangeStatsTest.php","https://github.com/programarivm/babylon/blob/master/src/UnicodeRangeStats.php","https://github.com/programarivm"],"app":"steemit/0.1","format":"markdown"}
created2018-09-03 17:45:06
last_update2018-09-03 17:54:39
depth0
children6
last_payout2018-09-10 17:45:06
cashout_time1969-12-31 23:59:59
total_payout_value9.187 HBD
curator_payout_value2.926 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length7,617
author_reputation2,631,258,794,707
root_title"Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id70,207,407
net_rshares10,401,735,394,557
author_curate_reward""
vote details (74)
@amosbastian ·
$8.09
Thanks for the contribution, @programarivm! It's always cool to read about people creating something for a specific need that they couldn't find elsewhere!

Some thoughts about the pull request:

* Even though there is little code, you could still add some comments (like function declarations, for example).
* Commit messages could be better - [this](https://chris.beams.io/posts/git-commit/) is a good reference.

I look forward to seeing more of your contributions!

Your contribution has been evaluated according to [Utopian policies and guidelines](https://join.utopian.io/guidelines), as well as a predefined set of questions pertaining to the category.

To view those questions and the relevant answers related to your post, [click here](https://review.utopian.io/result/3/1341224).

---- 
Need help? Write a ticket on https://support.utopian.io/. 
Chat with us on [Discord](https://discord.gg/uTyJkNm). 
[[utopian-moderator]](https://join.utopian.io/)
👍  , , , , , , ,
properties (23)
authoramosbastian
permlinkre-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180905t201813765z
categoryutopian-io
json_metadata{"tags":["utopian-io"],"users":["programarivm"],"links":["https://chris.beams.io/posts/git-commit/","https://join.utopian.io/guidelines","https://review.utopian.io/result/3/1341224","https://support.utopian.io/","https://discord.gg/uTyJkNm","https://join.utopian.io/"],"app":"steemit/0.1"}
created2018-09-05 20:18:12
last_update2018-09-05 20:18:12
depth1
children3
last_payout2018-09-12 20:18:12
cashout_time1969-12-31 23:59:59
total_payout_value6.101 HBD
curator_payout_value1.989 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length959
author_reputation174,473,586,900,705
root_title"Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id70,434,423
net_rshares7,363,501,153,723
author_curate_reward""
vote details (8)
@programarivm ·
Thanks for the review @amosbastian. 

In regards to commenting the code, I believe it is okay not to write comments as long as the code is simple enough, self-explanatory and the names of  variables, methods, constants and so on, are meaningful. 

Anyway I am reviewing the code already, thank you.
properties (22)
authorprogramarivm
permlinkre-amosbastian-re-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180908t091026457z
categoryutopian-io
json_metadata{"tags":["utopian-io"],"users":["amosbastian"],"app":"steemit/0.1"}
created2018-09-08 09:10:27
last_update2018-09-08 09:10:27
depth2
children1
last_payout2018-09-15 09:10:27
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length298
author_reputation2,631,258,794,707
root_title"Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id70,683,055
net_rshares0
@amosbastian ·
I agree.
properties (22)
authoramosbastian
permlinkre-programarivm-re-amosbastian-re-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180908t111956611z
categoryutopian-io
json_metadata{"tags":["utopian-io"],"app":"steemit/0.1"}
created2018-09-08 11:19:57
last_update2018-09-08 11:19:57
depth3
children0
last_payout2018-09-15 11:19:57
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length8
author_reputation174,473,586,900,705
root_title"Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id70,690,807
net_rshares0
@utopian-io ·
Thank you for your review, @amosbastian!

So far this week you've reviewed 10 contributions. Keep up the good work!
👍  ,
properties (23)
authorutopian-io
permlinkre-re-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180905t201813765z-20180909t014514z
categoryutopian-io
json_metadata"{"app": "beem/0.19.42"}"
created2018-09-09 01:45:15
last_update2018-09-09 01:45:15
depth2
children0
last_payout2018-09-16 01:45:15
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length115
author_reputation152,955,367,999,756
root_title"Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id70,747,158
net_rshares16,557,960,926
author_curate_reward""
vote details (2)
@checky ·
Possible wrong mentions found
Hi @programarivm, I'm @checky ! While checking the mentions made in this post I noticed that @throws doesn't exist on Steem. Did you mean to write @<em></em>thow ?

###### If you found this comment useful, consider upvoting it to help keep this bot running. You can see a list of all available commands by replying with `!help`.
properties (22)
authorchecky
permlinkre-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php
categoryutopian-io
json_metadata{"app":"checky/0.1.0","format":"markdown","tags":["mentions","bot","checky"]}
created2018-09-03 17:45:15
last_update2018-09-03 17:45:15
depth1
children0
last_payout2018-09-10 17:45:15
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length328
author_reputation933,802,853,502
root_title"Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id70,207,421
net_rshares0
@utopian-io ·
Hey, @programarivm!

**Thanks for contributing on Utopian**.
We’re already looking forward to your next contribution!

**Get higher incentives and support Utopian.io!**
 Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via [SteemPlus](https://chrome.google.com/webstore/detail/steemplus/mjbkjgcplmaneajhcbegoffkedeankaj?hl=en) or [Steeditor](https://steeditor.app)).

**Want to chat? Join us on Discord https://discord.gg/h52nFrV.**

<a href='https://steemconnect.com/sign/account-witness-vote?witness=utopian-io&approve=1'>Vote for Utopian Witness!</a>
properties (22)
authorutopian-io
permlinkre-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180906t021037z
categoryutopian-io
json_metadata"{"app": "beem/0.19.42"}"
created2018-09-06 02:10:39
last_update2018-09-06 02:10:39
depth1
children0
last_payout2018-09-13 02:10:39
cashout_time1969-12-31 23:59:59
total_payout_value0.000 HBD
curator_payout_value0.000 HBD
pending_payout_value0.000 HBD
promoted0.000 HBD
body_length594
author_reputation152,955,367,999,756
root_title"Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP"
beneficiaries[]
max_accepted_payout1,000,000.000 HBD
percent_hbd10,000
post_id70,455,828
net_rshares0