#### Repository - https://github.com/programarivm/unicode-ranges  #### Related Repositories - https://github.com/programarivm/babylon Have you ever needed to create a random string with Unicode characters encoded in blocks that you'd want to pick at will? I did a few months ago but couldn't find any library to easily achieve my goal. So I decided to write Unicode Ranges which is a PHP library that provides you with Unicode ranges -- blocks, if you like -- in a friendly, object-oriented way. By the way, if you are not very familiar with Unicode [click here](https://en.wikipedia.org/wiki/Unicode_block) for a quick introduction to the ranges: Basic Latin, Cyrillic, Hangul Hamo, and many, many others. Here is an example that creates a random char encoded in any of these three Unicode ranges: `BasicLatin`, `Tibetan` and `Cherokee`. ``` use UnicodeRanges\Randomizer; use UnicodeRanges\Range\BasicLatin; use UnicodeRanges\Range\Tibetan; use UnicodeRanges\Range\Cherokee; $char = Randomizer::char([ new BasicLatin, new Tibetan, new Cherokee, ]); echo $char . PHP_EOL; ``` Output: ``` Ꮉ ``` And this is how to create a random string with `Arabic`, `HangulJamo` and `Phoenician` characters: ``` use UnicodeRanges\Randomizer; use UnicodeRanges\Range\Arabic; use UnicodeRanges\Range\HangulJamo; use UnicodeRanges\Range\Phoenician; $letters = Randomizer::letters([ new Arabic, new HangulJamo, new Phoenician, ], 20); echo $letters . PHP_EOL; ``` Output: ``` ᄺᆺڽ𐤂ᆉᅔᅱ𐤆𐤄ᅰᇼᄓ𐤊𐤄ᄃ𐤋ᆝᆛەᅎ ``` Very useful if you want to create random UTF-8 tokens for example. I hope these examples will give you the context to follow my explanation -- for further information please read the [Documentation](https://unicode-ranges.readthedocs.io/en/latest/). ### New Features Let's now cut to the chase. Yesterday I created the following Unicode Ranges feature for [Babylon](https://github.com/programarivm/babylon) to be able to compute the ranges' frequencies -- or put another way, the number of times that a particular unicode range appears in a text.  The ultimate goal is for the language detector to understand alphabets. This is how the feature is implemented: - [Feature/power ranges #1](https://github.com/programarivm/unicode-ranges/pull/1) On the one hand, `PowerRanges` provides with an array containing all 255 Unicode ranges. Of course, I didn't manually instantiate the 255 classes, which would have been just tedious! Note that the `PowerRanges` array is dynamically built by reading the files stored in the [unicode-ranges/src/Range/](https://github.com/programarivm/unicode-ranges/tree/master/src/Range) folder. This is possible with PHP's [ReflectionClass](http://php.net/manual/en/class.reflectionclass.php). ``` <?php namespace UnicodeRanges; class PowerRanges { const RANGES_FOLDER = __DIR__ . '/Range'; protected $ranges = []; public function __construct() { $files = array_diff(scandir(self::RANGES_FOLDER), ['.', '..']); foreach ($files as $file) { $filename = pathinfo($file, PATHINFO_FILENAME); $classname = "\\UnicodeRanges\\Range\\$filename" ; $rangeClass = new \ReflectionClass($classname); $rangeObj = $rangeClass->newInstanceArgs(); $this->ranges[] = $rangeObj; } } public function ranges() { return $this->ranges; } } ``` On the other hand, `Converter::unicode2range($char)` converts any multibyte char into its object-oriented Unicode range counterpart. Example: ``` use UnicodeRanges\Converter; $char = 'a'; $range = Converter::unicode2range($char); echo "Total: {$range->count()}".PHP_EOL; echo "Name: {$range->name()}".PHP_EOL; echo "Range: {$range->range()[0]}-{$range->range()[1]}".PHP_EOL; echo 'Characters: ' . PHP_EOL; print_r($range->chars()); ``` Output: ``` Total: 96 Name: Basic Latin Range: 0020-007F Characters: Array ( [0] => [1] => ! [2] => " [3] => # [4] => $ [5] => % [6] => & [7] => ' ... ```  [This is how](https://github.com/programarivm/babylon/blob/master/tests/unit/UnicodeRangeStatsTest.php) Babylon can now analyze the frequency of the Unicode ranges: ``` /** * @test */ public function freq() { $text = '律絕諸篇俱宇宙古今مليارات في мале,тъйжалнопе hola que tal como 토마토쥬스 estas tu hoy この平安朝の'; $expected = [ 'Basic Latin' => 25, 'Cyrillic' => 14, 'CJK Unified Ideographs' => 12, 'Arabic' => 9, 'Hangul Syllables' => 5, 'Hiragana' => 3, ]; $this->assertEquals($expected, (new UnicodeRangeStats($text))->freq()); } ``` As you can see, a [UnicodeRangeStats](https://github.com/programarivm/babylon/blob/master/src/UnicodeRangeStats.php) class is instantiated, which is the one running `Converter::unicode2range($char);` as it is shown below. ``` <?php namespace Babylon; use Babylon; use UnicodeRanges\Converter; /** * Unicode range stats. * * @author Jordi Bassagañas <info@programarivm.com> * @link https://programarivm.com * @license MIT */ class UnicodeRangeStats { const N_FREQ_UNICODE_RANGES = 10; /** * Text to be analyzed. * * @var string */ protected $text; /** * Unicode ranges frequency -- number of times that the unicode ranges appear in the text. * * Example: * * Array * ( * [Basic Latin] => 25 * [Cyrillic] => 14 * [CJK Unified Ideographs] => 12 * [Arabic] => 9 * [Hangul Syllables] => 5 * [Hiragana] => 3 * ... * ) * * @var array */ protected $freq; /** * Constructor. * * @param string $text */ public function __construct(string $text) { $this->text = $text; } /** * The most frequent unicode ranges in the text. * * @return array * @throws \InvalidArgumentException */ public function freq(): array { $chars = $this->mbStrSplit($this->text); foreach ($chars as $char) { $unicodeRange = Converter::unicode2range($char); empty($this->freq[$unicodeRange->name()]) ? $this->freq[$unicodeRange->name()] = 1 : $this->freq[$unicodeRange->name()] += 1; } arsort($this->freq); return array_slice($this->freq, 0, self::N_FREQ_UNICODE_RANGES); } /** * The most frequent unicode range in the text. * * @return \UnicodeRanges\AbstractRange * @throws \InvalidArgumentException */ public function mostFreq(): string { return key(array_slice($this->freq(), 0, 1)); } /** * Converts a multibyte string into an array of chars. * * @return array */ private function mbStrSplit(string $text): array { $text = preg_replace('!\s+!', ' ', $text); $text = str_replace (' ', '', $text); return preg_split('/(?<!^)(?!$)/u', $text); } } ``` That's all for now! Today I showed you a few applications of the Unicode Ranges library: - Random phrases (tokens) with UTF chars - Alphabet detection - Frequency analysis of Unicode ranges Could you think of any more to add to this list? Any ideas are welcome! Thank you for reading today's post and sharing your views with the community. #### GitHub Account https://github.com/programarivm
author | programarivm |
---|---|
permlink | alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php |
category | utopian-io |
json_metadata | {"tags":["utopian-io","development","php","unicode","utf8"],"image":["https://cdn.steemitimages.com/DQmS9G5KPn5rDc5exJfsuEe98idBgqZGALwgGPoztrsMJUo/many-languages.jpg","https://cdn.steemitimages.com/DQmWqpLYeZoVbbU9BX4n1268tLKcJDCJ7tGeSLSAKV8LKvb/babel.jpg","https://cdn.steemitimages.com/DQmaT6BmTF8xEMVdmcqHN6LTKMxo7Uun9mjTE2c3FgUaj2n/stats.jpg"],"links":["https://github.com/programarivm/unicode-ranges","https://github.com/programarivm/babylon","https://en.wikipedia.org/wiki/Unicode_block","https://unicode-ranges.readthedocs.io/en/latest/","https://github.com/programarivm/unicode-ranges/pull/1","https://github.com/programarivm/unicode-ranges/tree/master/src/Range","http://php.net/manual/en/class.reflectionclass.php","https://github.com/programarivm/babylon/blob/master/tests/unit/UnicodeRangeStatsTest.php","https://github.com/programarivm/babylon/blob/master/src/UnicodeRangeStats.php","https://github.com/programarivm"],"app":"steemit/0.1","format":"markdown"} |
created | 2018-09-03 17:45:06 |
last_update | 2018-09-03 17:54:39 |
depth | 0 |
children | 6 |
last_payout | 2018-09-10 17:45:06 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 9.187 HBD |
curator_payout_value | 2.926 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 7,617 |
author_reputation | 2,631,258,794,707 |
root_title | "Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 70,207,407 |
net_rshares | 10,401,735,394,557 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
andrejcibik | 0 | 35,371,008,095 | 100% | ||
utopian-io | 0 | 10,262,689,166,530 | 6.53% | ||
amosbastian | 0 | 25,716,840,752 | 39.94% | ||
organicgardener | 0 | 1,109,760,053 | 10% | ||
simplymike | 0 | 11,745,413,590 | 10% | ||
statsexpert | 0 | 1,342,509,883 | 20% | ||
schmozzle | 0 | 485,099,532 | 100% | ||
beetlevc | 0 | 621,728,889 | 1% | ||
kolxoznik0 | 0 | 500,038,620 | 100% | ||
alina34 | 0 | 500,884,125 | 100% | ||
dimka10 | 0 | 500,271,629 | 100% | ||
petrenkosashka | 0 | 499,785,143 | 100% | ||
amaratonna | 0 | 499,317,199 | 100% | ||
ingakoral | 0 | 501,230,199 | 100% | ||
antonova2030 | 0 | 501,156,609 | 100% | ||
missnadeen | 0 | 499,711,180 | 100% | ||
czciborj | 0 | 509,637,305 | 100% | ||
bndage | 0 | 499,161,504 | 100% | ||
submitchair | 0 | 501,917,747 | 100% | ||
ancestorobserve | 0 | 501,559,228 | 100% | ||
bankhayloft | 0 | 501,905,138 | 100% | ||
seniorid | 0 | 499,221,367 | 100% | ||
wombtick | 0 | 489,756,185 | 100% | ||
sigur | 0 | 499,546,390 | 100% | ||
cookiees | 0 | 499,059,204 | 100% | ||
fedykosoy00 | 0 | 510,922,538 | 100% | ||
ira.timirova91 | 0 | 490,136,291 | 100% | ||
elvinhender | 0 | 501,554,760 | 100% | ||
mightypanda | 0 | 25,157,120,482 | 100% | ||
publicchord | 0 | 499,186,770 | 100% | ||
warblingunchin | 0 | 510,409,873 | 100% | ||
filkreserved | 0 | 501,812,186 | 100% | ||
luffnoisy | 0 | 499,820,547 | 100% | ||
cloudtickets | 0 | 499,184,921 | 100% | ||
sutegloss | 0 | 508,436,106 | 100% | ||
eczemamuon | 0 | 490,994,240 | 100% | ||
soapmousse | 0 | 499,063,918 | 100% | ||
graphbaggy | 0 | 490,342,749 | 100% | ||
headerharem | 0 | 502,129,700 | 100% | ||
windtwitter | 0 | 501,985,658 | 100% | ||
monkbulimia | 0 | 511,389,210 | 100% | ||
honeycheek | 0 | 501,559,726 | 100% | ||
annasokolova955 | 0 | 499,459,693 | 100% | ||
djane860 | 0 | 499,336,665 | 100% | ||
be1ozer | 0 | 499,475,281 | 100% | ||
peruska | 0 | 498,987,362 | 100% | ||
ingusik | 0 | 490,041,496 | 100% | ||
lapin124 | 0 | 489,753,133 | 100% | ||
artyr.kalmetov | 0 | 509,431,504 | 100% | ||
irina.abramovva | 0 | 490,680,299 | 100% | ||
shallowcuttle | 0 | 499,866,692 | 100% | ||
novatroup | 0 | 500,270,973 | 100% | ||
patcheswish | 0 | 500,224,507 | 100% | ||
curiouscred | 0 | 490,689,922 | 100% | ||
snailrepeat | 0 | 499,697,020 | 100% | ||
altitudetennis | 0 | 500,332,329 | 100% | ||
corvushilt | 0 | 490,793,393 | 100% | ||
barcodemail | 0 | 499,938,856 | 100% | ||
pointsbee | 0 | 491,226,891 | 100% | ||
crankyboned | 0 | 500,485,203 | 100% | ||
fastandcurious | 0 | 2,840,021,418 | 100% | ||
crystaleur | 0 | 509,360,251 | 100% | ||
mercedesmetric | 0 | 500,377,148 | 100% | ||
bohrbowling | 0 | 491,363,794 | 100% | ||
grouseunhelpful | 0 | 500,570,568 | 100% | ||
blueberryeither | 0 | 510,077,005 | 100% | ||
jacekw.dev | 0 | 1,928,867,741 | 100% | ||
pitondessert | 0 | 501,149,705 | 100% | ||
bikinivinyl | 0 | 485,560,987 | 100% | ||
bullinachinashop | 0 | 2,195,726,249 | 100% | ||
atihonov1990 | 0 | 500,445,273 | 100% | ||
kristiansim | 0 | 500,279,345 | 100% | ||
zelc | 0 | 506,801,733 | 100% | ||
kaczynski | 0 | 52,366,350 | 100% |
Thanks for the contribution, @programarivm! It's always cool to read about people creating something for a specific need that they couldn't find elsewhere! Some thoughts about the pull request: * Even though there is little code, you could still add some comments (like function declarations, for example). * Commit messages could be better - [this](https://chris.beams.io/posts/git-commit/) is a good reference. I look forward to seeing more of your contributions! Your contribution has been evaluated according to [Utopian policies and guidelines](https://join.utopian.io/guidelines), as well as a predefined set of questions pertaining to the category. To view those questions and the relevant answers related to your post, [click here](https://review.utopian.io/result/3/1341224). ---- Need help? Write a ticket on https://support.utopian.io/. Chat with us on [Discord](https://discord.gg/uTyJkNm). [[utopian-moderator]](https://join.utopian.io/)
author | amosbastian |
---|---|
permlink | re-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180905t201813765z |
category | utopian-io |
json_metadata | {"tags":["utopian-io"],"users":["programarivm"],"links":["https://chris.beams.io/posts/git-commit/","https://join.utopian.io/guidelines","https://review.utopian.io/result/3/1341224","https://support.utopian.io/","https://discord.gg/uTyJkNm","https://join.utopian.io/"],"app":"steemit/0.1"} |
created | 2018-09-05 20:18:12 |
last_update | 2018-09-05 20:18:12 |
depth | 1 |
children | 3 |
last_payout | 2018-09-12 20:18:12 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 6.101 HBD |
curator_payout_value | 1.989 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 959 |
author_reputation | 174,473,586,900,705 |
root_title | "Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 70,434,423 |
net_rshares | 7,363,501,153,723 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
mys | 0 | 7,868,402,294 | 6.07% | ||
pixelfan | 0 | 935,266,906 | 0.36% | ||
espoem | 0 | 17,180,282,695 | 15% | ||
utopian-io | 0 | 7,330,817,842,602 | 5.55% | ||
zapncrap | 0 | 2,001,559,740 | 5% | ||
curx | 0 | 1,782,932,993 | 5% | ||
mightypanda | 0 | 2,575,138,317 | 10% | ||
mops2e | 0 | 339,728,176 | 10% |
Thanks for the review @amosbastian. In regards to commenting the code, I believe it is okay not to write comments as long as the code is simple enough, self-explanatory and the names of variables, methods, constants and so on, are meaningful. Anyway I am reviewing the code already, thank you.
author | programarivm |
---|---|
permlink | re-amosbastian-re-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180908t091026457z |
category | utopian-io |
json_metadata | {"tags":["utopian-io"],"users":["amosbastian"],"app":"steemit/0.1"} |
created | 2018-09-08 09:10:27 |
last_update | 2018-09-08 09:10:27 |
depth | 2 |
children | 1 |
last_payout | 2018-09-15 09:10:27 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.000 HBD |
curator_payout_value | 0.000 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 298 |
author_reputation | 2,631,258,794,707 |
root_title | "Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 70,683,055 |
net_rshares | 0 |
I agree.
author | amosbastian |
---|---|
permlink | re-programarivm-re-amosbastian-re-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180908t111956611z |
category | utopian-io |
json_metadata | {"tags":["utopian-io"],"app":"steemit/0.1"} |
created | 2018-09-08 11:19:57 |
last_update | 2018-09-08 11:19:57 |
depth | 3 |
children | 0 |
last_payout | 2018-09-15 11:19:57 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.000 HBD |
curator_payout_value | 0.000 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 8 |
author_reputation | 174,473,586,900,705 |
root_title | "Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 70,690,807 |
net_rshares | 0 |
Thank you for your review, @amosbastian! So far this week you've reviewed 10 contributions. Keep up the good work!
author | utopian-io |
---|---|
permlink | re-re-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180905t201813765z-20180909t014514z |
category | utopian-io |
json_metadata | "{"app": "beem/0.19.42"}" |
created | 2018-09-09 01:45:15 |
last_update | 2018-09-09 01:45:15 |
depth | 2 |
children | 0 |
last_payout | 2018-09-16 01:45:15 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.000 HBD |
curator_payout_value | 0.000 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 115 |
author_reputation | 152,955,367,999,756 |
root_title | "Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 70,747,158 |
net_rshares | 16,557,960,926 |
author_curate_reward | "" |
voter | weight | wgt% | rshares | pct | time |
---|---|---|---|---|---|
espoem | 0 | 16,252,205,568 | 15% | ||
mops2e | 0 | 305,755,358 | 10% |
Hi @programarivm, I'm @checky ! While checking the mentions made in this post I noticed that @throws doesn't exist on Steem. Did you mean to write @<em></em>thow ? ###### If you found this comment useful, consider upvoting it to help keep this bot running. You can see a list of all available commands by replying with `!help`.
author | checky |
---|---|
permlink | re-programarivm-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php |
category | utopian-io |
json_metadata | {"app":"checky/0.1.0","format":"markdown","tags":["mentions","bot","checky"]} |
created | 2018-09-03 17:45:15 |
last_update | 2018-09-03 17:45:15 |
depth | 1 |
children | 0 |
last_payout | 2018-09-10 17:45:15 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.000 HBD |
curator_payout_value | 0.000 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 328 |
author_reputation | 933,802,853,502 |
root_title | "Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 70,207,421 |
net_rshares | 0 |
Hey, @programarivm! **Thanks for contributing on Utopian**. We’re already looking forward to your next contribution! **Get higher incentives and support Utopian.io!** Simply set @utopian.pay as a 5% (or higher) payout beneficiary on your contribution post (via [SteemPlus](https://chrome.google.com/webstore/detail/steemplus/mjbkjgcplmaneajhcbegoffkedeankaj?hl=en) or [Steeditor](https://steeditor.app)). **Want to chat? Join us on Discord https://discord.gg/h52nFrV.** <a href='https://steemconnect.com/sign/account-witness-vote?witness=utopian-io&approve=1'>Vote for Utopian Witness!</a>
author | utopian-io |
---|---|
permlink | re-alphabet-detection-and-frequency-analysis-of-unicode-ranges-with-php-20180906t021037z |
category | utopian-io |
json_metadata | "{"app": "beem/0.19.42"}" |
created | 2018-09-06 02:10:39 |
last_update | 2018-09-06 02:10:39 |
depth | 1 |
children | 0 |
last_payout | 2018-09-13 02:10:39 |
cashout_time | 1969-12-31 23:59:59 |
total_payout_value | 0.000 HBD |
curator_payout_value | 0.000 HBD |
pending_payout_value | 0.000 HBD |
promoted | 0.000 HBD |
body_length | 594 |
author_reputation | 152,955,367,999,756 |
root_title | "Alphabet Detection and Frequency Analysis of Unicode Ranges with PHP" |
beneficiaries | [] |
max_accepted_payout | 1,000,000.000 HBD |
percent_hbd | 10,000 |
post_id | 70,455,828 |
net_rshares | 0 |