There are several encodings of Cyrillic characters.
When creating sites on the Internet, they usually use:
- utf-8
- windows-1251
- koi8-r
More popular encodings:
- iso-8859-5
- ibm866
- mac-cyrillic
This is probably not the whole list, these are the encodings that I often encounter.
Sometimes it becomes necessary to determine the encoding of the text. And in PHP there is even a function for this:
mb_detect_encoding
But, as
m00t wrote in the article
Defining Text Encoding in PHP - An Overview of Existing Solutions plus Another Bike
In short, it does not work.
After reading the
m00t articles
, I was not inspired by its method and found this solution:
Determining the text encoding in PHP and Python
As
m00t said
again character codes
I tested the function of determining the encoding by character codes, the result satisfied me and I used this function for a couple of years.
Recently I decided to rewrite the project where I used this function, I found a ready-made package on packagist.org
cnpait / detect_encoding , in which the encoding is determined using the
m00t method
At the same time, the specified package was installed more than 1200 times, so itβs not for me alone that the task of determining the text encoding periodically arises.
I would install this package and calm down, but I decided to "get confused."
In general, I made my package:
onnov / detect-encoding .
How to use it is written in README.md
Iβll write about testing it and comparing it with the
cnpait / detect_encoding package.
Testing methodology
Take the big text: Tolstoy - Anna Karenina
In total - 1'701'480 characters
We remove all unnecessary, we leave only the Cyrillic alphabet:
$text = preg_replace('/[^--]/ui', '', $text);
There remained 1'336'252 cyrillic signs.
In the loop we take part of the text (5, 15, 30, ... characters), convert it to a known encoding and try to determine the encoding by the script. Then compare correctly or not.
Here is the table in which the encoding is on the left, the number of characters by which the encoding is determined on top, the table shows the reliability result in %%
Worst accuracy with Mac Cyrillic, you need at least 60 characters to determine this encoding with an accuracy of 92.15%. Windows-1251 encoding also has very low accuracy. This is due to the fact that the numbers of their characters in the tables overlap greatly.
Fortunately, mac-cyrillic and ibm866 encodings are not used to encode web pages.
Let's try without them:
The accuracy of the determination is high even in short sentences from 5 to 10 letters. And for phrases of 60 letters, the accuracy of determination reaches 100%. And yet, the encoding is determined very quickly, for example, text longer than 1,300,000 Cyrillic characters is checked in 0.00096 seconds. (on my computer)
And what results will the statistical method described by
m00t show :
As you can see, the results of determining the encoding are good. The speed of the script is high, especially in short texts, in huge texts the speed is significantly inferior. Text longer than 1,300,000 Cyrillic characters is checked in 0.32 seconds. (on my computer).
My findings
- Both methods give good results.
- The accuracy of the methods is close.
- The speed of determining by character codes is higher in large texts, but this is hardly of great importance, because it is unlikely that anyone will check such huge texts.
- The statistical method still has the potential to increase the accuracy of encoding determination.
Which method to use is up to you. In principle, you can use both at once.