Defining text encoding in PHP instead of mb_detect_encoding

There are several encodings of Cyrillic characters.



When creating sites on the Internet, they usually use:





More popular encodings:





This is probably not the whole list, these are the encodings that I often encounter.



Sometimes it becomes necessary to determine the encoding of the text. And in PHP there is even a function for this:



mb_detect_encoding
      
      





But, as m00t wrote in the article Defining Text Encoding in PHP - An Overview of Existing Solutions plus Another Bike

In short, it does not work.
After reading the m00t articles , I was not inspired by its method and found this solution: Determining the text encoding in PHP and Python

As m00t said

again character codes
I tested the function of determining the encoding by character codes, the result satisfied me and I used this function for a couple of years.



Recently I decided to rewrite the project where I used this function, I found a ready-made package on packagist.org cnpait / detect_encoding , in which the encoding is determined using the m00t method



At the same time, the specified package was installed more than 1200 times, so it’s not for me alone that the task of determining the text encoding periodically arises.



I would install this package and calm down, but I decided to "get confused."



In general, I made my package: onnov / detect-encoding .



How to use it is written in README.md



I’ll write about testing it and comparing it with the cnpait / detect_encoding package.



Testing methodology



Take the big text: Tolstoy - Anna Karenina

In total - 1'701'480 characters



We remove all unnecessary, we leave only the Cyrillic alphabet:



 $text = preg_replace('/[^--]/ui', '', $text);
      
      





There remained 1'336'252 cyrillic signs.



In the loop we take part of the text (5, 15, 30, ... characters), convert it to a known encoding and try to determine the encoding by the script. Then compare correctly or not.



Here is the table in which the encoding is on the left, the number of characters by which the encoding is determined on top, the table shows the reliability result in %%

letters -> 5 fifteen thirty 60 120 180 270
windows-1251 99.13 98.83 98.54 99.04 99.73 99.93 100.0
koi8-r 99.89 99.98 100.0 100.0 100.0 100.0 100.0
iso-8859-5 81.79 99.27 99.98 100.0 100.0 100.0 100.0
ibm866 99.81 99.99 100.0 100.0 100.0 100.0 100.0
mac-cyrillic 12.79 47.49 73.48 92.15 99.30 99.94 100.0


Worst accuracy with Mac Cyrillic, you need at least 60 characters to determine this encoding with an accuracy of 92.15%. Windows-1251 encoding also has very low accuracy. This is due to the fact that the numbers of their characters in the tables overlap greatly.



Fortunately, mac-cyrillic and ibm866 encodings are not used to encode web pages.



Let's try without them:

letters -> 5 10 fifteen thirty 60
windows-1251 99.40 99.69 99.86 99.97 100.0
koi8-r 99.89 99.98 99.98 100.0 100.0
iso-8859-5 81.79 96.41 99.27 99.98 100.0


The accuracy of the determination is high even in short sentences from 5 to 10 letters. And for phrases of 60 letters, the accuracy of determination reaches 100%. And yet, the encoding is determined very quickly, for example, text longer than 1,300,000 Cyrillic characters is checked in 0.00096 seconds. (on my computer)



And what results will the statistical method described by m00t show :

letters -> 5 10 fifteen thirty 60
windows-1251 88.75 96.62 98.43 99.90 100.0
koi8-r 85.15 95.71 97.96 99.91 100.0
iso-8859-5 88.60 96.77 98.58 99.93 100.0


As you can see, the results of determining the encoding are good. The speed of the script is high, especially in short texts, in huge texts the speed is significantly inferior. Text longer than 1,300,000 Cyrillic characters is checked in 0.32 seconds. (on my computer).



My findings





Which method to use is up to you. In principle, you can use both at once.



All Articles