John Davidson

debugging - How can I reproducibly represent a non-UTF8 string in PHP (Browser)

0 comments
Message:


I received a string with an unknown character encoding via import. How can I display such a string in the browser so that it can be reproduced as PHP code?


I would like to illustrate the problem with an example.


$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8,'CP1252');

var_dump($stringWin1252); //string(17) "The price is 15 �"
var_export($stringWin1252); // 'The price is 15 �'

The string delivered with var_export does not match the original. All unrecognized characters are replaced by the � symbol. The string is only generated here with mb_convert_encoding for test purposes. Here the character coding is known. In practice, it comes from imports e.G. with file_cet_contents() and the character coding is unknown.


The output with an improved var_export that I expect looks like this:


"The price is 15 \x80"

My approach to the solution is to find all non-UTF8 characters and then show them in hexadecimal. The code for this is too extensive to be shown here.


Another variant is to output all characters in hexadecimal PHP notation.


function strToHex2($str) {
return '\x'.rtrim(chunk_split(strtoupper(bin2hex($str)),2,'\x'),'\x');
}
echo strToHex2($stringWin1252);

Output:


\x54\x68\x65\x20\x70\x72\x69\x63\x65\x20\x69\x73\x20\x31\x35\x20\x80

This variant is well suited for purely binary data, but quite large and difficult to read for general texts.


My question in other words:


How can I change all non-UTF8 characters from a string to the PHP hex representation "\xnn" and leave correct UTF8 characters.



My Answer:
One possible solution is to iterate through each character in the string and check if it is a valid UTF-8 character. If it is not, convert it to its hexadecimal representation. Here is an example code snippet to achieve this:

php
function convertNonUTF8ToHex($str) {
$result = '';
$length = strlen($str);

for ($i = 0; $i < $length; $i++) {
$char = $str[$i];

if (mb_check_encoding($char, 'UTF-8')) {
$result .= $char;
} else {
$result .= '\x' . strtoupper(bin2hex($char));
}
}

return $result;
}

$stringUTF8 = "The price is 15 €";
$stringWin1252 = mb_convert_encoding($stringUTF8, 'CP1252');

echo convertNonUTF8ToHex($stringWin1252);


This function will iterate through each character in the input string and check if it is a valid UTF-8 character using `mb_check_encoding()`. If it is not a valid UTF-8 character, it will convert it to its hexadecimal representation using `bin2hex()` and append it to the result string with the `\x` prefix.

This way, you can represent non-UTF-8 characters in the string as PHP hex representation `\xnn` while leaving correct UTF-8 characters unchanged.

Rate this post

3 of 5 based on 5148 votes

Comments




© 2024 Hayatsk.info - Personal Blogs Platform. All Rights Reserved.
Create blog  |  Privacy Policy  |  Terms & Conditions  |  Contact Us