John Davidson

Trouble decoding string from JSON in PHP \u00e6\u0097\u00a5\u00e6\u009c\u00ac

0 comments
Message:


TLDR: Trying to convert the string \u00e6\u0097\u00a5\u00e6\u009c\u00ac to 日本 in php.
(Trying to get \u00e6\u0097\u00a5\u00e6\u009c\u00ac to echo out 日本)


Hi folks,


I have a json file from Instagram (downloaded my data) and many of my posts contain Japanese text which is stored encoded in UTF-8 (and please correct me if I'm wrong, especially as mb_detect_encoding("\u00e6\u0097\u00a5\u00e6\u009c\u00ac") returns "ASCII").


For example \u00e6\u0097\u00a5\u00e6\u009c\u00ac becomes 日本.


The conversions can be seen working fine on this encoder/decoder website:
https://mothereff.in/utf-8


(Note that if you put 日本 into the above site it returns \xE6\x97\xA5\xE6\x9C\xAC, so adding \xE6\x97\xA5\xE6\x9C\xAC \u00e6\u0097\u00a5\u00e6\u009c\u00ac to the encoded field will produce 日本 日本 in the decoded field)


I'm trying to convert it back to regular Japanese text but am having issues.


I've been googling and looking over Stackoverflow for just over a day and have been trying many different methods, but I just can't get it to convert. I'm clearly missing something. In most cases it does not change at all.


For the scope of this question, I'm simply trying to convert \u00e6\u0097\u00a5\u00e6\u009c\u00ac into 日本.
I am not trying to convert the json file (though am open to any suggestions that would need me to).


(For the record I am using the variable $str for \u00e6\u0097\u00a5\u00e6\u009c\u00ac)


The following attempts resulted in no visible change, \u00e6\u0097\u00a5\u00e6\u009c\u00ac


echo call_user_func_array('mb_convert_encoding', array(&$str,'HTML-ENTITIES','UTF-8'));
echo iconv('ASCII', 'UTF-8', $str);
echo iconv("UTF-8", "CP1252", $str);
echo iconv('UTF-8', 'ISO-8859-1', $str);
echo iconv('UTF-8', 'UTF-8//IGNORE', utf8_encode($str));
echo iconv('ISO-8859-1', 'UTF-8', $str);
echo iconv('ISO-8859-9', 'UTF-8', $str);
echo iconv(mb_detect_encoding($str, mb_detect_order(), true), "UTF-8", $str);
echo htmlentities($str);
echo mb_convert_encoding($str, 'utf-8', 'iso-8859-1');
echo mb_convert_encoding($str, "EUC-JP", "auto");
echo mb_convert_encoding($str, "utf-8", "windows-1251");
echo mb_convert_encoding($str, "windows-1251", "utf-8");
echo mb_convert_encoding($str,'HTML-ENTITIES', 'UTF-8');
echo mb_convert_encoding($str,"UTF-8","auto");
echo mb_convert_encoding($str,"UTF-8");
echo mb_convert_encoding($str, 'UTF-8', array('EUC-JP', 'SHIFT-JIS', 'AUTO'));
echo mb_convert_encoding($str, 'UTF-8', 'ISO-8859-1');
echo mb_convert_encoding($str, "UTF-8", mb_detect_encoding($str, "UTF-8, ISO-8859-1, ISO-8859-15", true));
echo mb_convert_encoding($str, "ISO-8859-1", mb_detect_encoding($str, "UTF-8, ISO-8859-1, ISO-8859-15", true));
echo mb_convert_encoding($str, 'ISO-8859-1', 'UTF-8');
echo utf8_decode($str);
echo utf8_encode($str);

The following attempt resulted in the slash being duplicated with double quotation marks added, "\\u00e6\\u0097\\u00a5\\u00e6\\u009c\\u00ac"


echo json_encode($str,JSON_HEX_TAG);
echo json_encode($str,JSON_UNESCAPED_UNICODE |JSON_PRETTY_PRINT);
echo json_encode($str,JSON_UNESCAPED_UNICODE|JSON_UNESCAPED_SLASHES);

The following attempt resulted in nothing being returned,


echo json_decode($str, JSON_HEX_TAG);
echo json_decode($str, false);
echo json_decode($str, false, 512, JSON_UNESCAPED_UNICODE);

The following attempted resulted in the slashes changing to an unknown character, �_u00e6�_u0097�_u00a5�_u00e6�_u009c�_u00ac


echo mb_convert_encoding($str, "SJIS");

From the PHP documentation I tried this to see if any of the combinations would work, but none did.
https://www.php.net/manual/en/function.mb-convert-encoding.php#97902


foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($str, 'UTF-8', $chr)." : ".$chr."<br>";
}
echo "<br>--- REVERSE TRY ---<br><br>";
foreach(mb_list_encodings() as $chr){
echo mb_convert_encoding($str, $chr, 'UTF-8')." : ".$chr."<br>";
}

I tried using the Unicode Codepoint Escape Syntax, which gave 日本
https://www.php.net/manual/en/migration70.new-features.php#migration70.new-features.unicode-codepoint-escape-syntax


echo "\u{00e6}\u{0097}\u{00a5}\u{00e6}\u{009c}\u{00ac}";

As mentioned in the brackets earlier, \xE6\x97\xA5\xE6\x9C\xAC does convert to 日本 when echoed.


echo "\xE6\x97\xA5\xE6\x9C\xAC";

Noticing above that the two different codes had the same endings, I tried using str_replace so that they would match, but this time \xE6\x97\xA5\xE6\x9C\xAC was echoed.


echo str_replace("\U00","\x",strtoupper($str));

I have also tried all of the above with and without the following:


header('Content-Type: text/plain; charset="UTF-8"');

Here is a segment of the original JSON file (original file is 13k lines, so here is a single element).


{
"media": [
{
"uri": "media/posts/202104/175127092_241529264421003_4026764649651789139_n_18106766305234668.jpg",
"creation_timestamp": 1619277565,
"title": "Time to head back to Tokyo.\nFukuoka Airport, Japan.\n18 October 2020\n.\n.\n.\n.\n.\n#japan #\u00e6\u0097\u00a5\u00e6\u009c\u00ac #toyphotography #toy #\u00e3\u0081\u008a\u00e3\u0082\u0082\u00e3\u0081\u00a1\u00e3\u0082\u0083 #\u00e3\u0083\u00ad\u00e3\u0083\u009b\u00e3\u0082\u0099\u00e3\u0083\u0083\u00e3\u0083\u0088 #GodJesusRobot #robot #toyholiday #holiday #vacation #\u00e6\u0097\u0085\u00e8\u00a1\u008c #photography #\u00e5\u0086\u0099\u00e7\u009c\u009f #japan_of_insta #japantravel #\u00e6\u0097\u00a5\u00e6\u009c\u00ac\u00e6\u0097\u0085\u00e8\u00a1\u008c #travel #kitakyushu #\u00e5\u008c\u0097\u00e4\u00b9\u009d\u00e5\u00b7\u009e #airport #\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #fukuokaairport #\u00e7\u00a6\u008f\u00e5\u00b2\u00a1\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #plane #airplane #aeroplane #\u00e9\u00a3\u009b\u00e8\u00a1\u008c\u00e6\u00a9\u009f #windowseat #window"
}
]
}

UPDATE


Based on the comments by @jerry and @yourcommonsense, hexbin can work so the string will have to be converted by dropping the \u00. hex2bin(str_replace('\u00', '', $str)); will definitely work for the string mentioned in the TLDR and upper part of the question, but to tackle the full title string in the json I've come up with a very ugly and messy method.


$str = "Time to head back to Tokyo.\nFukuoka Airport, Japan.\n18 October 2020\n.\n.\n.\n.\n.\n#japan #\u00e6\u0097\u00a5\u00e6\u009c\u00ac #toyphotography #toy #\u00e3\u0081\u008a\u00e3\u0082\u0082\u00e3\u0081\u00a1\u00e3\u0082\u0083 #\u00e3\u0083\u00ad\u00e3\u0083\u009b\u00e3\u0082\u0099\u00e3\u0083\u0083\u00e3\u0083\u0088  #GodJesusRobot #robot #toyholiday  #holiday #vacation #\u00e6\u0097\u0085\u00e8\u00a1\u008c #photography  #\u00e5\u0086\u0099\u00e7\u009c\u009f #japan_of_insta #japantravel #\u00e6\u0097\u00a5\u00e6\u009c\u00ac\u00e6\u0097\u0085\u00e8\u00a1\u008c #travel #kitakyushu #\u00e5\u008c\u0097\u00e4\u00b9\u009d\u00e5\u00b7\u009e #airport #\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #fukuokaairport #\u00e7\u00a6\u008f\u00e5\u00b2\u00a1\u00e7\u00a9\u00ba\u00e6\u00b8\u00af #plane #airplane #aeroplane #\u00e9\u00a3\u009b\u00e8\u00a1\u008c\u00e6\u00a9\u009f #windowseat #window";
$pattern = '/(\\\\u00..)+/i';

function getHex2Bin($matches) {
return hex2bin(str_replace("\U00","",strtoupper($matches[0])));
}

$result = preg_replace_callback($pattern, 'getHex2Bin', $str);
echo $result;

This does work, giving me my desired result:
Time to head back to Tokyo. Fukuoka Airport, Japan. 18 October 2020 . . . . . #japan #日本 #toyphotography #toy #おもちゃ #ロボット #GodJesusRobot #robot #toyholiday #holiday #vacation #旅行 #photography #写真 #japan_of_insta #japantravel #日本旅行 #travel #kitakyushu #北九州 #airport #空港 #fukuokaairport #福岡空港 #plane #airplane #aeroplane #飛行機 #windowseat #window but I can't help feel that there is a much neater solution.


Update 2


Here is a PHP Sandbox showing the results of all attempts mentioned above, including the messy working one.



My Answer:
) returns UTF-8). When I decode the JSON in PHP, the Japanese text is displayed as \u00e6\u0097\u00a5\u00e6\u009c\u00ac instead of the actual characters 日本.

I have tried using json_decode with the JSON_UNESCAPED_UNICODE flag, but it doesn't seem to work. I have also tried using utf8_decode and utf8_encode, but they also don't seem to work.

Here is a snippet of my code:

php
$json = file_get_contents('data.json');
$data = json_decode($json, true);

foreach ($data['posts'] as $post) {
echo $post['text'] . "
";
}


Is there a way to properly decode the Japanese text so that it displays correctly as 日本? Any help would be greatly appreciated. Thank you!

Rate this post

4 of 5 based on 2762 votes

Comments




© 2024 Hayatsk.info - Personal Blogs Platform. All Rights Reserved.
Create blog  |  Privacy Policy  |  Terms & Conditions  |  Contact Us