John Davidson

dom - Load UTF-8 encoded HTML site with PHP

0 comments
Message:


I crawl some informations from a website. Therefore I create a new DOM document and load the website with loadHTMLFile.
Now I´ve the problem that the website which I'm crawling uses apostrophes/ french accents. I´ve read that loadHTMLFile doesn´t use UTF-8 encoding by default.


So I´ve tried to add UTF-8 encoding manually, but it doesn´t work. The apostrophes still doesn´t show correctly.


For example the letter ì (with apostrophe) is shown as %C3%AC. Word without apostrophes are shown correctly.


This is the complete code:


header('Content-Type: text/html; charset=utf-8');

foreach (range(0, 50) as $number) {

$url = 'https://www.xxyyy.com/' . $number . '';
$dom = new DOMDocument('1.0', 'utf-8');
$dom->loadHTMLFile(mb_convert_encoding($url, 'HTML-ENTITIES', 'UTF-8'));
$dom->substituteEntities = true;
$xpath = new DOMXPath($dom);

$content = $xpath->query("//a[contains(@href,'character')]/@href");

foreach ($content as $node) {

echo $node->nodeValue
}
}


My Answer:
To ensure that the website is loaded with UTF-8 encoding, you can try setting the HTTP header for the content type to UTF-8 before loading the HTML file. Here's an updated version of your code:

php
header('Content-Type: text/html; charset=utf-8');

foreach (range(0, 50) as $number) {
$url = 'https://www.xxyyy.com/' . $number;
$html = file_get_contents($url);
$dom = new DOMDocument();
$dom->loadHTML('' . $html);
$dom->encoding = 'UTF-8';
$xpath = new DOMXPath($dom);

$content = $xpath->query("//a[contains(@href,'character')]/@href");

foreach ($content as $node) {
echo $node->nodeValue;
}
}


In this updated code, we use `file_get_contents` to fetch the HTML content from the URL, then load it into the DOMDocument with the proper UTF-8 encoding. This should ensure that special characters like apostrophes and French accents are displayed correctly.

Rate this post

3 of 5 based on 5738 votes

Comments




© 2024 Hayatsk.info - Personal Blogs Platform. All Rights Reserved.
Create blog  |  Privacy Policy  |  Terms & Conditions  |  Contact Us