PHP DOMDocument loadHTML not encoding UTF-8 correctly

Question

I'm trying to parse some HTML using DOMDocument, but when I do, I suddenly lose my encoding (at least that is how it appears to me).

$profile = "<div><p>various japanese characters</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile); 

$divs = $dom->getElementsByTagName('div');

foreach ($divs as $div) {
    echo $dom->saveHTML($div);
}

The result of this code is that I get a bunch of characters that are not Japanese. However, if I do:

echo $profile;

it displays correctly. I've tried saveHTML and saveXML, and neither display correctly. I am using PHP 5.3.

What I see:

ã¤ãªãã¤å·ã·ã«ã´ã«ã¦ãã¢ã¤ã«ã©ã³ãç³»ã®å®¶åºã«ã9äººåå¼ã®5çªç®ã¨ãã¦çã¾ãããå½¼ãå«ãã¦4äººãä¿³åªã«ãªã£ããç¶è¦ªã¯æ¨æã®ã»ã¼ã«ã¹ãã³ã§ãæ¯è¦ªã¯éµä¾¿å±ã®å®¢å®¤ä¿ã ã£ããé«æ ¡æä»£ã¯ãã£ãã£ã®ã¢ã«ãã¤ãã«å¤ãã¿ãæè²è³éãåããªããã«ããªãã¯ç³»ã®é«æ ¡ã¸é²å¦ã

What should be shown:

イリノイ州シカゴにて、アイルランド系の家庭に、9人兄弟の5番目として生まれる。彼を含めて4人が俳優になった。父親は木材のセールスマンで、母親は郵便局の客室係だった。高校時代はキャディのアルバイトに勤しみ、教育資金を受けながらカトリック系の高校へ進学

EDIT: I've simplified the code down to five lines so you can test it yourself.

$profile = "<div lang=ja><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>";
$dom = new DOMDocument();
$dom->loadHTML($profile);
echo $dom->saveHTML();
echo $profile;

Here is the html that is returned:

<div lang="ja"><p>ã‚¤ãƒªãƒŽã‚¤å·žã‚·ã‚«ã‚´ã«ã¦ã€ã‚¢ã‚¤ãƒ«ãƒ©ãƒ³ãƒ‰ç³»ã®å®¶åºã«ã€</p></div>
<div lang="ja"><p>イリノイ州シカゴにて、アイルランド系の家庭に、</p></div>

Thanks. I checked all those and nothing helped. I don't get ????, but some other strange text. I'll try to paste it here, but don't know how the site will display it. — Slightly A., Commented Nov 21, 2011 at 21:03
Tried with no success. Returned the same characters as before. — Slightly A., Commented Nov 21, 2011 at 21:22

cmbuckley · Accepted Answer · 2025-01-09 14:55:22Z

Firstly, DOMDocument uses an HTML4 parser. If you're loading HTML5, you should probably be using Dom\HTMLDocument::createFromString with PHP 8.4+.

DOMDocument::loadHTML will treat your string as being in ISO-8859-1 (the HTTP/1.1 default character set) unless you tell it otherwise. This results in UTF-8 strings being interpreted incorrectly.

If you're dealing with simple snippets of (X)HTML, you could prepend an XML encoding declaration or a meta charset declaration to cause the string to be treated as UTF-8:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();

// This version preserves the original characters
$contentType = '<meta http-equiv="Content-Type" content="text/html; charset=utf-8">';
$dom->loadHTML($contentType . $profile);
echo $dom->saveHTML();

// This version will HTML-encode high-ASCII bytes
$dom->loadHTML('<meta charset="utf8">' . $profile);
echo $dom->saveHTML();

// This version will also HTML-encode high-ASCII bytes,
// and won't work for LIBXML_DOTTED_VERSION >= 2.12.0
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

If you cannot know if the HTML will already contain declarations, there's a workaround in SmartDOMDocument which should help you:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8'));
echo $dom->saveHTML();

In PHP 8.2+, you'll get a deprecation warning, so the alternative would be:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML(mb_encode_numericentity($profile, [0x80, 0x10FFFF, 0, ~0], 'UTF-8'));
echo $dom->saveHTML();

(For a better explanation of that rather cryptic array, see here.)

This is not a great workaround, but since not all characters can be represented in ISO-8859-1 (like these katana), it's the safest alternative.

Yes, that did it. Thank you for your help. I tried saveHTML, saveXML, didn't think that the problem may have been coming during the load. — Slightly A., Commented Nov 21, 2011 at 21:34
The mb_convert_encoding call worked for me, whereas prepending the encoding declaration didn't. Likely because the document already had a conflicting declaration. Many thanks - saved me a lot of time chasing this down. — Peter Bagnall, Commented Jul 4, 2013 at 12:43
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $content); fixed it for me in PHP7 (so it is still an issue) - this is a really annoying problem, because I defined utf8 in the HTML document (with <meta charset="UTF-8" />) but that has no effect, it seems to need the <?xml part, which is totally unintuitive. — iquito, Commented Apr 20, 2016 at 14:00
Still in 2017 this answer is relevant and worked for me too. I had my database, multibyte, html meta tag and DOM encoding all set to utf8 and still had bad encoding on importing node from one DOC to another. php.net/manual/en/function.mb-convert-encoding.php was the fix. — Louis Loudog Trottier, Commented Mar 6, 2017 at 21:43
Using 'HTML-ENTITIES' is a horrible hack, but in october 2017, that is the only trick that is suggested on this page that works on RHEL7! — Free Radical, Commented Oct 18, 2017 at 14:38

Greeso · Accepted Answer · 2023-09-22 16:21:27Z

94

The problem is with saveHTML() and saveXML(), both of them do not work correctly in Unix. They do not save UTF-8 characters correctly when used in Unix, but they work in Windows.

The workaround is very simple:

If you try the default, you will get the error you described

$str = $dom->saveHTML(); // saves incorrectly

All you have to do is save as follows:

$str = $dom->saveHTML($dom->documentElement); // saves correctly

This line of code will get your UTF-8 characters to be saved correctly. Use the same workaround if you are using saveXML().

Update

As suggested by "Jack M" in the comments section below, and verified by "Pamela" and "Marco Aurélio Deleu", the following variation might work in your case:

$str = utf8_decode($dom->saveHTML($dom->documentElement));

Update 2

utf8_decode is now deprecated. An alternative is mb_convert_encoding(). You have to set it according to your needs.

Note

English characters do not cause any problem when you use saveHTML() without parameters (because English characters are saved as single byte characters in UTF-8)
The problem happens when you have multi-byte characters (such as Chinese, Russian, Arabic, Hebrew, ...etc.)

I recommend reading this article: http://coding.smashingmagazine.com/2012/06/06/all-about-unicode-utf8-character-sets/. You will understand how UTF-8 works and why you have this problem. It will take you about 30 minutes, but it is time well spent.

edited Sep 22, 2023 at 16:21

answered Dec 19, 2013 at 7:06

Greeso

8,30915 gold badges56 silver badges84 bronze badges

8

I had to utf8_decode while using this solution. Thanks!
– Jack M.
Commented Sep 8, 2014 at 23:48
14

This had to become utf8_decode($dom->saveHTML(dom->documentElement)) to preserve my special characters. Otherwise, they just became something else. Just mentioning it in case it helps someone else.
– Jack M.
Commented Sep 10, 2014 at 13:52
6

Thanks @MrJack. I also had to do the same to make it display without the strange characters $str = utf8_decode($dom->saveHTML($dom->documentElement));
– Pamela
Commented Jan 15, 2016 at 11:34
3

utf8_decode($dom->saveHTML($dom->documentElement)); did it perfectly for me.
– Marco Aurélio Deleu
Commented Oct 20, 2016 at 22:48
1

@Rounin-StandingwithUkraine Well wow, It is going to be 10 years since I wrote this answer, glad it is still relevant.
– Greeso
Commented Jan 31, 2023 at 1:39

| Show 11 more comments

Hossein · Accepted Answer · 2011-11-21 20:59:06Z

20

Make sure the real source file is saved as UTF-8 (You may even want to try the non-recommended BOM Chars with UTF-8 to make sure).

Also in case of HTML, make sure you have declared the correct encoding using meta tags:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

If it's a CMS (as you've tagged your question with Joomla) you may need to configure appropriate settings for the encoding.

answered Nov 21, 2011 at 20:59

Hossein

4,1472 gold badges26 silver badges46 bronze badges

I understand what you're saying, but I have no problems displaying the characters. if I do "echo $profile;" it works fine. it's when the DomDocument gets ahold of it that it starts failing.
– Slightly A.
Commented Nov 21, 2011 at 21:08
2

Your meta prevents saveHTML from encoding everything above ASCII into entities. The solution I was looking for :)
– sod
Commented Jun 28, 2013 at 13:32
3

As a side note, the newer <meta charset="UTF-8"> tag doesn't work with DOMDocument.
– Taylan
Commented Oct 16, 2015 at 15:23
1

@Taylan: no problem at all with <meta charset="UTF-8">: see 3v4l.org/AATjh
– Casimir et Hippolyte
Commented Oct 17, 2020 at 19:50

Add a comment |

Dharman · Accepted Answer · 2021-12-23 16:27:51Z

19

This took me a while to figure out but here's my answer.

Before using DomDocument I would use file_get_contents to retrieve URLs and then process them with string functions. Perhaps not the best way but quick. After being convinced Dom was just as quick I first tried the following:

$dom = new DomDocument('1.0', 'UTF-8');
if ($dom->loadHTMLFile($url) == false) { // read the url
    // error message
}
else {
    // process
}

This failed spectacularly in preserving UTF-8 encoding despite the proper meta tags, PHP settings, and all the rest of the remedies offered here and elsewhere. Here's what works:

$dom = new DomDocument('1.0', 'UTF-8');
$str = file_get_contents($url);
if ($dom->loadHTML(mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8')) == false) {
}

etc. Now everything's right with the world.

edited Dec 23, 2021 at 16:27

Dharman♦

33.6k27 gold badges101 silver badges149 bronze badges

answered Nov 20, 2017 at 15:57

user8972079

Just wanted to add to my answer above that another way to address this is with the following, suggested elsewhere as well: if ($dom->loadHTML('<?xml encoding="UTF-8">' . $str) == false). After posting my answer I found an occasion where my first suggestion failed but the second worked.
– user8972079
Commented Nov 20, 2017 at 16:14
Works for me even without the params in DomDocument('1.0', 'UTF-8'). But in my case only partial html is loaded.
– JKB
Commented Jun 17, 2020 at 13:59
thanks a lot man, worked for me dealing with hebrew 👍
– Sagive
Commented Dec 28, 2021 at 10:29
1

PHP8.2: Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead
– mickmackusa
Commented Sep 22, 2023 at 10:59

Add a comment |

Jan Turoň · Accepted Answer · 2021-04-05 00:23:35Z

13

Use correct header for UTF-8

Don't get satisfied by "it works".

@cmbuckley in his accepted answer advised to set <?xml encoding="utf-8" ?> to the document. However to use XML declaration in HTML document is a bit weird. HTML is not XML (unless it is XHTML) and it can confuse browsers and other software on the way to client (may be source of the failures reported by others).

I successfully used HTML5 declaration:

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<!DOCTYPE html><meta charset="UTF-8">' . $profile);
echo $dom->saveHTML();

If you use other standard, use correct header, the DOMDocument follows the standards quite pedantically and seems to support HTML5, too (if not in your case, try to update the libxml extension).

edited Apr 5, 2021 at 0:23

answered Apr 4, 2021 at 22:21

Jan Turoň

33k24 gold badges137 silver badges178 bronze badges

2

There is no support for HTML5 in PHP, unfortunately, because libxml doesn't support it. You'd get the same results with <!DOCTYPE alsfjaswrtoiufn>, i.e. it would just output whatever you typed.
– miken32
Commented Dec 23, 2021 at 18:00
I'm running PHP 8.1.0 on Windows and adding only the tag <meta charset="UTF-8"> works fine for me. No need to use <html> neither <!DOCTYPE...>
– MMJ
Commented Mar 25, 2022 at 22:24
The html5 charset meta, with mb input, would output every char html entity encoded. Confirm output with mb_strlen, or just view page source. Use the htm4 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">for the html4 parser.
– DWB
Commented Feb 14, 2024 at 21:14

Add a comment |

trincot · Accepted Answer · 2015-11-22 09:49:38Z

12

You could prefix a line enforcing utf-8 encoding, like this:

@$doc->loadHTML('<?xml version="1.0" encoding="UTF-8"?>' . "\n" . $profile);

And you can then continue with the code you already have, like:

$doc->saveXML()

edited Nov 22, 2015 at 9:49

trincot

353k37 gold badges273 silver badges328 bronze badges

answered Nov 22, 2015 at 5:42

Ivan

2,9263 gold badges27 silver badges23 bronze badges

Add a comment |

Alexander Goncharov · Accepted Answer · 2016-06-15 11:57:19Z

6

Use it for correct result

$dom = new DOMDocument();
$dom->loadHTML('<meta http-equiv="Content-Type" content="text/html; charset=utf-8">' . $profile);
echo $dom->saveHTML();
echo $profile;

This operation

mb_convert_encoding($profile, 'HTML-ENTITIES', 'UTF-8');

It is bad way, because special symbols like &lt ; , &gt ; can be in $profile, and they will not convert twice after mb_convert_encoding. It is the hole for XSS and incorrect HTML.

answered Jun 15, 2016 at 11:57

Alexander Goncharov

1,66219 silver badges21 bronze badges

Can you elaborate on "they will not convert twice after mb_convert_encoding"?
– Motivated
Commented Nov 7, 2020 at 19:51

Add a comment |

cmbuckley · Accepted Answer · 2014-11-26 16:53:00Z

5

You must feed the DOMDocument a version of your HTML with a header that make sense. Just like HTML5.

$profile ='<?xml version="1.0" encoding="'.$_encoding.'"?>'. $html;

maybe is a good idea to keep your html as valid as you can, so you don't get into issues when you'll start query... around :-) and stay away from htmlentities!!!! That's an an necessary back and forth wasting resources. keep your code insane!!!!

edited Nov 26, 2014 at 16:53

cmbuckley

42.6k10 gold badges83 silver badges94 bronze badges

answered Jun 4, 2013 at 7:25

Lazaros Kosmidis

74910 silver badges9 bronze badges

this is more or less part of the accepted answere....
– Dwza
Commented Sep 6, 2021 at 12:31

Add a comment |

mMo · Accepted Answer · 2016-06-23 13:47:30Z

4

Works finde for me:

$dom = new \DOMDocument;
$dom->loadHTML(utf8_decode($html));
...
return  utf8_encode( $dom->saveHTML());

answered Jun 23, 2016 at 13:47

mMo

2632 silver badges10 bronze badges

4

Be careful, utf8_decode may lose information (replaced with a ?)
– jwal
Commented Sep 14, 2017 at 16:42

Add a comment |

Luke Madhanga · Accepted Answer · 2019-11-08 13:38:51Z

The only thing that worked for me was the accepted answer of

$profile = '<p>イリノイ州シカゴにて、アイルランド系の家庭に、9</p>';
$dom = new DOMDocument();
$dom->loadHTML('<?xml encoding="utf-8" ?>' . $profile);
echo $dom->saveHTML();

HOWEVER

This brought about new issues, of having <?xml encoding="utf-8" ?> in the output of the document.

The solution for me was then to do

foreach ($doc->childNodes as $xx) {
    if ($xx instanceof \DOMProcessingInstruction) {
        $xx->parentNode->removeChild($xx);
    }
}

Some solutions told me that to remove the xml header, that I had to perform

$dom->saveXML($dom->documentElement);

This didn't work for me as for a partial document (e.g. a doc with two <p> tags), only one of the <p> tags where being returned.

The best solution IMHO.
– Furgas
Commented May 9, 2024 at 18:25 — Furgas, Commented May 9, 2024 at 18:25

xKobalt · Accepted Answer · 2020-07-17 20:19:53Z

-1

The problem is that when you add a parameter to DOMDocument::saveHTML() function, you lose the encoding. In a few cases, you'll need to avoid the use of the parameter and use old string function to find what your are looking for.

I think the previous answer works for you, but since this workaround didn't work for me, I'm adding that answer to help people who may be in my case.

edited Jul 17, 2020 at 20:19

xKobalt

1,5082 gold badges15 silver badges20 bronze badges

answered May 16, 2013 at 11:14

copndz

1,1042 gold badges12 silver badges24 bronze badges

Add a comment |

Collectives™ on Stack Overflow

PHP DOMDocument loadHTML not encoding UTF-8 correctly

11 Answers 11

Note

Use correct header for UTF-8

Linked

Hot Network Questions

Collectives™ on Stack Overflow

11 Answers 11

Note

Use correct header for UTF-8

Linked

Related