|
|
html_entity_decode (PHP 4 >= 4.3.0, PHP 5) html_entity_decode --
Преобразует HTML сущности в соответствующие символы
Описаниеstring html_entity_decode ( string string [, int quote_style [, string charset]] )
html_entity_decode(), в противоположность функции
htmlentities(), Преобразует HTML сущности в строке
string в соответствующие символы.
Необязательный аргумент quote_style позволяет
указать способ обработки 'одиночных' и "двойных" кавычек. Значением
этого аргумента может быть одна из трех следующих констант (по
умолчанию ENT_COMPAT):
Таблица 1. Константы quote_style | Имя константы | Описание |
|---|
| ENT_COMPAT |
Преобразуются двойные кавычки, одиночные остаются без изменений.
| | ENT_QUOTES |
Преобразуются и двойные, и одиночные кавычки.
| | ENT_NOQUOTES |
И двойные, и одиночные кавычки остаются без изменений.
|
Необязательный третий аргумент charset
определяет кодировку, используемую при преобразовании. По умолчанию
используется кодировка ISO-8859-1.
Начиная с PHP 4.3.0 поддерживаются следующие кодировки.
Таблица 2. Поддерживаемые кодировки | Кодировка | Псевдонимы | Описание |
|---|
| ISO-8859-1 | ISO8859-1 |
Западно-европейская Latin-1
| | ISO-8859-15 | ISO8859-15 |
Западно-европейская Latin-9. Добавляет знак евро, французские и
финские буквы к кодировке Latin-1(ISO-8859-1).
| | UTF-8 | |
8-битная Unicode, совместимая с ASCII.
| | cp866 | ibm866, 866 |
Кириллическая кодировка, применяемая в DOS.
Поддерживается в версии 4.3.2.
| | cp1251 | Windows-1251, win-1251, 1251 |
Кириллическая кодировка, применяемая в Windows.
Поддерживается в версии 4.3.2.
| | cp1252 | Windows-1252, 1252 |
Западно-европейская кодировка, применяемая в Windows.
| | KOI8-R | koi8-ru, koi8r |
Русская кодировка.
Поддерживается в версии 4.3.2.
| | BIG5 | 950 |
Традиционный китайский, применяется в основном на Тайване.
| | GB2312 | 936 |
Упрощенный китайский, стандартная национальная кодировка.
| | BIG5-HKSCS | |
Расширенная Big5, применяемая в Гонг-Конге.
| | Shift_JIS | SJIS, 932 |
Японская кодировка.
| | EUC-JP | EUCJP |
Японская кодировка.
|
Замечание:
Не перечисленные выше кодировки не поддерживаются, и вместо них
применяется ISO-8859-1.
Пример 1. Декодирование HTML сущностей |
<?php
$orig = "I'll \"walk\" the <b>dog</b> now";
$a = htmlentities($orig);
$b = html_entity_decode($a);
echo $a; echo $b; function unhtmlentities($string)
{
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
return strtr($string, $trans_tbl);
}
$c = unhtmlentities($a);
echo $c; ?>
|
|
Замечание:
Может показаться странным, что результатом вызова
trim(html_entity_decode(' ')); не является пустая строка
Причина том, что ' ' преобразуется не в символ с
ASCII-кодом 32 (который удаляется функцией trim()),а в символ с
ASCII-кодом 160 (0xa0) в принимаемой по умолчанию кодировке ISO-8859-1.
См. также описание функций htmlentities(),
htmlspecialchars(),
get_html_translation_table()
и urldecode().
add a note
User Contributed Notes
html_entity_decode
Matt Robinson
22-Oct-2007 11:11
Bafflingly, html_entity_decode() only converts the 100 most common named entities, whereas the HTML 4.01 Recommendation lists over 250. This wrapper function converts all known named entities to numeric ones before handing over to the original html_entity_decode, and hopefully isn't too insufferably slow (am I right in thinking that making the conversion table static will prevent it being reinitialised on each call?)
Unfortunately it's just a little too long for this documentation. You can see the code at http://www.lazycat.org/software/html_entity_decode_full.phps
Hayley Watson
01-Oct-2007 03:15
To go further with Fabian's comment:
The XML specification (production 66) says that (decimal) numeric character references start with '&#', followed by one or more digits [0-9], and end with a ';' - just as the documented regular expression states. Hex references start with "&#x" and the allowed digits are [0-9a-fA-F].
And indeed, ' is a legitimate reference for an apostrophe (but don't tell Internet Explorer).
So Fabien's alteration to the expression is necessary. It's still insufficient, however, as chr() does not handle multibyte characters such as "€".
Hayley Watson
01-Oct-2007 02:54
Fabian's observation that chr(039) returns "a heart character" is explained by the fact that numeric literals that start with '0' are interpreted in base 8, which doesn't have a digit '9'. So 039==3 and hence chr(039) is equivalent to chr(3), NOT chr(39).
Fabian
28-Sep-2007 02:31
Actually I am not sure about the regex replacements from numeric entities back.
If you give ' to a browser. ' will also turn into a single quote.
But if I do a:
<?php
chr(039);
?>
I will get not a single quote but a heart character (haven't seen it since DOS days :))
However
<?php
chr(39);
?>
gives the correct result.
This makes the correct preg something like this
<?php
$string = preg_replace('~�*([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~�*([0-9]+);~e', 'chr(\\1)', $string);
?>
The reason is also already found on preg_replace manual page:
http://de.php.net/manual/en/function.preg-replace.php#69478
039 is interpreted as octal
akniep at rayo dot info
13-Jul-2007 09:39
In answer to "laurynas dot butkus at gmail dot com" and "romans@void.lv" and their great code2utf-function I added the functionality for entries between [128, 160[ that are not ASCii, but equal for all major western encodings like ISO8859-X and UTF-8 that has been mentioned before.
Now, the following function should in fact convert any number (table-entry) into an UTF-8-character. Thus, the return-value code2utf( <number> ) equals the character that is represented by the XML-entity &#<number>; (exceptions: #129, #141, #143, #144, #157).
To give an example, the function may be useful for creating a UTF-8-compatible html_entity_decode-function or determining the entry-position of UTF-8-characters in order to find the correct entity-replacement or similar.
function code2utf($number)
{
if ($number < 0)
return FALSE;
if ($number < 128)
return chr($number);
// Removing / Replacing Windows Illegals Characters
if ($number < 160)
{
if ($number==128) $number=8364;
elseif ($number==129) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
elseif ($number==130) $number=8218;
elseif ($number==131) $number=402;
elseif ($number==132) $number=8222;
elseif ($number==133) $number=8230;
elseif ($number==134) $number=8224;
elseif ($number==135) $number=8225;
elseif ($number==136) $number=710;
elseif ($number==137) $number=8240;
elseif ($number==138) $number=352;
elseif ($number==139) $number=8249;
elseif ($number==140) $number=338;
elseif ($number==141) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
elseif ($number==142) $number=381;
elseif ($number==143) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
elseif ($number==144) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
elseif ($number==145) $number=8216;
elseif ($number==146) $number=8217;
elseif ($number==147) $number=8220;
elseif ($number==148) $number=8221;
elseif ($number==149) $number=8226;
elseif ($number==150) $number=8211;
elseif ($number==151) $number=8212;
elseif ($number==152) $number=732;
elseif ($number==153) $number=8482;
elseif ($number==154) $number=353;
elseif ($number==155) $number=8250;
elseif ($number==156) $number=339;
elseif ($number==157) $number=160; // (Rayo:) #129 using no relevant sign, thus, mapped to the saved-space #160
elseif ($number==158) $number=382;
elseif ($number==159) $number=376;
} //if
if ($number < 2048)
return chr(($number >> 6) + 192) . chr(($number & 63) + 128);
if ($number < 65536)
return chr(($number >> 12) + 224) . chr((($number >> 6) & 63) + 128) . chr(($number & 63) + 128);
if ($number < 2097152)
return chr(($number >> 18) + 240) . chr((($number >> 12) & 63) + 128) . chr((($number >> 6) & 63) + 128) . chr(($number & 63) + 128);
return FALSE;
} //code2utf()
laurynas dot butkus at gmail dot com
15-May-2007 04:24
In PHP4 html_entity_decode() is not working well with UTF-8 spitting: "Warning: cannot yet handle MBCS in html_entity_decode()!".
This is working solution combining several workarounds:
<?php
function html_entity_decode_utf8($string)
{
static $trans_tbl;
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'code2utf(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'code2utf(\\1)', $string);
if (!isset($trans_tbl))
{
$trans_tbl = array();
foreach (get_html_translation_table(HTML_ENTITIES) as $val=>$key)
$trans_tbl[$key] = utf8_encode($val);
}
return strtr($string, $trans_tbl);
}
function code2utf($num)
{
if ($num < 128) return chr($num);
if ($num < 2048) return chr(($num >> 6) + 192) . chr(($num & 63) + 128);
if ($num < 65536) return chr(($num >> 12) + 224) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
if ($num < 2097152) return chr(($num >> 18) + 240) . chr((($num >> 12) & 63) + 128) . chr((($num >> 6) & 63) + 128) . chr(($num & 63) + 128);
return '';
}
?>
teecee[(a)]teecee[pont]hu
13-May-2007 08:29
Hi!
The main problem with the UTF-8 strings if You try to unhtmlentities them is that the get_html_translation_table() gives back a non-UTF8 conversion table. So the idea is to get the translation table and then translate the needed non-UTF8 strings to UTF8...
I have this code working, actually this code is the one sent by 'daviscabral', just with an extra foreach in it ( http://hu.php.net/manual/en/function.htmlentities.php#68479 )
And the code is:
<?
function unhtmlentitiesUtf8($string) {
// replace numeric entities
$string = preg_replace('~&#x([0-9a-f]+);~ei', 'chr(hexdec("\\1"))', $string);
$string = preg_replace('~&#([0-9]+);~e', 'chr("\\1")', $string);
// replace literal entities
$trans_tbl = get_html_translation_table(HTML_ENTITIES);
$trans_tbl = array_flip($trans_tbl);
// changing translation table to UTF-8
foreach( $trans_tbl as $key => $value ) {
$trans_tbl[$key] = iconv( 'ISO-8859-1', 'UTF-8', $value );
}
return strtr($string, $trans_tbl);
}
?>
If You need this in production code, I suggest to get the $trans_tbl into a common-includable file I think it should be faster. ( Maybe the easiest way to do this is to write after the translation: die(var_export($trans_tbl, true)); and copy&paste the source of the displaying text. And don't forget to check if the browser uses UTF8 codepage! ;)
elektronaut gmx.net
10-Jan-2007 05:11
I made my own fix to allow numerical entities in utf8 in php4...
<?
function utf8_replaceEntity($result){
$value = (int)$result[1];
$string = '';
$len = round(pow($value,1/8));
for($i=$len;$i>0;$i--){
$part = ($value & (255>>2)) | pow(2,7);
if ( $i == 1 ) $part |= 255<<(8-$len);
$string = chr($part) . $string;
$value >>= 6;
}
return $string;
}
function utf8_html_entity_decode($string){
return preg_replace_callback(
'/&#([0-9]+);/u',
'utf8_replaceEntity',
$string
);
}
$string = '’‘ – “ ”'
.' ć ń ř'
;
$string = utf8_html_entity_decode($string,null,'UTF-8');
header('Content-Type: text/html; charset=UTF-8');
echo '<li>'.$string;
?>
inco
28-Dec-2006 12:26
@ romekt:
iconv could not be implemented, so alternatively use utf8_decode and utf8_encode to solve the utf-8 / iso-8859-1 problem
jojo
03-Nov-2006 08:27
The decipherment does the character encoded by the escape function of JavaScript.
When the multi byte is used on the page, it is effective.
javascript escape('aa
|
|