Проблема кодировок часто возникает при написании парсеров, чтении данных из xml и CSV файлов. Ниже представлены способы эту проблему решить.
1
windows-1251 в UTF-8
$text = iconv('windows-1251//IGNORE', 'UTF-8//IGNORE', $text);
echo $text;
PHP
$text = mb_convert_encoding($text, 'UTF-8', 'windows-1251');
echo $text;
PHP
2
UTF-8 в windows-1251
$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text);
echo $text;
PHP
$text = mb_convert_encoding($text, 'windows-1251', 'utf-8');
echo $text;
PHP
3
Когда ни что не помогает
$text = iconv('utf-8//IGNORE', 'cp1252//IGNORE', $text);
$text = iconv('cp1251//IGNORE', 'utf-8//IGNORE', $text);
echo $text;
PHP
Иногда доходит до бреда, но работает:
$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text);
$text = iconv('windows-1251//IGNORE', 'utf-8//IGNORE', $text);
echo $text;
PHP
4
File_get_contents / CURL
Бывают случаи когда file_get_contents()
или CURL возвращают иероглифы (ÐлмазнÑе боÑÑ) – причина тут не в кодировке, а в отсутствии BOM-метки.
$text = file_get_contents('https://example.com');
$text = "\xEF\xBB\xBF" . $text;
echo $text;
PHP
Ещё бывают случаи, когда file_get_contents() возвращает текст в виде:
�mw�Ƒ0�����&IkAI��f��j4/{�</�&�h�� ��({�o�����:/��<g���g��(�=�9�Paɭ
Это сжатый текст в GZIP, т.к. функция не отправляет правильные заголовки. Решение проблемы через CURL:
function getcontents($url){
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_ENCODING, 'gzip');
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
$output = curl_exec($ch);
curl_close($ch);
return $output;
}
echo getcontents('https://example.com');
PHP
12.01.2017, обновлено 02.11.2021
Другие публикации
Отправка e-mail в кодировке UTF-8 с вложенными файлами и возможные проблемы.
JSON (JavaScript Object Notation) – текстовый формат обмена данными, основанный на JavaScript, который представляет собой набор пар {ключ: значение}. Значение может быть массивом, числом, строкой и…
Описание значений глобального массива $_SERVER с примерами.
Так как Instagram и Fasebook ограничили доступ к API, а фото с открытого аккаунта всё же нужно периодически получать и…
В статье представлены различные PHP-расширения для чтения файлов XLS, XLSX, описаны их плюсы и минусы, а также примеры…
Примеры как зарегистрировать бота в Телеграм, описание и взаимодействие с основными методами API.
windows-1251 в UTF-8
$text = iconv('windows-1251//IGNORE', 'UTF-8//IGNORE', $text);
echo $text;
$text = mb_convert_encoding($text, 'UTF-8', 'windows-1251');
echo $text;
UTF-8 в windows-1251
$text = iconv('utf-8//IGNORE', 'windows-1251//IGNORE', $text);
echo $text;
$text = mb_convert_encoding($text, 'windows-1251', 'utf-8');
echo $text;
Ещё бывают случаи когда file_get_contents или CURL возвращают иероглифы (ÐлмазнÑе боÑÑ) – причина тут не в кодировке, а в отсутствии BOM-метки.
$text = file_get_contents('https://example.com');
$text = "\xEF\xBB\xBF" . $text;
echo $text;
Источник: https://snipp.ru/php/iconv-utf-8
(PHP 4 >= 4.0.6, PHP 5, PHP 7, PHP
mb_convert_encoding — Преобразует строку из одной кодировки символов в другую
Описание
mb_convert_encoding(array|string $string
, string $to_encoding
, array|string|null $from_encoding
= null
): array|string|false
Список параметров
-
string
-
Строка (string) или массив (array), для преобразования.
-
to_encoding
-
Требуемая кодировка результата.
-
from_encoding
-
Текущая кодировка, используемая для интерпретации строки
string
.
Несколько кодировок могут быть указаны в виде массива (array) или в виде строки через запятую,
в этом случае правильная кодировка будет определена по тому же алгоритму,
что и в функции mb_detect_encoding().Если параметр
from_encoding
опущен или равенnull
,
то будет использоваться mbstring.internal_encoding setting,
если она установлена, иначе кодировка по умолчанию.Допустимые значения
to_encoding
иfrom_encoding
указаны на странице поддерживаемые кодировки.
Возвращаемые значения
Преобразованная строка (string) или массив (array) или false
в случае возникновения ошибки.
Ошибки
Начиная с PHP 8.0.0, если значение to_encoding
или
from_encoding
является недопустимой кодировкой, выбрасывается ValueError.
До PHP 8.0.0 вместо этого выдавалась ошибка уровня E_WARNING
.
Список изменений
Версия | Описание |
---|---|
8.0.0 |
mb_convert_encoding() теперь выбрасывает ValueError, если передана недопустимая кодировка в to_encoding .
|
8.0.0 |
mb_convert_encoding() теперь выбрасывает ValueError, если передана недопустимая кодировка в from_encoding .
|
8.0.0 |
Теперь from_encoding может быть null .
|
7.2.0 |
Функция теперь также принимает массив (array) в string .Ранее поддерживались только строки (string). |
Примеры
Пример #1 Пример использования mb_convert_encoding()
<?php
/* Преобразует строку в кодировку SJIS */
$str = mb_convert_encoding($str, "SJIS");/* Преобразует из EUC-JP в UTF-7 */
$str = mb_convert_encoding($str, "UTF-7", "EUC-JP");/* Автоматически определяется кодировка среди JIS, eucjp-win, sjis-win, затем преобразуется в UCS-2LE */
$str = mb_convert_encoding($str, "UCS-2LE", "JIS, eucjp-win, sjis-win");/* Если mbstring.language равен "Japanese", "auto" используется для обозначения "ASCII,JIS,UTF-8,EUC-JP,SJIS" */
$str = mb_convert_encoding($str, "EUC-JP", "auto");
?>
Смотрите также
- mb_detect_order() — Установка/получение списка кодировок для механизмов определения кодировки
- UConverter::transcode() — Преобразует строку из одной кодировки символов в другую
- iconv() — Преобразует строку из одной кодировки символов в другую
josip at cubrad dot com ¶
10 years ago
For my last project I needed to convert several CSV files from Windows-1250 to UTF-8, and after several days of searching around I found a function that is partially solved my problem, but it still has not transformed all the characters. So I made this:
function w1250_to_utf8($text) {
// map based on:
// http://konfiguracja.c0.pl/iso02vscp1250en.html
// http://konfiguracja.c0.pl/webpl/index_en.html#examp
// http://www.htmlentities.com/html/entities/
$map = array(
chr(0x8A) => chr(0xA9),
chr(0x8C) => chr(0xA6),
chr(0x8D) => chr(0xAB),
chr(0x8E) => chr(0xAE),
chr(0x8F) => chr(0xAC),
chr(0x9C) => chr(0xB6),
chr(0x9D) => chr(0xBB),
chr(0xA1) => chr(0xB7),
chr(0xA5) => chr(0xA1),
chr(0xBC) => chr(0xA5),
chr(0x9F) => chr(0xBC),
chr(0xB9) => chr(0xB1),
chr(0x9A) => chr(0xB9),
chr(0xBE) => chr(0xB5),
chr(0x9E) => chr(0xBE),
chr(0x80) => '€',
chr(0x82) => '‚',
chr(0x84) => '„',
chr(0x85) => '…',
chr(0x86) => '†',
chr(0x87) => '‡',
chr(0x89) => '‰',
chr(0x8B) => '‹',
chr(0x91) => '‘',
chr(0x92) => '’',
chr(0x93) => '“',
chr(0x94) => '”',
chr(0x95) => '•',
chr(0x96) => '–',
chr(0x97) => '—',
chr(0x99) => '™',
chr(0x9B) => '’',
chr(0xA6) => '¦',
chr(0xA9) => '©',
chr(0xAB) => '«',
chr(0xAE) => '®',
chr(0xB1) => '±',
chr(0xB5) => 'µ',
chr(0xB6) => '¶',
chr(0xB7) => '·',
chr(0xBB) => '»',
);
return html_entity_decode(mb_convert_encoding(strtr($text, $map), 'UTF-8', 'ISO-8859-2'), ENT_QUOTES, 'UTF-8');
}
Julian Egelstaff ¶
10 months ago
If you have what looks like ISO-8859-1, but it includes "smart quotes" courtesy of Microsoft software, or people cutting and pasting content from Microsoft software, then what you're actually dealing with is probably Windows-1252. Try this:
<?php
$cleanText = mb_convert_encoding($text, 'UTF-8', 'Windows-1252');
?>
The annoying part is that the auto detection (ie: the mb_detect_encoding function) will often think Windows-1252 is ISO-8859-1. Close, but no cigar. This is critical if you're then trying to do unserialize on the resulting text, because the byte count of the string needs to be perfect.
regrunge at hotmail dot it ¶
13 years ago
I've been trying to find the charset of a norwegian (with a lot of ø, æ, å) txt file written on a Mac, i've found it in this way:
<?php
$text = "A strange string to pass, maybe with some ø, æ, å characters.";
foreach(
mb_list_encodings() as $chr){
echo mb_convert_encoding($text, 'UTF-8', $chr)." : ".$chr."<br>";
}
?>
The line that looks good, gives you the encoding it was written in.
Hope can help someone
volker at machon dot biz ¶
16 years ago
Hey guys. For everybody who's looking for a function that is converting an iso-string to utf8 or an utf8-string to iso, here's your solution:
public function encodeToUtf8($string) {
return mb_convert_encoding($string, "UTF-8", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));
}
public function encodeToIso($string) {
return mb_convert_encoding($string, "ISO-8859-1", mb_detect_encoding($string, "UTF-8, ISO-8859-1, ISO-8859-15", true));
}
For me these functions are working fine. Give it a try
francois at bonzon point com ¶
14 years ago
aaron, to discard unsupported characters instead of printing a ?, you might as well simply set the configuration directive:
mbstring.substitute_character = "none"
in your php.ini. Be sure to include the quotes around none. Or at run-time with
<?php
ini_set('mbstring.substitute_character', "none");
?>
eion at bigfoot dot com ¶
17 years ago
many people below talk about using
<?php
mb_convert_encode($s,'HTML-ENTITIES','UTF-8');
?>
to convert non-ascii code into html-readable stuff. Due to my webserver being out of my control, I was unable to set the database character set, and whenever PHP made a copy of my $s variable that it had pulled out of the database, it would convert it to nasty latin1 automatically and not leave it in it's beautiful UTF-8 glory.
So [insert korean characters here] turned into ?????.
I found myself needing to pass by reference (which of course is deprecated/nonexistent in recent versions of PHP)
so instead of
<?php
mb_convert_encode(&$s,'HTML-ENTITIES','UTF-8');
?>
which worked perfectly until I upgraded, so I had to use
<?php
call_user_func_array('mb_convert_encoding', array(&$s,'HTML-ENTITIES','UTF-8'));
?>
Hope it helps someone else out
aaron at aarongough dot com ¶
14 years ago
My solution below was slightly incorrect, so here is the correct version (I posted at the end of a long day, never a good idea!)
Again, this is a quick and dirty solution to stop mb_convert_encoding from filling your string with question marks whenever it encounters an illegal character for the target encoding.
<?php
function convert_to ( $source, $target_encoding )
{
// detect the character encoding of the incoming file
$encoding = mb_detect_encoding( $source, "auto" );// escape all of the question marks so we can remove artifacts from
// the unicode conversion process
$target = str_replace( "?", "[question_mark]", $source );// convert the string to the target encoding
$target = mb_convert_encoding( $target, $target_encoding, $encoding);// remove any question marks that have been introduced because of illegal characters
$target = str_replace( "?", "", $target );// replace the token string "[question_mark]" with the symbol "?"
$target = str_replace( "[question_mark]", "?", $target );
return
$target;
}
?>
Hope this helps someone! (Admins should feel free to delete my previous, incorrect, post for clarity)
-A
Rainer Perske ¶
1 year ago
Text-encoding HTML-ENTITIES will be deprecated as of PHP 8.2.
To convert all non-ASCII characters into entities (to produce pure 7-bit HTML output), I was using:
<?php
echo mb_convert_encoding( htmlspecialchars( $text, ENT_QUOTES, 'UTF-8' ), 'HTML-ENTITIES', 'UTF-8' );
?>
I can get the identical result with:
<?php
echo mb_encode_numericentity( htmlentities( $text, ENT_QUOTES, 'UTF-8' ), [0x80, 0x10FFFF, 0, ~0], 'UTF-8' );
?>
The output contains well-known named entities for some often used characters and numeric entities for the rest.
Stephan van der Feest ¶
18 years ago
To add to the Flash conversion comment below, here's how I convert back from what I've stored in a database after converting from Flash HTML text field output, in order to load it back into a Flash HTML text field:
function htmltoflash($htmlstr)
{
return str_replace("<br />","\n",
str_replace("<","<",
str_replace(">",">",
mb_convert_encoding(html_entity_decode($htmlstr),
"UTF-8","ISO-8859-1"))));
}
urko at wegetit dot eu ¶
11 years ago
If you are trying to generate a CSV (with extended chars) to be opened at Exel for Mac, the only that worked for me was:
<?php mb_convert_encoding( $CSV, 'Windows-1252', 'UTF-8'); ?>
I also tried this:
<?php
//Separado OK, chars MAL
iconv('MACINTOSH', 'UTF8', $CSV);
//Separado MAL, chars OK
chr(255).chr(254).mb_convert_encoding( $CSV, 'UCS-2LE', 'UTF-8');
?>
But the first one didn't show extended chars correctly, and the second one, did't separe fields correctly
vasiliauskas dot agnius at gmail dot com ¶
5 years ago
When you need to convert from HTML-ENTITIES, but your UTF-8 string is partially broken (not all chars in UTF-8) - in this case passing string to mb_convert_encoding($string, 'UTF-8', 'HTML-ENTITIES'); - corrupts chars in string even more. In this case you need to replace html entities gradually to preserve character good encoding. I wrote such closure for this job :
<?php
$decode_entities = function($string) {
preg_match_all("/&#?\w+;/", $string, $entities, PREG_SET_ORDER);
$entities = array_unique(array_column($entities, 0));
foreach ($entities as $entity) {
$decoded = mb_convert_encoding($entity, 'UTF-8', 'HTML-ENTITIES');
$string = str_replace($entity, $decoded, $string);
}
return $string;
};
?>
Daniel Trebbien ¶
14 years ago
Note that `mb_convert_encoding($val, 'HTML-ENTITIES')` does not escape '\'', '"', '<', '>', or '&'.
chzhang at gmail dot com ¶
14 years ago
instead of ini_set(), you can try this
mb_substitute_character("none");
bmxmale at qwerty dot re ¶
1 year ago
/**
* Convert Windows-1250 to UTF-8
* Based on https://www.php.net/manual/en/function.mb-convert-encoding.php#112547
*/
class TextConverter
{
private const ENCODING_TO = 'UTF-8';
private const ENCODING_FROM = 'ISO-8859-2';
private array $mapChrChr = [
0x8A => 0xA9,
0x8C => 0xA6,
0x8D => 0xAB,
0x8E => 0xAE,
0x8F => 0xAC,
0x9C => 0xB6,
0x9D => 0xBB,
0xA1 => 0xB7,
0xA5 => 0xA1,
0xBC => 0xA5,
0x9F => 0xBC,
0xB9 => 0xB1,
0x9A => 0xB9,
0xBE => 0xB5,
0x9E => 0xBE
];
private array $mapChrString = [
0x80 => '€',
0x82 => '‚',
0x84 => '„',
0x85 => '…',
0x86 => '†',
0x87 => '‡',
0x89 => '‰',
0x8B => '‹',
0x91 => '‘',
0x92 => '’',
0x93 => '“',
0x94 => '”',
0x95 => '•',
0x96 => '–',
0x97 => '—',
0x99 => '™',
0x9B => '’',
0xA6 => '¦',
0xA9 => '©',
0xAB => '«',
0xAE => '®',
0xB1 => '±',
0xB5 => 'µ',
0xB6 => '¶',
0xB7 => '·',
0xBB => '»'
];
/**
* @param $text
* @return string
*/
public function execute($text): string
{
$map = $this->prepareMap();
return html_entity_decode(
mb_convert_encoding(strtr($text, $map), self::ENCODING_TO, self::ENCODING_FROM),
ENT_QUOTES,
self::ENCODING_TO
);
}
/**
* @return array
*/
private function prepareMap(): array
{
$maps[] = $this->arrayMapAssoc(function ($k, $v) {
return [chr($k), chr($v)];
}, $this->mapChrChr);
$maps[] = $this->arrayMapAssoc(function ($k, $v) {
return [chr($k), $v];
}, $this->mapChrString);
return array_merge([], ...$maps);
}
/**
* @param callable $function
* @param array $array
* @return array
*/
private function arrayMapAssoc(callable $function, array $array): array
{
return array_column(
array_map(
$function,
array_keys($array),
$array
),
1,
0
);
}
}
Daniel ¶
7 years ago
If you are attempting to convert "UTF-8" text to "ISO-8859-1" and the result is always returning in "ASCII", place the following line of code before the mb_convert_encoding:
mb_detect_order(array('UTF-8', 'ISO-8859-1'));
It is necessary to force a specific search order for the conversion to work
me at gsnedders dot com ¶
14 years ago
It appears that when dealing with an unknown "from encoding" the function will both throw an E_WARNING and proceed to convert the string from ISO-8859-1 to the "to encoding".
katzlbtjunk at hotmail dot com ¶
15 years ago
Clean a string for use as filename by simply replacing all unwanted characters with underscore (ASCII converts to 7bit). It removes slightly more chars than necessary. Hope its useful.
$fileName = 'Test:!"$%&/()=ÖÄÜöäü<<';
echo strtr(mb_convert_encoding($fileName,'ASCII'),
' ,;:?*#!§$%&/(){}<>=`´|\\\'"',
'____________________________');
Tom Class ¶
17 years ago
Why did you use the php html encode functions? mbstring has it's own Encoding which is (as far as I tested it) much more usefull:
HTML-ENTITIES
Example:
$text = mb_convert_encoding($text, 'HTML-ENTITIES', "UTF-8");
mac.com@nemo ¶
17 years ago
For those wanting to convert from $set to MacRoman, use iconv():
<?php
$string
= iconv('UTF-8', 'macintosh', $string);?>
('macintosh' is the IANA name for the MacRoman character set.)
nicole ¶
7 years ago
// convert UTF8 to DOS = CP850
//
// $utf8_text=UTF8-Formatted text;
// $dos=CP850-Formatted text;
// have fun
$dos = mb_convert_encoding($utf8_text, "CP850", mb_detect_encoding($utf8_text, "UTF-8, CP850, ISO-8859-15", true));
lanka at eurocom dot od dot ua ¶
20 years ago
Another sample of recoding without MultiByte enabling.
(Russian koi->win, if input in win-encoding already, function recode() returns unchanged string)
<?php
// 0 - win
// 1 - koi
function detect_encoding($str) {
$win = 0;
$koi = 0;
for(
$i=0; $i<strlen($str); $i++) {
if( ord($str[$i]) >224 && ord($str[$i]) < 255) $win++;
if( ord($str[$i]) >192 && ord($str[$i]) < 223) $koi++;
}
if(
$win < $koi ) {
return 1;
} else return 0;
}
// recodes koi to win
function koi_to_win($string) {$kw = array(128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 254, 224, 225, 246, 228, 229, 244, 227, 245, 232, 233, 234, 235, 236, 237, 238, 239, 255, 240, 241, 242, 243, 230, 226, 252, 251, 231, 248, 253, 249, 247, 250, 222, 192, 193, 214, 196, 197, 212, 195, 213, 200, 201, 202, 203, 204, 205, 206, 207, 223, 208, 209, 210, 211, 198, 194, 220, 219, 199, 216, 221, 217, 215, 218);
$wk = array(128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 225, 226, 247, 231, 228, 229, 246, 250, 233, 234, 235, 236, 237, 238, 239, 240, 242, 243, 244, 245, 230, 232, 227, 254, 251, 253, 255, 249, 248, 252, 224, 241, 193, 194, 215, 199, 196, 197, 214, 218, 201, 202, 203, 204, 205, 206, 207, 208, 210, 211, 212, 213, 198, 200, 195, 222, 219, 221, 223, 217, 216, 220, 192, 209);$end = strlen($string);
$pos = 0;
do {
$c = ord($string[$pos]);
if ($c>128) {
$string[$pos] = chr($kw[$c-128]);
}
} while (++
$pos < $end);
return
$string;
}
function
recode($str) {$enc = detect_encoding($str);
if ($enc==1) {
$str = koi_to_win($str);
}
return
$str;
}
?>
nospam at nihonbunka dot com ¶
15 years ago
rodrigo at bb2 dot co dot jp wrote that inconv works better than mb_convert_encoding, I find that when converting from uft8 to shift_jis
$conv_str = mb_convert_encoding($str,$toCS,$fromCS);
works while
$conv_str = iconv($fromCS,$toCS.'//IGNORE',$str);
removes tildes from $str.
David Hull ¶
16 years ago
As an alternative to Johannes's suggestion for converting strings from other character sets to a 7bit representation while not just deleting latin diacritics, you might try this:
<?php
$text = iconv($from_enc, 'US-ASCII//TRANSLIT', $text);
?>
The only disadvantage is that it does not convert "ä" to "ae", but it handles punctuation and other special characters better.
--
David
aofg ¶
16 years ago
When converting Japanese strings to ISO-2022-JP or JIS on PHP >= 5.2.1, you can use "ISO-2022-JP-MS" instead of them.
Kishu-Izon (platform dependent) characters are converted correctly with the encoding, as same as with eucJP-win or with SJIS-win.
jamespilcher1 — hotmail ¶
19 years ago
be careful when converting from iso-8859-1 to utf-8.
even if you explicitly specify the character encoding of a page as iso-8859-1(via headers and strict xml defs), windows 2000 will ignore that and interpret it as whatever character set it has natively installed.
for example, i wrote char #128 into a page, with char encoding iso-8859-1, and it displayed in internet explorer (& mozilla) as a euro symbol.
it should have displayed a box, denoting that char #128 is undefined in iso-8859-1. The problem was it was displaying in "Windows: western europe" (my native character set).
this led to confusion when i tried to convert this euro to UTF-8 via mb_convert_encoding()
IE displays UTF-8 correctly- and because PHP correctly converted #128 into a box in UTF-8, IE would show a box.
so all i saw was mb_convert_encoding() converting a euro symbol into a box. It took me a long time to figure out what was going on.
StigC ¶
15 years ago
For the php-noobs (like me) - working with flash and php.
Here's a simple snippet of code that worked great for me, getting php to show special Danish characters, from a Flash email form:
<?php
// Name Escape
$escName = mb_convert_encoding($_POST["Name"], "ISO-8859-1", "UTF-8");// message escape
$escMessage = mb_convert_encoding($_POST["Message"], "ISO-8859-1", "UTF-8");// Headers.. and so on...
?>
gullevek at gullevek dot org ¶
13 years ago
If you want to convert japanese to ISO-2022-JP it is highly recommended to use ISO-2022-JP-MS as the target encoding instead. This includes the extended character set and avoids ? in the text. For example the often used "1 in a circle" ① will be correctly converted then.
rodrigo at bb2 dot co dot jp ¶
15 years ago
For those who can´t use mb_convert_encoding() to convert from one charset to another as a metter of lower version of php, try iconv().
I had this problem converting to japanese charset:
$txt=mb_convert_encoding($txt,'SJIS',$this->encode);
And I could fix it by using this:
$txt = iconv('UTF-8', 'SJIS', $txt);
Maybe it´s helpfull for someone else! ;)
phpdoc at jeudi dot de ¶
17 years ago
I\'d like to share some code to convert latin diacritics to their
traditional 7bit representation, like, for example,
- à,ç,é,î,... to a,c,e,i,...
- ß to ss
- ä,Ä,... to ae,Ae,...
- ë,... to e,...
(mb_convert \"7bit\" would simply delete any offending characters).
I might have missed on your country\'s typographic
conventions--correct me then.
<?php
/**
* @args string $text line of encoded text
* string $from_enc (encoding type of $text, e.g. UTF-8, ISO-8859-1)
*
* @returns 7bit representation
*/
function to7bit($text,$from_enc) {
$text = mb_convert_encoding($text,\'HTML-ENTITIES\',$from_enc);
$text = preg_replace(
array(\'/ß/\',\'/&(..)lig;/\',
\'/&([aouAOU])uml;/\',\'/&(.)[^;]*;/\'),
array(\'ss\',\"$1\",\"$1\".\'e\',\"$1\"),
$text);
return $text;
}
?>
Enjoy :-)
Johannes
==
[EDIT BY danbrown AT php DOT net: Author provided the following update on 27-FEB-2012.]
==
An addendum to my "to7bit" function referenced below in the notes.
The function is supposed to solve the problem that some languages require a different 7bit rendering of special (umlauted) characters for sorting or other applications. For example, the German ß ligature is usually written "ss" in 7bit context. Dutch ÿ is typically rendered "ij" (not "y").
The original function works well with word (alphabet) character entities and I've seen it used in many places. But non-word entities cause funny results:
E.g., "©" is rendered as "c", "­" as "s" and "&rquo;" as "r".
The following version fixes this by converting non-alphanumeric characters (also chains thereof) to '_'.
<?php
/**
* @args string $text line of encoded text
* string $from_enc (encoding type of $text, e.g. UTF-8, ISO-8859-1)
*
* @returns 7bit representation
*/
function to7bit($text,$from_enc) {
$text = preg_replace(/W+/,'_',$text);
$text = mb_convert_encoding($text,'HTML-ENTITIES',$from_enc);
$text = preg_replace(
array('/ß/','/&(..)lig;/',
'/&([aouAOU])uml;/','/ÿ/','/&(.)[^;]*;/'),
array('ss',"$1","$1".'e','ij',"$1"),
$text);
return $text;
}
?>
Enjoy again,
Johannes
qdb at kukmara dot ru ¶
11 years ago
mb_substr and probably several other functions works faster in ucs-2 than in utf-8. and utf-16 works slower than utf-8. here is test, ucs-2 is near 50 times faster than utf-8, and utf-16 is near 6 times slower than utf-8 here:
<?php
header('Content-Type: text/html; charset=utf-8');
mb_internal_encoding('utf-8');$s='укгезәөшөхзәхөшк2049һһлдябчсячмииюсит.июбҗрарэ'
.'лдэфвәәуүйәуйүәу034928348539857әшаыдларорашһһрлоавы';
$s.=$s;
$s.=$s;
$s.=$s;
$s.=$s;
$s.=$s;
$s.=$s;
$s.=$s;$t1=microtime(true);
$i=0;
while($i<mb_strlen($s)){
$a=mb_substr($s,$i,2);
$i+=2;
if($i==10)echo$a.'. ';
//echo$a.'. ';
}
echo$i.'. ';
echo(microtime(true)-$t1);
echo
'<br>';
$s=mb_convert_encoding($s,'UCS-2','utf8');
mb_internal_encoding('UCS-2');
$t1=microtime(true);
$i=0;
while($i<mb_strlen($s)){
$a=mb_substr($s,$i,2);
$i+=2;
if($i==10)echo mb_convert_encoding($a,'utf8','ucs2').'. ';
//echo$a.'. ';
}
echo$i.'. ';
echo(microtime(true)-$t1);
echo
'<br>';
$s=mb_convert_encoding($s,'utf-16','ucs-2');
mb_internal_encoding('utf-16');
$t1=microtime(true);
$i=0;
while($i<mb_strlen($s)){
$a=mb_substr($s,$i,2);
$i+=2;
if($i==10)echo mb_convert_encoding($a,'utf8','utf-16').'. ';
//echo$a.'. ';
}
echo$i.'. ';
echo(microtime(true)-$t1);?>
output:
өх. 12416. 1.71738100052
өх. 12416. 0.0211279392242
өх. 12416. 11.2330229282
DanielAbbey at Hotmail dot co dot uk ¶
9 years ago
When using the Windows Notepad text editor, it is important to note that when you select 'Save As' there is an Encoding selection dropdown. The default encoding is set to ANSI, with the other two options being Unicode and UTF-8. Since most text on the web is in UTF-8 format it could prove vital to save the .txt file with this encoding, since this function does not work on ANSI-encoded text.
Stephan van der Feest ¶
18 years ago
Here's a tip for anyone using Flash and PHP for storing HTML output submitted from a Flash text field in a database or whatever.
Flash submits its HTML special characters in UTF-8, so you can use the following function to convert those into HTML entity characters:
function utf8html($utf8str)
{
return htmlentities(mb_convert_encoding($utf8str,"ISO-8859-1","UTF-8"));
}
Edward ¶
15 years ago
If mb_convert_encoding doesn't work for you, and iconv gives you a headache, you might be interested in this free class I found. It can convert almost any charset to almost any other charset. I think it's wonderful and I wish I had found it earlier. It would have saved me tons of headache.
I use it as a fail-safe, in case mb_convert_encoding is not installed. Download it from http://mikolajj.republika.pl/
This is not my own library, so technically it's not spamming, right? ;)
Hope this helps.
jackycms at outlook dot com ¶
9 years ago
// mb_convert_encoding($input,'UTF-8','windows-874'); error : Illegal character encoding specified
// so convert Thai to UTF-8 is better use iconv instead
<?php
iconv
("windows-874","UTF-8",$input);?>
mightye at gmail dot com ¶
15 years ago
To petruzanauticoyahoo?com!ar
If you don't specify a source encoding, then it assumes the internal (default) encoding. ñ is a multi-byte character whose bytes in your configuration default (often iso-8859-1) would actually mean ñ. mb_convert_encoding() is upgrading those characters to their multi-byte equivalents within UTF-8.
Try this instead:
<?php
print mb_convert_encoding( "ñ", "UTF-8", "UTF-8" );
?>
Of course this function does no work (for the most part - it can actually be used to strip characters which are not valid for UTF-8).
Problem Description:
I have a small html code and I need to convert it to UTF-8.
I use this iconv("windows-1251", "utf-8", $html);
All text converts correctly, but if text for example in tag <i>...</i>
, then it don’t convert text and I see somethig like this Показать РјРЅ
Solution – 1
If you have access to the Multibye package, you can try it. See the PHP page here:
http://www.php.net/manual/en/function.mb-convert-encoding.php
$html_utf8 = mb_convert_encoding($html, "utf-8", "windows-1251");
Solution – 2
You know, message like Показать мн
you see if
encoding for page is windows-1251
, but text encoded in utf-8
.
I saw this problem in one of my project, so just change change encoding for page in utf-8
and this text will shown correctly.
Let me take you some examples:
if page in utf-8
, but text in windows-1251
you wil see something like this:
???? ?? ?????? ??? ????? ??? ??????? ?? ????? ???? ??? ?????
if page in windows-1251
, but text in utf-8
you see this:
"Мобильные телефоны";"Apple iPhone 4
Solution – 3
I always use manual convertation (character-by-character), like this:
$input= 'Обращение РљР°С';
$s= str_replace('С?','fgr43443443',$input);
$s= mb_convert_encoding($s, "windows-1251", "utf-8");
$s= str_replace('fgr43443443','ш',$s);
echo $s;
p.s. dont forget, the .php file encoding should be UTF8.
also, in the head of HTML,insert standard declaration for UTF8
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
Solution – 4
Most of the solutions lack conversion to single-byte encoding.
I use mb_convert_encoding($string,’windows-1251′) to convert from UTF-8 in my case.
function ru2Lat($string)
{
$rus = array('ё','ж','ц','ч','ш','щ','ю','я','Ё','Ж','Ц','Ч','Ш','Щ','Ю','Я');
$lat = array('yo','zh','tc','ch','sh','sh','yu','ya','YO','ZH','TC','CH','SH','SH','YU','YA');
$string = str_replace($rus,$lat,$string);
$string = strtr($string,
"АБВГДЕЗИЙКЛМНОПРСТУФХЪЫЬЭабвгдезийклмнопрстуфхъыьэ",
"ABVGDEZIJKLMNOPRSTUFH_I_Eabvgdezijklmnoprstufh'i'e");
return($string);
}
function transliterate($string){
if (!is_string($string)) return $string;
return ru2lat(mb_convert_encoding($string,'windows-1251'));
}
function transliterate_array($a){
$c = array_map(transliterate,$a);
return $c;
}
Solution – 5
try this, works for me!
$result = str_replace ('€', '€' , $result);
Solution – 6
$data = mb_convert_encoding($data, «utf-8», «windows-1251»);
$data = mb_convert_encoding($data, «windows-1251», «Windows-1250»);
//works to me
Answer by Presley Wilkinson
Most of the solutions lack conversion to single-byte encoding.
I use mb_convert_encoding($string,’windows-1251′) to convert from UTF-8 in my case.,
2
please provide the string that you’re trying to convert.
– Ejaz
Mar 29 ’13 at 15:22
,Making statements based on opinion; back them up with references or personal experience.,
Podcast 392: Do polyglots have an edge when it comes to mastering programming…
If you have access to the Multibye package, you can try it. See the PHP page here:
http://www.php.net/manual/en/function.mb-convert-encoding.php
$html_utf8 = mb_convert_encoding($html, "utf-8", "windows-1251");
Answer by Bridget Saunders
iconv — Convert string to requested character encoding,
Performs a character set conversion on the string
string from from_encoding
to to_encoding.
,
If and how //TRANSLIT works exactly depends on the
system’s iconv() implementation (cf. ICONV_IMPL).
Some implementations are known to ignore //TRANSLIT,
so the conversion is likely to fail for characters which are illegal for
the to_encoding.
,
The string to be converted.
Original : This is the Euro symbol '€'.
TRANSLIT : This is the Euro symbol 'EUR'.
IGNORE : This is the Euro symbol ''.
Plain :
Notice: iconv(): Detected an illegal character in input string in .\iconv-example.php on line 7
Answer by Zachariah Ibarra
If you have access to the Multibye package, you can try it. See the PHP page here:
http://www.php.net/manual/en/function.mb-convert-encoding.php
If you have access to the Multibye package, you can try it. See the PHP page here:
http://www.php.net/manual/en/function.mb-convert-encoding.php
$html_utf8 = mb_convert_encoding($html, "utf-8", "windows-1251");
Answer by Russell Malone
I have a small html code and I need to convert it to UTF-8.
I use this iconv(«windows-1251», «utf-8», $html);,All text converts correctly, but if text for example in tag <i>…</i>, then it don’t convert text and I see somethig like this Показать РјРЅ,If not, try to guess the encoding (CP1252 or ISO-8859-1 would be my first guess) and convert it to UTF-8, see if the output is valid:,This worked fine. Then, I figured out that mb_internal_encoding(«UTF-8»); is enough. So now it works. Thanks for all the suggestions!
If you have access to the Multibye package, you can try it. See the PHP page here:
http://www.php.net/manual/en/function.mb-convert-encoding.php
$html_utf8 = mb_convert_encoding($html, "utf-8", "windows-1251");
Answer by Marie Newman
@1nt3g3r, your script won’t work. You missed * in the filename templates. To make it work the first line should look like this:,What you actually should use for this operation is enca, since it will correctly detect input encoding and act accordingly.,However, your variant works much better then the TS’s. It works even with the unprintable characters in the filenames. Thanks!,For many Russian filenames with spaces and etc, and autodetect for codepage, (macos) best for me:
find ./ -name «*.sql» -type f | while read file; do enca -L russian -x UTF-8 «$file»; done;
find ./ -name "*.txt" -o -name "*.html" -o -name "*.css" -o -name "*.js" -type f |
Answer by Jessie Nunez
La plupart des solutions manquent de conversion en codage sur un octet. J’utilise mb_convert_encoding ($ string, ‘windows-1251’) pour convertir de UTF-8 dans mon cas.,J’ai un petit code HTML et j’ai besoin de le convertir en UTF-8. J’utilise cette iconv(«windows-1251», «utf-8», $html);,/PHP Convertir Windows-1251 en UTF 8,PHP sortie montrant des petits diamants noirs avec un point d’interrogation
Si vous avez accès au paquet Multibye, vous pouvez l’essayer. Voir la page PHP ici: http://www.php.net/manual/fr/function.mb-convert-encoding.php
$html_utf8 = mb_convert_encoding($html, "utf-8", "windows-1251");
Answer by Aiden Franklin
the-character-encoding
echo ''; var_dump($_POST["NAME"]); echo '
string(16) "SRR R R s•R R†R VRR RRR" // here should be "check"
var_dump(mb_detect_encoding($_POST["NAME"])); // UTF-8
var_dump(iconv('UTF-8','windows-1251', $_POST["NAME"])); Received: string(8) "Resurrects"
Answer by Saoirse Wilcox
To change the encoding windows-1251 to utf — 8?,How to change the encoding of Superscript from UTF-8 to windows-1251 on php?,Why does the browser allocates different encoding php files: utf-8 is encoded in utf-8, and utf-8 without BOM in windows-1251?,Oddities with the behavior of browsers in the encoding of get parameters (Windows-1251 / UTF-8)
$config['charset'] = "windows-1251";
Answer by Creed Romero
echo '<pre>';
var_dump($_POST["NAME"]);
echo '</pre>';
string(16) "SRR R R s•R R†R VRR RRR"" // here should be "check"
var_dump(mb_detect_encoding($_POST["NAME"])); // UTF-8
var_dump(iconv('UTF-8','windows-1251', $_POST["NAME"]));
Received:
string(8) "Resurrects"