I need to convert a string from Windows-1251 to UTF-8.
I tried to do this with iconv, but all I get is something like this:
пїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅ пїЅпїЅпїЅпїЅпїЅпїЅпїЅпїЅ
var iconv = new Iconv('windows-1251', 'utf-8')
title = iconv.convert(title).toString('utf-8')
Pang
9,622146 gold badges81 silver badges122 bronze badges
asked Jan 1, 2012 at 13:52
1
Here is working solution to your problem. You have to use Buffer and convert your string to binary first.
const Iconv = require('iconv').Iconv;
request({
uri: website_url,
method: 'GET',
encoding: 'binary'
}, function (error, response, body) {
const body = new Buffer(body, 'binary');
conv = Iconv('windows-1251', 'utf8');
body = conv.convert(body).toString();
});
Ahmet Şimşek
1,4111 gold badge15 silver badges24 bronze badges
answered Jan 29, 2012 at 0:20
Alex KolarskiAlex Kolarski
3,2651 gold badge25 silver badges35 bronze badges
1
If you’re reading from file, you could use something like that:
const iconv = require('iconv-lite');
const fs = require("fs");
fs.readFile("filename.xml", null, (err, data) => {
if(err) {
console.log(err)
return
}
const encodedData = iconv.encode(iconv.decode(data, 'win1251'), 'utf8')
fs.writeFile("result_filename.xml", encodedData, () => { })
})
answered Jul 14, 2021 at 18:27
I use Node version 16 and code bellow works fine. You don’t need to use Buffer node will write warnings. You need to install iconv
package before.
fs = require('fs')
fs.readFile('printed_document.txt', function (err,data) {
if (err) {
return console.log(err);
}
console.log(require('iconv').Iconv('windows-1251', 'utf-8').convert(data).toString())
})
answered Oct 13, 2022 at 13:44
Orlov ConstOrlov Const
3323 silver badges10 bronze badges
Using regular expressions is probably the best way. You can see a bunch of tests here (taken from chromium)
function validateEmail(email) {
const re = /^(([^<>()[\]\\.,;:\s@"]+(\.[^<>()[\]\\.,;:\s@"]+)*)|(".+"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
return re.test(String(email).toLowerCase());
}
Here’s the example of regular expresion that accepts unicode:
const re = /^(([^<>()[\]\.,;:\s@\"]+(\.[^<>()[\]\.,;:\s@\"]+)*)|(\".+\"))@(([^<>()[\]\.,;:\s@\"]+\.)+[^<>()[\]\.,;:\s@\"]{2,})$/i;
But keep in mind that one should not rely only upon JavaScript validation. JavaScript can easily be disabled. This should be validated on the server side as well.
Here’s an example of the above in action:
function validateEmail(email) {
const re = /^(([^<>()[\]\\.,;:\s@\"]+(\.[^<>()[\]\\.,;:\s@\"]+)*)|(\".+\"))@((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\])|(([a-zA-Z\-0-9]+\.)+[a-zA-Z]{2,}))$/;
return re.test(email);
}
function validate() {
const $result = $("#result");
const email = $("#email").val();
$result.text("");
if (validateEmail(email)) {
$result.text(email + " is valid :)");
$result.css("color", "green");
} else {
$result.text(email + " is not valid :(");
$result.css("color", "red");
}
return false;
}
$("#email").on("input", validate);
<script src="https://ajax.googleapis.com/ajax/libs/jquery/2.1.1/jquery.min.js"></script>
<label for=email>Enter an email address:</label>
<input id="email">
<h2 id="result"></h2>
A closure is a pairing of:
- A function, and
- A reference to that function’s outer scope (lexical environment)
A lexical environment is part of every execution context (stack frame) and is a map between identifiers (ie. local variable names) and values.
Every function in JavaScript maintains a reference to its outer lexical environment. This reference is used to configure the execution context created when a function is invoked. This reference enables code inside the function to «see» variables declared outside the function, regardless of when and where the function is called.
If a function was called by a function, which in turn was called by another function, then a chain of references to outer lexical environments is created. This chain is called the scope chain.
In the following code, inner
forms a closure with the lexical environment of the execution context created when foo
is invoked, closing over variable secret
:
function foo() {
const secret = Math.trunc(Math.random()*100)
return function inner() {
console.log(`The secret number is ${secret}.`)
}
}
const f = foo() // `secret` is not directly accessible from outside `foo`
f() // The only way to retrieve `secret`, is to invoke `f`
In other words: in JavaScript, functions carry a reference to a private «box of state», to which only they (and any other functions declared within the same lexical environment) have access. This box of the state is invisible to the caller of the function, delivering an excellent mechanism for data-hiding and encapsulation.
And remember: functions in JavaScript can be passed around like variables (first-class functions), meaning these pairings of functionality and state can be passed around your program: similar to how you might pass an instance of a class around in C++.
If JavaScript did not have closures, then more states would have to be passed between functions explicitly, making parameter lists longer and code noisier.
So, if you want a function to always have access to a private piece of state, you can use a closure.
…and frequently we do want to associate the state with a function. For example, in Java or C++, when you add a private instance variable and a method to a class, you are associating state with functionality.
In C and most other common languages, after a function returns, all the local variables are no longer accessible because the stack-frame is destroyed. In JavaScript, if you declare a function within another function, then the local variables of the outer function can remain accessible after returning from it. In this way, in the code above, secret
remains available to the function object inner
, after it has been returned from foo
.
Uses of Closures
Closures are useful whenever you need a private state associated with a function. This is a very common scenario — and remember: JavaScript did not have a class syntax until 2015, and it still does not have a private field syntax. Closures meet this need.
Private Instance Variables
In the following code, the function toString
closes over the details of the car.
function Car(manufacturer, model, year, color) {
return {
toString() {
return `${manufacturer} ${model} (${year}, ${color})`
}
}
}
const car = new Car('Aston Martin','V8 Vantage','2012','Quantum Silver')
console.log(car.toString())
Functional Programming
In the following code, the function inner
closes over both fn
and args
.
function curry(fn) {
const args = []
return function inner(arg) {
if(args.length === fn.length) return fn(...args)
args.push(arg)
return inner
}
}
function add(a, b) {
return a + b
}
const curriedAdd = curry(add)
console.log(curriedAdd(2)(3)()) // 5
Event-Oriented Programming
In the following code, function onClick
closes over variable BACKGROUND_COLOR
.
const $ = document.querySelector.bind(document)
const BACKGROUND_COLOR = 'rgba(200,200,242,1)'
function onClick() {
$('body').style.background = BACKGROUND_COLOR
}
$('button').addEventListener('click', onClick)
<button>Set background color</button>
Modularization
In the following example, all the implementation details are hidden inside an immediately executed function expression. The functions tick
and toString
close over the private state and functions they need to complete their work. Closures have enabled us to modularise and encapsulate our code.
let namespace = {};
(function foo(n) {
let numbers = []
function format(n) {
return Math.trunc(n)
}
function tick() {
numbers.push(Math.random() * 100)
}
function toString() {
return numbers.map(format)
}
n.counter = {
tick,
toString
}
}(namespace))
const counter = namespace.counter
counter.tick()
counter.tick()
console.log(counter.toString())
Examples
Example 1
This example shows that the local variables are not copied in the closure: the closure maintains a reference to the original variables themselves. It is as though the stack-frame stays alive in memory even after the outer function exits.
function foo() {
let x = 42
let inner = function() { console.log(x) }
x = x+1
return inner
}
var f = foo()
f() // logs 43
Example 2
In the following code, three methods log
, increment
, and update
all close over the same lexical environment.
And every time createObject
is called, a new execution context (stack frame) is created and a completely new variable x
, and a new set of functions (log
etc.) are created, that close over this new variable.
function createObject() {
let x = 42;
return {
log() { console.log(x) },
increment() { x++ },
update(value) { x = value }
}
}
const o = createObject()
o.increment()
o.log() // 43
o.update(5)
o.log() // 5
const p = createObject()
p.log() // 42
Example 3
If you are using variables declared using var
, be careful you understand which variable you are closing over. Variables declared using var
are hoisted. This is much less of a problem in modern JavaScript due to the introduction of let
and const
.
In the following code, each time around the loop, a new function inner
is created, which closes over i
. But because var i
is hoisted outside the loop, all of these inner functions close over the same variable, meaning that the final value of i
(3) is printed, three times.
function foo() {
var result = []
for (var i = 0; i < 3; i++) {
result.push(function inner() { console.log(i) } )
}
return result
}
const result = foo()
// The following will print `3`, three times...
for (var i = 0; i < 3; i++) {
result[i]()
}
Final points:
- Whenever a function is declared in JavaScript closure is created.
- Returning a
function
from inside another function is the classic example of closure, because the state inside the outer function is implicitly available to the returned inner function, even after the outer function has completed execution. - Whenever you use
eval()
inside a function, a closure is used. The text youeval
can reference local variables of the function, and in the non-strict mode, you can even create new local variables by usingeval('var foo = …')
. - When you use
new Function(…)
(the Function constructor) inside a function, it does not close over its lexical environment: it closes over the global context instead. The new function cannot reference the local variables of the outer function. - A closure in JavaScript is like keeping a reference (NOT a copy) to the scope at the point of function declaration, which in turn keeps a reference to its outer scope, and so on, all the way to the global object at the top of the scope chain.
- A closure is created when a function is declared; this closure is used to configure the execution context when the function is invoked.
- A new set of local variables is created every time a function is called.
Links
- Douglas Crockford’s simulated private attributes and private methods for an object, using closures.
- A great explanation of how closures can cause memory leaks in IE if you are not careful.
- MDN documentation on JavaScript Closures.
Время на прочтение
5 мин
Количество просмотров 22K
Есть у меня старый сайт на Народ.Ру, и недавно я закинул туда несколько статей — как это я теперь делаю в UTF-8. Кодировка была указана в теге meta
, но, взглянув на страницы, я увидел крякозябры: «Р§С‚Рѕ-то случилось.
» Оказывается, Народ.Ру шлёт HTTP-заголовок Content-Type: text/html; charset=windows-1251
и это на нём никак не отключается. Пользователь может получить читабельный текст — только если догадается вручную переключить кодировку в браузере.
Что делать? Переходить на другой хостинг? Само собой, но пока руки не дошли, хотелось добиться результата тут. Перекодировать тексты? Более достойным и интересным показалось поставить Javascript-«заплатку».
Способа переключить кодировку из Javascript я не нашёл. Остался вариант перекодировать текст скриптом, запускаемым по событию onready
документа.
Итак, браузер получает текст в UTF-8, разбивает UTF-последовательности на группы по 8 бит и трактует их как коды символов в кодировке Windows-1251. Чтобы восстановить читаемость текста, нужно получить эти коды, объединить их в UTF-последовательности, а из них — восстановить Unicode-коды символов и вернуть последние посредством числовых ссылок HTML на символы. В этом деле обнаружились несколько закавык.
Во-первых, считывая текст из свойства innerHTML
, мы обнаруживаем на месте неразрывного пробела (0xA0) HTML-сущность «
». Нужно её заменять обратно на 0xA0
.
Во-вторых, функция charCodeAt
возвращает код символа в Unicode, а не в Windows-1251, значит нужно преобразовывать первый во второй. В-третьих, символа с кодом 0x98
в Windows-1251 нет, так что эта функция возвращает для него undefined
, это нужно предусмотреть.
В-четвёртых, Internet Explorer и Safari не позволяют поменять заголовок документа через DOM, только через соответствующее свойство документа — но туда нельзя писать числовые ссылки HTML. Для этого случая можно переводить Unicode-коды в шестнадцатеричную систему счисления, записывать их в виде «%код» и пропускать через функцию unescape
.
Итоговый код получается таким:
bindReady(
function(){
var Win1251 =
{
0x0: 0x0,
0x1: 0x1,
0x2: 0x2,
0x3: 0x3,
0x4: 0x4,
0x5: 0x5,
0x6: 0x6,
0x7: 0x7,
0x8: 0x8,
0x9: 0x9,
0xA: 0xA,
0xB: 0xB,
0xC: 0xC,
0xD: 0xD,
0xE: 0xE,
0xF: 0xF,
0x10: 0x10,
0x11: 0x11,
0x12: 0x12,
0x13: 0x13,
0x14: 0x14,
0x15: 0x15,
0x16: 0x16,
0x17: 0x17,
0x18: 0x18,
0x19: 0x19,
0x1A: 0x1A,
0x1B: 0x1B,
0x1C: 0x1C,
0x1D: 0x1D,
0x1E: 0x1E,
0x1F: 0x1F,
0x20: 0x20,
0x21: 0x21,
0x22: 0x22,
0x23: 0x23,
0x24: 0x24,
0x25: 0x25,
0x26: 0x26,
0x27: 0x27,
0x28: 0x28,
0x29: 0x29,
0x2A: 0x2A,
0x2B: 0x2B,
0x2C: 0x2C,
0x2D: 0x2D,
0x2E: 0x2E,
0x2F: 0x2F,
0x30: 0x30,
0x31: 0x31,
0x32: 0x32,
0x33: 0x33,
0x34: 0x34,
0x35: 0x35,
0x36: 0x36,
0x37: 0x37,
0x38: 0x38,
0x39: 0x39,
0x3A: 0x3A,
0x3B: 0x3B,
0x3C: 0x3C,
0x3D: 0x3D,
0x3E: 0x3E,
0x3F: 0x3F,
0x40: 0x40,
0x41: 0x41,
0x42: 0x42,
0x43: 0x43,
0x44: 0x44,
0x45: 0x45,
0x46: 0x46,
0x47: 0x47,
0x48: 0x48,
0x49: 0x49,
0x4A: 0x4A,
0x4B: 0x4B,
0x4C: 0x4C,
0x4D: 0x4D,
0x4E: 0x4E,
0x4F: 0x4F,
0x50: 0x50,
0x51: 0x51,
0x52: 0x52,
0x53: 0x53,
0x54: 0x54,
0x55: 0x55,
0x56: 0x56,
0x57: 0x57,
0x58: 0x58,
0x59: 0x59,
0x5A: 0x5A,
0x5B: 0x5B,
0x5C: 0x5C,
0x5D: 0x5D,
0x5E: 0x5E,
0x5F: 0x5F,
0x60: 0x60,
0x61: 0x61,
0x62: 0x62,
0x63: 0x63,
0x64: 0x64,
0x65: 0x65,
0x66: 0x66,
0x67: 0x67,
0x68: 0x68,
0x69: 0x69,
0x6A: 0x6A,
0x6B: 0x6B,
0x6C: 0x6C,
0x6D: 0x6D,
0x6E: 0x6E,
0x6F: 0x6F,
0x70: 0x70,
0x71: 0x71,
0x72: 0x72,
0x73: 0x73,
0x74: 0x74,
0x75: 0x75,
0x76: 0x76,
0x77: 0x77,
0x78: 0x78,
0x79: 0x79,
0x7A: 0x7A,
0x7B: 0x7B,
0x7C: 0x7C,
0x7D: 0x7D,
0x7E: 0x7E,
0x7F: 0x7F,
0x402: 0x80,
0x403: 0x81,
0x201A: 0x82,
0x453: 0x83,
0x201E: 0x84,
0x2026: 0x85,
0x2020: 0x86,
0x2021: 0x87,
0x20AC: 0x88,
0x2030: 0x89,
0x409: 0x8A,
0x2039: 0x8B,
0x40A: 0x8C,
0x40C: 0x8D,
0x40B: 0x8E,
0x40F: 0x8F,
0x452: 0x90,
0x2018: 0x91,
0x2019: 0x92,
0x201C: 0x93,
0x201D: 0x94,
0x2022: 0x95,
0x2013: 0x96,
0x2014: 0x97,
0x2122: 0x99,
0x459: 0x9A,
0x203A: 0x9B,
0x45A: 0x9C,
0x45C: 0x9D,
0x45B: 0x9E,
0x45F: 0x9F,
0xA0: 0xA0,
0x40E: 0xA1,
0x45E: 0xA2,
0x408: 0xA3,
0xA4: 0xA4,
0x490: 0xA5,
0xA6: 0xA6,
0xA7: 0xA7,
0x401: 0xA8,
0xA9: 0xA9,
0x404: 0xAA,
0xAB: 0xAB,
0xAC: 0xAC,
0xAD: 0xAD,
0xAE: 0xAE,
0x407: 0xAF,
0xB0: 0xB0,
0xB1: 0xB1,
0x406: 0xB2,
0x456: 0xB3,
0x491: 0xB4,
0xB5: 0xB5,
0xB6: 0xB6,
0xB7: 0xB7,
0x451: 0xB8,
0x2116: 0xB9,
0x454: 0xBA,
0xBB: 0xBB,
0x458: 0xBC,
0x405: 0xBD,
0x455: 0xBE,
0x457: 0xBF,
0x410: 0xC0,
0x411: 0xC1,
0x412: 0xC2,
0x413: 0xC3,
0x414: 0xC4,
0x415: 0xC5,
0x416: 0xC6,
0x417: 0xC7,
0x418: 0xC8,
0x419: 0xC9,
0x41A: 0xCA,
0x41B: 0xCB,
0x41C: 0xCC,
0x41D: 0xCD,
0x41E: 0xCE,
0x41F: 0xCF,
0x420: 0xD0,
0x421: 0xD1,
0x422: 0xD2,
0x423: 0xD3,
0x424: 0xD4,
0x425: 0xD5,
0x426: 0xD6,
0x427: 0xD7,
0x428: 0xD8,
0x429: 0xD9,
0x42A: 0xDA,
0x42B: 0xDB,
0x42C: 0xDC,
0x42D: 0xDD,
0x42E: 0xDE,
0x42F: 0xDF,
0x430: 0xE0,
0x431: 0xE1,
0x432: 0xE2,
0x433: 0xE3,
0x434: 0xE4,
0x435: 0xE5,
0x436: 0xE6,
0x437: 0xE7,
0x438: 0xE8,
0x439: 0xE9,
0x43A: 0xEA,
0x43B: 0xEB,
0x43C: 0xEC,
0x43D: 0xED,
0x43E: 0xEE,
0x43F: 0xEF,
0x440: 0xF0,
0x441: 0xF1,
0x442: 0xF2,
0x443: 0xF3,
0x444: 0xF4,
0x445: 0xF5,
0x446: 0xF6,
0x447: 0xF7,
0x448: 0xF8,
0x449: 0xF9,
0x44A: 0xFA,
0x44B: 0xFB,
0x44C: 0xFC,
0x44D: 0xFD,
0x44E: 0xFE,
0x44F: 0xFF
}
String.prototype.Win1251_charCodeAt=function(char_num){
var char_code=this.charCodeAt(char_num);
return (char_code===undefined)?0x98:Win1251[char_code];
}
function utf8_decode(text){
text=text.replace(/ /g,"\u00A0");
var char_code, char_code2, char_code3, char_code4;
var result_str='';
for(var char_num=0; char_num<text.length; char_num++)
if((char_code=text.Win1251_charCodeAt(char_num))<0x80 || char_code===text.charCodeAt(char_num))
result_str+=text.charAt(char_num);//0zzzzzzz - 00000000 00000000 00000000 0zzzzzzz
else if(char_code>=0xC0)
if(char_code<0xE0){
if(
(char_code2=text.Win1251_charCodeAt(++char_num))>=0x80 &&
char_code2<0xC0
)//110yyyyy 10zzzzzz - 00000000 00000000 00000yyy yyzzzzzz
result_str+="&#"+((char_code-0xC0)*0x40+(char_code2-0x80))+";";
}
else if(char_code<0xF0){
if(
(char_code2=text.Win1251_charCodeAt(++char_num))>=0x80 &&
char_code2<0xC0 &&
(char_code3=text.Win1251_charCodeAt(++char_num))>=0x80 &&
char_code3<0xC0
)//1110xxxx 10yyyyyy 10zzzzzz - 00000000 00000000 xxxxyyyy yyzzzzzz
result_str+="&#"+((char_code-0xE0)*0x1000+(char_code2-0x80)*0x40+(char_code3-0x80))+";";
}
else if(
char_code<0xF8 &&
(char_code2=text.Win1251_charCodeAt(++char_num))>=0x80 &&
char_code2<0xC0 &&
(char_code3=text.Win1251_charCodeAt(++char_num))>=0x80 &&
char_code3<0xC0 &&
(char_code4=text.Win1251_charCodeAt(++char_num))>=0x80 && char_code4<0xC0
)//11110www 10xxxxxx 10yyyyyy 10zzzzzz - 00000000 000wwwxx xxxxyyyy yyzzzzzz
result_str+="&#"+((char_code-0xF0)*0x40000+(char_code2-0x80)*0x1000+(char_code3-0x80)*0x40+(char_code4-0x80))+";";
return result_str;
}
function unescapeTitle(title){
return unescape(
utf8_decode(document.title).replace(
/&#([0-9]+);/g,
function(expression, value){
if(isNaN(value=parseInt(value, 10)))
return NaN;
var i=0, retval="", radix=16;
while(i++<4 || value>0){
retval=["0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F"][value%radix]+retval;
value=Math.floor(value/radix);
}
return "%u"+retval;
}
)
);
}
document.body.innerHTML=utf8_decode(document.body.innerHTML);
if("vendor" in navigator && navigator.vendor.indexOf("Apple")>-1){
document.title=unescapeTitle(document.title);
return;
}
/*@cc_on
document.title=unescapeTitle(document.title);
return;
@*/
document.getElementsByTagName("title")[0].innerHTML=utf8_decode(document.title);
}
);
Проверено в Firefox 3 и 4; Opera 9, 10 и 11; Internet Explorer 5.5, 6, 7, 8; Google Chrome и Safari последних релизных версий.
Конечно, это кунштюк; я на нём разобрался, что такое UTF-8. По-хорошему, сервер не должен вредить своими HTTP-заголовками. Но тут встаёт философский вопрос: а кому лучше знать кодировку документа — серверу (автору .htaccess) или самому HTML-документу (его автору)? Возможно, у браузеров есть веская причина верить серверу, а не meta
-тегу в документе.
iconv-lite: Pure JS character encoding conversion
- No need for native code compilation. Quick to install, works on Windows, Web, and in sandboxed environments.
- Used in popular projects like Express.js (body_parser),
Grunt, Nodemailer, Yeoman and others. - Faster than node-iconv (see below for performance comparison).
- Intuitive encode/decode API, including Streaming support.
- In-browser usage via browserify or webpack (~180kb gzip compressed with Buffer shim included).
- Typescript type definition file included.
- React Native is supported (need to install
stream
module to enable Streaming API). - License: MIT.
Usage
Basic API
var iconv = require('iconv-lite'); // Convert from an encoded buffer to a js string. str = iconv.decode(Buffer.from([0x68, 0x65, 0x6c, 0x6c, 0x6f]), 'win1251'); // Convert from a js string to an encoded buffer. buf = iconv.encode("Sample input string", 'win1251'); // Check if encoding is supported iconv.encodingExists("us-ascii")
Streaming API
// Decode stream (from binary data stream to js strings) http.createServer(function(req, res) { var converterStream = iconv.decodeStream('win1251'); req.pipe(converterStream); converterStream.on('data', function(str) { console.log(str); // Do something with decoded strings, chunk-by-chunk. }); }); // Convert encoding streaming example fs.createReadStream('file-in-win1251.txt') .pipe(iconv.decodeStream('win1251')) .pipe(iconv.encodeStream('ucs2')) .pipe(fs.createWriteStream('file-in-ucs2.txt')); // Sugar: all encode/decode streams have .collect(cb) method to accumulate data. http.createServer(function(req, res) { req.pipe(iconv.decodeStream('win1251')).collect(function(err, body) { assert(typeof body == 'string'); console.log(body); // full request body string }); });
Supported encodings
- All node.js native encodings: utf8, ucs2 / utf16-le, ascii, binary, base64, hex.
- Additional unicode encodings: utf16, utf16-be, utf-7, utf-7-imap, utf32, utf32-le, and utf32-be.
- All widespread singlebyte encodings: Windows 125x family, ISO-8859 family,
IBM/DOS codepages, Macintosh family, KOI8 family, all others supported by iconv library.
Aliases like ‘latin1’, ‘us-ascii’ also supported. - All widespread multibyte encodings: CP932, CP936, CP949, CP950, GB2312, GBK, GB18030, Big5, Shift_JIS, EUC-JP.
See all supported encodings on wiki.
Most singlebyte encodings are generated automatically from node-iconv. Thank you Ben Noordhuis and libiconv authors!
Multibyte encodings are generated from Unicode.org mappings and WHATWG Encoding Standard mappings. Thank you, respective authors!
Encoding/decoding speed
Comparison with node-iconv module (1000x256kb, on MacBook Pro, Core i5/2.6 GHz, Node v0.12.0).
Note: your results may vary, so please always check on your hardware.
operation iconv@2.1.4 iconv-lite@0.4.7
----------------------------------------------------------
encode('win1251') ~96 Mb/s ~320 Mb/s
decode('win1251') ~95 Mb/s ~246 Mb/s
BOM handling
- Decoding: BOM is stripped by default, unless overridden by passing
stripBOM: false
in options
(f.ex.iconv.decode(buf, enc, {stripBOM: false})
).
A callback might also be given as astripBOM
parameter — it’ll be called if BOM character was actually found. - If you want to detect UTF-8 BOM when decoding other encodings, use node-autodetect-decoder-stream module.
- Encoding: No BOM added, unless overridden by
addBOM: true
option.
UTF-16 Encodings
This library supports UTF-16LE, UTF-16BE and UTF-16 encodings. First two are straightforward, but UTF-16 is trying to be
smart about endianness in the following ways:
- Decoding: uses BOM and ‘spaces heuristic’ to determine input endianness. Default is UTF-16LE, but can be
overridden withdefaultEncoding: 'utf-16be'
option. Strips BOM unlessstripBOM: false
. - Encoding: uses UTF-16LE and writes BOM by default. Use
addBOM: false
to override.
UTF-32 Encodings
This library supports UTF-32LE, UTF-32BE and UTF-32 encodings. Like the UTF-16 encoding above, UTF-32 defaults to UTF-32LE, but uses BOM and ‘spaces heuristics’ to determine input endianness.
- The default of UTF-32LE can be overridden with the
defaultEncoding: 'utf-32be'
option. Strips BOM unlessstripBOM: false
. - Encoding: uses UTF-32LE and writes BOM by default. Use
addBOM: false
to override. (defaultEncoding: 'utf-32be'
can also be used here to change encoding.)
Other notes
When decoding, be sure to supply a Buffer to decode() method, otherwise bad things usually happen.
Untranslatable characters are set to � or ?. No transliteration is currently supported.
Node versions 0.10.31 and 0.11.13 are buggy, don’t use them (see #65, #77).
Testing
$ git clone git@github.com:ashtuchkin/iconv-lite.git $ cd iconv-lite $ npm install $ npm test $ # To view performance: $ node test/performance.js $ # To view test coverage: $ npm run coverage $ open coverage/lcov-report/index.html
Пример того, как преобразовать в Node.js кодировку windows-1251 в UTF-8.
Подобное преобразование актуально, если вы парсите сайты, у которых кодировка windows-1251 и с текстом на русском языке.
const iconv = require(‘iconv-lite’) var body = iconv.encode (iconv.decode (new Buffer (body, ‘binary’), ‘win1251’), ‘utf8’) |
Пример загрузки страницы и преобразования кодировки
request({ uri: url, method: ‘GET’, encoding: ‘binary’ }, function (error, response, body) { if(!error && response.statusCode===200){ var body = iconv.encode (iconv.decode (new Buffer (body, ‘binary’), ‘win1251’), ‘utf8’) var $ = cheerio.load(body) //Это строчка, если будете работать с Jquery } } ) |