Character Encoding questions

kepler · Nov 14, 2006

I can appreciate that for internationalisation purposes and forward-compatibility, UTF-8 is the encoding to use for presenting data (e.g. webpages)..but what about the backend?

Most existing textual data on filesystems is probably in the Windows 1252 encoding (or possibly ASCII), but what about when they are stored in a database? Do they retain their original encoding or get re-encoded according to some database default?

What happens when that data is extracted programatically into a dataset, recordset or similar? Does it retain it's encoding or get translated to UTF-16? If so, does this introduce a possibility of conversion errors?

Should you make an effort to convert an existing data archive from the legacy encodings to UTF-16 to lose the backend conversion stage?

I've looked all over for this stuff but can only find wishy-washy stuff, no actual system-wide real world examples.

Any help is appreciated.

squid · Nov 15, 2006

they would stay in their original encodings as you are only viewing them, but if you updated and saved them they would be converted, but it would depend on the character encoding of the program you are doing it with.
i think utf-16 is backwards compatable with utf-8 and ascii anyway.

this is just my opinion, not any facts or anything, so it could be wrong.

kepler · Nov 15, 2006

squid said:
they would stay in their original encodings as you are only viewing them, but if you updated and saved them they would be converted, but it would depend on the character encoding of the program you are doing it with.
i think utf-16 is backwards compatable with utf-8 and ascii anyway.

Really? I've had a surprising amount of trouble trying to find anyone that will commit to anything on the compatibility front. I know that ASCII can be used as UTF-8 as it's a subset, and ISO 8859(sp?) can be viewed as Windows 1252 again as it's a subset; but the overrlap of 1252 / UTF-8 / UTF-16 still remain a mystery to me.

Maybe it would be worth investing time into looking into converting our archival data into UTF-16. If we could do that it would seem (UTF-16 -> UCS2/UTF-16 -> UTF-8 ) to take a lot of the headache out of it.

As for now I've recommended we review keeping the datasources in their originall encodings (a mix of ASCII, ISO 8859 & Windows 1252), leave the application logic as is except for all production output to be encoded and marked up as UTF-8.

I suppose it's the conversion of 1252 -> UTF-8 thats ringing warning bells for me as I know some characters (e.g. O with the diacritic umlaut - which I know we use a lot) are really sensitive to that.

Character Encoding questions

kepler

New member

squid

New member

kepler

New member