You are here

Fixing column encoding mess in MySQL

MySQL Performance Blog - Tue, 18/12/2007 - 2:34pm

Just had an interesting issue with an encoding mess on a column containing non-ASCII (Russian) text. The solution was not immediately obvious so I decided it's worth sharing.

The column (actually the whole table) was created with DEFAULT CHARSET cp1251. Most of the data was in proper cp1251 national encoding indeed. However, because of web application failure to properly set the encoding, some of the rows were actually in UTF-8. That needed to be fixed.

Simply using CONVERT(column USING xxx) did not work because MySQL treated the source data as if it was in cp1251. One obvious solution would be to write a throwaway PHP script which would SET NAMES cp1251, pull the offending rows (they'd come out in UTF-8), iconv() them to proper cp1251, and UPDATE them with new values.

However it's possible to fix the issue within MySQL. The trick is to tell it to treat the string coming from the table as binary, and then do charset conversion:

PLAIN TEXT CODE:
  1. UPDATE table SET column=CONVERT(CONVERT(CONVERT(column USING binary) USING utf8) USING cp1251) WHERE123;

This can be further simplified. After 2nd conversion the result is in UTF-8, and this time MySQL knows that it's UTF-8 as well. So it will perform conversion to per-table charset automatically, and the 3rd explicit CONVERT can be omitted.

PLAIN TEXT CODE:
  1. UPDATE table SET column=CONVERT(CONVERT(column USING binary) USING utf8) WHERE123;

The same trick could be applied to fix notorious KOI8-R vs CP1251 encoding issues in Russian, and other national SBCS encoding vs UTF-8 issues, I suppose. But, of course, ideally you'd always properly set the encoding in the Web application and avoid these issues at all.

Entry posted by shodan | 6 comments

Add to: delicious | digg | reddit | netscape | Google Bookmarks