Do you also face the problem with charsets in MySQL? I do quite often. The reason is that I work with different MySQL versions (4.0, 4.1, 5.0, 5.1), where each of them handles charsets a little differently. For example for conversions between 4.0 and 4.1 check this link.

Problem

We have a latin1 MySQL 5.0 server, where our database is in “latin1” charset. By the way “latin1” is the default charset for generic MySQL installation. We need to transform the database encoding to “UTF-8”.




Analysis

I generally deal with database/tables encoding conversions in two ways”

  1. Using ALTER DATABASE or ALTER TABLE command – this is useful when you convert databases/tables inside one database server (of course the server must support different encodings on different databases, so you need MySQL 4.1+).
  2. Using mysqldump command – this is useful when you convert between two versions of databases, i.e. between MySQL 4.0 ang MySQL 5.1

What we might want to analyze first are encodings used by our client/database/server. You can do it by submitting the following queries:

mysql> USE postfix;
mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | latin1                     |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

Important variable for us is character_set_database – this variable implies what is the default encoding of our database – this encoding is used when for example you create a new table inside this database.

You can change character_set_database with the following query:

mysql> ALTER DATABASE your_database DEFAULT CHARACTER SET utf8 COLLATE=utf8_slovak_ci;

The DEFAULT keyword is optional. After submitting this query you can check again your “char%” variables:

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------+
| Variable_name            | Value                      |
+--------------------------+----------------------------+
| character_set_client     | latin1                     |
| character_set_connection | latin1                     |
| character_set_database   | utf8                       |
| character_set_filesystem | binary                     |
| character_set_results    | latin1                     |
| character_set_server     | latin1                     |
| character_set_system     | utf8                       |
| character_sets_dir       | /usr/share/mysql/charsets/ |
+--------------------------+----------------------------+
8 rows in set (0.00 sec)

We see that the character_set_database variable was changed. The other way of seeing the encoding and collation of your database is this query:

mysql> SHOW CREATE DATABASE `postfix`;
+----------+-----------------------------------------------------------------------------------------+
| Database | Create Database                                                                         |
+----------+-----------------------------------------------------------------------------------------+
| postfix  | CREATE DATABASE `postfix` /*!40100 DEFAULT CHARACTER SET utf8 COLLATE utf8_slovak_ci */ |
+----------+-----------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

IMPORTANT: One might think that this command changes the encoding of the whole database including the tables. Unfortunatelly this is not true. This query just sets the character_set_database variable to “utf8” but it does not convert the charset of your existing tables, it just says that new tables should be created in UTF-8 encoding.

Additionally despite you set the character_set_database variable to anything, each table and even column can have its own encoding. In order to see how to convert your existing latin1 tables to UTF-8, read the next section.




Solution

1. Using ALTER DATABASE or ALTER TABLE

In order to change the charset of your existing tables, you need to convert them:

mysql> ALTER TABLE your_table CONVERT TO CHARACTER SET utf8 COLLATE utf8_slovak_ci;

You might want to use MODIFY TO instead of CONVERT TO – this has implication though.
For more info read this page and search for “ALTER TABLE tbl_name CONVERT TO CHARACTER SET”.

2. Using mysqldump

First create your new database:

mysql> CREATE DATABASE postfix_new CHARACTER SET utf8

Next, export your old database in UTF8 encoding, using a command line “mysqldump” tool. I do NOT advice to do the export process in phpMyAdmin or other web-based admin tool, since you might have additional confusion with encodings used by the respective web application. So here is the mysqldump command:

$ mysqldump postfix -u root -p --default-character-set=utf8 > postfix.sql

Here “postfix” is name of the database in “latin1” encoding. The MySQL database reads correct data using its encoding. Important is, that character set of the “postfix.sql” dump file is UTF-8!

Because the SQL queries inside the file imply latin1 encoding, we need to replace the text “CHARSET=latin1” with “CHARSET=utf8” or better we deal with collation in one place and we replace it with “CHARSET=utf8 COLLATE=utf8_slovak_ci”:

sed -i 's/CHARSET=latin1/CHARSET=utf8 COLLATE=utf8_slovak_ci/' ./postfix.sql

Now if one SQL CREATE TABLE query looked like this:

CREATE TABLE `postfix_access` (
...
) ENGINE=MyISAM AUTO_INCREMENT=4 DEFAULT CHARSET=latin1;

Now it should be changed to this:

CREATE TABLE `postfix_access` (
...
) ENGINE=MyISAM AUTO_INCREMENT=4 DEFAULT CHARSET=utf8 COLLATE=utf8_slovak_ci;

Now your dump is ready to create valid UTF-8 tables inside your UTF-8 database (which was created above):

$ mysql postfix_new -u root -p < ./postfix.sql

Here “postfix_new_ is the new database and now it should be fully UTF-8.

Now you might want to make the new UTF-8 database your default one.

Resources

  1. http://dev.mysql.com/doc/refman/5.1/en/charset.html
  2. http://dev.mysql.com/doc/refman/5.0/en/charset.html
  3. http://forum.mambo-foundation.org/showthread.php?t=9620&page=2
  4. http://lists.linux.sk/pipermail/linux/2006-April/037200.html
Converting Character sets in MySQL to UTF8
Tagged on:     

Leave a Reply

Your email address will not be published. Required fields are marked *