Adding your own collation to MySQL

I’m the kind of DBA that prefers to keep everything simple, BUT, sometimes it’s not possible. Few days ago I’ve faced an issue where none of the collations shipped by default with MySQL would guarantee integrity of my database, and to avoid a massive re-write of application code, we have explored an option that up to the day, I didn’t know about.
Add your own collation to MySQL.

This option is described on this section of MySQL documentation . On this post I will show how to make MySQL identify volves with acute accent (fada) as a different letter:

First, let’s find out where is our character-set folder:

mysql [localhost] {msandbox} ((none)) > SHOW VARIABLES LIKE 'character_sets_dir';
+--------------------+---------------------------------------+
| Variable_name | Value |
+--------------------+---------------------------------------+
| character_sets_dir | /mysql/sources/5.6.26/share/charsets/ |
+--------------------+---------------------------------------+
1 row in set (0.00 sec)

In this case, I’ll be working on /mysql/sources/5.6.26/share/charsets/ it may be different on your environment.
You will find a file named Index.xml inside your character_sets_dir.

On this file, we will be using the locate data markup language syntax (LDML), more specifically reset, p and t elements.
You can have more information about LDML and the complete list of elements LDML Syntax Supported in MySQL.

Let’s edit this file and add our collation inside :

<charset name="utf8">
  <family>Unicode</family>
  <description>UTF-8 Unicode</description>
  <alias>utf-8</alias>
 . . .
  <collation name="utf8_test_ci" id="1122" version="5.2.0">
    <rules>
      <reset>A</reset>
      <p>\u00c1</p>
      <t>\u00e1</t>
      <reset>E</reset>
      <p>\u00c9</p>
      <t>\u00e9</t>
      <reset>I</reset>
      <p>\u00cd</p>
      <t>\u00ed</t>
      <reset>O</reset>
      <p>\u00d3</p>
      <t>\u00f3</t>
      <reset>U</reset>
      <p>\u00da</p>
      <t>\u00fa</t>
    </rules>
  </collation>
</charset>

So, let’s explain the first rule:

<reset>A</reset>
<p>\u00c1</p>
<t>\u00e1</t>
  • reset – is reseting the ordering an rules for the A (and a) character
  • p – is saying that \u00c1 (Á) will be treated as a different letter
  • t – is saying that \u00e1 (á) will be treated as the lower case version of \u00c1 (Á)

The same applies for the other letters (E / I / O / U). Now all we need is to restart MySQL service.

Let’s do some testing to find out how it works, I’ll create 2 tables one with utf8 default collation and 1 with our test collation:


mysql [localhost] {msandbox} (test) > CREATE TABLE `utf8` ( `name` varchar(50), PRIMARY KEY (`name`) ) ENGINE=InnoDB DEFAULT CHARSET utf8;
Query OK, 0 rows affected (0.04 sec)

mysql [localhost] {msandbox} (test) > INSERT INTO utf8 VALUES ('A');
Query OK, 1 row affected (0.01 sec)

mysql [localhost] {msandbox} (test) > INSERT INTO utf8 VALUES ('Á');
ERROR 1062 (23000): Duplicate entry 'Á' for key 'PRIMARY'
mysql [localhost] {msandbox} (test) > INSERT INTO utf8 VALUES ('a');
ERROR 1062 (23000): Duplicate entry 'a' for key 'PRIMARY'
mysql [localhost] {msandbox} (test) > INSERT INTO utf8 VALUES ('á');
ERROR 1062 (23000): Duplicate entry 'á' for key 'PRIMARY'
mysql [localhost] {msandbox} (test) > SELECT * FROM utf8;
+------+
| name |
+------+
| A |
+------+
1 row in set (0.00 sec)

MySQL only allowed 1 variation of the letter A, all subsequent failed due to duplication of the PRIMARY KEY , let’s see how it looks on our test collation:


mysql [localhost] {msandbox} (test) > CREATE TABLE `utf8_test` ( `name` varchar(50) CHARACTER SET utf8 COLLATE utf8_test_ci , PRIMARY KEY (`name`)) ENGINE=InnoDB DEFAULT CHARSET utf8;
Query OK, 0 rows affected (0.03 sec)

mysql [localhost] {msandbox} (test) > INSERT INTO utf8_test VALUES ('A');
Query OK, 1 row affected (0.01 sec)

mysql [localhost] {msandbox} (test) > INSERT INTO utf8_test VALUES ('Á');
Query OK, 1 row affected (0.01 sec)

mysql [localhost] {msandbox} (test) > INSERT INTO utf8_test VALUES ('a');
ERROR 1062 (23000): Duplicate entry 'a' for key 'PRIMARY'
mysql [localhost] {msandbox} (test) > INSERT INTO utf8_test VALUES ('á');
ERROR 1062 (23000): Duplicate entry 'á' for key 'PRIMARY'
mysql [localhost] {msandbox} (test) > SELECT * FROM utf8_test;
+------+
| name |
+------+
| A |
| Á |
+------+
2 rows in set (0.00 sec)

In this case, MySQL allowed 2 entries, A and Á, it blocked a because it’s a duplicate of A and it blocked á because it’s a duplicate of Á.

That is it, I hope it will be of any usage for you.

Liked? Help and Share!
This entry was posted in Charset, Collation, MySQL. Bookmark the permalink.