Database Collation - Explain Choosing
MySQL Database Collation Charts
This page exists to help a new site administrator understand what a Database Collation is, and how to go about selecting the best one for your site.
A collation is something you need for your MySQL database. A collation is a Character Set you'll be able to use in your database and later display on your TNG site, and a set of sorting and matching rules for these characters when you perform a TNG database search. This is different from the issue of using the TNG page at Admin>>Languages to select character set encodings such as UTF-8 or ISO-8859-1 for a TNG "language". (Somewhat confusingly, most natural languages explicitly supported by TNG has two variants that represent different character encodings, but each of those two variants is called a "Language" in TNG.)
Update: This article was written before UTF-8 became the almost-universally-used character encoding, not only in TNG, but also in most databases. At this point, it is probably best for most new TNG databases to use the collation utf8mb4_unicode_ci, or perhaps the slightly less accurate but simpler encodings utf8_unicode_ci or utf8_general_ci.
utf8_general_ci has probably been the most popular encoding for TNG in recent years. The newer collation utf8_unicode_ci handles additional considerations such as the distinctions among German lower-case "ß" and its two upper-case representations, "ẞ" and "SS". The even newer utfmb4_unicode_ci accommodates more characters, including emojis.
If you use a MySQL collation that begins with "UTF", you will need to use TNG's UTF8 languages. Most other MySQL collations call for the non-UTF8 "language" in TNG, though some MySQL collations that do not begin with "UTF" also require the TNG's UTF8 languages.
To pick the right collation you need to first answer two questions:
- What Characters do I need to be able to display on my site?
- When I type something in the "Search" box, how do I want TNG to match letters?
I'll walk you through my example and show you what decisions I made and point out a few tradeoffs.
What Characters do I need to be able to display on my site?
My heritage is Norwegian. This means that in addition to the standard A-Z that I'm used to seeing, I also need Æ Ø Å. These are used frequently in my database to describe places like Nærøy without having to anglicize the name to Naeroy. [Note: remember 'æ' is a single character, like the letter 'a'] If you search for either spelling (Nærøy or Naeroy) in Google Maps you will see the correct spelling presented in the result. I wanted my TNG Ancestry to operate similarly. Which brings us to our next decision...
When I type something in the "Search" box, how do I want TNG to match letters?
My last name is Arno. This name was anglicized from Arnø when my grandfather entered the US through Ellis Island. It is derived from the island my family originated from called lille Arnøya, Nærøy, Norway island.
So when I search for the last name Arno, I would like all spellings of "Arno" and "Arnø" to display. I'd like the same to happen for places. This means 'o' has to match or be treated "the same as" 'ø' in searches.
How do I go about finding a Collation to do these things?
You're in luck! I found a great site which maps out various Collation options Collation Charts. Each chart is very visual and easy to read. This site was put together by Alexander Barkov, a Senior Software Developer in MariaDB Foundation. (Obviously someone who understands the problem well and has the skill to address same).
In some cases, there are several collation charts for a given encoding. Each of those variants is based on exactly the same character set, but focuses on characters used by a set of natural languages (which the charts refer to as "scripts"). For example, there are 11 utf8_general_ci charts, each applying to a regions such as American, European, and Korean.
Collation chart for cp1257_general_ci (MySQL 6.0.4)
What this line in the Chart is telling me is the letter 'O' will match all the other letters to the right, and the reverse is also true. A match for 'O' will return letters 'o', 'Ø', 'ø', ... Conversely, a search for 'ø' will return matches for 'o', 'Ø', 'O'. This is what I want for my search results.
Let's check another line for the letter 'A':
Notice that A will match Æ and æ. So when I want to search for "Nærøy", and all I have is my American keyboard, I have to type "naroy" and the search will return all matches including "Nærøy". This means I can have "Nærøy" in my database, and "Nærøy" will display right in TNG. My Norwegian relatives who have 'æ' and 'ø' on their keyboards can spell "Nærøy" correctly and also find it in their search’s of my database. Also remembering the character matching map from the Collation page will help you figure out how to search for "Nærøy" using "naroy". NOTICE that a search for "Neroy" will -not- match. The Collation table shows why:
The letter 'E' won't match the letter 'Æ'
My Collation Selection
Conclusion - cp1257_general_ci behaves and displays what I want on my American computers for the partially Norwegian database.
Note: I had to set my TNG Language Setup to use UTF-8 for my cp1257_general_ci data to display correctly in web pages.
A caveat, or side effect of using cp1257_general_ci is you may actually want Æ Ø Å treated as separate letters in the Norwegian alphabet in your searches and TNG presentations, such as Roger discusses and demonstrates. Then you might look at utf8_danish_ci or one of the utf8_unicode_ci collations.
I'd like to point out that cp1257_general_ci will not work well for another language like French. Why? Because cp1257_general_ci is missing critical French symbols like 'ç' in the word Français. So for French you'll need to find a different chart with the right symbols and pattern matching rules to achieve the result you like.
French writing rules are different in that five diacritics and two orthographic ligatures are used in combination with letters in French communication. For instance 'ç' in the word Français isn't really a separate letter of the French alphabet, it is the letter 'c' with a diacritical marking. A database collation will have each combination of letter with diacritical and/or ligature marking combination represented as a separate character in the table. The question (#2) which you need to answer is how do you want these individual diacritical and/or ligature marking combinations handled in a Search on your site? Answer this and you'll be on your way to searching Mr. Barkov's Collation Chart collection to find the one collation which comes closest to your answers for #1 and #2 (Characters and Searching). The Collation is the best compromise between the language rules and computer implementation you can find to match your specific application requirements.
How are the Collation Charts Named? (magic decoder ring)
I don't have a complete answer for this, but do have a few helpful observations.
- Notice that every chart name ends in either a 'ci' or a 'cs'. These appear to mean
'ci' - Case Insensitive (or Indifferent) This means in this chart that 'A' and 'a' mean the same thing.
'cs' - Case Sensitive This means in this chart that 'A' and 'a' are two different letters and won't match in a search.
If you look at two charts named the same except one ends in "cs" and the other in "ci" you'll see both the upper and lowercase letters next to each other in the "ci" chart, and on different rows or both colored yellow in the "cs" chart.
Charts I found particularly useful, or interesting
Pretty much anything with the word "general", "european", or "Unicode" in the name. European seemed to have matching rules I liked, General was a little more stiff about not matching characters. i.e. in many cases 'o' and 'ø' are treated as two different characters, and in the Norwegian alphabet they -are- two different characters, but remember I wanted the anglicized version of my name to match the Norwegian spelling in a search. Unicode gave a lot of characters to use, and treated all as individual characters needing an exact match in a search.
If you want to see what kinds of headaches a database designer has to handle, look at Japanese Collation (keep scrolling down, down, down...)
So figure out what you want and then go searching through Mr. Barkov's Collation Charts for the best match for what you want to happen on your site.
Examples: What works for me
If you spent time selecting a collation for your site and wish to share, consider sharing your setup in the table below. i.e. Edit this section and add your row to this table. Your site and name are optional. Please don't modify any existing entries, unless you put it there yourself.
|Language(s)||Database Collation||Character Set||Your Site||Your Name|
|English, Norwegian||cp1257_general_ci||UTF-8||Arnø||Craig Arno|
|English, French||utf8_swedish_ci||UTF-8||Our Roy and Boucher Families||Ken Roy|