Database Collation - Explain Choosing

From TNG_Wiki
Jump to: navigation, search

MySQL Database Collation Charts

Quick external references for returning visitors: Collation Charts | My Norwegian Language Collation Choice


This page exists to help a new site administrator understand what a Database Collation is, and how to go about selecting the best one for your site.

A collation is something you need for your MySQL database. A collation is a Character Set you'll be able to use in your database and later display on your TNG site, and a set of sorting and matching rules for these characters when you perform a TNG database search. This is different from the issue of selecting UTF-8 or ISO-8859-1 in the page "Setup TNG Language".


To pick the right collation you need to first answer two questions:

  1. What Characters do I need to be able to display on my site?
  2. When I type something in the "Search" box, how do I want TNG to match letters?

I'll walk you through my example and show you what decisions I made and point out a few tradeoffs.


What Characters do I need to be able to display on my site?

My heritage is Norwegian. This means that in addition to the standard A-Z that I'm used to seeing, I also need Æ Ø Å. These are used frequently in my database to describe places like Nærøy without having to anglicize the name to Naeroy. [Note: remember 'æ' is a single character, like the letter 'a'] If you search for either spelling (Nærøy or Naeroy) in Google Maps you will see the correct spelling presented in the result. I wanted my TNG Ancestry to operate similarly. Which brings us to our next decision...


When I type something in the "Search" box, how do I want TNG to match letters?

My last name is Arno. This name was anglicized from Arnø when my grandfather entered the US through Ellis Island. It is derived from the island my family originated from called lille Arnøya, Nærøy, Norway island.

So when I search for the last name Arno, I would like all spellings of "Arno" and "Arnø" to display. I'd like the same to happen for places. This means 'o' has to match or be treated "the same as" 'ø' in searches.


How do I go about finding a Collation to do these things?

You're in luck! I found a great site which maps out various Collation options Collation Charts. Each chart is very visual and easy to read. This site was put together by Alexander Barkov, a Senior Software Developer in MariaDB Foundation. (Obviously someone who understands the problem well and has the skill to address same).


The Collation Charts contain elements like cp1257_general_ci:

Collation chart for cp1257_general_ci (MySQL 6.0.4)

O
4F
004F
o
6F
006F
Ø
A8
00D8
ø
B8
00F8
Ó
D3
00D3
Ō
D4
014C
Õ
D5
00D5
Ö
D6
00D6
ó
F3
00F3
ō
F4
014D
õ
F5
00F5
ö
F6
00F6

What this line in the Chart is telling me is the letter 'O' will match all the other letters to the right, and the reverse is also true. A match for 'O' will return letters 'o', 'Ø', 'ø', ... Conversely, a search for 'ø' will return matches for 'o', 'Ø', 'O'. This is what I want for my search results.


Let's check another line for the letter 'A':

A
41
0041
a
61
0061
Æ
AF
00C6
æ
BF
00E6
Ą
C0
0104
Ā
C2
0100
Ä
C4
00C4
Å
C5
00C5
ą
E0
0105
ā
E2
0101
ä
E4
00E4
å
E5
00E5

Notice that A will match Æ and æ. So when I want to search for "Nærøy", and all I have is my American keyboard, I have to type "naroy" and the search will return all matches including "Nærøy". This means I can have "Nærøy" in my database, and "Nærøy" will display right in TNG. My Norwegian relatives who have 'æ' and 'ø' on their keyboards can spell "Nærøy" correctly and also find it in their search’s of my database. Also remembering the character matching map from the Collation page will help you figure out how to search for "Nærøy" using "naroy". NOTICE that a search for "Neroy" will -not- match. The Collation table shows why:

E
45
0045
e
65
0065
Ę
C6
0118
Ē
C7
0112
É
C9
00C9
Ė
CB
0116
ę
E6
0119
ē
E7
0113
é
E9
00E9
ė
EB
0117

The letter 'E' won't match the letter 'Æ'


My Collation Selection

Conclusion - cp1257_general_ci behaves and displays what I want on my American computers for the partially Norwegian database.

Note: I had to set my TNG Language Setup to use UTF-8 for my cp1257_general_ci data to display correctly in web pages.

A caveat, or side effect of using cp1257_general_ci is you may actually want Æ Ø Å treated as separate letters in the Norwegian alphabet in your searches and TNG presentations, such as Roger discusses and demonstrates. Then you might look at utf8_danish_ci or one of the utf8_unicode_ci collations.

I'd like to point out that cp1257_general_ci will not work well for another language like French. Why? Because cp1257_general_ci is missing critical French symbols like 'ç' in the word Français. So for French you'll need to find a different chart with the right symbols and pattern matching rules to achieve the result you like.

French writing rules are different in that five diacritics and two orthographic ligatures are used in combination with letters in French communication. For instance 'ç' in the word Français isn't really a separate letter of the French alphabet, it is the letter 'c' with a diacritical marking. A database collation will have each combination of letter with diacritical and/or ligature marking combination represented as a separate character in the table. The question (#2) which you need to answer is how do you want these individual diacritical and/or ligature marking combinations handled in a Search on your site? Answer this and you'll be on your way to searching Mr. Barkov's Collation Chart collection to find the one collation which comes closest to your answers for #1 and #2 (Characters and Searching). The Collation is the best compromise between the language rules and computer implementation you can find to match your specific application requirements.

How are the Collation Charts Named? (magic decoder ring)

I don't have a complete answer for this, but do have a few helpful observations.

  1. Notice that every chart name ends in either a 'ci' or a 'cs'. These appear to mean
'ci' - Case Insensitive (or Indifferent)
        This means in this chart that 'A' and 'a' mean the same thing.
'cs' - Case Sensitive
    This means in this chart that 'A' and 'a' are two different letters and won't match in a search.

If you look at two charts named the same except one ends in "cs" and the other in "ci" you'll see both the upper and lowercase letters next to each other in the "ci" chart, and on different rows or both colored yellow in the "cs" chart.


Charts I found particularly useful, or interesting?

Pretty much anything with the word "general", "european", or "Unicode" in the name. European seemed to have matching rules I liked, General was a little more stiff about not matching characters. i.e. in many cases 'o' and 'ø' are treated as two different characters, and in the Norwegian alphabet they -are- two different characters, but remember I wanted the anglicized version of my name to match the Norwegian spelling in a search. Unicode gave a lot of characters to use, and treated all as individual characters needing an exact match in a search.

If you want to see what kinds of headaches a database designer has to handle, look at Japanese Collation (keep scrolling down, down, down...)


So figure out what you want and then go searching through Mr. Barkov's Collation Charts for the best match for what you want to happen on your site.


Examples: What works for me

If you spent time selecting a collation for your site and wish to share, consider sharing your setup in the table below. i.e. Edit this section and add your row to this table. Your site and name are optional. Please don't modify any existing entries, unless you put it there yourself.

Language(s) Database Collation Character Set Your Site Your Name
English, Norwegian cp1257_general_ci UTF-8 Arnø Craig Arno
English, French utf8_swedish_ci UTF-8 Our Roy and Boucher Families Ken Roy
Example Example Example Example Example

Related links