Character encoding
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Alternativer Titel |
| |
Serientitel | ||
Anzahl der Teile | 150 | |
Autor | ||
Lizenz | CC-Namensnennung 2.0 Belgien: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/34441 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache | ||
Produktionsjahr | 2015 |
Inhaltliche Metadaten
Fachgebiet | |
Genre |
FOSDEM 201548 / 150
3
5
9
11
13
14
15
16
17
18
19
20
24
26
27
28
29
30
37
41
42
44
48
50
54
62
64
65
67
68
69
71
75
78
79
82
83
84
85
86
87
88
89
90
94
96
97
103
104
105
107
108
113
114
115
117
118
121
122
123
125
126
127
129
130
131
133
136
137
138
141
142
147
148
149
150
00:00
Codierung <Programmierung>BinärdatenSchnittmengeASCIIUnicodeLateinisches QuadratVisuelles SystemInformationGleitkommarechnungDatenbankTabelleWechselsprungQuadratzahlDefaultMAPVererbungshierarchieLokales MinimumMultiplikationCodierung <Programmierung>Technische OptikKomplex <Algebra>TaskFormale SpracheInformationNeuroinformatikLateinisches QuadratSchnittmengeMAPZahlenbereichStabilitätstheorie <Logik>InformationsspeicherungCASE <Informatik>GeradeDatensichtgerätTabelleServerTouchscreenWort <Informatik>ClientKartesische KoordinatenEinsRadikal <Mathematik>MultiplikationsoperatorUmsetzung <Informatik>Coxeter-GruppeDefaultDrucksondierungEin-AusgabeDifferenteGradientBus <Informatik>Kontextbezogenes SystemDatenbankTeilbarkeitWeb-SeiteFunktionalMereologieArithmetisches MittelRechter WinkelCodePrimitive <Informatik>QuaderGebäude <Mathematik>Prozess <Informatik>UnicodeTypentheorieKardinalzahlBinärcodeQuick-SortKonfigurationsraumXML
09:34
ServerClientKontrollstrukturMAPSchnittmengeTabelleDefaultVererbungshierarchieDatenbankCodierung <Programmierung>QuadratzahlBenutzeroberflächeStochastische DifferentialgleichungBefehl <Informatik>TropfenKonfiguration <Informatik>ATMMathematikPatch <Software>Plug inAnalog-Digital-UmsetzerTuring-TestIndexberechnungLateinisches QuadratMultiplikationInverser LimesSoftwaretestKonfigurationsraumKraftTwitter <Softwareplattform>Kontextbezogenes SystemVersionsverwaltungKartesische KoordinatenAggregatzustandTabelleGeradeUmsetzung <Informatik>Leistung <Physik>Bildgebendes VerfahrenWinkelInformationDimensionsanalyseDualitätMathematikElement <Gruppentheorie>SchlüsselverwaltungBildschirmmaskeKette <Mathematik>ClientServerDefaultQuick-SortKonfiguration <Informatik>SystemaufrufTelekommunikationLateinisches QuadratDatenreplikationCASE <Informatik>Web SiteFigurierte ZahlGrundsätze ordnungsmäßiger DatenverarbeitungMAPSchnittmengeDatenbankDifferenteAdditionNummernsystemRechter WinkelTwitter <Softwareplattform>SoftwaretestBinärdatenDatensatzFahne <Mathematik>Schiefe WahrscheinlichkeitsverteilungSpieltheoriePunktGemeinsamer SpeicherAutomatische IndexierungFacebookZentrische StreckungCodierung <Programmierung>PartikelsystemKonfigurationsraumFunktionalRelationentheorieInformationsspeicherungDickeStochastische AbhängigkeitSpeicherabzugOrdinalzahlPlug inGoogol
19:02
Codierung <Programmierung>Lineare RegressionComputeranimationVorlesung/Konferenz
19:43
GoogolComputeranimation
Transkript: Englisch(automatisch erzeugt)
00:07
Okay. So, hello everyone. My name is Maciej Dabczynski and I'll be talking about character encoding in MySQL, about how you break and then maybe fix your data that was incorrectly
00:26
encoded and stored. Okay, so let's start. So, we view documents as lines of text but computers have to see that data, see that information as binary data and so characters
00:47
become numeric values, numeric codes and these codes become character encoding where a letter becomes a number. There are two most popular encoding types which is ASCII
01:02
obviously and Unicode. Collection of certain characters encoded in a certain way becomes a character set which could be for a language like Latin 1 or Latin US which covers US English
01:21
or English in general. Latin 1 which covers most of European languages that do not really need all fancy characters, Latin 2. Also, you'll have characters that use unicode encoding which can be multibyte encoding where a character takes from 1 to 4 bytes and these could be
01:42
more complex character sets like Asian languages or pretty much anything these days. So, character set is the context that sort of defines what each character code means and what it becomes when it's printed on the screen on a printer or whatever. So, when the context
02:06
is lost, we no longer can tell what a specific code means, what is 65. Is it A? Is it question mark? Is it exclamation mark? We don't know that. So, the character set defines that character for us. If we mix up the character set, then we are most likely unable to figure
02:25
it out anymore. So, what can go wrong or when can things go wrong? So, this session is about breaking and fixing character encoding. So, encoding is maintained on multiple levels
02:42
from the presentation layer like a web page or a terminal or anything else, a screen, down to the data storage. So, tables, columns and so on. This multi-level layer support
03:05
offers flexibility in how we can access information, how we can display information and store information. But it also creates complexity, this is flexibility, but it also creates complexity that often leads to problems. In the next few examples, I'll show you
03:21
what these problems can be, how they occur and later how to fix some of those. So, problem number one is, and this is in MySQL specifically, is that MySQL by default is Latin-1. So, its default character encoding is Latin-1. When you install it out of the
03:40
go box, this is what it will be using and this also applies to the upcoming MySQL 5.7, I think. And most people these days don't bother really looking at that because they generally will assume that, well, okay, it's a tool I need to use. So, it will support, it will be multi-lingual, it will support whatever I want to store or read
04:08
from it, right? It's 2015. So, people don't really look at that too often. So, we kind of will start following this process of building applications. So, we start with
04:23
a clean MySQL and a new application, both server and client use Latin-1 by default for no reason. So, we create a schema and we create a first table to support our application that will allow users to store the places where they live. So, we have all these people
04:44
coming in. Okay. And then, of course, when we create this table, the server uses the default encoding for this table. It says Latin-1. So, we have a new table that will use Latin-1, of course. So, we have this application where people come in and add their
05:03
cities. And so, someone from Germany, someone from Poland, and then there is this guy from Japan who, what I believe, should be, tries to enter Tokyo using his own letters. So, hit save and voila, it's great. It works, right? Everything displays correctly.
05:27
So, the problem is that we created a table that is Latin-1. And Latin-1 doesn't really support Japanese characters. It's not in the character set. These characters are outside
05:41
of this character set. But, you know, it works. So, we are probably fine, right? Well, actually, we are not fine. So, what MySQL does when it sees a client and a data, it looks at a character set that they use. If the character sets are different on both ends,
06:01
it will try to perform an on-the-fly conversion so that the client gets information in the character set it wants. So, we set character set to UTF because, well, we expect input from all around the world in different languages. So, we set this in UTF and MySQL converts our cities into garbage. So, it is definitely not right, right? Interestingly, if we set
06:30
this to Latin-1, we got all the right things. So, well, the data is kind of correct but in a kind of weird way. So, is it a problem? Well, I guess you can live with
06:47
that. Well, application works. It's fine. So, you can use it and there appear to be no problems. But, there may be such time where something goes wrong. For example,
07:00
you'll try to upgrade a Rabi and then all the Rabi will work with that and then your Rabi will tell you, okay, but this Japanese character doesn't really make sense in Latin-1. So, I won't get this data for you. So, this is the place where you need to start thinking, okay, I need to fix this, right? So, you start fixing it.
07:21
So, you update a server encoding, you set the client encoding right, you convert data tables and whatnot and obviously you get a table that is now done correctly. It uses correct unicode encoding. So, the good news is sometimes fixable. You can fix the encoding without losing anything or breaking anything. So, then we move to kind of problem number
07:45
two. So, I said that encoding or character sets are handled on multiple levels. So, in MySQL you have what, five different settings alone to define what's going on with data,
08:01
how it's translated, how it's encoded and how it's converted. You also have schema level defaults, table level defaults and of course column character sets which define the actual storage character set. So, how the data is stored in tables. So, we fixed our first problem, right? We converted data back into, I mean, sorry,
08:23
we converted it into UTF so it's now done correctly and we continue to develop our applications. So, we add new functionality that is, we allow people to add their names, right? So, we create a new table and, well, we start our application. So, there were
08:45
two guys who already added their names and I decided to mine as well and I click save and then, oh, and then I have a question mark in the middle of my name. So, we fixed the problem but apparently something is still wrong. So, we go, we log in, we log in
09:06
and we see that our settings are generally correct but we look at the table and it's Latin one for no reason. So, why? Well, we started investigating configuration
09:23
and we see this character set database that is Latin one. Why is it? Well, because when we first created the database, we had the server running with Latin one default encoding. So, the database got its own schema level character set, set to Latin one and we
09:42
never fixed that. We converted data, we converted tables. That was fine but all new tables that we kept creating later were receiving this old encoding associated with a schema. So, can we fix this problem? Well, we start digging and it turns out
10:01
that my accented N in the last name became a question mark and it permanently became a question mark. So, at this point, there is no going back. I mean, we only have data that has a question mark. My name will always have a question mark in it. If you know my name, you can fix it but you will not know all the answers
10:23
for all the rows in all tables. So, essentially, this means that this data is broken beyond repair for one silly reason. So, it may not be enough to set up server correctly. You have to be way more careful with that. Also, it's even more dangerous if you get
10:43
difference between the server and client encoding because of the implicit conversions that I mentioned before. So, going back to all these different settings that you can do in MySQL. So, the reason that it happened, the second problem
11:03
happened was that schema level defaults, they affect new tables and table level defaults affect new columns. So, what you see in the table definition as a default character set will only be affecting any new column that you add without specifying the character set. And what you see in the schema definition will apply to any new table that you create
11:24
without the default character set specified explicitly. So, we have some more examples of that. So, we have this, we create this schema, we
11:41
have a server that runs Latin-1 and our client uses UTF-8, right? So, we create a schema and the schema receives the server's default encoding, which is Latin-1, regardless of our client setting. So, then we go into the schema and we create a table. A table
12:02
will obviously receive the schema's default encoding, which is also Latin-1. So, then we change the default schema, the default character set for this new table to UTF-8, right? And what it turns out is our column suddenly is still character
12:26
set Latin-1. This is not displayed here because the column character set matches the table's default character set. But as soon as there is a difference, the column has this additional text stating that it's a default encoding that the table's default encoding.
12:46
So, this is what can happen. So, changing table's character set to something else doesn't really fix the problem because column remains as Latin-1. It's kind of set independently.
13:01
So, then we can add a new column to this table, which is already UTF. So, you can see the new column has no character set mentioned in this output, which means it is UTF-8. It is the same as the default encoding for this table. So, essentially, when this happens, when you have all these problems with characters, when you have UTF data stored
13:26
in Latin-1 tables, what you can do? Well, first, it's important that you don't act too quickly. You have to analyze the situation, figure out what happened, why it happened, then reassess the damage. Figure out if the problems are consistent, if all data is
13:44
broken in the same way, are some tables broken in a different way, or maybe they are broken in several different ways. So, each case has to be handled differently. So, we have to know that. You have to also figure out if the data is reparable. Sometimes
14:00
it's not. So, what you shouldn't be doing? Well, you shouldn't act too quickly because you can actually break things even more. You can't fix this on a replication slave because replication will, if you start converting data on a replication slave, replication will continue to send you bad data. Replication carries character set information and encoding.
14:22
You can't really fix this one by one, the table is one by one, because this change in the application or server has to be sort of atomic. So, if you switch from Latin-1 to ADTF8, it affects all tables on all communication. So, you can't really do that. It has to be in sort of one go. And you don't want to use these if
14:41
you have a broken character set encoding anywhere, because these generally assume that everything is okay. They will convert data, or the last two at least, but only under assumption that everything is right. So, what needs to be fixed? Schema, default character set, like I mentioned, all the text columns in the database, also enum, which
15:05
isn't really a text column, but it carries some text information. You can grab all these columns, all these tables from the information schema. Other tables that do not have text information in them, you should eventually fix as well. They won't be affected by the existing problems, but you may want to add new columns to those tables later, and
15:25
if they are Latin-1 instead of UTF, you will get Latin-1 columns again. So, what are your options? Well, you can do dump and restore. If you just tell MySQL dump to use the same encoding as the data, as the tables use, so you use Latin-1
15:42
from Latin-1 tables, this will keep the implicit conversion and will, for example, export the UTF data correctly. Then you recreate everything, edit the MySQL dump to update table definitions and import everything as UTF into the server again.
16:01
The other option is to use alter table and convert each column separately by converting them into binary form and then back into UTF. Converting column data or table data into binary encoding means that basically you drop the character set context, so there is no conversion applied either way. So, this way you can kind of, having UTF data
16:25
in a Latin-1 table, you get it into binary form without any changes and then you get to UTF without any changes as well, so it's correct. Last option that we developed recently is to use PT-Online schema change to do this
16:42
online without pretty much any downtime. So we wrote a plugin that we use with a PT-Online schema change and a slight modification of the PT-Online schema change to actually let it work. That works like PT-Online schema change, so you can perform your conversion online. The only kind of requirement is that you don't rotate your tables as they
17:04
are converted, so you end up with a database that is twice as large with all the tables having both new and old versions and then when everything is converted, you store your application, you fix your broken configs and everything and then you rotate the tables, you basically replace the Latin-1 tables with UTF tables for example,
17:24
so the downtime could be less than a minute. There could be a few gotchas, which is if you convert from Latin-1 to UTF, your storage requirements may increase, so you may need to increase the column length or
17:45
change the index length to accommodate the new requirements. Also, there could be more problems than you think, so this function can test if the conversion worked, so you can convert all pre-conversion value with a post-conversion value and test if it
18:05
actually worked correctly in an automated way, so you can apply it on all data in the database. So how to do it right? Set the character set during initial configuration, always explicitly specify the full character set for everything, including tables, columns
18:23
and schemas. Of course, don't forget to set up your applications correctly as well. Okay and one more thing, which is not related to that, if you're interested in web-scale SQL, we are sharing this with everyone, we've been using this and building
18:42
binary packages for them for a while, so we are now sharing the web-scale packages with you, and you can grab them and test them, the amazing web-scale from Facebook, Twitter, LinkedIn, Google and others. Any questions?