We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Character encoding

00:00

Formal Metadata

Title
Character encoding
Subtitle
Breaking and unbreaking your data
Alternative Title
MySQL And Friends - MySQL Character
Title of Series
Number of Parts
150
Author
License
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Identifiers
Publisher
Release Date
Language
Production Year2015

Content Metadata

Subject Area
Genre
18
20
Thumbnail
55:22
24
Thumbnail
49:05
26
Thumbnail
45:24
30
Thumbnail
25:44
37
Thumbnail
26:33
87
89
90
104
Thumbnail
22:20
126
Thumbnail
16:49
127
Codierung <Programmierung>Binary fileSet (mathematics)ASCIIUnicodeLatin squareVisual systemInformationFloating pointDatabaseTable (information)Hydraulic jumpSquare numberDefault (computer science)Level (video gaming)Inheritance (object-oriented programming)Maxima and minimaMultiplicationCodierung <Programmierung>OpticsComplex (psychology)Task (computing)Formal languageInformationNeuroinformatikLatin squareSet (mathematics)Level (video gaming)NumberStability theoryData storage deviceCASE <Informatik>Line (geometry)Electronic visual displayTable (information)Server (computing)TouchscreenWordClient (computing)Cartesian coordinate system1 (number)Radical (chemistry)Multiplication signData conversionPresentation of a groupDefault (computer science)Cone penetration testoutputDifferent (Kate Ryan album)GradientBus (computing)Context awarenessDatabaseDivisorWeb pageFunctional (mathematics)MereologyArithmetic meanRight angleCodePrimitive (album)CuboidBuildingProcess (computing)UnicodeType theoryNumeral (linguistics)Binary codeQuicksortConfiguration spaceXML
Server (computing)Client (computing)Control flowLevel (video gaming)Set (mathematics)Table (information)Default (computer science)Inheritance (object-oriented programming)DatabaseCodierung <Programmierung>Square numberUser interfaceStochastic differential equationStatement (computer science)Drop (liquid)Computer configurationAsynchronous Transfer ModeMathematicsPatch (Unix)Plug-in (computing)Analog-to-digital converterTuring testPrice indexLatin squareMultiplicationLimit (category theory)Software testingConfiguration spaceForceTwitterContext awarenessRevision controlCartesian coordinate systemState of matterTable (information)Line (geometry)Data conversionPower (physics)Medical imagingAngleInformationDimensional analysisDualismMathematicsElement (mathematics)Key (cryptography)Form (programming)ChainClient (computing)Server (computing)Default (computer science)QuicksortComputer configurationSystem callTelecommunicationLatin squareReplication (computing)CASE <Informatik>WebsiteFigurate numberGodLevel (video gaming)Set (mathematics)DatabaseDifferent (Kate Ryan album)AdditionNumbering schemeRight angleTwitterSoftware testingBinary fileRow (database)FlagSkewnessGame theoryPoint (geometry)Shared memorySubject indexingFacebookScaling (geometry)Codierung <Programmierung>Particle systemConfiguration spaceFunctional (mathematics)Database normalizationData storage deviceLengthIndependence (probability theory)Core dumpAtomic numberPlug-in (computing)Googol
Codierung <Programmierung>Linear regressionComputer animationLecture/Conference
GoogolComputer animation
Transcript: English(auto-generated)
Okay. So, hello everyone. My name is Maciej Dabczynski and I'll be talking about character encoding in MySQL, about how you break and then maybe fix your data that was incorrectly
encoded and stored. Okay, so let's start. So, we view documents as lines of text but computers have to see that data, see that information as binary data and so characters
become numeric values, numeric codes and these codes become character encoding where a letter becomes a number. There are two most popular encoding types which is ASCII
obviously and Unicode. Collection of certain characters encoded in a certain way becomes a character set which could be for a language like Latin 1 or Latin US which covers US English
or English in general. Latin 1 which covers most of European languages that do not really need all fancy characters, Latin 2. Also, you'll have characters that use unicode encoding which can be multibyte encoding where a character takes from 1 to 4 bytes and these could be
more complex character sets like Asian languages or pretty much anything these days. So, character set is the context that sort of defines what each character code means and what it becomes when it's printed on the screen on a printer or whatever. So, when the context
is lost, we no longer can tell what a specific code means, what is 65. Is it A? Is it question mark? Is it exclamation mark? We don't know that. So, the character set defines that character for us. If we mix up the character set, then we are most likely unable to figure
it out anymore. So, what can go wrong or when can things go wrong? So, this session is about breaking and fixing character encoding. So, encoding is maintained on multiple levels
from the presentation layer like a web page or a terminal or anything else, a screen, down to the data storage. So, tables, columns and so on. This multi-level layer support
offers flexibility in how we can access information, how we can display information and store information. But it also creates complexity, this is flexibility, but it also creates complexity that often leads to problems. In the next few examples, I'll show you
what these problems can be, how they occur and later how to fix some of those. So, problem number one is, and this is in MySQL specifically, is that MySQL by default is Latin-1. So, its default character encoding is Latin-1. When you install it out of the
go box, this is what it will be using and this also applies to the upcoming MySQL 5.7, I think. And most people these days don't bother really looking at that because they generally will assume that, well, okay, it's a tool I need to use. So, it will support, it will be multi-lingual, it will support whatever I want to store or read
from it, right? It's 2015. So, people don't really look at that too often. So, we kind of will start following this process of building applications. So, we start with
a clean MySQL and a new application, both server and client use Latin-1 by default for no reason. So, we create a schema and we create a first table to support our application that will allow users to store the places where they live. So, we have all these people
coming in. Okay. And then, of course, when we create this table, the server uses the default encoding for this table. It says Latin-1. So, we have a new table that will use Latin-1, of course. So, we have this application where people come in and add their
cities. And so, someone from Germany, someone from Poland, and then there is this guy from Japan who, what I believe, should be, tries to enter Tokyo using his own letters. So, hit save and voila, it's great. It works, right? Everything displays correctly.
So, the problem is that we created a table that is Latin-1. And Latin-1 doesn't really support Japanese characters. It's not in the character set. These characters are outside
of this character set. But, you know, it works. So, we are probably fine, right? Well, actually, we are not fine. So, what MySQL does when it sees a client and a data, it looks at a character set that they use. If the character sets are different on both ends,
it will try to perform an on-the-fly conversion so that the client gets information in the character set it wants. So, we set character set to UTF because, well, we expect input from all around the world in different languages. So, we set this in UTF and MySQL converts our cities into garbage. So, it is definitely not right, right? Interestingly, if we set
this to Latin-1, we got all the right things. So, well, the data is kind of correct but in a kind of weird way. So, is it a problem? Well, I guess you can live with
that. Well, application works. It's fine. So, you can use it and there appear to be no problems. But, there may be such time where something goes wrong. For example,
you'll try to upgrade a Rabi and then all the Rabi will work with that and then your Rabi will tell you, okay, but this Japanese character doesn't really make sense in Latin-1. So, I won't get this data for you. So, this is the place where you need to start thinking, okay, I need to fix this, right? So, you start fixing it.
So, you update a server encoding, you set the client encoding right, you convert data tables and whatnot and obviously you get a table that is now done correctly. It uses correct unicode encoding. So, the good news is sometimes fixable. You can fix the encoding without losing anything or breaking anything. So, then we move to kind of problem number
two. So, I said that encoding or character sets are handled on multiple levels. So, in MySQL you have what, five different settings alone to define what's going on with data,
how it's translated, how it's encoded and how it's converted. You also have schema level defaults, table level defaults and of course column character sets which define the actual storage character set. So, how the data is stored in tables. So, we fixed our first problem, right? We converted data back into, I mean, sorry,
we converted it into UTF so it's now done correctly and we continue to develop our applications. So, we add new functionality that is, we allow people to add their names, right? So, we create a new table and, well, we start our application. So, there were
two guys who already added their names and I decided to mine as well and I click save and then, oh, and then I have a question mark in the middle of my name. So, we fixed the problem but apparently something is still wrong. So, we go, we log in, we log in
and we see that our settings are generally correct but we look at the table and it's Latin one for no reason. So, why? Well, we started investigating configuration
and we see this character set database that is Latin one. Why is it? Well, because when we first created the database, we had the server running with Latin one default encoding. So, the database got its own schema level character set, set to Latin one and we
never fixed that. We converted data, we converted tables. That was fine but all new tables that we kept creating later were receiving this old encoding associated with a schema. So, can we fix this problem? Well, we start digging and it turns out
that my accented N in the last name became a question mark and it permanently became a question mark. So, at this point, there is no going back. I mean, we only have data that has a question mark. My name will always have a question mark in it. If you know my name, you can fix it but you will not know all the answers
for all the rows in all tables. So, essentially, this means that this data is broken beyond repair for one silly reason. So, it may not be enough to set up server correctly. You have to be way more careful with that. Also, it's even more dangerous if you get
difference between the server and client encoding because of the implicit conversions that I mentioned before. So, going back to all these different settings that you can do in MySQL. So, the reason that it happened, the second problem
happened was that schema level defaults, they affect new tables and table level defaults affect new columns. So, what you see in the table definition as a default character set will only be affecting any new column that you add without specifying the character set. And what you see in the schema definition will apply to any new table that you create
without the default character set specified explicitly. So, we have some more examples of that. So, we have this, we create this schema, we
have a server that runs Latin-1 and our client uses UTF-8, right? So, we create a schema and the schema receives the server's default encoding, which is Latin-1, regardless of our client setting. So, then we go into the schema and we create a table. A table
will obviously receive the schema's default encoding, which is also Latin-1. So, then we change the default schema, the default character set for this new table to UTF-8, right? And what it turns out is our column suddenly is still character
set Latin-1. This is not displayed here because the column character set matches the table's default character set. But as soon as there is a difference, the column has this additional text stating that it's a default encoding that the table's default encoding.
So, this is what can happen. So, changing table's character set to something else doesn't really fix the problem because column remains as Latin-1. It's kind of set independently.
So, then we can add a new column to this table, which is already UTF. So, you can see the new column has no character set mentioned in this output, which means it is UTF-8. It is the same as the default encoding for this table. So, essentially, when this happens, when you have all these problems with characters, when you have UTF data stored
in Latin-1 tables, what you can do? Well, first, it's important that you don't act too quickly. You have to analyze the situation, figure out what happened, why it happened, then reassess the damage. Figure out if the problems are consistent, if all data is
broken in the same way, are some tables broken in a different way, or maybe they are broken in several different ways. So, each case has to be handled differently. So, we have to know that. You have to also figure out if the data is reparable. Sometimes
it's not. So, what you shouldn't be doing? Well, you shouldn't act too quickly because you can actually break things even more. You can't fix this on a replication slave because replication will, if you start converting data on a replication slave, replication will continue to send you bad data. Replication carries character set information and encoding.
You can't really fix this one by one, the table is one by one, because this change in the application or server has to be sort of atomic. So, if you switch from Latin-1 to ADTF8, it affects all tables on all communication. So, you can't really do that. It has to be in sort of one go. And you don't want to use these if
you have a broken character set encoding anywhere, because these generally assume that everything is okay. They will convert data, or the last two at least, but only under assumption that everything is right. So, what needs to be fixed? Schema, default character set, like I mentioned, all the text columns in the database, also enum, which
isn't really a text column, but it carries some text information. You can grab all these columns, all these tables from the information schema. Other tables that do not have text information in them, you should eventually fix as well. They won't be affected by the existing problems, but you may want to add new columns to those tables later, and
if they are Latin-1 instead of UTF, you will get Latin-1 columns again. So, what are your options? Well, you can do dump and restore. If you just tell MySQL dump to use the same encoding as the data, as the tables use, so you use Latin-1
from Latin-1 tables, this will keep the implicit conversion and will, for example, export the UTF data correctly. Then you recreate everything, edit the MySQL dump to update table definitions and import everything as UTF into the server again.
The other option is to use alter table and convert each column separately by converting them into binary form and then back into UTF. Converting column data or table data into binary encoding means that basically you drop the character set context, so there is no conversion applied either way. So, this way you can kind of, having UTF data
in a Latin-1 table, you get it into binary form without any changes and then you get to UTF without any changes as well, so it's correct. Last option that we developed recently is to use PT-Online schema change to do this
online without pretty much any downtime. So we wrote a plugin that we use with a PT-Online schema change and a slight modification of the PT-Online schema change to actually let it work. That works like PT-Online schema change, so you can perform your conversion online. The only kind of requirement is that you don't rotate your tables as they
are converted, so you end up with a database that is twice as large with all the tables having both new and old versions and then when everything is converted, you store your application, you fix your broken configs and everything and then you rotate the tables, you basically replace the Latin-1 tables with UTF tables for example,
so the downtime could be less than a minute. There could be a few gotchas, which is if you convert from Latin-1 to UTF, your storage requirements may increase, so you may need to increase the column length or
change the index length to accommodate the new requirements. Also, there could be more problems than you think, so this function can test if the conversion worked, so you can convert all pre-conversion value with a post-conversion value and test if it
actually worked correctly in an automated way, so you can apply it on all data in the database. So how to do it right? Set the character set during initial configuration, always explicitly specify the full character set for everything, including tables, columns
and schemas. Of course, don't forget to set up your applications correctly as well. Okay and one more thing, which is not related to that, if you're interested in web-scale SQL, we are sharing this with everyone, we've been using this and building
binary packages for them for a while, so we are now sharing the web-scale packages with you, and you can grab them and test them, the amazing web-scale from Facebook, Twitter, LinkedIn, Google and others. Any questions?