Character encoding

Video in TIB AV-Portal: Character encoding

Formal Metadata

Character encoding
Breaking and unbreaking your data
Alternative Title
MySQL And Friends - MySQL Character
Title of Series
CC Attribution 2.0 Belgium:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor.
Release Date
Production Year

Content Metadata

Subject Area
Web page Complex (psychology) Presentation of a group Context awareness Table (information) Cone penetration test Code Latin square Codierung <Programmierung> Visual system Set (mathematics) Floating point Database Primitive (album) Unicode Number Formal language Neuroinformatik Set (mathematics) Energy level Cuboid Electronic visual display Information Codierung <Programmierung> ASCII Task (computing) Default (computer science) Multiplication Touchscreen Information Latin square Data storage device Line (geometry) Binary file Radical (chemistry) Arithmetic mean Personal digital assistant Optics Table (information)
Building Context awareness User interface Latin square State of matter Length Multiplication sign 1 (number) Set (mathematics) Numbering scheme Database Price index Client (computing) Mereology Replication (computing) Dimensional analysis Formal language Medical imaging Mathematics Different (Kate Ryan album) Computer configuration Set (mathematics) Bus (computing) Flag Data conversion Multiplication Stability theory God Gradient Latin square Data storage device Drop (liquid) Control flow Process (computing) Computer configuration Telecommunication Chain output Website Software testing Right angle Energy level Figurate number Quicksort Row (database) Point (geometry) Asynchronous Transfer Mode Server (computing) Functional (mathematics) Inheritance (object-oriented programming) Table (information) Divisor Codierung <Programmierung> Maxima and minima Limit (category theory) Binary file Number Element (mathematics) Power (physics) Revision control Energy level Statement (computer science) Codierung <Programmierung> Form (programming) Default (computer science) Plug-in (computing) Default (computer science) Addition Turing test Information Key (cryptography) Patch (Unix) Server (computing) Client (computing) Database Skewness Line (geometry) Cartesian coordinate system System call Mathematics Subject indexing Stochastic differential equation Word Database normalization Personal digital assistant Analog-to-digital converter Dualism Game theory Table (information) Hydraulic jump Square number
Turing test Table (information) Server (computing) Codierung <Programmierung> Client (computing) Control flow Cartesian coordinate system Particle system Set (mathematics) Configuration space Configuration space Table (information) Force
Facebook Scaling (geometry) Angle Linear regression Codierung <Programmierung> Shared memory Software testing Twitter Twitter
the this so hello everyone my name is my giants skiing and I'll be I'm talking about the character encoding my SQL about about how you break and then maybe peaks your died out that it was incorrectly encoded then stored the it on the case always start so we view documents past lines of text about computers have to see that divested information on us as binary lot of 10 so characters become numeric values numerate codes and these codes become character encoding where a letter becomes a number and the 2 most popular encoding tives which is task you obviously and and code a collection of of certainly Our characters encoded in a certain way becomes a character set which could be for a language like Latin 1 as which covers US-English optic English in general I'll let in 1 which covers most of Europe peeling which is that if we need all fancy characters like into fall so you'll have multi the you'll have characters that is unique code coding which can be multi-byte encoding where cocker takes from 1 to 4 bytes and this could be more complex character sets like Asian languages or of primitive things these days I can so Cockta set is the context so that certain defines what each carcass called means and what it because when it's printed on the screen on the printer or whatever so uh when the context is lost we no longer can tell what specific cone means what 65 is it a is it question mark is not accept a exclaimation mark we don't know that so the character set defines that for us but it we mix up the character set then we are most likely unable to figure it out anymore so what can go wrong when can things go wrong so this session is about breaking and fixing character encoding so encoding is maintained on multiple levels from the presentation layer like web page or the of the terminal and terminals or anything else a screen but down to 2 . us to reach so tables columns and so on this a 1st put this multi level layer support but offers flexibility in how is is how can access information out can display information and store information my also creates complexity of is that ability but it also creates a problem complexity that often leads to problems only in the next few examples I'll show you what these problems can be our have probably occur and the later how to fix some of those so problem number 1 is and it is in my school specific always in my skill by default is what Latin 1 so it's default cotton continues that want when a solid out of a box this is what you will be using and this also applies to the upcoming bicycle 5 7 lifting and what most people this is done by a really looking at that because they generally will assume that locates tool I need to use so it will support of love
crack the use of you will be multilingual will support where I want to store or read from it right it's 2 thousand so people don't really look at that too often so we can will start following this process of building applications so we start with a the my Scuola any application both server and client use Latin 1 by default for no reason so we create a schema and we created 1st able to support obligation to store the wild users to store up the places where they leave so we have all these people coming they at OK the and then of course when create a stable the server uses the full encoding for stable it's as much in 1 so we have a new table they will use like 1 of course so we have this application where people come in and obduracy and so if someone for German someone ball and then end-users guy from Japan who would ideally should be tries to and her Tokyo using his own of letters so I think it's safe and what it's is ready what's right and in this place correctly so the phone is that we created table is what and what and I think 1 doesn't really support Japanese characters it's not in the in Latin and it's not in the practice of these factors are outside of his character set but you know it works so we qualified right why actually we're not fine so in my school bus when he sees a client and that of it looks at the Carter said that the use if the character sets are different on both ends it will try to perform and in fact uh on-the-fly conversion so the client that's information in the character set it once so we said Carter said to you because what we expect input from all around the world in different languages so we said this new given my school converts are uh cities into garbage so so it is definitely not right right interestingly it was sentenced to Latin 1 we got all the right things so well that is kind of correct bonds in a weird way so is a problem well I guess you can live with that well application works it's fine so we can use it and there appear to be no problems but there may be such time where something goes wrong size for example you'll try to a robbing and all the Robby will work with that and in your Robbie will tell you OK but this Japanese character doesn't really make sense in letting ones so I want I want that this data for you so this is a word the place where you need to start thinking OK I need to I need a fixed it's right so Mr. faking fixing it aid updated server encoding you said declining calling right convert got temblors and what not see get a table that is now is not done correctly uses correct unique coding coding so the good news is a sometimes poetic except you can fixed encoding without losing anything or breaking anything so then we move to the problem number 2 so I said that I encoding of character sets are held on multiple levels say my skill you have so what 5 different settings along to define what was going on with the uh with the without how it's translated how it's encoded in the how it's converted you also have this schema level the false stable of all the forwards all the 1st column correctors sense which define the actual storage a character set so how did at a story in tables so with fixed on 1st problem right we converted that about into I'm it's our weaker interview UTF so done correctly our and we continue to develop our application so we are new functionalities are we also people to add their names right so we create a new any table I well we start application so there were 2 guys who already ordered your names and I decided to mine aswell medically safe and then and then I have a question mark in the middle of my name so we fix the problem but this a part of this something is still wrong so we go
will mean we see that our settings are generally correct but really the able and it's like 1 for no reason so so what i what would story investigating co duration and we see this Carter said database that is Latin 1 why is it while because when we 1st grade and updates we had the server running with Latin wonderful encoding so that a base got its own schema level of practice set set to lot in 1 and we've never fixed at we converted data we can converted tables that was fine but all new tables and we kept creating later were receiving this all encoding associated with a schema so can we fix this problem wall we started the game and it turns out that my accented and in the last name became a question more and permanently became a question mark so at this point there is no going back I mean we only have died it is a question mark and my will always have the question or if we you know my name you can fix it but he will not know all the answers for all the rows in all tables so essentially this means that this data is broken beyond repair for 1 silly reason so it may not be enough to set up sir correctly you have to be way more careful with that also is even would it more dangerous you get the difference between the server and client encoding because of the implicit convergence that I mentioned before so going back to this this so all these different settings that it can do in my skew up so the reason that it happened at the problem 2nd problem happened was that our schema level the false they affect new tables and table of all the false flag in columns so what you see in the table definition as a default character set will only be affected any new column that you are without specifying the practice at annual you seen the schema definition will apply to any new table that you created without the default character set specified explicitly so i'm we have some more examples of that so we and this we create previous schema we have a server that runs Latin 1 and art client uses the deified right so we create a scheme I and this team receives the servers default encoding which is lacking 1 regardless of our clients setting so then we go into this scheme of and we create a table a table will obviously received that schemas the full encoding which is also like in 1 of once of so then we change the default scheme of the full carcasses for his new table to UTF-8 right and what it turns out is are column summary is still Carter said Latin 1 this is not this plate here because the common characters and much of the tables evil characters but as soon as there is a difference in the column has this additional text states stating that is the default encoding that the tables if all encoding so this is what happens so changing people's character set to something else doesn't really fix the problem because column remains as of Latin 1 it's it's the kind of setting and then click so then we cannot be new column to these new to this table which is already so this is the problem has no character set mentioned in this in itself is out of which means it is to the of Lake is the same as the the calling for stable so essentially when this happens when you have all these problems with carcass we have like what a beauty of God lacking 1 tables so you can do all 1st it's important that you don't do that the you don't act to quickly you have to analyze the situation if you're out what happened why it happened then reassessed damage figure out if the problems are consistent if all that is broken is same day way our some tables broken in a different way or maybe they're brick and broken in several different ways so each case has to be handled differently so we have to knowledge you yet also if you're out dies the probable sometimes it's not so what he shouldn't be doing well all you shouldn't have to quickly because you can actually break things even more you can't fix this eradication slave because replication if you start converting other of the site replication will continue to send you back that replication carries practice information and encoding but you can't really fix this 1 by 1 of the papers tables 1 by 1 because this change in application or server has to be sort of atomic so if you switch from document a duty evaded it affects all tables on all communication so you can't really do that he has to seen sort of 1 go and you don't want to use deceive you haven't broken a characters syncording anywhere because D is generally assume that the things OK they will convert dada or the blast to at least but only on the assumption that everything is right so what the Sadr fixed schema of the full character set like a mention of all the call all the text columns in that our base also in in ways doesn't which isn't really a text column body carries some text information you can drop all these columns only sailors from information skin all the tables and Donald Hebb text information that you should mn tool takes as well they won't be affected by the existing problems but you may want to ot the new columns this was later and if they are what in 1 city together you will get in 1 columns so what are your options well you can do down if you just tell my spilled them to use the same encoding as the dot up Our as in tables as if tables you so you use Latin want to don't uh from tables this will keep the implicit conversion and will for example x for here about are correctly and you recreated thing edits that my down to the of the table definitions and important thing as everything us UTF into the server again the other option is to use all the table and convert each column separately by converting them into binary form and then bump into UTM convergent converting the column that out or table that ain't binary encoding means that basically you drop that characters of context so there's no conversion applied I'd away so this way you can kind of having you as you idiot dying at walking 1 Table you get our you get it into binary form without any chains and then you get to you without any changes as well so it's correct it would less option we developed recently is to use be dealing schema change to do this online without 3 months and downtime so we're when applying in that we use Wikipedia my schema change and a slight modification of the key elements dimensions to actually plenty work but that works like the dualist imaging so can perform your conversion lines the only kind of requirement is that you don't properly your tables of state power converters to end up with a database is twice as large with all the tables having fault new and all the versions and then when as the summary star application you fix your your broken context and another thing and then you retain the tables you basic replaced Latin 1 tables for with UTF tables for example so the downtime can be less than a minute the they could be if you not just which is if you convert from Latin once you hear about your storage requirements may increase so I and you could you may need to increase their of the column length or in change the index climbed to accommodate then your requirements well also there could be more problems than you think so this function can tell the contested the conversion work 2nd compared all pre-conversion value with a customer redundant that's that's it actually work correctly in it how to make it a automated ways it can applied on all that I that a base of so how to read write said the
practice entering her configuration always explicitly specify particle collect character set for everything including tables columns and art schemas of course don't forget to don't forget to set up your applications correctly as well the OK
and 1 more thing which is not related to that if
you're interested in went wept scale school all they're sharing this with everyone would be new using this and billing dire Park is for and for a while so we are not sure know assuring them that watch this topic is with you which is and you can problem and test and that's amazing what this question scholar from Facebook Twitter . angle elders of like
regressions yeah I think that it was the same for what and that the but only and sir you see yeah you does all the the a