Merken

Putting the cork back on the bottle

Zitierlink des Filmsegments
Embed Code

Automatisierte Medienanalyse

Beta
Erkannte Entitäten
Sprachtranskript
apart from the obvious and with the clock and bottle of a and it it strike me it has always strike me how and do naturally actually the whole world of encoding could could be really described as a Pandora's box the bottle and for something so it's really up on 1 on both counts from the city of court and for the general situation of encodings today so as start this with the beginning the Joneses whether
that's that during this summer rule might Sinai might size itemset and in Slovenia the in computer science and bioinformatics something this I'm doing something in between as well math and computer science so we were sponsored by who will the who gives out a lot of money to students to commit open source code and my of something related to 10 documentation and I I I I made a proposal related to making take more unique of compliance note by maybe a bit ambitious so and maybe also on the of size and start with the what means exactly I
and a basic safety check what is unique I suppose everyone has this quo the name and has some idea about what it is so my an informal definition is outside the torso character set so suitable I knew running writing system and manuscript it's not the official definition I think you will find in the Fig book that's behind cabinet 105 thousand 1 thousand the pages let's say 1 suppose there and those as well and then the the actual content of my we're sort of my project for the summer what does it mean property-based system for the collective system to really could compliance the around it's really not clear at all so of course I have some idea before beginning in button spring the spring but then we started actually doing something related but not completely not completely the same 1st
step Western just by my side actually and related to to hyphenation patterns so as you know we have you we have quite a lot of hyphenation patterns for of many languages in in 10 distributions the and none of them are in UTF-8 encoding so not they're quite normal Punycode in that respect the which doesn't mean that they can be used for many different writing systems before they can the damaging unique about all that's not in UTF-8 encoding with the in which is ask from a request and uses a more more precise recruitment mature and before the real problem when Zytec started to be integrated in the Clive or 1 and a half year ago actually started being integrated approximately 2 years ago and it was a problem because it expects by default UTF-8 encoding so Jonathan devise a system to automatically automatically converted the different hyphenation patterns the so this is for each for a for each uh pattern file available through high from some some high from the text he erected down in a file which a cold it's you hyphen for you take technical I assume and given the actual the that these are all diversity of part of 5 that we have I think is quite appropriate to call it's too good as you hyphen proteins that we have a resume of different patterns are on and some of them really in the wild which we didn't know what to do with so the detector alive so you more samples after all that's the 1 I show you what we've actually done but the attack could is quite simple because the so called biggest apart from legacy encodings where all in the some it's font encoding so far this in order to convey our convert the different here a characters we need to convert we simply make an active would become the micro and we make about micro output the UTF-8 byte sequence usually a 2 byte sequence fall the appropriate character in the font encoding the from fighting is in so this is very well with this does indeed work for take but it's I think it's not the real why we wanted to have things because nowadays we might want to have the master flies in Munich and and had Low had it converted ematically to an and be concluding if needed that uh Matisse doesn't really reasonable to rely on I has all the very diversity of font encoding so the but we
did with the with might set less to tool in reverse the problem to address it the other way around which means we are going to take every hyphenation pattern 5 available and converted to UTF-8 such as in we would converted beforehand and have them of the of of 9 5 main file and at the same time we wanted to get rid of many many complicated macros that around because the actual pattern 5 we do not only have patterns and the patterns command with different pattern from that language we also have a lot of support groups so that the capture the as because our I remember important part of support code and then we also have other things are really complicated thing that's the messed the situation and energy did they 3 onto on on occasion it was really not clear which which character well actually the finding which actual patterns after I chose to remove parts of and remain to be in the file and finally we wish to add up at the learning scheme for the languages at hand and that is we the 2 now the patterns were mostly called with uh something like uh to character all 3 character or code followed by the name highest if or a hyphen which was remotely OK but actually we had really difficulties because an inner for English For example this is the most basic example example and yet it's already a problem because uh the if was called that there was text and so do original file could hide pattern detectors and the the which or accounted for pro-American English and then someone also devised the fight for British English which was called you paid for you but unfortunately you can't UK is also the ISO bicycle from Ukrainian and that's a problem because we also have the Ukrainian so and of course as soon as you know it's really not a problem that's a the really messes things that much more so we decided to to find to find some some sets some set of language tag that could account for the diversity of languages we had and we found out that actually the only 1 in the online that we could use by the IETF language tag that is in the request for common 4 6 4 to 6 and which is quite precise I'm afraid I don't have time to to discuss the all that is but I like to show you the completely of languages at 1 point uh sort if anyone wants I can discuss seconds to the studied the exact problem we're facing so and just to to insist on it the ISO language could simply weren't enough because of the for example is a single language but and yet we had had good for American English and British English will have Irish English from Australia India for that matter but we we don't have for moment of course I I'm not containing on working on it at all so the strategy so the
next 2 slides are Abbott effect because it's really refer because it's been support has been contributed mostly a free the lieutenant type of 10 might turn out it again so what we want to do actually is we want to have a single so the equivalent of what Jonathan the names those who file uh where quality the lord chief Lord hyphenation the national language good although for the particular and live language so this is sort of say the top 1 the so we have to the text to detect and to test which kind of TEC engine running and we're doing like like that we need just so we have indeed is macro actually expect to to see 2 characters before all the barn and it defines the net present our the the 2nd it's an argument and the funny thing here the is that is not letting the century till but if you look at the 5 the UTF-8 character UTF-8 and coding for a Greek still which is 2 bytes so this is actually that if a running something like the take electric which is the natively in which native UTF-8 it sees a single character because the its input is you know is a key factor so it seems a single character so half 1 is a so have we just rooms discarded and the hash to is actually but if you're running PDF take all of the text 3 or some when you take a look for example we actually see particle we see we see 2 bytes so we have to is not entered the so the symbol does is simply this if there is 2nd artisans we will recently a message to the to the user and then you do nothing else was simply input directly that the actual fighting with the red patterns and otherwise if 70 hours is not empty it means that were running in 8 the editor the engine so we have to the output and also on the stage of the message and we resource to input a file that to do will do the conversion from UTF-8 to the appropriate a bit encoding in now just to be precise this much and I are mostly context user and the ECE and mean the scene and using some context side of sympathy for the G 1 encoding so so I gave the court going because most languages of of Europe uh which are written in the Latin alphabet the use the court including of course it was devised for that so that's it so I will naturally and His doesn't here that best as that it's simply a pattern file which we converted to UTF-8 the and this the the problem is not converting the fight activate problems loading them into text that's that's what I and mistress after that if you have general question could you just ask them after the talk the so the I show the
control so the Converter TEC finance it's a added funny if you know the intervals of UTF-8 serious forgets how it works and foreign I This this is an extract this an extract of the actual diversified but of course is much longer so I just I just showed how it worked for all the 3 characters that are used for us to and those 3 characters happen to to win the encoded with a 2 by the effect and a so here the first one that he was currently we have checked is encoded with the exosomal C for an individual and the so so this there there are 2 2 bytes so simply it makes you for active the and the it takes 1 argument and if that current this is the for it takes 1 argument and the argument is a deeds outputs simply outputs in the appropriate character curled for to with current in the q 1 encoding which happens to be a 3 83 that and has 4 to doubt that actually if he sees anything else it's an the user world with the person running the fight and likewise for the other 2 characters who you encoding starts with a C 5 and C 5 and then I mentioned as he goes before the because it's other important part of the particle the mechanism uh actually Tech expects uh any character in the pattern file to be the and uh to to have had to be a letter of that is to have kept at the level and and also to have an answer to have an appropriate ACLs and you go about lowercase code which actually must be non nonzero so in most cases so we have utterances I think in any case but are in the patterns in all the cases we have patterns in lower case form and so they as he put it simply themselves for this we characters on hand we simply give a sequence of so the actual the actual
problem and I knew I wouldn't have temple is that uh I still have to I still have to mention the different problems actually may answer John's question that and it doesn't work that way at all because of their own we we had a permanent form probably and the language we will have a problem we have probably was still have a few for so for example the some set of languages could it could be handled you know which 1 do you all encodings and 7 that thing and the pattern flies try to accommodate this for that only those so this was some sort of had the flu that was introduced in the German Patent 5 but actually the 1st set of the 4 s uh has a different because position in T 1 and what you want and it was included twice and a it isn't straightforward at all to reproduce at this from a single master UTF-8 because it means that whenever we make it we encounter are sharp s in UTF-8 pattern 5 we should actually output to different patterns in 4 from the text private tech engines and the same happens for a French nation Latin because each 1 of them has some special character that has different good position in 1 and what you want so this and so forth all actually was simply dropped the nice approach but I show 2 slides for we simply don't do that at all we simply do if we're running out of UTF-8 again in we'll forcing for the UTF-8 by and if we're running out and I'd take and recently input the legacy of the of the of the old we did want to test for this subtle and actually were we're not convince obsoleted dolls the does indeed what I mean that the old file with you 1 which try to collect spoken when I'm not doing that convince of all that it really works very well for what you want because what you when you you simply don't you don't have to access to power of the stock market and the accident so having the sharpness in Germany of course fine know what you and that it by far doesn't account for everything is like Russell for French and then also there's actually letting school because that's in Hazzard this old union model and in the spelling of latin nowadays you have the all he is a really yeah and which can be complete represented by what you want a single parent it so how much do I have a unit of people the who so sometimes you up the Cyrillic community you're in Eastern Europe that we're actually because uh russian and praying in that can love was really completely different pattern set so the master file our you hyphen and you pay a hyphen uh the of the legacy our real world of because you you can define the which by uh you can just set a macro which define which pattern set to 1 to use because different people contribute to different patterns we have half a dozen I think for Russian and you can also use different encodings because unlike the and that language is the European languages that use the Latin alphabet uh for which the default encoding of the nation and putting released you and and like this in the Cyrillic script we don't have we have different encoding mainly it's situated to B C D and also encoding called X 2 think that so when we realize that we want to quite for what to do when we simply decided not to try to emulate this and 4 4 or including inclusion into characters have made because it's it's into 2 short notice and should be seen just not wise to to to emulate all behavior and sometimes it has a dimension of course and I already said it actually sometime in unit of this year the battles that ensued resulting some language and so on we just we just can't accommodate you codes so for Greek and and should or Platonic great and because and this is our the languages not similar to integrate with a single axis see using axis inside uh here we you really couldn't try to the it to accommodate for you because bad for the universe falls and text problems as well so was simply splits the part of about and sometimes or the warm your hands about how to discuss this with him sometimes really need to fix things in Babel as well so that with hands the and have an approach that was not really about the problem was not at all and then the result
will be ever this personal data 2 months ago it was because we might society it was for all horror ruined by parents and the knowledge and thanks to call the it's as relief has been imported into the but of course we have that to see tendencies the UTF-8 and we imported it to the to see 10 so I'm contemplating whether I will be speaking out my actual project for with American so let's say no and it's to to the thanks right at the
poles the people who were interested in what I actually do very well with me and really that of length with many people here so thanks to all those people tool before crossing raise performance all and you don't want to tackle enhanced and all those people the people in the 4th paragraph but we're actually really receptive to throw initiated and it has a good nice to see that so it's still an something very alive India and moo monitored for example for the use certain patterns actually at that time in 1990 1990 call that several racial so he did that 80 years ago probably start 20 years ago and actually was extremely receptive and so the years I have all all the people of the goes from the government but even over here how to a lot for consume next is right the so the Government and and you can you repeat the question what are some of the the this that is that what you really want just tell the people that don't use is you take to to get lost in that there is no will a calorie have never supported that for for for example we have few looking at the
world and the the I think he would have more things to do to to answer you his own the they and and and and things the there not patching anything about any good can fuel and the the as a this is just completely wrong and I really don't care for so if you it is on the this you you you said you should not rob tech and in and across the row and use text and a correct is that we're inclined and you for the the the we of this is ridiculous winning yeah can ask the girls it for I mean that's not our responsibility to do that's he people of course I find it sad that we have we don't have more people trying to to use model the and what they really I really easy to get what it here but have and its and its sad for them but to do we have to for them to form so that upon them so this makes sense I don't agree and Hollywood agree I think OK go and get it and I and in the thing here in the main the true but the and I think this is what the generating all of you the the the the the yes yesterday yield what is the weight on the side and the relation only all of the people who live in the north language so they could be in use that's absolutely actually when you look at the active uh you really don't some some particle and therefore he doesn't use different from decoding the to reuse 1 of the patterns are encoded for 1 particle encoding and German French with T 1 and which you and are like an exception in that respect and Russian repression also our as I mentioned because it's really the only 1 who that try to accommodate different encoding of to the the patterns themselves are in some particular good but at OK there was 1 which is only and what do you know what people think during the war on the their in the wall due the a lot of what is to you you just the wars I think who absolutely and the city and way to proceed and I think we probably moved to that in the future of that uh and in the beginning it wasn't clear how how messy the situation where it it so now I just yet no no this but the report analysis them you can see get the time of
Unicode
Decodierung
Zählen
Computeranimation
Quelle <Physik>
Bit
Open Source
Mathematisierung
Unicode
t-Test
Spieltheorie
Schlussregel
Physikalisches System
Quick-Sort
Code
Hinterlegungsverfahren <Kryptologie>
Computeranimation
Homepage
Projektive Ebene
Bildschirmsymbol
Inhalt <Mathematik>
Informatik
Gammafunktion
Distributionstheorie
Folge <Mathematik>
Subtraktion
Punkt
Momentenproblem
Formale Sprache
Gruppenkeim
Schreiben <Datenverarbeitung>
Code
Computeranimation
Font
Reverse Engineering
Stichprobenumfang
Mustersprache
Default
Funktion <Mathematik>
Zwei
Anwendungsspezifischer Prozessor
Unicode
Nummerung
Physikalisches System
Elektronische Publikation
Quick-Sort
Motion Capturing
Energiedichte
Menge
Mereologie
Strategisches Spiel
Decodierung
Ordnung <Mathematik>
Makrobefehl
Bit
Umsetzung <Informatik>
Formale Sprache
Zeichenvorrat
Kombinatorische Gruppentheorie
Äquivalenzklasse
Code
Computeranimation
Demoszene <Programmierung>
Message-Passing
Bildschirmmaske
Softwaretest
Mustersprache
Hash-Algorithmus
Datentyp
Funktion <Mathematik>
Soundverarbeitung
Kraftfahrzeugmechatroniker
Parametersystem
Unicode
Symboltabelle
Dichte <Stochastik>
Ein-Ausgabe
Kontextbezogenes System
Elektronische Publikation
Teilbarkeit
Quick-Sort
Rechenschieber
Texteditor
Verschlingung
Parametersystem
Ein-Ausgabe
Mereologie
Gamecontroller
Partikelsystem
Lateinisches Quadrat
Makrobefehl
Message-Passing
Resultante
Betragsfläche
Ortsoperator
Hausdorff-Dimension
Formale Sprache
Zeichenvorrat
Kartesische Koordinaten
Kontextbezogenes System
Computeranimation
Informationsmodellierung
Bildschirmmaske
Einheit <Mathematik>
Mustersprache
Vererbungshierarchie
Skript <Programm>
Inklusion <Mathematik>
Default
Grundraum
Gammafunktion
Funktion <Mathematik>
Leistung <Physik>
Inklusion <Mathematik>
Softwaretest
Unicode
Einfache Genauigkeit
Elektronische Publikation
Ein-Ausgabe
Quick-Sort
Rechenschieber
Menge
Mereologie
Codierung
Projektive Ebene
Lateinisches Quadrat
Decodierung
Makrobefehl
Subtraktion
Dicke
Gewicht <Mathematik>
Güte der Anpassung
Formale Sprache
Relativitätstheorie
Unicode
Ausnahmebehandlung
Codec
Computeranimation
Metropolitan area network
Polstelle
Energiedichte
Freeware
Informationsmodellierung
Datensatz
Mustersprache
Endogene Variable
Partikelsystem
Decodierung
Verkehrsinformation
Analysis

Metadaten

Formale Metadaten

Titel Putting the cork back on the bottle
Untertitel Improving Unicode support in TeX extensions
Serientitel The annual conference of the TeX Users Group (TUG 2008)
Teil 09
Anzahl der Teile 33
Autor Miklavec, Mojca
Reutenauer, Arthur
Lizenz CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.
DOI 10.5446/30797
Herausgeber River Valley TV
Erscheinungsjahr 2012
Sprache Englisch
Produktionsort Cork, Ireland

Inhaltliche Metadaten

Fachgebiet Informatik
Abstract In the TeX world, the name of Cork is associated with a standardization effort dating back to 1990, the Cork font encoding, which can be used for most European languages written in the Latin script. At about the same time, though, a much wider standardization effort was initiated, as the Unicode Consortium was created to devise a universal character set suitable for any language and writing system. Of course, it wasn’t long before people felt the need to support Unicode in TeX–like systems. How far are we today? The latest extensions to the TeX engine are all labelled as “supporting Unicode”, but upon closer inspection this reveals rather imprecise: does it mean enabling UTF–8 input, handling multibyte characters, or implementing all the Unicode character properties and algorithms?

Ähnliche Filme

Loading...