Putting the cork back on the bottle - TIB AV-Portal

Putting the cork back on the bottle

00:00

2

River Valley TV

Miklavec, Mojca Reutenauer, Arthur

Formale Metadaten

Titel

Putting the cork back on the bottle

Untertitel

Improving Unicode support in TeX extensions

Serientitel

The annual conference of the TeX Users Group (TUG 2008)

Teil

9

Anzahl der Teile

33

Autor

Miklavec, Mojca

Reutenauer, Arthur

Lizenz

CC-Namensnennung 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/30797 (DOI)

Herausgeber

River Valley TV

Erscheinungsjahr

Sprache

Produktionsort

Cork, Ireland

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

In the TeX world, the name of Cork is associated with a standardization effort dating back to 1990, the Cork font encoding, which can be used for most European languages written in the Latin script. At about the same time, though, a much wider standardization effort was initiated, as the Unicode Consortium was created to devise a universal character set suitable for any language and writing system. Of course, it wasn’t long before people felt the need to support Unicode in TeX–like systems. How far are we today? The latest extensions to the TeX engine are all labelled as “supporting Unicode”, but upon closer inspection this reveals rather imprecise: does it mean enabling UTF–8 input, handling multibyte characters, or implementing all the Unicode character properties and algorithms?

The annual conference of the TeX Users Group (TUG 2008)9 / 33

1

39:47

Windows of opportunity

2

30:39

A pragmatic toolchain

3

20:58

How to develop your own document class — our experience

4

30:06

docx2tex: Word 2007 to TeX

5

30:34

TeXworks: lowering the barrier to entry

6

20:05

Data mining: Role of TeX files?

7

23:38

MPlib: Developments

8

32:23

MPlib context MkIV

9

26:32

Putting the cork back on the bottle

10

28:41

Xindy revisited — multilingual index creation for the UTF-8 age

11

30:46

Do we need a Cork math font encoding?

12

34:57

Three typefaces for mathematics

13

29:03

Minion Math — The Design of a New Math Font Family

14

25:17

Creating cuneiform fonts with METATYPE1 and FontForge

15

35:53

Meta designing parameterized Arabic fonts for AlQalam

16

29:36

Writing Gregg Shorthand with LATEX and Metafont

17

19:37

Why didn't Metafont catch on?

18

24:35

Multidimensional Text

19

14:19

Multiple simultaneous galleys

20

24:11

Parallel Typesetting

21

24:54

Smart ways of drawing PSTricks figures

22

41:59

Medical pedigrees with TeX and PSTricks

23

22:28

MathTran and TeX as a web service

24

22:08

The galley Module or

25

25:56

Advanced features for publishing mathematics, in PDF and on the Web

26

25:19

Direct and reverse synchronization with SyncTeX

27

25:58

LuaTeX: what has been done, and what will be done

28

34:38

Image handling in LuaTeX

29

34:08

Where does TeX end, Lua start and vise-versa

30

14:56

Don's Punk Anno 2008

31

22:36

Observations of a TeXnician for hire

32

29:55

Languages for bibliography styles

33

45:14

A newbie's experiences with Lilypond, Lilypond-book, LaTeX, and Perl

Automatisches Abspielen

Sprache

Text

Bild

00:00

Codierung <Programmierung>ZählenComputeranimation

00:31

InformatikKontextbezogenes SystemSpeicherabzugLateinisches QuadratElektronische PublikationUmsetzung <Informatik>Codierung <Programmierung>Funktion <Mathematik>Ein-AusgabeMessage-PassingHash-AlgorithmusSoftwaretestMusterspracheZweiSchnittmengeMultiplikationsoperatorVollständigkeitCodeBetrag <Mathematik>Mailing-ListeRechenschieberMAPMakrobefehlBitÄquivalenzklasseLastParametersystemProjektive EbeneDickeForcingResultanteInformationExogene VariableRechenwerkProgrammierumgebungProgrammfehlerDifferenteMereologieStreaming <Kommunikationstechnik>SystemplattformComputerunterstützte ÜbersetzungAusnahmebehandlungNummernsystemSchreiben <Datenverarbeitung>ComputerspielReelle ZahlDefaultStichprobenumfangUnicodeOrdnung <Mathematik>Inklusion <Mathematik>Folge <Mathematik>CASE <Informatik>Ortsoperatort-TestEinfache GenauigkeitNeuroinformatikInhalt <Mathematik>DistributionenraumMathematikHinterlegungsverfahren <Kryptologie>Open SourceSystemprogrammierungProgrammierspracheFontMehrrechnersystemSkriptspracheCodierungMechanismus-Design-TheoriePolstelleVererbungshierarchieBetragsflächeGrundraumLeistung <Physik>Endliche ModelltheorieDimensionsanalyseZeichenvorratKartesische KoordinatenBildschirmmaskeQuick-SortEnergiedichteGruppenoperationMotion CapturingGamecontrollerSoundverarbeitungPartikelsystemPunktTexteditorAnwendungsspezifischer ProzessorSymboltabelleTeilbarkeitDichte <Stochastik>Demoszene <Programmierung>Reverse EngineeringWeb-SeiteVerkehrsinformationAnalysisSchlussregelCoxeter-GruppeTypentheorieQuelle <Physik>MomentenproblemStrategisches SpielGewicht <Ausgleichsrechnung>RelativitätstheorieGüte der AnpassungDatensatzComputeranimation

Transkript: Englisch(automatisch erzeugt)

00:00

Apart for the obvious pun, with the cork and bottle, it has always striked me how actually the whole world of tech encoding could be really described as a Pandora's box or bottle or amphora or something. So it's really a pun on both accounts, for the city of cork and for the general

00:24

situation of tech encodings today. So I'll start obviously with the beginning. The genesis was that during this summer, Moica and I, Moica is a young student in Slovenia, in computer science, bioinformatics, something like this, I'm doing something in between

00:46

as well, math and computer science. So we were sponsored by Google, who gives out a lot of money to students to commit open source code and Moica does something related to tech documentation and I made a

01:03

proposal related to making tech more Unicode compliant. It may sound maybe a bit ambitious and maybe also unclear, so I'll start with what it means exactly, a basic sanity check, what is Unicode, I suppose everyone has at least

01:23

heard the name and has some idea about what it is, so my informal definition is that it's a universal character set suitable for any writing system and any script, it's not the official definition I think you'll find in the thick book that's behind cover,

01:41

101,500 pages, what is tech, suppose everyone knows as well, and then the actual content of my research or my project for this summer, what does it mean for a tech-based system, for a tech-related system to be Unicode compliant?

02:06

It's really not clear at all, so of course I had some idea before beginning and back in the spring, but then we started actually doing something related but not completely the same, so our first step was suggested by Moica actually and related to hyphenation patterns,

02:30

so as you know we have quite a lot of hyphenation patterns for many languages in tech distributions and none of them are in UTF-8 encoding actually, so they're quite ignorant

02:48

of Unicode in that respect, which doesn't mean that they can be used for many different writing systems, of course they can, but they're not in Unicode at all, it's not in UTF-8 encoding which is a stronger requirement, I mean it's a more precise requirement actually, and this

03:07

was a real problem when Zetec started to be integrated in tech life one and a half year ago, actually it started being integrated approximately two years ago, and it was a problem because it expects by default UTF-8 encoding, so Jonathan devised a system to automatically

03:24

convert the different hyphenation patterns, so that is for each pattern file available, some hyphen.tech, he wrapped it up in a file which he called XU- for Zetec Unicode

03:42

I assume, and given the actual, the bizarre diversity of files that we have, I think it's quite appropriate to call it a ZU- that we had, a real ZU of different patterns around some of them really in the wild, which we didn't know what to do with.

04:05

So the tech code, I'll show you more samples after that, when I'll show you what we've actually done, but the tech code is quite simple because the so-called legacy encodings were all in some 8-bit font encoding, so for this, in order to convert the different

04:27

8-bit characters we needed to convert, we simply make them active, so it becomes a micro, and we make that micro output the UTF-8 byte sequence usually, or a two byte sequence, for the appropriate character in the font encoding the pattern file is in.

04:48

So this is very well, this does indeed work for Zetec, but it's, I think it's not the real way we wanted to have things, because nowadays we might want to have the

05:02

master files in Unicode, and have it loaded, have it converted automatically to an 8-bit encoding if needed, but nowadays it doesn't seem really reasonable to rely on haphazard diversity of font encodings. So what we did with MOISA was to

05:26

to revert the problem to address it the other way around, which means we are going to take every hyphenation pattern file available and convert it to UTF-8, that is,

05:41

we would convert it beforehand and have them as the main file, and at the same time we wanted to get rid of many complicated macros that were around, because in the actual pattern file we did not only have patterns, and the patterns command, with a different pattern

06:02

from that language, we also have a lot of support codes, so the cat code and LC code are really a very important part of the support code, and then we also had other things, really complicated things that really messed up the situation and made it really unclear, on occasion it was really not clear which character were actually in the file,

06:25

I mean which actual patterns, actual stream of characters, were meant to be in the file. And finally we wish to adopt a cleaner name scheme for the languages at hand, that is, up to now the patterns were mostly called with something like two character or

06:47

three character or code, followed by the name hyphen, which was remotely okay, but actually we had really difficulties, because for English for example, this is

07:02

the most basic example, and yet it's already a problem, because the file was called, there was text, sorry, Knuth, original file called hyphen.tech, which more or less accounted for American English, and then someone also devised a file for British English,

07:25

which was called UK, but unfortunately UK is also the ISO code for Ukrainian, and that's a problem because we also have the Ukrainian file. So of course, as soon as you know it, it's really not a problem,

07:41

but it really messes up things much more, so we decided to find out some set of language tag that could account for the diversity of languages we had, and we found out that actually the only one that we could use was the IETF language tag, that is the request for common 4646,

08:04

which is quite precise, I'm afraid I don't really have time to discuss all the list, but I'll show you the complete list of languages at one point, so if anyone wants, I can discuss the exact problem we were facing. So just to insist on it,

08:27

the ISO language code simply weren't enough, because English for example is a single language, and yet we had code for American English and British English, we could have Irish English or Australian English for that matter, but we don't have it for the moment,

08:42

and of course I'm not intending on working on it at all. So the strategy, so the next two slides are a bit of tech code, it's really funical, it's been contributed mostly actually by Jonathan and Tacko, and Moi-Tay and I reworked it,

09:02

so what we want to do actually is we want to have a single file, so the equivalent of what Jonathan named the ZOO file, where I call here the load hyphenation-language code, for the particular language, so this is so to say the top level file.

09:23

So we have to detect, to test which kind of tech engine we're running, and we're doing it like that, we just have, this macro actually expects to see two characters before the bang,

09:41

and it defines the macro second argument as being its second argument, and the funny thing here is that it's not a Latin T, it's a Greek T. So if you look at the file, the UTF-8 encoding for a Greek T, which is two bytes,

10:03

so this means actually that if we're running something like Z-Tech or Lua-Tech, which is natively UTF-8, it sees a single character, because its input is UTF-8. So it sees a single character, so hash 1 is 2, we just discard it, and the hash 2 is actually empty.

10:27

But if we're running PDF-Tech or TEC-3 or some e-tech as well for example, we actually see two characters because we see two bytes, so hash 2 is not empty. So the simple test is simply this, if second guard is empty, we will simply output a message to the user,

10:49

and then we do nothing else, we simply input directly the actual file with the real patterns. And otherwise, if second guard is not empty, it means that we're running an 8-bit tech engine,

11:03

so we output another message, and we force to input a file that will do the conversion from UTF-8 to the appropriate 8-bit encoding in TEC. Just to be precise, Moi-Tse and I are really mostly context users,

11:22

and the EC name is some context hydro-syncracy for the T1 encoding, so again, the core encoding, because most languages of Europe, which are written in the Latin alphabet, use the core encoding, of course it was devised for that.

11:42

So that's it. So I will not show the HIF-SL, it's simply a pattern file which we converted to UTF-8 yesterday.

12:07

The problem is not converting the file to UTF-8, the problem is loading them into TEC, and that's what I mistrust after that. If you have general questions, could you just ask them after the talk?

12:23

So I'll show the converter TEC file now. It's really a bit funny, if you know the internals of UTF-8, you should guess how it works. This is an extract of the actual converter file, because of course it's much longer.

12:43

So I just showed how it works for the three characters that are used for Slovenian. And those three characters happen to be encoded with a 2-byte in UTF-8, and the first one, the C with Charon, which is encoded with the hexadecimal C4 and hexadecimal 8D.

13:09

So these are the two bytes. So we simply make C4 active, and it takes one argument, and if the argument is 8D,

13:24

it simply outputs the appropriate character code for C with Charon in the T1 encoding, which happens to be 8-3. And I stripped it out, but actually if it sees anything else, it insults the user, the person running the file.

13:42

And likewise for the other two characters whose UTF-8 encoding starts with a C5. And then I mentioned LC codes before, because it's a very important part of the pattern loading mechanism. Actually TEC expects any character in the pattern file to be a letter, that is to have cat code 11,

14:12

and also to have an appropriate LC code, a lowercase code, which actually must be non-zero. So in most cases we have patterns, I think in any cases, in the patterns.

14:26

In other cases we have patterns in lowercase form, and so the LC code is simply themselves. So for the three characters at hand, we simply give LC code for this. So the actual problem, and I knew I wouldn't have time for this,

14:45

but I still have to mention the different problems. Actually it may answer John's question, it doesn't work that way at all,

15:00

because for probably any language we will have problems, we had problems and we will still have in the future. So for example some set of languages could be handled in OT1, the old TEC encoding, the 7-bit thing, and the pattern files try to accommodate this, try to accommodate both.

15:23

So this was some sort of hack that was introduced in the German pattern files, that actually the set, the sharp S, has a different code position in T1 and OT1, and it was encoded twice, and it isn't straightforward at all to reproduce this from a single master file in UTF-8,

15:43

because it means that whenever we encounter a sharp S in the UTF-8 pattern file, we should actually output two different patterns for 8-bit TECs, for 8-bit TEC engines. And the same happens for French, Danish and Latin, because each one of them has some special character

16:04

that has different code position in T1 and in OT1. So for those actually we simply dropped the nice approach that I showed two slides before, we simply don't do that at all, we simply do, if we are running a UTF-8 TEC engine,

16:24

of course we input the UTF-8 pattern files, and if we are running an 8-bit TEC engine, we simply input the other, the old file. We didn't want to touch this at all, and actually we are not convinced at all that it does indeed,

16:43

what I mean is that the old file with T1, which tried to accommodate T1 and OT1, I'm not that convinced at all that it really works very well for OT1, because in OT1 you simply don't have accent characters, you use TEC macros for that, you use the accent macro, so having the sharp S in German is of course fine in OT1,

17:04

but it by far doesn't account for everything, and likewise for French and also Danish, and actually Latin is cool, because Latin has this O in modern, the spelling of Latin nowadays, you have this OE league and this AE league,

17:20

which can be completely represented by OT1 with single characters. So how much do I have Anita and Peter? Cool. So sometimes you have the Cyrillic community in Europe and Eastern Europe is actually very active, because Russian and Ukrainian can load completely different pattern sets,

17:46

so the master file, the legacy file are really well done, because you can set a macro which defines which pattern set you want to use,

18:02

because different people contribute to different patterns, we have half a dozen I think for Russian, and you can also use different encodings, because unlike European languages that use the Latin alphabet, for which the default encoding, the mainstream encoding really is T1,

18:23

unlike this in the Cyrillic script we have different encoding, namely T2A, T2B, C, D and also encoding called X2 and things like that. So when we realized that we weren't quite sure what to do, and we simply decided not to try to emulate this for inclusion in Tecla 2008,

18:47

because it seemed too short notice and it seemed just not wise to try to emulate this old behavior. And sometimes it has to be mentioned of course, and I already said it actually, sometimes unicode is inherently bad as representing some language,

19:05

and we just can't accommodate unicodes, so for Greek, I mean ancient or polytonic Greek, because these are the languages, not monotonic Greek with a single accent sign.

19:22

Here we really couldn't try to accommodate for unicodes bad, for unicodes false, and text problems as well, so we simply split the patterns apart. And sometimes I already warned Johannes that I had to discuss this with him,

19:41

sometimes we really need to fix things in Babel as well. So bear with me Johannes. Probably that was not really a pattern problem, not at all actually. And then the result was, we started this approximately two months ago,

20:01

it was really Moises' idea, it was all driven by her energy, and now thanks to Carl Berry, it has been imported into Tecla. So of course we uploaded it to CETA, the name is future fate, and we imported it to CETA.

20:21

So I'm contemplating whether I will be speaking about my actual project for Google Summer of Code, so let's say no and let's skip to the thanks. People who were interested in what I actually do can always talk with me, and I've already discussed that at length with many people here.

20:42

So thanks to all those good people, to Carl Berry first and foremost before all, and to Jonathan, to Taku and Hans, and all those people, the people in the fourth paragraph, were actually really receptive to our initiative, and it has been really nice to see that

21:03

it's still something very alive. Dean Mu, Mohammed Agich, for example, who contributed the Serbian patterns, actually at that time in 1990, he called them Serbo-Croatian, so he did that 18 years ago, probably he started 20 years ago,

21:20

and actually he was extremely receptive and said, yes, I have to fix things, and all those other people did too, of course, for German, Werner Lemberg, and Blahy Marvolovic helped us a lot for Russian too, etc, etc. So, thank you very much.

21:45

Can you repeat the questions then? Is that what you really want? Just tell the people that don't use Z-Tech to get lost?

22:03

Is that a reasonable default? I mean, Carl would have never supported that for all. For example, I think he will have more things to do,

22:26

to answer to you during his own talk.

22:43

We're not patching anything about Unicode. Sorry, this is just completely wrong. I mean, I really don't care for...

23:04

Sorry? Yes, you said we should just drop the 8-bit tech engine and force everyone to use Z-Tech. Am I correct? Is that what you're implying?

23:26

OK, this is ridiculous. Can I now speak? I mean, that's not our responsibility to do. Of course, I find it sad that we don't have more people

23:42

trying to use modern tech engines. I really mean Z-Tech and LuaTech here. It's a bit sad for them, but do we have to force them to force that upon them? Does it make sense? I don't agree and Carl wouldn't agree.

24:02

OK, go and get him. Sure. I wish he considered just being

24:27

generating out of the unit externally according to the 8-bit. I mean, the problem with the old tech environment is that there was a bug the way it was designed to make the high information

24:41

depend on the current spoken, rather than our language. I mean, he called it language, but he actually was talking about what the language platform was talking about. So effectively, you need for every encoding you use Absolutely. Actually, when you look carefully at it you really don't

25:02

some particular language really doesn't use different front encodings. It really uses one. The patterns are encoded for one particular encoding and German and French with T1 and OT1 are an exception in that respect. And Russian and Ukrainian also are as I mentioned because it's really the only one

25:20

that tries to accommodate different encodings that try to be the patterns themselves are in some particular encoding. Yeah? Why is it different?

25:41

German T1, German T2 German LY I mean, there are more encodings a lot actually in view and if you just do the conversion to T1 I think you don't have to do all the macro-codings Absolutely. It seems a similar way to proceed and I think we'll probably move to that

26:00

in the future. But, in the beginning it wasn't clear how messy the situation was I don't understand Oh, I just Yeah. No no no Those are the 49 languages