Die Hard 1.1024.0: backward compatibility of a search engine with persistant Ids
This is a modal window.
Das Video konnte nicht geladen werden, da entweder ein Server- oder Netzwerkfehler auftrat oder das Format nicht unterstützt wird.
Formale Metadaten
Titel |
| |
Untertitel |
| |
Alternativer Titel |
| |
Serientitel | ||
Anzahl der Teile | 60 | |
Autor | ||
Mitwirkende | ||
Lizenz | CC-Namensnennung 3.0 Deutschland: Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen. | |
Identifikatoren | 10.5446/42501 (DOI) | |
Herausgeber | ||
Erscheinungsjahr | ||
Sprache |
Inhaltliche Metadaten
Fachgebiet | ||
Genre | ||
Abstract |
|
6
13
21
25
41
53
00:00
SoftwareSystemaufrufComputerSoftwareentwicklerMultiplikationExpertensystemGraphArchitektur <Informatik>VisualisierungTypentheorieMereologieAbfragespracheRFIDNichtlinearer OperatorInformationZeitbereichRelation <Informatik>VersionsverwaltungBefehl <Informatik>Patch <Software>MathematikAdditionKomponente <Software>DateiformatBenutzeroberflächeOffene MengeFormale SpracheDigitalsignalDigital Object IdentifierSystemprogrammierungVerschlingungAbfrageURLParametersystemInterface <Schaltung>AnalysisDatenbankDatenstrukturFormale SpracheGraphImplementierungInformatikInformationMathematikComputerarchitekturRelativitätstheorieSchaltnetzSoftwareTypentheorieFormale SemantikDigitalisierungVisuelles SystemGrenzschichtablösungKonfiguration <Informatik>FunktionalInterpretiererLokales MinimumMereologiePhysikalisches SystemResultanteTabelleTermVerschlingungAbfrageVersionsverwaltungMatchingNichtlinearer OperatorDatenaustauschCASE <Informatik>Prozess <Informatik>ProgrammfehlerZusammenhängender GraphCoxeter-GruppeHauptidealringHilfesystemBeobachtungsstudieUmwandlungsenthalpieSuchmaschineBefehl <Informatik>QuellcodePatch <Software>Objekt <Kategorie>Mensch-Maschine-SchnittstelleEinfache GenauigkeitIdentifizierbarkeitKontextbezogenes SystemMultiplikationsoperatorURLBenutzerbeteiligungWeb-ApplikationREST <Informatik>Design by ContractSoftwareentwicklerBildgebendes VerfahrenExpertensystemHalbleiterspeicherBenutzeroberflächePhasenumwandlungMIDI <Musikelektronik>Arithmetisches MittelBildschirmmaskeHochdruckStatistische SchlussweiseStichprobenumfangZahlenbereichZellularer AutomatÜberlagerung <Mathematik>AdditionCodierung <Programmierung>Interaktives FernsehenReverse EngineeringRichtungFramework <Informatik>Workstation <Musikinstrument>BitrateMaskierung <Informatik>Streaming <Kommunikationstechnik>SchlüsselverwaltungDifferenteInverseMinkowski-MetrikEinsComputeranimation
08:53
AbfrageVerschlingungAbfragespracheVersionsverwaltungFormale SpracheInternet ExplorerCodeIdeal <Mathematik>ImplementierungDigitalsignalDatenmodellOrakel <Informatik>ÄhnlichkeitsgeometrieMigration <Informatik>MaßerweiterungStandardabweichungOrthogonalitätRelation <Informatik>Nichtlinearer OperatorBinärdatenMereologieATMEmulatorSpieltheorieMenütechnikGraphMathematikUnicodeRegulärer GraphKeilförmige AnordnungRegulärer Ausdruck <Textverarbeitung>Regulärer AusdruckLeistung <Physik>OvalGoogolZeichenketteInternationalisierung <Programmierung>TabelleBetafunktionEindringerkennungSystemverwaltungCodeDatenbankImplementierungMathematikOrdnung <Mathematik>ProgrammierspracheSoftwareZeichenketteFormale SemantikProdukt <Mathematik>SoftwaretestOrthogonalitätFunktionalLeistung <Physik>MaßerweiterungMereologiePhysikalisches SystemResultanteSpeicherabzugStellenringVerschlingungAbfrageVersionsverwaltungServerMatchingRegulärer GraphInternetworkingMigration <Informatik>CASE <Informatik>ProgrammfehlerSystemverwaltungCoxeter-GruppeATMComputersicherheitVollständigkeitAdditionPunktUmwandlungsenthalpieBrowserMailing-ListeArithmetischer AusdruckRegulärer Ausdruck <Textverarbeitung>Open SourceDifferenteHinterlegungsverfahren <Kryptologie>EindringerkennungIdentifizierbarkeitStandardabweichungMessage-PassingMusterspracheAttributierte GrammatikFormale SpracheStereometrieTelekommunikationSingularität <Mathematik>Kategorie <Mathematik>PhasenumwandlungArithmetisches MittelBetafunktionDrucksondierungIndexberechnungPaarvergleichPhysikalismusProjektive EbeneRechenwerkZentrische StreckungFamilie <Mathematik>Automatische HandlungsplanungTemperaturstrahlungInstantiierungGebundener ZustandQuaderPixelRichtungt-TestBefehl <Informatik>QuellcodeMaskierung <Informatik>BestimmtheitsmaßAutorisierungsinc-FunktionEinfache GenauigkeitWald <Graphentheorie>Minkowski-MetrikRechter Winkel
17:46
SoftwaretestNebenbedingungCASE <Informatik>SchnittmengeQuellcodePlastikkarteMultiplikationsoperatorGrenzschichtablösungAbfrageServerNetzadresseOffene MengeVorlesung/KonferenzBesprechung/Interview
Transkript: Englisch(automatisch erzeugt)
00:00
So we are very interested in sustainable software and researching sustainable software. It's basically two persons, me, Thomas Krause, who is a computer scientist who kind of ended up in linguistics. It's also Stefan Drosko, who's sitting there. He's the original English master who's now a software developer and computer scientist. And of course, both of
00:24
us are pre-definition research software engineers. As important, the case study will be about a software called ANES. It's query language, so I want to give a brief introduction and what this actually is. Cited from the website,
00:42
more or less, is a web browser-based search and visualization system architecture for linguistic corpora, which is diverse type of annotations. And it's part of the corpus tools or collection of tools for linguists. So what does it mean with annotations? I know that annotations are very different things for a lot of people. Everyone has their own
01:02
definition of annotations. In this context, for this software, it's structured information that is added to a text because linguists deal with language, so we deal with text, and it's represented internally as a graph with labels. It's used by expert users, so it's not
01:21
the case that expert users are normally linguists. And they try to find and analyze linguistic phenomena here. So what they do, so finding linguistic phenomena means that it's about a different corpora, so different text, different annotations, and they don't want to be like, okay, change a little thing, they want to change everything. So they want to have
01:43
a query language where they can find annotations but also various combinations of annotations. And so they have to combine it with, so they have to search for these node labels, and then they can join them as operators, which is kind of basically constrains relations of the node in the graph.
02:03
But these operators are normally more linguistically defined, and for them it's much easier to use this unnecessary language than to actually use SQL, or for example, because it's much more optimized for this, okay, I want to find structures in linguistic phenomena, use it for linguistic phenomena.
02:22
Also, another backup, because, well, in the title I did some inside joke about semantic versioning, so I want, because I can't assume everyone knows what it is, I just want to give a short introduction. If I refer to semantic versioning, I mean the term as popularized by Semver Org. It's an explicit statement about compatibility between versions of an API.
02:43
You have the first part, which is a major version, then comes a minor version, and this patch version, and if you only do some bug fixing, don't change the API at all, you just increase the patch version. If you do additional things that are still backward compatible, you increase the minor version, and if you remove functionality or you redefine semantics in a non-backward compatible way,
03:04
you have to increase the major version. And this is really nice, because you can make stuff about backward compatibility explicit, but also have this problem, because the first problem is, well, what actually is part of the API? What is part of your contract with the user? Especially in
03:21
software like Anais, which has a lot of senior components that work together, is this the REST API? Is this a part of the REST API? Is it the query language? Because the query language also is kind of part of the API. There's a lot about data exchange formats, so what we need to support is a data exchange format, and even if, for example, imagine you have something in the REST API,
03:44
which is then several stuff is combined in the user interface, and the stuff in the REST API is still there, but it removes functionality from the user interface, is it backward compatible? I don't know, but maybe it was useful to remove functionality from there, so this is kind of like, it doesn't help you to define what your interfaces are. And also what I see is a kind of
04:05
because a lot of people don't want to be back in compatible, or don't want to kind of keep stuff forever, we've seen also in the talks before, is there maybe something like an anxiety to release a 1.0 version? Like, okay, I definitely, this is my API, and I stick to it. Or like,
04:22
do I want to be able to change stuff in every release? So, also another part of the title is the persistent identifiers. Imagine I said something like analysis software, what do I actually mean by this? I mean, do I mean the home page? Do I mean the GitHub repo,
04:44
or the source code in it? Maybe I need to tell about it, I want to mean a specific fork, a specific version, and present identifies, in this case specifically a digital object identifier, allow me to identify the source code, or the digital object, long term. So I say, well,
05:03
we have stored the source code also in Zenodo, so this gives us a DOI, and I refer to this number, which is the latest version number, I can say, well, I really mean this source code, I mean this version. And in general, it isn't resolving an identifier to a resource,
05:21
it can be digital or not, every ESPRN is in principle PID, and it should never, ever, ever change. So the PID doesn't change, you can print it in a book, it stays there, and of course several systems exist, UIs, handle.net, whatever. Still also here some open questions like, well, if a digital resource moves like a website, well, it actually updates the reference.
05:46
This really can be a huge problem. If it's hosted at Zenodo, well, we hope it never moves, but if it's hosted elsewhere, well, it could be a problem. And also, of course, who provides and finds infrastructure, is this really like long term in printing a book long term.
06:03
So, and I want to get more specific, how we try to achieve some backward compatibility in Anas 4. Anas allows these nice reference links, these are kind of short links to query results and single matches, I mean short is relative, but so you can say, well, I mean this result, I mean this query with these results, and internally it's a glorified
06:26
URL shortener, so it's just a table that expands this kind of short URL to a longer URL encoding the match and the actual query parameters, like this was a query versus the context size, the maximum of matches and so on, and the query is executed each time the
06:41
link is opened. There's no internal result and then it's saved simply because the results can be like millions. Some people can just easily get generated a query, like having millions of matches, and maybe even don't use it, should I store this really forever? Storing the queries is kind of cheap, and the corpora, the database is always there.
07:02
Problem, well, backward compatibility. This is downside of the approach is, well, we have ANIS3, it's a nice software, a lot of people use it, and basically ANIS3 builds a way that the SQL queries are mapped to or translated to SQL queries and then executed by PostgreSQL.
07:21
Easy enough. I'm too fast, right? Okay, so ANIS4 is a new implementation I did for research reasons, and practical reasons as well, but it was mainly research. It was a custom in-memory graph database search engine written in Rust.
07:42
This one is different because it directly interprets AQL on the data. It doesn't need an intermediate interpretation. The thing is that, well, we never really defined it, but while we said these are reference URLs, we didn't never call them persistent, but still we treat them like persistent URLs. So we want that all reference links should still
08:04
work because, well, yeah, these query results are part of the research process. People use this to argue about stuff, so this can't just vanish. It would be a problem for all people using it. And people are literally printing these links in books. That's why I
08:22
made this example, and I can't even know. So even a Google search, if this identifier is there, I can't know. If someone printed in a book, it's not indexed somewhere or I wouldn't find it. So one solution that could be would be like, okay, we just keep the old software running
08:42
forever and parallel to the new one. This is a valid option but has all the problems you have like, do we need a virtualization? How do you bug fix everything? For example, is this a security issue? Can you fix it easily on this old software? All the problems you have if you kind of want to keep running software forever. Well, the second version,
09:03
while we say it's under this four, so in principle it can be backward incompatible, but still can we make sure that every query that has been referenced produced the same result in under this four than it does in under this three? And we want to choose this approach, and I want to present this approach. And basically it's very easy. If you say, well, we want to make this sure, well,
09:22
then you have to, as long as they have both systems, you have to execute each query reference by which has been referenced. We are storing it on the server, it's a URL shortener, and we'll execute in three, execute in version four, compare them. This is where the kind of like the interesting stuff happens. And if successful, we create the links to the new
09:42
installation. So kind of like an automatic approach to solve this, oh, where do I, what is my digital resource? Where do I store it? How do I migrate it? Some things about that. Well, actually being incompatible is a feature for us because we did
10:02
remove functions that we really didn't like anymore. We also removed some bugs because some of the bugs were, let's say, semantic and unsound assumptions. And you want to be able to remove these kinds of things. So, but the problem is that backwards competency means you still have to support all features and even replicate these kind of bugs. What do we do?
10:26
Well, you need a quirks mode to emulate the old behavior. And this is not an old idea. So, InterExplorer has done this for ages because browsers have the same problems. They have old HTML versions and they still want to run it forever. But there's actually also new approaches
10:44
like in the Rust programming language where they have these called additions where you opt into breaking changes and they promise to keep the old version forever. So what they say is, well, we, or some teams of the core members say, we can't get rid of us because we have a commitment not to break users code. There will not be a Rust 2.0 version.
11:03
That's a pretty bold statement. Ah, so I want to present some selected problems. It happened. So, well, one thing is the semantics. In the ideal world, a query language would formally be, it would be formally defined like data log. You have predicate logic,
11:21
everything is sound. And all you would need to restore the query result, the digital object, would be to have the data and execute the query again. In reality, well, we have SQL. And I want to show that SQL is bad at backward compatibility. They have been standardized version of SQL, 93, whatever. But they are, of course, also various
11:43
implementations like MySQL, Postgres, and so on. And all of these support different versions of standard or have a lot of vendor extensions. And RQL, for example, has only two implementations, three and four. But the first version, it needed some, a lot of semantics from SQL
12:00
and the specific implementation PostgresQL. And so the problem of changes RQL stuff implementation is really similar to actually, for example, migrating to another database. We have very similar problems if you do that. Well, one thing I always say is that we want to remove things. We want to remove features. For example, some of the features, they're just
12:22
not orthogonal to each other. They're orthogonal features, we want to remove them. And one thing about the nice thing about the Querq's mode is, well, you can move all the implementation to a separate code plus. So like you have this nice sounding stuff here, and you have the ugly stuff there. And the good thing also about this approach, by checking these things, is you can make it transparent. So if you
12:45
know that a query was never used, you don't support it. So that's kind of like the telemetry stuff. Well, I know it's not used, I just don't implement it anymore. If it's too hard to implement and people still use it, well, you can still make it transparent. You can show the user a message like, we are really, really sorry, but this link is known
13:04
not to work. You can try it anyway, but we know it's different. And here's a bucket URL, an email, please write us if you really, really want to have this feature. One thing you can have is that identifiers are not really stable. So you have a kind of use URIs as internet identifiers, and the matches are defined by
13:26
these URIs. And a match for us is only the same if all the URIs are the same. For people to use weird names, stuff like spaces, slashes, umlauts, they will escape things.
13:41
They will double escape things. They will triple escape things. Everything will happen. And they will use Unicode. And Unicode is really, really rich. I can just recommend if you own software, try this list of interesting strings and see if it's crashing. So importing data via IDs and comparing them is really hard. Also regular expressions are awful. They are an
14:05
important part of RQL because people search for text and patterns in text. Even how you search for expression is not standardized. You have PostgreSQL with this tilde thing, you have regular X and MySQL, but you even have different syntaxes and different query engines.
14:21
And even if they say they're POSIX compatible, that doesn't mean a thing. They will have different features, like these backreferences are not actually regular expressions anymore. Still, these engines will support them. Some leave them out because performance reasons. And power users will always find out that these kind of stuff exists, even if they don't never document it. We never safely support backreferences.
14:44
People still use it because the SQL implementation we used allows it. That's a problem. Also another thing is string ordering. Well, for query results, the order of results is important because we want to refer to the second one, this match, this match. So the order is important. So we have to check the
15:04
order as well in this thing. Who of you can tell me what the result of this query is? Who is for true? Okay. Who is for false? Okay. The other thing is none of you are right.
15:23
It depends. It depends on your localization because I never said actually in the standard what this means. And you can have a system with this localization, a German-English plain C, and depending what your server is running on, well, you have different results.
15:41
In Postgres, you can say, and please use German collation. But honestly, I don't think anyone ever did this before they had the problem in production. So you will have to kind of be back up to the specific server running a software, which is really bad. Think about this before you do this. Okay. So quick conclusion. And as for us currently running
16:02
in public beta, so the users created on the old server about 13,000 reference links. So it's quite a lot. And all about 140 queries about work. So it's a really good result. It's like about 1% that's failing. There of course are users remaining like unsupported expression features,
16:23
some binding points we might even not support, and actually bugs. The nice thing about having this approach is that you have this reference links also improved our software because they are a huge test case. You have this automatic migration for resistant IDs, which is not usual. And you have a lot of transparency for the administrator who knows, okay, on my instance,
16:46
people use this query language features. Does everything work out or not? And also for the end user, if something didn't work, so he knows, okay, there's an actual problem there. And we hope that this will be able to us to retire the old version and use a new version on our
17:01
server and not have to manage both of them. So thank you very much. Okay. Thank you very much for the talk. And are there any questions? Oh, there's one. Thank you for your
17:23
presentation. I really liked the idea of testing the query results across different versions, but what I ask myself now is, I guess this is an open source project, right? So basically if I wanted to contribute, do I have access to the complete list of reference
17:46
queries and their results? So in case I want to, well, contribute to actually test whether my contribution is actually correct and whether you intend to maybe do that in the future. I think this is a good idea. So all these data are normalized. There's no IP addresses
18:01
locked. So this could be put online without a problem. I think I didn't do it yet because of time constraints and also because executing this on the server takes about like two days on comparing it. And you actually need access to the data. So the problem is that you will need to have access to all the corpora and some of the corpora are not open access.
18:20
But I already published a large, like a 3000 query data set I used for my PhD thesis, which was collected similarly, which also helped a lot. But yeah, I should look into which of these corpora open accesses and make a separate data set for testing these. But you can't execute in a CI. This would take too long.
Empfehlungen
Serie mit 2 Medien