Die Hard 1.1024.0: backward compatibility of a search engine with persistant Ids - TIB AV-Portal

Die Hard 1.1024.0: backward compatibility of a search engine with persistant Ids

00:00

13

Gesellschaft für Informatik e.V. (GI)

Krause, Thomas Druskat, Stephan

Formale Metadaten

Titel

Die Hard 1.1024.0: backward compatibility of a search engine with persistant Ids

Untertitel

subtitle of the resource

Alternativer Titel

alternative title of the resource

Serientitel

deRSE 2019 - Konferenz für ForschungssoftwareentwicklerInnen in Deutschland

Anzahl der Teile

60

Autor

Druskat, Stephan

Mitwirkende

de-RSE e.V. - Gesellschaft für Forschungssoftware

Lizenz

CC-Namensnennung 3.0 Deutschland:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen.

Identifikatoren

10.5446/42501 (DOI)

Herausgeber

Gesellschaft für Informatik e.V. (GI)

Erscheinungsjahr

Sprache

Inhaltliche Metadaten

Fachgebiet

Genre

Abstract

Semantic versioning can be used to describe the downward compatibility of software. Using the example of a search engine for linguistic annotations, we show which problems can occur when operating a search-based service with persistent IDs for queries and results and present possible solutions. Different components, such as the software, the domain-specific query language, and the web service, provide different guarantees and make different demands on downward compatibility. By migrating an existing PostgreSQL-based system to our own implementation, we gain important practical experience in the downward-compatible modernization of an existing system, which we will share in this talk.

deRSE 2019 - Konferenz für ForschungssoftwareentwicklerInnen in Deutschland27 / 60

1

08:12

Introducing: GERMAN INFORMATICS SOCIETY (GI)

2

44:17

Research Computing and Computing for Research

3

04:23

Potsdam Institute for Climate Impact Research (PIK) welcomes you to the 1st deRSE conference

4

05:36

A warm welcome from the GFZ German Research Centre for Geoscience

5

04:46

Welcome from AWI (Forschungsstelle Potsdam des Alfred-Wegener-Instituts)

6

15:56

7

22:49

HIFIS Software Services, the Competence Cluster for a Sustainable Spftware Development in the Helmholtz Association

8

29:46

Empfehlungen für bessere Forschungssoftware

9

22:58

Neukonzeption des DLR Software-Katalogs

10

23:49

PALM – a story of developing and maintaining a scientific model system

11

24:16

Integrierte Entwicklungs- und Publikationsumgebung für Forschungssoftware und Daten am Helmholtz-Zentrum Dresden-Rossendorf (HZDR)

12

16:39

Software for autonomous astronomical observatories

13

24:40

Help me help you

14

22:52

Challenges and Opportunities of Open-Source Software: the case of SU2

15

16:52

Parallel-in-Time integration with PFASST: from prototyping to applications

16

28:35

The Quest For Better Tests In Scientific Computing

17

23:59

Umsetzung effizienter plattformunabhängiger App-Entwicklung in einer bestehenden Forschungssoftware-Landschaft

18

38:53

deRSE 2019 - Poster Lightning Talks

19

27:41

Zwischen Digital und Humanities, oder die digitale Kirche im Dorf lassen

20

26:32

New Approaches towards User Research and Software Architecture in Research Software Engineering: A Humanities Example

21

30:27

The Debian Astro Project

22

25:39

ESM-TOOLS: a tool for Earth-System-Modellers

23

22:39

Develop, License, Test, Curate - Mathematical Optimization In The Real World

24

17:12

Building a healthy and vibrant volunteer driven community: The Bio-IT project

25

25:32

Von Closed zu Open Source

26

30:10

Portable Container zum Entwickeln, Erstellen, Verteilen und Ausführen von komplexer wissenschaftlicher Software

27

18:37

Die Hard 1.1024.0: backward compatibility of a search engine with persistant Ids

28

22:35

Softwareentwicklung zwischen Forschungscode und Industriereleases

29

26:31

GUI-Architektur für interaktive Datenanalyse

30

26:25

Building research software communities

31

26:26

Curious Containers: Framework zur Reproduzierbarkeit von digitalen Experimenten

32

22:11

The Research Software Engineering Landscape in Germany

33

27:03

NICOS - ein Steuerungsframework für Großforschungsgeräte

34

27:13

Eine virtuelle Werkstatt für die Digitalisierung in den Wissenschaften

35

17:03

Breaking down scientific mono cultures by cross-disciplinary software development

36

24:33

ediarum - from bottom-up to generic programming

37

23:27

Forschungssoftware als digitale Ressource erhalten

38

17:02

GitLab pipelines for every need: testing, documentation, and writing a paper

39

09:50

deRSE19 and de-RSE e.V. - Society for Research Software

40

09:56

How to save a scientist’s career with data classes?

41

35:43

deRSE 2019 - Closing Session

42

19:42

Building scientific communities - Lessons learned with AIRR

43

19:17

Development of research software at DLR - role and status in practice

44

29:34

Keynote: Delivering on the promise of Research Computing

45

26:52

Entwicklung der Forschungssoftware RCE im DLR

46

28:47

Decentralized software engineering and CI/CD in a European joint research project

47

19:11

Against Schematisation – Mapping the Choreographic Vector Space

48

49:56

Keynote: Sustainable Research Software – as Code, as Paper, as Book

49

45:34

Keynote: RSEs together - Building careers, collaborations, groups and communities

50

13:42

Evaluation of the semantic research data management system CaosDB in glaciology

51

30:55

Data mining made easy, reproducible and open-source

52

21:25

The art of giving and receiving code reviews

53

22:03

54

16:57

Generation of a wrapper library for MPI - MeDiPack

55

13:47

Rückblick: Herausforderungen für die nachhaltige Entwicklung, Bereitstellung und Pflege von Forschungssoftware in Deutschland

56

27:10

Rahmenbedingungen für einen nachhaltigen Umgang mit Forschungssoftware am Helmholtz-Zentrum Potsdam - Deutsches GeoForschungsZentrum GFZ

57

23:28

Lebensverlängernde Maßnahmen für Fortran-Codes

58

15:57

Auto-scaling deadline-constrained workloads

59

15:57

Automated Deadline-Based Scaling of Experiments in the Cloud with MiCADO

60

1:01:26

Paneldiskussion "Nachhaltigkeit von Forschungssoftware in Deutschland" (in German)

Automatisches Abspielen

Sprache

Text

Bild

00:00

SoftwareSystemaufrufComputerSoftwareentwicklerMultiplikationExpertensystemGraphArchitektur <Informatik>VisualisierungTypentheorieMereologieAbfragespracheRFIDNichtlinearer OperatorInformationZeitbereichRelation <Informatik>VersionsverwaltungBefehl <Informatik>Patch <Software>MathematikAdditionKomponente <Software>DateiformatBenutzeroberflächeOffene MengeFormale SpracheDigitalsignalDigital Object IdentifierSystemprogrammierungVerschlingungAbfrageURLParametersystemInterface <Schaltung>AnalysisDatenbankDatenstrukturFormale SpracheGraphImplementierungInformatikInformationMathematikComputerarchitekturRelativitätstheorieSchaltnetzSoftwareTypentheorieFormale SemantikDigitalisierungVisuelles SystemGrenzschichtablösungKonfiguration <Informatik>FunktionalInterpretiererLokales MinimumMereologiePhysikalisches SystemResultanteTabelleTermVerschlingungAbfrageVersionsverwaltungMatchingNichtlinearer OperatorDatenaustauschCASE <Informatik>Prozess <Informatik>ProgrammfehlerZusammenhängender GraphCoxeter-GruppeHauptidealringHilfesystemBeobachtungsstudieUmwandlungsenthalpieSuchmaschineBefehl <Informatik>QuellcodePatch <Software>Objekt <Kategorie>Mensch-Maschine-SchnittstelleEinfache GenauigkeitIdentifizierbarkeitKontextbezogenes SystemMultiplikationsoperatorURLBenutzerbeteiligungWeb-ApplikationREST <Informatik>Design by ContractSoftwareentwicklerBildgebendes VerfahrenExpertensystemHalbleiterspeicherBenutzeroberflächePhasenumwandlungMIDI <Musikelektronik>Arithmetisches MittelBildschirmmaskeHochdruckStatistische SchlussweiseStichprobenumfangZahlenbereichZellularer AutomatÜberlagerung <Mathematik>AdditionCodierung <Programmierung>Interaktives FernsehenReverse EngineeringRichtungFramework <Informatik>Workstation <Musikinstrument>BitrateMaskierung <Informatik>Streaming <Kommunikationstechnik>SchlüsselverwaltungDifferenteInverseMinkowski-MetrikEinsComputeranimation

08:53

AbfrageVerschlingungAbfragespracheVersionsverwaltungFormale SpracheInternet ExplorerCodeIdeal <Mathematik>ImplementierungDigitalsignalDatenmodellOrakel <Informatik>ÄhnlichkeitsgeometrieMigration <Informatik>MaßerweiterungStandardabweichungOrthogonalitätRelation <Informatik>Nichtlinearer OperatorBinärdatenMereologieATMEmulatorSpieltheorieMenütechnikGraphMathematikUnicodeRegulärer GraphKeilförmige AnordnungRegulärer Ausdruck <Textverarbeitung>Regulärer AusdruckLeistung <Physik>OvalGoogolZeichenketteInternationalisierung <Programmierung>TabelleBetafunktionEindringerkennungSystemverwaltungCodeDatenbankImplementierungMathematikOrdnung <Mathematik>ProgrammierspracheSoftwareZeichenketteFormale SemantikProdukt <Mathematik>SoftwaretestOrthogonalitätFunktionalLeistung <Physik>MaßerweiterungMereologiePhysikalisches SystemResultanteSpeicherabzugStellenringVerschlingungAbfrageVersionsverwaltungServerMatchingRegulärer GraphInternetworkingMigration <Informatik>CASE <Informatik>ProgrammfehlerSystemverwaltungCoxeter-GruppeATMComputersicherheitVollständigkeitAdditionPunktUmwandlungsenthalpieBrowserMailing-ListeArithmetischer AusdruckRegulärer Ausdruck <Textverarbeitung>Open SourceDifferenteHinterlegungsverfahren <Kryptologie>EindringerkennungIdentifizierbarkeitStandardabweichungMessage-PassingMusterspracheAttributierte GrammatikFormale SpracheStereometrieTelekommunikationSingularität <Mathematik>Kategorie <Mathematik>PhasenumwandlungArithmetisches MittelBetafunktionDrucksondierungIndexberechnungPaarvergleichPhysikalismusProjektive EbeneRechenwerkZentrische StreckungFamilie <Mathematik>Automatische HandlungsplanungTemperaturstrahlungInstantiierungGebundener ZustandQuaderPixelRichtungt-TestBefehl <Informatik>QuellcodeMaskierung <Informatik>BestimmtheitsmaßAutorisierungsinc-FunktionEinfache GenauigkeitWald <Graphentheorie>Minkowski-MetrikRechter Winkel

17:46

SoftwaretestNebenbedingungCASE <Informatik>SchnittmengeQuellcodePlastikkarteMultiplikationsoperatorGrenzschichtablösungAbfrageServerNetzadresseOffene MengeVorlesung/KonferenzBesprechung/Interview

Transkript: Englisch(automatisch erzeugt)

00:00

So we are very interested in sustainable software and researching sustainable software. It's basically two persons, me, Thomas Krause, who is a computer scientist who kind of ended up in linguistics. It's also Stefan Drosko, who's sitting there. He's the original English master who's now a software developer and computer scientist. And of course, both of

00:24

us are pre-definition research software engineers. As important, the case study will be about a software called ANES. It's query language, so I want to give a brief introduction and what this actually is. Cited from the website,

00:42

more or less, is a web browser-based search and visualization system architecture for linguistic corpora, which is diverse type of annotations. And it's part of the corpus tools or collection of tools for linguists. So what does it mean with annotations? I know that annotations are very different things for a lot of people. Everyone has their own

01:02

definition of annotations. In this context, for this software, it's structured information that is added to a text because linguists deal with language, so we deal with text, and it's represented internally as a graph with labels. It's used by expert users, so it's not

01:21

the case that expert users are normally linguists. And they try to find and analyze linguistic phenomena here. So what they do, so finding linguistic phenomena means that it's about a different corpora, so different text, different annotations, and they don't want to be like, okay, change a little thing, they want to change everything. So they want to have

01:43

a query language where they can find annotations but also various combinations of annotations. And so they have to combine it with, so they have to search for these node labels, and then they can join them as operators, which is kind of basically constrains relations of the node in the graph.

02:03

But these operators are normally more linguistically defined, and for them it's much easier to use this unnecessary language than to actually use SQL, or for example, because it's much more optimized for this, okay, I want to find structures in linguistic phenomena, use it for linguistic phenomena.

02:22

Also, another backup, because, well, in the title I did some inside joke about semantic versioning, so I want, because I can't assume everyone knows what it is, I just want to give a short introduction. If I refer to semantic versioning, I mean the term as popularized by Semver Org. It's an explicit statement about compatibility between versions of an API.

02:43

You have the first part, which is a major version, then comes a minor version, and this patch version, and if you only do some bug fixing, don't change the API at all, you just increase the patch version. If you do additional things that are still backward compatible, you increase the minor version, and if you remove functionality or you redefine semantics in a non-backward compatible way,

03:04

you have to increase the major version. And this is really nice, because you can make stuff about backward compatibility explicit, but also have this problem, because the first problem is, well, what actually is part of the API? What is part of your contract with the user? Especially in

03:21

software like Anais, which has a lot of senior components that work together, is this the REST API? Is this a part of the REST API? Is it the query language? Because the query language also is kind of part of the API. There's a lot about data exchange formats, so what we need to support is a data exchange format, and even if, for example, imagine you have something in the REST API,

03:44

which is then several stuff is combined in the user interface, and the stuff in the REST API is still there, but it removes functionality from the user interface, is it backward compatible? I don't know, but maybe it was useful to remove functionality from there, so this is kind of like, it doesn't help you to define what your interfaces are. And also what I see is a kind of

04:05

because a lot of people don't want to be back in compatible, or don't want to kind of keep stuff forever, we've seen also in the talks before, is there maybe something like an anxiety to release a 1.0 version? Like, okay, I definitely, this is my API, and I stick to it. Or like,

04:22

do I want to be able to change stuff in every release? So, also another part of the title is the persistent identifiers. Imagine I said something like analysis software, what do I actually mean by this? I mean, do I mean the home page? Do I mean the GitHub repo,

04:44

or the source code in it? Maybe I need to tell about it, I want to mean a specific fork, a specific version, and present identifies, in this case specifically a digital object identifier, allow me to identify the source code, or the digital object, long term. So I say, well,

05:03

we have stored the source code also in Zenodo, so this gives us a DOI, and I refer to this number, which is the latest version number, I can say, well, I really mean this source code, I mean this version. And in general, it isn't resolving an identifier to a resource,

05:21

it can be digital or not, every ESPRN is in principle PID, and it should never, ever, ever change. So the PID doesn't change, you can print it in a book, it stays there, and of course several systems exist, UIs, handle.net, whatever. Still also here some open questions like, well, if a digital resource moves like a website, well, it actually updates the reference.

05:46

This really can be a huge problem. If it's hosted at Zenodo, well, we hope it never moves, but if it's hosted elsewhere, well, it could be a problem. And also, of course, who provides and finds infrastructure, is this really like long term in printing a book long term.

06:03

So, and I want to get more specific, how we try to achieve some backward compatibility in Anas 4. Anas allows these nice reference links, these are kind of short links to query results and single matches, I mean short is relative, but so you can say, well, I mean this result, I mean this query with these results, and internally it's a glorified

06:26

URL shortener, so it's just a table that expands this kind of short URL to a longer URL encoding the match and the actual query parameters, like this was a query versus the context size, the maximum of matches and so on, and the query is executed each time the

06:41

link is opened. There's no internal result and then it's saved simply because the results can be like millions. Some people can just easily get generated a query, like having millions of matches, and maybe even don't use it, should I store this really forever? Storing the queries is kind of cheap, and the corpora, the database is always there.

07:02

Problem, well, backward compatibility. This is downside of the approach is, well, we have ANIS3, it's a nice software, a lot of people use it, and basically ANIS3 builds a way that the SQL queries are mapped to or translated to SQL queries and then executed by PostgreSQL.

07:21

Easy enough. I'm too fast, right? Okay, so ANIS4 is a new implementation I did for research reasons, and practical reasons as well, but it was mainly research. It was a custom in-memory graph database search engine written in Rust.

07:42

This one is different because it directly interprets AQL on the data. It doesn't need an intermediate interpretation. The thing is that, well, we never really defined it, but while we said these are reference URLs, we didn't never call them persistent, but still we treat them like persistent URLs. So we want that all reference links should still

08:04

work because, well, yeah, these query results are part of the research process. People use this to argue about stuff, so this can't just vanish. It would be a problem for all people using it. And people are literally printing these links in books. That's why I

08:22

made this example, and I can't even know. So even a Google search, if this identifier is there, I can't know. If someone printed in a book, it's not indexed somewhere or I wouldn't find it. So one solution that could be would be like, okay, we just keep the old software running

08:42

forever and parallel to the new one. This is a valid option but has all the problems you have like, do we need a virtualization? How do you bug fix everything? For example, is this a security issue? Can you fix it easily on this old software? All the problems you have if you kind of want to keep running software forever. Well, the second version,

09:03

while we say it's under this four, so in principle it can be backward incompatible, but still can we make sure that every query that has been referenced produced the same result in under this four than it does in under this three? And we want to choose this approach, and I want to present this approach. And basically it's very easy. If you say, well, we want to make this sure, well,

09:22

then you have to, as long as they have both systems, you have to execute each query reference by which has been referenced. We are storing it on the server, it's a URL shortener, and we'll execute in three, execute in version four, compare them. This is where the kind of like the interesting stuff happens. And if successful, we create the links to the new

09:42

installation. So kind of like an automatic approach to solve this, oh, where do I, what is my digital resource? Where do I store it? How do I migrate it? Some things about that. Well, actually being incompatible is a feature for us because we did

10:02

remove functions that we really didn't like anymore. We also removed some bugs because some of the bugs were, let's say, semantic and unsound assumptions. And you want to be able to remove these kinds of things. So, but the problem is that backwards competency means you still have to support all features and even replicate these kind of bugs. What do we do?

10:26

Well, you need a quirks mode to emulate the old behavior. And this is not an old idea. So, InterExplorer has done this for ages because browsers have the same problems. They have old HTML versions and they still want to run it forever. But there's actually also new approaches

10:44

like in the Rust programming language where they have these called additions where you opt into breaking changes and they promise to keep the old version forever. So what they say is, well, we, or some teams of the core members say, we can't get rid of us because we have a commitment not to break users code. There will not be a Rust 2.0 version.

11:03

That's a pretty bold statement. Ah, so I want to present some selected problems. It happened. So, well, one thing is the semantics. In the ideal world, a query language would formally be, it would be formally defined like data log. You have predicate logic,

11:21

everything is sound. And all you would need to restore the query result, the digital object, would be to have the data and execute the query again. In reality, well, we have SQL. And I want to show that SQL is bad at backward compatibility. They have been standardized version of SQL, 93, whatever. But they are, of course, also various

11:43

implementations like MySQL, Postgres, and so on. And all of these support different versions of standard or have a lot of vendor extensions. And RQL, for example, has only two implementations, three and four. But the first version, it needed some, a lot of semantics from SQL

12:00

and the specific implementation PostgresQL. And so the problem of changes RQL stuff implementation is really similar to actually, for example, migrating to another database. We have very similar problems if you do that. Well, one thing I always say is that we want to remove things. We want to remove features. For example, some of the features, they're just

12:22

not orthogonal to each other. They're orthogonal features, we want to remove them. And one thing about the nice thing about the Querq's mode is, well, you can move all the implementation to a separate code plus. So like you have this nice sounding stuff here, and you have the ugly stuff there. And the good thing also about this approach, by checking these things, is you can make it transparent. So if you

12:45

know that a query was never used, you don't support it. So that's kind of like the telemetry stuff. Well, I know it's not used, I just don't implement it anymore. If it's too hard to implement and people still use it, well, you can still make it transparent. You can show the user a message like, we are really, really sorry, but this link is known

13:04

not to work. You can try it anyway, but we know it's different. And here's a bucket URL, an email, please write us if you really, really want to have this feature. One thing you can have is that identifiers are not really stable. So you have a kind of use URIs as internet identifiers, and the matches are defined by

13:26

these URIs. And a match for us is only the same if all the URIs are the same. For people to use weird names, stuff like spaces, slashes, umlauts, they will escape things.

13:41

They will double escape things. They will triple escape things. Everything will happen. And they will use Unicode. And Unicode is really, really rich. I can just recommend if you own software, try this list of interesting strings and see if it's crashing. So importing data via IDs and comparing them is really hard. Also regular expressions are awful. They are an

14:05

important part of RQL because people search for text and patterns in text. Even how you search for expression is not standardized. You have PostgreSQL with this tilde thing, you have regular X and MySQL, but you even have different syntaxes and different query engines.

14:21

And even if they say they're POSIX compatible, that doesn't mean a thing. They will have different features, like these backreferences are not actually regular expressions anymore. Still, these engines will support them. Some leave them out because performance reasons. And power users will always find out that these kind of stuff exists, even if they don't never document it. We never safely support backreferences.

14:44

People still use it because the SQL implementation we used allows it. That's a problem. Also another thing is string ordering. Well, for query results, the order of results is important because we want to refer to the second one, this match, this match. So the order is important. So we have to check the

15:04

order as well in this thing. Who of you can tell me what the result of this query is? Who is for true? Okay. Who is for false? Okay. The other thing is none of you are right.

15:23

It depends. It depends on your localization because I never said actually in the standard what this means. And you can have a system with this localization, a German-English plain C, and depending what your server is running on, well, you have different results.

15:41

In Postgres, you can say, and please use German collation. But honestly, I don't think anyone ever did this before they had the problem in production. So you will have to kind of be back up to the specific server running a software, which is really bad. Think about this before you do this. Okay. So quick conclusion. And as for us currently running

16:02

in public beta, so the users created on the old server about 13,000 reference links. So it's quite a lot. And all about 140 queries about work. So it's a really good result. It's like about 1% that's failing. There of course are users remaining like unsupported expression features,

16:23

some binding points we might even not support, and actually bugs. The nice thing about having this approach is that you have this reference links also improved our software because they are a huge test case. You have this automatic migration for resistant IDs, which is not usual. And you have a lot of transparency for the administrator who knows, okay, on my instance,

16:46

people use this query language features. Does everything work out or not? And also for the end user, if something didn't work, so he knows, okay, there's an actual problem there. And we hope that this will be able to us to retire the old version and use a new version on our

17:01

server and not have to manage both of them. So thank you very much. Okay. Thank you very much for the talk. And are there any questions? Oh, there's one. Thank you for your

17:23

presentation. I really liked the idea of testing the query results across different versions, but what I ask myself now is, I guess this is an open source project, right? So basically if I wanted to contribute, do I have access to the complete list of reference

17:46

queries and their results? So in case I want to, well, contribute to actually test whether my contribution is actually correct and whether you intend to maybe do that in the future. I think this is a good idea. So all these data are normalized. There's no IP addresses

18:01

locked. So this could be put online without a problem. I think I didn't do it yet because of time constraints and also because executing this on the server takes about like two days on comparing it. And you actually need access to the data. So the problem is that you will need to have access to all the corpora and some of the corpora are not open access.

18:20

But I already published a large, like a 3000 query data set I used for my PhD thesis, which was collected similarly, which also helped a lot. But yeah, I should look into which of these corpora open accesses and make a separate data set for testing these. But you can't execute in a CI. This would take too long.