We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Building new NLP solutions with spaCy and Prodigy

00:00

Formale Metadaten

Titel
Building new NLP solutions with spaCy and Prodigy
Serientitel
Anzahl der Teile
132
Autor
Lizenz
CC-Namensnennung - keine kommerzielle Nutzung - Weitergabe unter gleichen Bedingungen 3.0 Unported:
Sie dürfen das Werk bzw. den Inhalt zu jedem legalen und nicht-kommerziellen Zweck nutzen, verändern und in unveränderter oder veränderter Form vervielfältigen, verbreiten und öffentlich zugänglich machen, sofern Sie den Namen des Autors/Rechteinhabers in der von ihm festgelegten Weise nennen und das Werk bzw. diesen Inhalt auch in veränderter Form nur unter den Bedingungen dieser Lizenz weitergeben
Identifikatoren
Herausgeber
Erscheinungsjahr
Sprache

Inhaltliche Metadaten

Fachgebiet
Genre
Abstract
Commercial machine learning projects are currently like start-ups: many projects fail, but some are extremely successful, justifying the total investment. While some people will tell you to "embrace failure", I say failure sucks --- so what can we do to fight it? In this talk, I will discuss how to address some of the most likely causes of failure for new Natural Language Processing (NLP) projects. My main recommendation is to take an iterative approach: don't assume you know what your pipeline should look like, let alone your annotation schemes or model architectures. I will also discuss a few tips for figuring out what's likely to work, along with a few common mistakes. To keep the advice well-grounded, I will refer specifically to our open-source library spaCy, and our commercial annotation tool Prodigy.
35
74
Vorschaubild
11:59
Gebäude <Mathematik>Natürliche SpracheProgrammbibliothekKette <Mathematik>Orientierung <Mathematik>Virtuelle MaschineBildverstehenProzess <Informatik>NeuroinformatikComputeranimation
Künstliche IntelligenzFormale SpracheNatürliche ZahlProzess <Informatik>DigitalsignalEndliche ModelltheorieProgrammbibliothekMaschinelles LernenVarietät <Mathematik>Domain <Netzwerk>Open SourceSystemprogrammierungComputerHypermediaProgrammiergerätSoftwareentwicklerEreignishorizontNatürliche SpracheSoftwareProgrammbibliothekDigitalisierungEDV-BeratungProdukt <Mathematik>AnalogieschlussLeistung <Physik>Orientierung <Mathematik>Projektive EbeneVirtuelle MaschineProzess <Informatik>Varietät <Mathematik>SchnittmengeInformationsspeicherungSchnitt <Mathematik>Open SourceEndliche ModelltheorieFormale SpracheQuick-SortInnerer PunktComputeranimation
InformationNatürliche SprachePhysikalischer EffektBitProjektive EbeneVirtuelle MaschineCASE <Informatik>Prozess <Informatik>VektorpotenzialBefehl <Informatik>Office-PaketBildverarbeitungNatürliche ZahlIntegriertes InformationssystemRechter WinkelComputeranimation
TensorLokales MinimumGradientImplementierungRechnernetzDatensatzFolge <Mathematik>SoftwareBitMereologieProjektive EbeneVirtuelle MaschineQuick-SortBildverstehenProzess <Informatik>GradientKartesische KoordinatenRechter WinkelApp <Programm>Globale OptimierungHyperbelverfahrenTensorComputeranimation
Arithmetisches MittelProjektive EbeneComputeranimation
Prozess <Informatik>DatenmodellNummernsystemArchitektur <Informatik>Maschinelles LernenHierarchische StrukturModallogikKategorie <Mathematik>WiderspruchsfreiheitPlastikkarteSteuerwerkLeistungsbewertungSchätzungDatenverwaltungOrdnung <Mathematik>ComputerarchitekturRückkopplungProdukt <Mathematik>WellenpaketMAPEntscheidungstheorieGlobale OptimierungLeistungsbewertungMereologieProjektive EbeneSpeicherabzugVirtuelle MaschineQuick-SortOvalBildverstehenGüte der AnpassungAutomatische HandlungsplanungGewicht <Ausgleichsrechnung>Prozess <Informatik>DatenfeldNummernsystemSchnittmengeKartesische KoordinatenUmwandlungsenthalpieSchätzfunktionHierarchische StrukturPlastikkarteEndliche ModelltheorieDifferenteDreiecksfreier GraphMultiplikationsoperatorRechter WinkelGamecontrollerAlgorithmische LerntheorieComputeranimation
CodeIterationEndliche ModelltheorieNummernsystemEntscheidungstheorieVorgehensmodellDatenbankCodeIterationProdukt <Mathematik>Quick-SortBildverstehenProzess <Informatik>SchätzfunktionDreiecksfreier GraphDatenbankFolge <Mathematik>InformationMathematische LogikNatürliche SpracheFunktion <Mathematik>TypentheorieFehlertoleranzEntscheidungstheorieBitLogische ProgrammierungNummernsystemSchnittmengeRichtungVerkehrsinformationEndliche ModelltheorieURLFormale SpracheFormale SemantikMAPIdeal <Mathematik>CASE <Informatik>Rechter WinkelComputeranimation
Mathematische LogikOrdnung <Mathematik>Syntaktische AnalyseKonfiguration <Informatik>EntscheidungstheorieAuswahlaxiomTermVirtuelle MaschineCASE <Informatik>SchnittmengeEndliche ModelltheorieDifferenteGenerizitätKontextbezogenes SystemSchlussregelNP-hartes ProblemAuflösung <Mathematik>Computeranimation
Kategorie <Mathematik>GenerizitätEreignishorizontEndliche ModelltheorieSyntaktische AnalyseRandwertMAPKategorie <Mathematik>IntegralTaskQuick-SortWort <Informatik>RichtungEreignishorizontEndliche ModelltheorieGenerizitätRandwertURLMessage-PassingBenutzerbeteiligungDatenbankSyntaktische AnalyseMereologieMinkowski-MetrikRechter WinkelComputerunterstützte Übersetzung
Zentrische StreckungIterationOverhead <Kommunikationstechnik>LeistungsbewertungNummernsystemMaßstabTaskLjapunov-ExponentDatenmodellSchätzungInformatikFunktion <Mathematik>FrequenzProdukt <Mathematik>TaskKonfiguration <Informatik>EntscheidungstheorieAnalytische MengeLeistungsbewertungProjektive EbeneResultanteZentrische StreckungGüte der AnpassungBenutzerschnittstellenverwaltungssystemSchnittmengeUmwandlungsenthalpieEndliche ModelltheorieDifferenteMinimalgradDreiecksfreier GraphMultiplikationsoperatorOverhead <Kommunikationstechnik>Minkowski-MetrikRechter WinkelBildgebendes VerfahrenPerspektiveDatenflussMagnetbandlaufwerkAutomatische HandlungsplanungInstantiierungVollständiger VerbandKontextbezogenes SystemComputeranimation
Twitter <Softwareplattform>Attributierte GrammatikCodeDatenbankFormale SpracheGraphInformationNatürliche ZahlFunktion <Mathematik>ProgrammierungProdukt <Mathematik>AsymptotikWellenpaketGenerator <Informatik>VektorraumBenutzeroberflächeTaskBitBoolesche AlgebraFundamentalsatz der AlgebraHybridrechnerInverser LimesLokales MinimumPhysikalisches SystemRechenschieberStetige FunktionTermVirtuelle MaschineZellularer AutomatZentrische StreckungToken-RingQuick-SortMenütechnikAutomatische HandlungsplanungBasis <Mathematik>CASE <Informatik>Prozess <Informatik>InstantiierungAdditionPunktVererbungshierarchieAbstimmung <Frequenz>SchnittmengeWort <Informatik>Kartesische KoordinatenMailing-ListeCharakteristisches PolynomInformation ExtractionEndliche ModelltheorieWeb logAlgorithmische LerntheorieMultiplikationsoperatorSchlussregelMessage-PassingMapping <Computergraphik>Rechter WinkelChatbotCodierungService providerDreiecksfreier GraphComputeranimation
Transkript: Englisch(automatisch erzeugt)
OK, so first of all, am I mic'd correctly? OK, great. So yeah, this is a talk about two technologies that we've built, spaCy and Prodigy, but it's not primarily a technology talk. The genre is not really library talk. It's actually about much more general considerations that arise whenever you try to apply technologies
like natural language processing, but also computer vision or other machine learning things to new problems. And there's an under-discussed design consideration in this that arises that I think is actually a very important issue that I'd like to highlight instead. So I guess it's kind of more like an opinion talk that I hope will be useful to people regardless of which
tool chain they happen to be using. So as a quick orientation around this and the activities of the company and to basically help you get your bearings, our primary, so I'm a co-founder of a very small digital studio company called Explosion AI. And we make spaCy, this open source library
for natural language processing. And we have a few other projects associated with it. So spaCy has its own machine learning library that we use to power it so that we can keep things working more specifically called Think. And in particular, we have an annotation tool called Prodigy
which people use alongside spaCy and also other work to basically help them create new annotations and help them adapt models to their requirements. And then finally, we're going to be releasing a data store of these pre-trained models that people will be able to use alongside spaCy as well. So these technologies are used in a variety of companies.
And they're primarily designed to basically make it easier to adopt more cutting edge or at the edge of understanding technologies and put them into production quicker. So the work that I'll be talking today is joint work with Ines Bontani, my co-founder.
And you can see a little background about us. I've been working on spaCy since 2014 and have been working on natural language processing things for most of my career. So the analogy that we give to people about how this all works and how the company works, which actually you can hear more about this in Ines's keynote after this,
is that the open source software is kind of like these free recipes that are published online. And then we initially did some consulting alongside this, which you can see is kind of like catering. And then this tool Prodigy is kind of like this gadgetry that you can use alongside the recipes. And so that's how kind of the things fit together as a set of offerings.
OK. So I guess as an opinion piece, this is kind of a motivating statement of this. And it does take a little bit of explanation. So the concept here is that these projects that use natural language processing are like startups. They fail a lot. And what I mean by this is that there'll
be a few projects that use NLP or other machine learning projects that are wildly successful, and the rest of the projects will basically struggle. And so if you think about why this might be true, you can imagine that the world would look very different if it were the case that natural language processing projects usually worked. Imagine if it were just easy to take
any process which involved natural language in an office situation or in a business situation and just automate it. Well, natural language is really the underpinning of the human information system, right? So it must be difficult to do this. Otherwise, the world would look very different from what it does.
And so we can see that, all right, there's enormous potential in these technologies. And indeed, often they do succeed wildly. But you can still say that, all right, it must be difficult to do this, otherwise things would look wildly different. And so the question is, all right, if natural language processing projects do fail a lot, then what's the cause of that failure? Like, what's the thing that makes this so hard?
OK, so we can turn this around a little bit and say slightly flippantly, all right, what would it look like if we were trying to maximize an NLP project's risk of failure? So we would start off with this sort of, OK, let's just start off by imagineering. We'll just decide what the application ought to do. And we really want to be ambitious, because nobody changed the world, saying,
doubting whether something would work. So let's just start with the vision, right? And leave the technology for later. And then finally, then the next step would be to forecast. OK, we've got a vision of what we want our app to do. And then we say, all right, what accuracy do we think we'll need from the technology to drive this forward? What do we need to make this work?
And if we don't know, let's just say 90%, I mean, it sounds about right. Then there are some details here which we don't care so much about. So next, we'll just outsource the data collection. It's just click work. We'll pay somebody else to get the data. We'll think carefully about the requirements that we've stated and decide that we need 10,000 rows for some reason.
And then having got that, we can now start the real work, the part that everybody talks about when they talk about machine learning projects. And this is this process of wiring, this beautiful iterative sequence of tinkering where we implement a network and tensor all our flows and descend every gradient, optimize everything, tweak our hyperparameters, and come up
with something that fits beautifully well on our 10,000 rows. And then let's hope that it works. And if it doesn't, well, I hope that we have somebody to blame. So if this is what it looks like when we fail, and I hope you can see a few suspicions here for why this might not work so well. And I'll flesh this out a little bit. But first, I want to say that we shouldn't accept
this risk of failure even if we acknowledge that it's true. We can accept that, all right, empirically, there's a high-risk activity here. But I still want to keep our eyes clear that failure sucks, right? So we still want to minimize this. We don't want to embrace methodologies that make it more likely that we fail. Even if we say, all right, many projects fail, and that's a reality of the situation,
it doesn't mean that we just say, oh, well, embrace this and move on. No, failure sucks. We want to fail less. How do we do this? OK, we can start thinking about this as kind of a hierarchy of needs. And I think at the base of this pyramid, the sort of core food and shelter level of this hierarchy of needs
is understanding how the model will work in a larger application or business process. So having a clarity of what we're trying to do and where the value is going to come from. What are we trying to ship? How's it going to work in the rest of our application? Why do we need machine learning at all? What can we do without? That sort of clarity of purpose. Then translating that clarity of purpose
into an annotation scheme and using it to guide what data we need to collect, I think, is the next stage of this. So translating the requirements into a set of models. I think that's really a key step, and it's the step that I'll be talking about the most through this process. Then translating that annotation scheme
after we've decided what models we ought to have after this need of models, translating that into an annotation process so that we can actually get the data cleanly. So this is this project management stage of having attentive annotators who know what we're trying to do. We have a good quality control process and a good process for cleaning up
discriminant annotations, et cetera, et cetera. And then finally, at the top of this pyramid, the parts that matter less, but also the parts which are discussed much more, these questions of model architecture, so making smart modeling decisions so that the model is more likely to be accurate, using the wisdom that's in the literature to basically have the right technologies or whatever,
and also optimization tricks so that we actually end up with good weights from this. And so you can see here that the parts which I've identified as sort of the tippy top, the self-actualization part, the less necessary part, these are the parts which are vastly more discussed than these other issues.
And it does make sense that these are vastly more discussed in the literature because kind of globally, if you think of the field as a whole, that is kind of the bottleneck, right? Like if we have better model architectures and better optimization techniques, that does generalize across all of the projects. But the same consideration doesn't necessarily apply if you're considering your specific project. If you're considering your specific project,
the set of considerations are kind of different. And the part that you should spend most of your time thinking about is why you need machine learning at all and how you're going to map that need into a set of specific models. And then how you're going to get data to meet that need. And if we're going to solve this, then a difficult chicken and egg problem ends up arising.
And so the difficult chicken and egg problem, the circular dependency here, works like this. If the most important thing is having a clear vision of the product and what we're trying to do, well, then we want to know how accurate the model might be so that we can basically come up with realistic plans. So we need an accuracy estimate.
But in order to get an accuracy estimate, we need to have training and evaluation data. And we need to train and evaluate a model. And then in order to do that, we need to get labeled data. And in order to do that, we need an annotation scheme. But if we have to decide what to annotate, then we're going to need to know how this is going to work in the product. So there's a feedback loop here.
There's a cycle. So what can we do? Well, in any other things where we have this sort of cycle, this solution's iterative. So what we need to do is have an iterative process where we progressively refine these estimates. And the iteration has to happen not just on the code, but also on the data that we're collecting and the vision of the product that we're trying to build.
So basically, don't have this waterfall approach where you start off making these assumptions and just feed them forward and hope that they're correct. We need to accept that the initial estimates are going to be slightly wrong and basically start trying to travel in a circle and refine our estimates so that we can collect some evidence that we can base these on.
So we're asking, what model should we train to meet the business needs? Does the annotation scheme make sense? And then finally, does the problem look easy or hard? As soon as we start doing it, we can start getting evidence about that. And then we can also try to figure out what we can do to improve fault tolerance when we start to see what sort of mistakes the model might make and how serious those are.
So if we don't take an iterative approach and we just sort of blindly go with these things, then especially in natural language processing, it's very easy to make modeling decisions that are simple, obvious, and wrong. So as an example of this, imagine that we had the following requirements. We wanna build a crime database based on news reports
and we wanna label the following. So we wanna get extract information from text, very common type of need where the technology's currently performed quite well. So we wanna get the victim name, perpetrator name, crime location, offense date, and arrest date. So here's an example of what that sort of annotation might look like. This is the sort of output that we might want.
So we want something like 24-year-old and Alex Smith labeled as a victim and then was fatally stabbed in East London and we want that labeled as a crime. So all right, how should we do this? How should we map this requirement into a set of modeling decisions? Well, the simple way to do this, which actually a lot of the current fashion
is guiding people towards, is to take an end-to-end approach and just basically map this labeling scheme directly to the model. And we say, all right, we're just gonna have a sequence labeling scheme where we're going to extract that information directly. Now I suggest that this is quite likely to be an unideal way to approach the problem. So instead, I suspect that it's actually gonna be better to do this.
Apply a label of crime to the whole text and then apply more generic labels to the individual entities. So apply the label person to the entity Alex Smith, the label location to the entity East London and also the label location to Kings Cross. And so what we're doing here is we're factoring the information better.
And so the model, so we need to have much less annotation data because we're only adding one bit of information crime and we're deciding that once over the whole text, as opposed to in this example, the bit of information crime is coupled to the first person entity and also the semantic role their victim is coupled into that as well.
And so you have to decide that all at once. And then as well in the next one, East London, you have to decide all at once that it's a crime and this is the location of the crime. And then again in Kings Cross, you have to decide that this is the location but it's not the crime location and therefore the label is null. And this makes the modeling much harder and you need many more examples
to estimate the model this way in many cases. So it's quite likely to be an unideal way to do this and you should at least explore composing the models in a different way and saying, all right, I'll take, I'll decide once that it's a crime and then compose these things and have a bit of rule-based logic to match this up afterwards.
So in terms of what that rule-based logic might look like, this is an example of a kind of generic annotation that can be applied to text from spaCy and also from other technologies as well. So this is a syntactic dependency parse. So here we can see that the phrase Alex Smith has a syntactic relationship of passive subject to stabbed
and fatally is a modifier here in East London is attached as a prepositional phrase. And so we can use this kind of generic annotation in order to basically start building rules to hang our logic on. Now, it may not be the case that this is actually the optimal way to do it
but there are at least these choices and I wanna bring awareness of the fact that there are many decisions to be made in how you're actually decomposing a set of needs into a set of models and so that you can at least try different options because that's the kind of decision that's going to decide whether the problem's easy or hard and much more than using machine learning techniques
to solve a hard problem slightly better, making the problem easy in the first place is a much higher leverage way to get the problem solved. So the general sort of approach here is that we can compose generic models into novel solutions. So if we have generic categories like location and person, we can use pre-trained models
and just improve them on the data. And then you can, I would normally recommend annotating events and topics at the sentence or paragraph level so that you don't have to decide the exact boundaries of something like a crime occurred. Instead, you can just apply it at the sentence level rather than coming up with policies which will struggle to enforce. And then for semantic roles as well, you can annotate these at a word or entity level
and use the dependency pass to find the boundaries. So this is kind of a suggestion of a solution. So this is what the workflow of this looks like in the specific tooling that we've built. It's basically in Prodigy. And so specifically, Prodigy lets you basically quickly spin up an annotation task so that you can start trying out
whether it's easy to label sentences with a crime or not. So you basically, you run this command, you get a little web server, you make some annotations, they're stored in a database, and then you can train a model from them. And then the integration with spaCy is also quite nice. You can basically read this out as a spaCy pipeline and then read it, you start using this directly.
So you get the capability of saying, all right, dot dot cat's crime, and you see the probability of that. So the other problem or the other consideration that I want to raise with this is that, I guess, often stops you from taking an iterative approach. I think it's worth awareness,
is that if you focus mostly on big annotation projects, then it becomes very difficult to collect evidence and very expensive to collect evidence because there's this high startup cost of starting a new project. So rather than viewing annotation as something that has to happen at scale with lots of people and something
where the biggest consideration is actually driving down the marginal cost of each additional annotation, actually driving down the overhead of the annotation projects so that you can try out more things is a much better consideration and a thing that's actually going to take more projects from failure to success if you can get this right. And so if you're able to run
specific annotation projects much quicker and basically run in a few hours, decide whether something is gonna work or not, then you can try things out and basically explore the space of different modeling options. And yeah, so this is the solution that we have to this. Basically, even as a data scientist yourself, you should be basically have a methodology
or a workflow that lets you yourself have an idea and just label some data and try it out so that if you have an idea for something that you wanna try, you don't have to basically convene a meeting, convince your boss that your idea is good, who will then get the annotators to give you some time,
then you get the data back and you decide, oh, it didn't work. Instead, just labeling a few hundred examples yourself, gives you a much better perspective on whether the thing is likely to work and then you'll be able to basically try more things and have more successes. Then additionally, actually, A-B evaluation is a particularly good methodology for this
and especially since it lets you work on generative tasks. So I don't have time to explain this in detail but basically, even if you have a task where you're trying to output text, so for instance, imagine you're trying to caption images, you can't compare this statically to one reference annotation
because you don't know what's a good caption or what's not. But if you use a randomized A-B evaluation, which Prodigy supports, you're able to still rigorously evaluate these tasks well and I think that that's a very good tool to have in your toolbox. And then finally, another detail about annotation projects
which people often get wrong is, if you think of it primarily as boring work that doesn't matter in your project, then it shouldn't be so surprising that you end up with data that's actually unideal and you also shouldn't be surprised that there ends up being this terrible overhead in your projects of maintaining the quality and making the data good. So instead, you can just not do this.
It's actually not that expensive to hire people who don't have computer science degrees to just do things and you can hire them consistently and talk to them and stuff. So rather than trying to outsource this pathologically and everything, you can have a few people that work like 30 or 35 hours a week on your thing and you can talk to them and they will work like humans and understand what they're doing.
And this is generally a better way to get work done. And so I would actually recommend this rather than trying to make people as dehumanized and disconnected from their tasks as possible. So this is, and again, also if annotation teams are smaller rather than larger, this is also quite good because it lets you iterate and if you need 100 hours of annotation,
it's much better to have three people working for a period of time rather than have 100 people do one hour because then you don't get any time to iterate. Okay, and so this is really the solution here. You know, as with any other thing with cyclic dependencies, we can't solve this problem analytically. We have to solve it iteratively.
So we have to basically as quickly as possible start moving through the cycle and say, all right, what would it look like if I could make this work? Here's how we can actually export the model and have it plugged into the rest of the product. And then you see, oh, okay, well, it doesn't quite work well this way. Let's like try it this other way. And moving around that cycle quicker
is going to lead to better results rather than having a very siloed perspective of say, getting lost in TensorFlow for weeks, improving the accuracy on some data set that might not even be the right data set for what you wanna do. It's much better to basically be moving over the whole pipeline. Thanks.
So thank you again. There's time for questions now.
So in terms of what's the funniest thing that I've seen go wrong, there's definitely been some misunderstandings about what the technology is able to do
and what are reasonable product plans and what are not. I would say that the most general thing or common mistake that I find like sort of puzzling is the general chatbot enthusiasm I think is driven by a quite deep misunderstanding of what the technology is actually doing.
And in particular, people act as though the primary task is in understanding the message when you also still have to have your application actually do the thing that the messaging codes. So for instance, people imagine that if you can just understand what people are searching for in say, like a menu system or something like, oh, find me a place that sells French tarts at 2 a.m.
That if you can understand that that's what people want to look for that you can just look for it, you still have to have your database indexed by whether the thing sells French tarts, right? And so the scope of capabilities is so much more narrow than people imagine for this. And that's fundamental
because you're not just gonna like generate code. And so people are like, oh, why is it so narrow? Why does it feel so stiff? And I'm like, because it's still a program that you've just wired a user interface to. And so I think that that's definitely something that I've seen go wrong at a large scale across the industry and people trying to apply these technologies.
About the information extraction, because you know there's a rule base and a model base, what do you think about rule base method
is still alive in the future or no? So what I would normally recommend is actually using machine learning to add annotations to text that allow you to hang rules. And if you think about what you're actually doing with the machine learning, there's always at some point
where you probably want the output of your machine learning system to feed back into some other system. So at some point you need to translate from like the continuous space that you're probably in if you're doing machine learning to some sort of Boolean logic that the rest of your program is going to interact with. And so the question is,
what's the minimum that I can learn about this text that gives me consistent attributes that I can then get by with rule-based approaches. So I would never want a rule-based approach that tells whether some sentence is about a crime. That's silly. Like, you know, it's so much easier to do that in a machine learning approach. But it might be the case
that once I noted it's a crime, I can just say, all right, the first person in the crime is this sort of rule because of the nature of my data. Or, okay, like if I noted it's a crime and I've got this list of verbs and I noted that's the crime that occurred. And that might be a much easier way to do it than trying to, you know, basically learn all of those bits of information coupled.
So that's what I would say is the hybrid of those approaches in practical terms. Hi, great talk. My question is on spaCy. It looks like a really useful tool. How much work is it
to add an additional language model to spaCy? So it really depends on what capabilities you're interested in adding. So at this point, the process of just sort of adding a new tokenizer is pretty easy. And similarly, if you have a large sample
of unlabeled text, the process of training basic word vectors is pretty easy. But then most languages need, say, lemmatization. So, you know, you normally want to have, say, jumped maps to jump. But in, and in English, that's a very simple process, but in languages like, you know, Finnish or Arabic,
it's like actually quite an involved process. And so that ends up being difficult. Then for most of the other things, you really need to have data. And so there's some data sets which you can license or which you can use depending on, you know, the licensing terms you need. But in many situations, you actually don't have a pre,
a suitable corpus, and then you have to create one. And so we're interested in doing annotation for this using Prodigy. But we want to do, because we want to basically pay for that, these are likely to be more commercial models. But there's definitely, you know, some data sets out there which are available. And so we do want to provide models on those, on that basis, basically, you know,
free like the current English one. Great talk. So I really liked your instructions. Will they be available online? So I don't have to like copy them. So do you mean the, like the specific commands on the slide or?
No, like the graph, like the cycling graph. I mean, so the slides will be available and the talk's recorded. But do you mean like, we don't have it written up as a blog post yet, but maybe we can like do that. Maybe. Thanks.
So the question is about what the considerations are
about how well the technologies will work on different languages. So in general, the less like a language is, the less like English a language is, the worse everything works. So English being the language most like English, everything works pretty well. Dutch is also quite like English and so things work fairly well.
Chinese, even though there's more text than Dutch, it doesn't work as well because it's less like English. So, you know, these methods have been really quite well tuned to the characteristics of English as a language. And there are a couple of attributes of English that are slightly convenient that, you know, basically mean that there are some easier problems associated with it.
So I would say that that's the biggest consideration. I would say that even though there's plenty of text for say Arabic, Arabic language processing is quite difficult because it's quite unlike English. Okay, so let's thank Matthew again.