We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Scikit-LLM: Beginner Friendly NLP Using LLMs

00:00

Formal Metadata

Title
Scikit-LLM: Beginner Friendly NLP Using LLMs
Title of Series
Number of Parts
131
Author
Contributors
License
CC Attribution - NonCommercial - ShareAlike 3.0 Unported:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal and non-commercial purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
The instruction following and in-context learning capabilities of LLMs make them suitable for tackling many NLP tasks. In this talk, we will introduce Scikit-LLM: github.com/iryna-kondr/scikit-llm, a rapidly growing, beginner-friendly library that abstracts the complexity of working with LLMs by providing a scikit-learn compatible API. We will showcase how Scikit-LLM can be utilized for solving text classification and text-to-text tasks, and will delve deeper into various methods to improve the model performance, such as prompting strategies and fine-tuning.
Receiver operating characteristicTotal S.A.Task (computing)TetraederArtificial intelligenceEstimatorPoint cloudComputer configurationSlide ruleInferenceProcess (computing)Focus (optics)Formal languageRight angleModel theoryFlow separationCodeLine (geometry)Task (computing)QR codeOpen sourceForschungszentrum Rossendorf2 (number)TunisProjective planeMultiplication signSource codeComputer animationLecture/Conference
Decision tree learningTask (computing)Function (mathematics)Sign (mathematics)Task (computing)Multiplication signTranslation (relic)Event horizonFormal languageRepresentation (politics)Bit1 (number)Presentation of a groupModel theoryLine (geometry)Interface (computing)Electronic mailing listFunction (mathematics)Uniform resource locatorResultantSingle-precision floating-point formatDampingPattern recognitionData managementSimilarity (geometry)System callSet (mathematics)Social classLevel (video gaming)Service (economics)Library (computing)Sampling (statistics)outputBuildingExpected valueEndliche ModelltheorieComputer animation
Arc (geometry)Task (computing)Network-attached storageModel theoryFormal languageInternet service providerFunctional (mathematics)Model theoryFormal languageDifferent (Kate Ryan album)INTEGRALTask (computing)Open sourceLibrary (computing)Electric generatorComputer animation
Sign (mathematics)Negative numberSample (statistics)Design of experimentsTask (computing)Token ringObject-oriented programmingPrice indexPartition (number theory)Query languageKolmogorov complexityPersonal digital assistantIdeal (ethics)Dynamical systemAerodynamicsEstimatorTurbo-CodeBenchmarkSubsetNetwork topologyRecursionModel theoryBinary fileSocial classComputer reservations systemCurve fittingWordTask (computing)Sampling (statistics)Adaptive behaviorDynamical systemSet (mathematics)Social classStatement (computer science)WindowEndliche ModelltheorieOpen setContext awarenessSlide ruleResultantBinary codeToken ringFunction (mathematics)State observerModel theoryView (database)NumberKettenkomplexVector spaceTable (information)Multiplication signBitElectronic mailing listInferenceData storage devicePredictabilityDifferent (Kate Ryan album)Order (biology)CodeService (economics)Message passingLine (geometry)Utility softwareSingle-precision floating-point formatExtension (kinesiology)Maxima and minimaFitness functionEstimatorInternet service providerInstance (computer science)AdditionCASE <Informatik>Position operatorModal logicQuery languageGroup actionParameter (computer programming)Computer animation
Object-oriented analysis and designModel theoryCategory of beingSet (mathematics)outputTask (computing)Endliche ModelltheorieModel theoryFunction (mathematics)Computer animation
Sample (statistics)Data modelFormal languageFood energyArtificial intelligenceTranslation (relic)Object-oriented analysis and designFormal languageTranslation (relic)Function (mathematics)Transformation (genetics)Task (computing)WordNatural numberResultantEndliche ModelltheorieInstance (computer science)Computer animation
CompilerEstimatorObject (grammar)Connectivity (graph theory)Translation (relic)Computer animation
Successive over-relaxationCone penetration testDigital object identifierPattern recognitionSystem identificationProduct (business)Self-organizationPhysical lawTask (computing)Translation (relic)Web pageSlide ruleFlow separationScripting languageLine (geometry)LaptopDescriptive statisticsKey (cryptography)Function (mathematics)Pattern recognitionForm (programming)Data dictionaryVector potentialObject (grammar)Computer animation
BitFamilyDifferent (Kate Ryan album)Endliche ModelltheorieComputer animation
Uniform resource locatorVertex (graph theory)FamilyData modelState diagramTask (computing)EstimatorArchitectureFamilyEstimatorMultiplicationLatent heat2 (number)Front and back endsUniform resource locatorDifferent (Kate Ryan album)Proxy serverVertex (graph theory)InferenceStandard deviationInternet service providerOpen setEndliche ModelltheorieModel theoryPoint (geometry)Computer animation
FamilyTask (computing)Imperative programmingSocial classFamilyBitStatement (computer science)Vertex (graph theory)Computer animation
Source codePoint cloudGraphics processing unitComputer hardwareMultiplication signSource codeClosed setOrder (biology)TunisCodeLine (geometry)Complex (psychology)Functional (mathematics)CASE <Informatik>Representational state transferVertex (graph theory)Computer animation
Object-oriented analysis and designWeb pageWeb pageTable (information)Link (knot theory)WebsiteProjective planeRight angleComputer animation
Software development kitRouter (computing)Multiplication signPointer (computer programming)CodeLecture/Conference
Sign (mathematics)Negative numberSample (statistics)Euler anglesDesign of experimentsCone penetration testElectronic mailing listCartesian coordinate systemToken ringStandard deviationDifferent (Kate Ryan album)KettenkomplexInternet service providerSelectivity (electronic)Server (computing)Equivalence relationTouch typingCASE <Informatik>Endliche ModelltheorieObservational studyDynamical systemPresentation of a groupLink (knot theory)Library (computing)Slide ruleInheritance (object-oriented programming)Numbering schemeElectric generatorType theoryTerm (mathematics)System callOpen sourceAbstractionTask (computing)Bootstrap aggregatingProjective planeModel theoryComputer animationLecture/Conference
Transcript: English(auto-generated)
Yes, thank you. So as it was told, we are going to give a beginner-friendly introduction to NLP using scikit-lm. If you want to follow the slides, you can scan this QR code and open it on your devices.
Yeah, and before we begin, I would like to ask a question. Have you ever tried using large language models for NLP tasks? Maybe just raise your hands if you did. Okay, one, two, three, four.
Maybe 10 people at most, so not that many, which means that you're in the right place, because today we are going to show that this can actually be as easy as writing three lines of scikit-learn compatible code. Yeah, and let us briefly introduce us again.
I am Oleg. I am working as a data scientist at SCCH, which is an applied research center in upper Austria. Yes, and I am Marina. I also work as a data scientist in Austria. In our spare time, both Oleg and I contribute to several open source projects, one of which is scikit-lm,
which will be the focus of today's talk. Yes, so since this talk is aimed at absolute beginners, we will start with introducing some common NLP tasks. Then we will explain how those can be solved with large language models. Afterwards, we will introduce scikit-lm
and explain how it simplifies the whole process. And finally, we will briefly touch the topic of using different LMBKs for both inference and fine-tuning. So what are the common NLP tasks we might need to solve? The most classic one is obviously a text classification
where the idea is to assign a label from a predefined set to an input sample. For example, here we have a sentence saying, a great service for an affordable price. And we can determine that its sentiment is positive. Another common NLP tasks probably most of you have to deal
with from time to time is the text translation. For example, here we need to translate a sentence in English into German. Moving on, the text summarization task involves condensing a large chunk of text into a more compact representation while retaining all of the key ideas.
And the final example I'm going to give for now is the named entity recognition task where the goal is to find and classify all of the key concepts in the text. For example, here we have the sentence, EuroPython will take place in Prague. And we can determine that the EuroPython is an event
and Prague is the location. And obviously there are many more tasks and this list is far from being exhaustive. And we could probably spend the duration of the entire presentation just enumerating all of those tasks. But probably it's going to be a bit boring.
So let's talk about something else and namely how large language models can be used here. Yeah, and the first question we should ask ourselves is whether we can use LLMs for NLP tasks. It's difficult to give a definitive answer for every single task since there are just too many of them.
However, at least for the ones we introduced earlier than CZS. Yeah, and indeed if we just construct a simple prompt and pass it to charge GPT it already returns the results more in line with what we would expect. So now that it's clear that in principle LLMs can be used
for the NLP tasks, the next question is whether we should use them. And again, it depends and probably you would have to decide for yourself. And if you do decide to use LLMs there are certainly going to be many disadvantages in such an approach
but there are also going to be many advantages. Just to name a single one, if you use LLMs you could use the same model for many downstream tasks which would simplify its management and potentially save the costs. However, building the whole NLP pipeline
from scratch might be inconvenient since you would have to manually construct the prompts, do LLM calls, validate the output, et cetera. There are some libraries that provide certain levels of obstructions, however those are usually a bit too general and not necessarily beginner friendly.
At the same time there is an interface probably every single data scientist is familiar with which is the scikit-learn interface and so it would be really nice if we had something similar but for NLP, for example there could be a GPT classifier
that uses charge GPT under the hood to classify the text but otherwise it's fully compliant with the scikit-learn API. And this is exactly the functionality scikit-lm provides. scikit-lm is an open source Python library that allows
to seamlessly integrate large language models into scikit-learn for different NLP tasks. And believe it or not this sentence wasn't even GPT generated. And now I would like to give a word to Irina who is going to give a more practical introduction to scikit-lm starting with the text classification task.
Yes, so let's consider the classification example we've seen earlier where the task is to generate a sentiment of a given text. The prompt for that task includes the instruction highlighted in blue, the sample to classify highlighted in green and the list of candidate labels which are in yellow.
This is a typical example of a zero shot classification or more precisely the classification using zero shot prompting. The concept behind this method is that we don't give the model any hints on how to classify the text. Instead we expect it to use its background knowledge
to determine the most appropriate label. Now let's have a look at how it works in scikit-llm. Firstly we need to import the zero shot classifier and then use a very familiar scikit-learn API in order to obtain the predictions in just three lines of code.
The zero shot approach is suitable when you either have no label data available or you need a very cheap solution since the zero shot prompts are usually very short and therefore they don't consume many tokens. However the trade-off is that this approach usually offers lower accuracy compared to other approaches we will see in the next slides.
Now suppose you have some label data available or you're willing to manually create a few label demonstrations. In that case you could use another technique known as a few shot prompting. The main idea here is that you include a few labeled examples
or shots alongside the prompt. For instance in this example one demonstration per label per sentiment is provided in the prompt. If we have a look at scikit-llm code we can see that it remains mostly unchanged and the transition from the zero shot to the few shot approach
only requires some minor adaptations in the import statement. The few shot approach can outperform the zero shot approach when provided with as little as one demonstration per class. However it's not that uncommon to have data sets containing hundreds or even thousands of labeled examples.
And even though common LLMs have very high context to windows lens that are capable of fitting that many observations still processing such a large number of examples would be costly and the efficiency of this approach is also questionable.
Therefore we can extend the idea of the few shot prompting by incorporating the additional pre-processing step that would dynamically select only n most relevant examples. We call this approach a dynamic few shot. The idea is as follows, during the training step the label data is embedded
and the vector store is created. And during the inference step the user's query is embedded and only n nearest neighbors for each class are retrieved from the vector store. These samples are then included as demonstrations into a prompt. That approach allows us to basically indefinitely scale
the training set size while maintaining relatively constant token usage. The prompt for the dynamic few shot would be identical to the one we used for the few shot prompting. However the main difference here is that the demonstrations added into the prompt
are dynamically selected by the classifier. So dynamic few shot classifier sufficiently utilized large amounts of training data but it obviously comes with additional token consumption and therefore costs. So overall the message should be as follows.
If you have lots of training data and high accuracy is your primary goal then start immediately with a dynamic few shot approach. Otherwise consider starting with a zero shot approach as a baseline and then switch to a few shot approach if necessary.
Another classification approach I would like to cover is called a chain of thought classification. The main idea here is that the model is prompted to provide intermediate reasoning step before it generates the final label. As you can see here we modified the prompt used for the zero shot classification
by additionally asking the model to explain its answer. As a result the model first explains its reasoning and then provides a label. The chain of thought approach potentially enhances reasoning capabilities and also offers
better explainability since it does not only provide the label but also the model's reasoning as an output. However this approach comes with additional tokens, consumption and costs. To give you some idea on the performance of different approaches we have evaluated the classifiers
on binary and multiclass classification data sets. For all of our experiments we use GPT 3.5 and GPT 4.0 models from OpenAI. We used a Stanford Sentiment Treebank data set for that. This data set contains a corpus of phrases from movie reviews.
On this slide we can see the results obtained in a binary setting. The table showed that the few shot approach improved the accuracy of both models. However the chain of thought approach did not improve the performance of GPT 4.0 model and even a bit worsened the accuracy of the GPT 3.5 model.
For assessing the multiclass classification the SST fine grained data set has been used where the list of possible labels is extended to five. We can see that the few shot approach improved the accuracy of both models
and the chain of thought approach did improve the accuracy of both models even to a greater extent. Based on these results we can assume that making use of the models reasoning is advantageous for more complex tasks. While it does not improve or even confuses the model on simpler tasks.
Until now we have only discussed single label classification tasks. However LLMs can also perform multi-label classification. For instance in this example a review may be categorized into several classes at the same time.
For example price, delivers speed and service quality. Scikit-LLM natively supports multi-label classification through a special multi-label estimator. This estimator takes the maximum number of labels as a hyper parameter and otherwise works identically
to the single label classifiers we've seen earlier. Additionally I would like to make a small remark about the zero shot classifiers. It was mentioned before that those do not require any label data but you might have noticed
that all the code snippets you've seen so far had still both X and Y being passed to the fit method. Therefore you might have a very logical question. How can I use the classifier when no training data is available? And the answer is very simple. You just need to pass none instead of X
and a list of candidate labels as Y. So far we have discussed the text classification tasks where the input is a text and the output is a label or a set of labels. However, as mentioned in the introduction
text classification is just one of many NLP tasks. Another category of tasks includes text-to-text modeling where both input and output are unstructured text segments. For instance, here the LLM performs two consecutive text-to-text transformations.
Firstly, it summarizes the provided text in 10 words and then it translates the summarized text into the Czech language. These tasks can also be performed by scikit-llm using specific scikit-learn transformers
like a summarizer and a translator. As you can see here, firstly we need to initialize a summarizer to transform the initial text X and then transform the result further using a GPT translator. However, there is a minor issue with this example. As you may have observed,
we manually pass the output of the first task into the second one. It will be similar to asking chatGPT to summarize the text and then manually copying and pasting the output into a subsequent prompt for translation.
A more natural approach would be to continue the conversation and ask the model to translate already summarized text without manually copying and pasting. When working with scikit-llm, we can skip the step of manually chaining the components by using the fact that each estimator in scikit-llm
is scikit-learn-compatible estimator and therefore can be used in the pipeline object. Here you can see that we have formed a pipeline that consists of two steps. Firstly, we apply a GPT summarizer and then we apply a GPT translator.
So that was it about the text tasks and now Oleg is going to introduce another task supported by scikit-llm, which is a text tagging.
Yeah, so the main idea about the text tagging is to take an arbitrary original text and augment it with some XML-like text. As you might imagine, there are quite a few tasks that can be formulated this way. However, currently scikit-llm only supports one of them,
which you already know about, the named entity recognition. So using the named entity recognition task in scikit-llm is also very easy. You just need to instantiate the net object and pass a dictionary of entities into it to where the keys are the potential entity names
and the values are the textual descriptions. Yeah, so after the scikit-llm produces the text output, it can also be automatically parsed and transformed into a highlighted human readable form,
similar to the one you are seeing on the slide. This works both in the Jupyter notebooks by displaying this text in line, but it also works in the standalone scripts by generating a separate HTML page.
Now, I would like to switch the topic a little bit and talk about different API families. As you probably noticed so far, we always used GPT for all of the examples. However, it's not the only model that is supported by scikit-llm.
In scikit-llm, all estimators are split into multiple API families where the family is mostly defined by the schema of the underlying API. So there are family-specific variants of different estimators. For example, here we have the GPT classifier
for the GPT family and vertex classifier for the vertex family. And on top of that, each API family can work with different backends. So the GPT backend can work with the OpenAI, Azure, local GGUF models for GPT for all,
and virtually any backend by providing a custom URL, and if necessary, using an OpenAI-compatible proxy server. But the usage of this proxy server is not always required since there are already many different providers that support the standard OpenAI API. For example, the famous Hugging Face inference and points.
Yeah, and the vertex family supports PAL-2 and Gemini models in the vertex AI. Switching between those different families is relatively easy as the full import path
and the class name consists of two things, the task descriptor and the family name itself. So for example, here we have a zero-shot GPT classifier where the zero-shot classifier is the task descriptor, meaning it should be used for the zero-shot classification
and the GPT is obviously the family. And if we want to switch it to the vertex family, we adjust the import statement a little bit, but everything else stays mostly the same. Yeah, and since there is not much time left,
I would just quickly touch upon the topic of the fine-tuning of closed-source LLMs. As you probably know, both OpenAI and Vertex provide the REST APIs for the fine-tuning of their closed-source LLMs, which means that you can actually fine-tune Charge GPT on your own data.
However, I must warn you that the fine-tuning costs can get out of hand really, really quickly, so that's not something you should be using as your first approach. And we would always recommend to start with something really simple. For example, the zero-shot classification is the baseline
and slowly increase the complexity if it's needed. But if the fine-tuning is really needed for your use case or you're just eager to try it and you're not afraid of the costs, this functionality is implemented in the scikit-lm and it uses the same three lines of scikit-learn code so you don't really have to learn anything new
in order to do that. Yeah, so that was it for now. Wait a second, I have a couple of announcements. First of all, you can check out the scikit-lm GitHub page and we would really appreciate if you could start it.
I guess right now we are about 15 starts away from reaching 3K. And also, if you want to check our other projects or simply get in contact with us, please go to our website where you will find all the necessary links. And the final announcement, we also have a small table right behind this room where we have lots of stickers
so if you want to grab a sticker or just have a chat, please find us there. That's it. Thank you. Thank you so much for your talk. It was really good and we all learned a lot. So now we have some time for questions.
If anybody has questions, you can go to this microphone here in the middle of the room and ask your questions. In the meantime, I would like to ask a question. So when we pass examples, how do we actually pass examples? Because I didn't see that in the code
and I was wondering if that's something that is handled by the library or if you handcraft your examples and then pass them. What exactly do you mean by the examples? So in few-shot classification, you mentioned that we can pass some examples. Yes, so this is the training set basically. When you do a classifier that fits with the X and Y,
the X are the actual texts. I don't see the pointer. So these are the Xs. And the negative, positive, and neutral is the list of your Ys. Okay, so the examples are treated as a typical training set
in psychic terms. Okay, that's very cool. Okay, then I'll carry on with another question. So you talked about dynamic example selection. And how that works really well. Yeah, it works really well. Did you try to combine it with
other methods for prompting like with COT? No, not yet, but probably it would work even better. Okay, so can you please explain how would that look like, like to combine both chain of thought prompting and... I mean, right now it's not natively supported, so
you could do that, but probably you would have to tin-grab it with the library. Probably this is something we should add in the future. Yeah, but basically the standard chain of thought is just a zero-shot thing with some additional explanation, but you could have a few-shot with also some explanation which would be
the equivalent of what you're asking about. Okay, so then you would have an example which includes the chain of thought. I mean, for that you might need the examples that already have the explanations in the training set, but also it's possible to bootstrap the explanation, so first you use LLM to provide the explanations
and then you use it for the subsequent for the subsequent future chain of thought and there are actually some studies that suggest that you do not necessarily need the handcrafted explanations and if you just have it in the prompt, it already improves the performance. There are also
some studies that suggest that for the future case even if your training data is total nonsense, you could just have the random labels, it already improves the performance. Okay, super interesting. And you mentioned that it's an open source project, right? Yes, at the very last slide
there is a GitHub link so if you want to check it out we are also looking for the contributors Yeah, that's what I was wondering if you're accepting contributors it could be a very interesting opportunity for some people. So, please. One question, thank you for your presentation and which type
of tokenization do you use in GPT models? We are not explicitly using the tokenization scheme because every single model has its own tokenizers and usually for example when you do the API call to the OpenAI API, it does
the tokenization on the server side so it's kind of abstracted away. And the same thing if you are using the local GGUF models, it already knows which tokenizer to use automatically so you don't really have to worry about the tokenization at all. Thank you. Yes, please.
About the providers that you show under your experience, some are better for different tasks, for example GPT is better for touch generation and Gemini is better for leveler? I would say GPT 4 is almost always better for every task, which is
unfortunate but for now it's true. Thank you guys for presenting and thank you all for your questions and for being here. Next we actually have lunch.