We're sorry but this page doesn't work properly without JavaScript enabled. Please enable it to continue.
Feedback

Why ChatGPT and its friends are bad at Math

00:00

Formal Metadata

Title
Why ChatGPT and its friends are bad at Math
Title of Series
Number of Parts
3
Author
License
CC Attribution - ShareAlike 4.0 International:
You are free to use, adapt and copy, distribute and transmit the work or content in adapted or unchanged form for any legal purpose as long as the work is attributed to the author in the manner specified by the author or licensor and the work or content is shared also in adapted form only under the conditions of this
Identifiers
Publisher
Release Date
Language

Content Metadata

Subject Area
Genre
Abstract
Dr. Greiner-Petter’s talk will explore the contradictory criticism of LLMs for their poor performance in mathematical applications and their abilities to pass challenging mathematical milestones. He will further examine how LLMs sometimes succeed in high-profile math challenges, but simultaneously lacking the ability to reliably perform basic arithmetic.
Computer animation
Computer animation
Transcript: English(auto-generated)
I want to talk quickly about why Church of BT and its friends, and in particular, that language models are bad at mathematics, and what I mean by that. So I think it's a good start with, to start with just some examples, and I want to keep it simple. So I have here a simple multiplication question.
Maybe not that simple, it's rather large, five digits each. So I show you the correct answer here, but maybe we would expect from these large language models to be able to do this. So I asked six of the most famous, probably, large language models these days. And at first, I want everybody in their own mind to think about, if they think how many
of these models get it right, maybe one, maybe all of them, maybe none of them. But since we are digital, we will just skip ahead and everybody have this in their own mind. So let's see. Llama, 8 billion. Okay, that was rather close, 70 billion. That's good. More, more parameters means it's closer to the answer.
Then Mixtral is completely off and a bit talkative, GPT 3.5 is close, but also not quite right. GPT-4 is closer, okay, so there is an improvement, and GPT-4.0 is actually correct, bang on. But I want to keep that a bit out of the picture for a moment because GPT-4.0 does
something that might be a bit surprising. So let's focus on the other answers first. Do you notice something about these answers apart, obviously, from the fact that they are rather close? If you look closely, you will probably notice that, well, the first few digits are always
correct and also the last few digits are always correct, which is a bit weird. So it looks like the models are able to do this but then get confused within the numbers or something. We will get to that a little bit later again, why that might be the case. And the length is also almost always correct except for Mixtral, so maybe don't use Mixtral
for multiplications. Actually, you probably don't use any of these models except GPT-4.0. So maybe you're wondering why GPT-4.0 was correct, actually, what none of the other large language models was able to pull off. Well, technically, they were cheating. So if you ask this GPT-4.0 this question, it will generate Python code for you and
then just run the Python code instead and returns the answer. So it didn't do the calculations itself, right? It outsourced the task to something better. And okay, you could maybe argue, well, large language models, you can play the devil's
advocate and say large language models are not calculators, right? They are trained on text, so ask them something more logical, like math reasoning or something. And Eman Mirzades and I has just done this rather recently and evaluated the mathematical reasoning understanding of large language models on the GSM8K dataset, although you don't
need to understand the dataset, it's very easy school stuff here. So to give you an idea of what this dataset looks like, here's an example, a little bit shortened. And it's something like Sophie watches her nephew and gave him 31 blocks, eight stuffed animals, nine colored wings, and so on, and then bouncy balls.
So he ends up with 62 toys in total, how many bouncy balls did she bought him? And then the answer is relatively simple, where you just 62 and then remove all the other toys he already has and then calculate the missing number of bouncy balls. And so what these researchers have done is they actually calculated an eight shot chain
of thought performance across 50 sets. And then the dashed gray line you can see here is for each of these models is the accuracy of these models in average on all of these sets.
But something interesting they have also done is they changed, for example, the names in these questions, or they changed the numbers in these questions, or they changed the combination of both. And you can clearly see that each of these models, and it happens with all models, it's just here, a smaller picture, I definitely urge you to read the paper, it's quite nice one.
You can see that the accuracy drops or changes quite significantly by just changing something clearly not important for the final answer. If you understand how this question should be solved, then you should be able to answer independently from the actual words used or from the actual numbers used unless they
are appropriate, of course. So the authors concluded in summary, we show that large language models performance significantly deteriorates as the number of clauses in a question increases. That's something we have seen, the more complex, the more difficult, obviously. But then adding a single clause that seems relevant to the question causes significant
performance of up to 65% across all state-of-the-art models, even though the clause doesn't contribute to the reasoning chain needed for the final answer. So this is something I want to highlight again. Those were not difficult questions.
And on the other hand, then we sometimes hear news that large language models are suddenly able to reach gold medals in the mathematics Olympics. We will come back later to that also. So let's talk about why this is actually the case. Why are JetGPT and its friends this bad at math and then sometimes so good at it? So let's focus on why they are bad at math first.
And for me, it comes down to two categories, flawed information. And I mean by that how the large language models read the information and then flawed knowledge, where I mean how the LLMs memorize the information they have seen. So for the input, when you think about mathematical texts, there's always numbers
involved, symbols involved are formally involved. So let's start with the numbers. How does a large language model see numbers? Well, ironically, numbers are not encoded as the actual number in the internal embeddings of a large language model. Instead, they have vector representations encoding the semantics of that token. So what does that mean?
First of all, we're talking about tokens and not numbers anymore. And a token is not necessarily the entire number. So I've color coded here how the large language model would see these, the input from the beginning in tokens. And those is actually the tokenization that JetGPT was using. So it's tick token.
So up until GPT 4.0, this is exactly how these models see the inputs. And you can see it didn't see 50,897 as one number. It saw actually two different numbers. And an interesting case is maybe that could give us a hint why it was always correct in terms of length and the correct number in the beginning and the correct number in
the end. It's a hint only. So if you, for example, multiply the first two tokens, 578 by 128, you get this number and then the second one also. And we noticed that here again, we have the correct numbers in the beginning and then the correct numbers in the end. And I would think, okay, well, that was very cherry picked.
So I tried a lot around and it actually works every time I tried it. So I'm close to confident to say maybe that has something at least to do with this. So what I want to say is that large language models don't see six and seven as the actual binary encoding, as the actual number representation, but more as a token of six and seven as a
language token. So with symbols, we have a similar issue as with numbers. First the tokenization doesn't really respect mathematical tokens. For example, let's say we have an input here with pi and then parentheses is X plus Y. You can see already that the tokenization doesn't respect X and Y.
It's actually a token for parentheses is open X and then plus Y. And again, these models are now trying to understand what each of these tokens means. So it is unable to tell you what X means because X doesn't come, is in this context. It doesn't exist here for the model.
And then you can imagine, sorry, it was a bit quick, then you can imagine this getting very complicated for larger formulae and so on. And the model is not really able to learn anything about the meaning of X or Y. And for formulae this getting even more complex because for textual texts, let's say
pure textual content, that means there's no interruption of non-textual tokens such as math or anything else. They always follow a sequential order. The quick brown fox jumps over the lazy dog and then end of sentence. But the mathematical formulae is always understood in a tree structure.
So we have, for example, the equal sign as a left and right hand side. And then the left hand side is, for example, a function and it has parameters which change the attributes of that function and so on. So we understand and usually write down this in a tree structure. And to the best of my knowledge, there is no model, actually no one even tried to kind of combine these both structural differences in a sequential order and a tree structure
to learn any larger models. So now let's focus on the flawed knowledge, which means how they store the information. And mathematical knowledge is, as you probably already know, a highly structured and extremely well interconnected field, right?
Here is, for example, a graph of the mathematical stack exchange. And how are these information then stored in this world is not necessarily in this structured context. It looks more like this. So for example, that which does not kill you only makes you and then we're looking for the next token.
So how is this next token generated? Well, it looks at the context, obviously, and then walks through a large vector space where each of these vectors and directions probably represents something important in the context. So for example, the first token is you, then it has to follow this.
So it needs an objective next and then preceded by that which does not kill you and then related to growth and strength. And then hopefully we end up in a position that is roughly stronger. And this roughly is exactly the issue why this is not good enough for math. Because slight variations here can have drastic consequences in the output as you
can imagine. For example, we just suddenly end up in a completely different position, which means stranger in this case. So all over all, all large language model outputs are always only approximated results from a convergent contextual information. But in mathematics, on the other hand, we kind of expect rigorous deterministic logic
reasoning. And this is simply not happening by large language models. So very quickly, what can we do to make these large language models better? And I can focus on what researchers have tried over the years to fix all of these single flaws and so on. But I want to focus, since we don't have much time, I want to focus on something else here. How do big corporations currently solve this issue or trying to solve this issue?
Or in other words, how does alpha geometry, for example, was able to solve or almost get the gold medal in the International Mathematical Olympiad with geometrical questions? Well, under the hood, and then those are really tough questions, right? So under the hood, alpha geometry is just a neural symbolic system that uses large
language models to guide a so-called symbolic deduction engine. It's just another fancy word of the theory improver in this case. And the theory improver walks through a so-called knowledge graph. And this knowledge graph was automatically built in training by looking at many cases
of geometrical figures and premises. So here, the large language model is nothing else as just an interface to bridge the gap between textual questions and that deduction engine behind the scenes. So in other words, they cheated again, like JGBT4O did, with generating Python code to perform calculations. And so what can we do in summary?
Well, we cheat, most of them cheat, silently outsourcing, which means silently outsourcing the reasoning and computation to engines much better suitable for the given task and still call it our AI. So JGBT is just doing that, for example, with the Python code generation. And now very dangerously, you maybe notice JPT4O1 is the next model.
It also gives you the correct answer, but it doesn't show you the Python code anymore. So it might be worth discussing. There's also a very infamous connection between JGBT and Wolfram engine, which also opposes more or less exactly this. And then alpha geometry also uses the large language models just as an aiding tool, as
an interpreter for the task description, right? It doesn't do the actual reasoning. And of course, I use the word cheating here a lot. This is a little bit of teasing, obviously, like it's not really cheating, but like you have to understand that these large language models, how they solve these tasks now is not by solving the tasks themselves.
They outsource the task to something more appropriate. So this is something I want to give you home with ending this presentation here now is just use the right tool for the right task in the end. Large language models are good as interfaces, but then, yeah, use them to use the appropriate tools in the background.
Thank you very much.