Can LLMs do math?

A dive into LLMs and their struggles with math.

Arithmetic is numbers you squeeze from your head to your hand to your pencil to your paper till you get the answer.

This line, from a poem by Carl Sandburg, has stood the test of time. And maybe, the test of small LLMs as well.

By small LLMs, I mean LLMs that have less than 50B parameters(observed experimentally).

Why is that? Are 1.5B parameters not enough to perform SIMPLE tasks like addition? Surely, it cannot be that difficult, right? Right??

Truth is, even with Chain-of-Thought reasoning, small LLMs find it harder to add numbers than your average 5th grader. Why is that? Let’s take a not-so-deep dive.

Tokenization challenges: Text is processed as sequence of tokens by LLMs and numbers are not given any special privilege. This means that they rely on generalized patterns from training data rather than performing a true mathematical operation. So, essentially, a model cannot “ADD” 7+5 to get 12, it just has a “gut” feeling from seeing examples in its training data.
Length generalization limits: Let’s be real. How many times in our lives have we seen 6 or more 6 digit numbers being added? Not many(unless you ask a calculator). This means that an LLM is not trained on many long additions with a lot of carry-overs. That is why, they struggle with numbers longer than those encountered during training. There are little to no explicit mechanisms to handle the “positional” carry-over or borrow relationships.
Parameter Efficiency Tradeoff: Fancy way to say that a smaller model tends to “spend more of its energy” and prioritize language generation over numerical reasoning due to a limited parameter space.

So the issue boils down to LLMs being pattern recognizers, even if they have the ability to generate Chain-of-Thought reasoning. They should not be mistaken for algorithmic problem solvers.

<aside> ❔

There’s a little bit of a fun twist to this that will be touched on later, thanks to Anthropic’s Interpretability teams’ paper title “On the Biology of a Large Language Model” and Anthropic’s Alignment teams’ paper titled “Reasoning Models Don’t Always Say What They Think”.

</aside>

Getting back to the point, the experiment(read: voyage) that we are setting out on is a fairly simple, task.

Okay so here it goes. The task is to:

Get a reasoning model(small one) to add numbers.
Check if it's correct.