Alejandro F Queiruga: Prompt hacking is P-hacking.

This is my intuition on prompt-hacking workflows I have done myself and see others do. I went through the math concretely with a friend over dinner and decided to write the argument down. (I am trying to reinforce the writing habit.) I have not performed any experiments to simulate the argument yet in real world examples.

Few shot prompting does not mean you can get away with few shot testing. You always need a large test dataset before attempting to prompt an LLM to distinguish a good prompt from a bad prompt.

Consider the prompt engineering we all embark on. You have sense of the task and a couple of examples. You write one prompt, and the few examples you spot check don’t work. Then you tweak the prompt again and more examples don’t work. You keep tweaking the prompt, and after a few times, the prompt works for all of the examples you spot check. The vibes are right. Great, now you can ship an API call with that prompt.

But does your prompt work?

Let’s take a step back and remember the framing of ML problems. Any model you are using has an innate accuracy, the probability that given an example the prediction is correct (irrespective of the value of the label). Let us call that probability:

\[p(correct|example)\]

The probability depends on the model and the example; it gets some right and some wrong with possibly a systematic bias. The accuracy measurement is aggregated over your datasets

\[acc = \left<p(correct|example[i])\right>_{i\in N_{data}}\]

to determine the models final accuracy.

The goal of machine learning is to try to systematically find a model that increases that probability. With traditional ML and deep learning, your randomly initialized model has a probability of a random guess of getting the right answer. A randomly initialized neural network is very obviously useless.

It used to take a lot of data to get the model accurate. Because traditional methods were data hungry to train a better than random model, we ML engineers had to put a lot of thought into the data sets. We also had to carefully curate of validation data and test data to make sure we aren’t over-fitting on gradient descent and hyperparameter tuning.

The standard practice with LLMs has deviated from this. Having instruction tuned LLMs around that are so easy to use seems great. These things seem to have an innate 70% probability to get the task right with an arbitrary instruction. However, this ease of use is also misleading.

When you are prompt engineering, you are trying to increase that innate 70% probability to 99+% by continually tweaking the prompt. Unlike with deep learning, we no longer need a lot of data. This is why you can quickly get reasonable results by few shot prompting. But how do you know you are making any progress?

In the prompt hacking journey, you systematically kept trying prompts. Ideally, your efforts have steadily improved from 0.7 up to 0.99.

But, perturbing the prompt will result in slightly different results, but that does not necessarily mean it was more accurate. What if the LLM’s innate accuracy never increased, and each prompt version was still 70% accurate? For any eval set, the probability of getting 100% accuracy on your eval set is:

\[p(\text{observe 100\% acc}) = \prod_{i=1}^{N_{test}} p(correct|example[i])\]

If you have only 5 examples you are looking at, the probability of a given prompt passing all of them is,

\[p(\text{pass gut check}|\text{5 examples}) = p(correct)^{5} = 0.7^{5}=16.8\%\]

So each prompt has a 16.8% chance of looking okay!

How many times does it take to get 100% accuracy on your 5-example eval set? Simulate the process:

tries	p(fooled)
1	16.8%
2	30.7%
4	52.0%
8	77.0%
13	90.8%

By repeatedly tweaking your prompt, it is possible that you’re making no progress, but by randomly perturbing the LLM inputs, you stumble upon a case where the prompt passes your tiny eval set. The probability of that occurring is very high: after just 4 tweaked problems, there is over a 50% chance that you are fooled. After 13 prompts the probability is 90% that you are fooled. When you deploy the prompt to a new stream of inferences, the model will still have 70% accuracy.

You p-hacked your way into a significant result!

How does that compare with the journey you think you are taking? As an engineer, you need to distinguish between the case of no progress and the case of progress.

tries	$acc(\text{no progress})$	$p(fooled)$	$acc(\text{ progress})$	$p(confirmed)$
1	0.7	16.8%	0.7	16.8%
2	0.7	30.7%	0.8	44.0%
3	0.7	42.4%	0.9	77.0%
4	0.7	52.0%	0.95	94.8%
5	0.7	60.8%	0.99	99.7%

If you observe a success at step 4, what is the probability that you made progress? Assume a prior that both tracks of progress are equally likely. We can use Bayes’ rule to compute the probability that you are on the right track versus the wrong track:

\[p(\text{no progress}|\text{success at 4}) = \frac{ p(\text{success at 4}|\text{no progress}) p(\text{no progress}) }{p(\text{success at 4})} = \frac{52.0\%\times 50\%}{73.4\%} = 35.4\%\] \[p(\text{progress}|\text{success at 4}) =\frac{ p(\text{success at 4}|\text{progress}) p(\text{progress}) }{p(\text{success at 4})} = \frac{94.8\%\times 50\%}{73.4\%} = 64.46\%\]

That is a 1/3 probability that you are fooled! With just 5 eval examples and ad hoc “vibe checking”, you cannot distinguish random luck with making progress on the accuracy of the model!

How do you get around this?

If you instead annotate 100 examples in your test set, the probability of a model with only 70% accuracy:

\[p(\text{observe 100\%}|\text{100 examples}) = p(correct)^{100} = 0.7^{100}=3.23e-16\]

Now let us simulate the same methodology but with access to this robust test set each time:

tries	$acc$(no progress)	$p(fooled)$	$acc$(progress)	$p(confirmed)$
1	0.7	$3\times10^{-16}$	0.7	$3\times10^{-16}$
2	0.7	$6\times10^{-16}$	0.8	$2\times10^{-10}$
3	0.7	$9\times10^{-16}$	0.9	$2.6\times10^{-5}$
4	0.7	$1.2\times10^{-15}$	0.95	0.6%
5	0.7	$1.6\times10^{-15}$	0.99	36.9%

The probability of getting fooled is exceedingly low with this many test sets. After 100 million prompts, the probability of seeing all success in the no progress case is still $3\times 10^{-8}$. You can also now look at accuracy measures, where you would see 95/100 correct as a stronger indication that the model is around 95% accurate. But still remember that observing 95% accuracy on your test set does not mean the model is 95% accurate on all future inferences. There is another flaw with this methodology: you are looking at the test set each time. This is where the test/validation split comes in.

The conclusion that I argue for is that you always need to have a test set with 100s of examples to test & validate your model. Few shot prompting may work: it is possible that you need very little training data or eval data thanks to the foundation model. However, few shot testing is not capable of determining if your prompt-tuned API call will work. You always need robust test sets that you do not look at when prompt tuning. Prompting and evaluation tools need to stress this more.

Furthermore, it’s important to build up the span of fringe cases and understand how often they impact your use case. (There are probably more than just 5 fringe cases in whatever you are doing.) This is not just a tip for LLMs: whether its a traditional ML model, a human eval system, a prompted or fine tuned LLM, or even a list of hand-coded rules, systematic quality control is always required.