I have a visceral reaction to people talking about "prompt engineering". This is an attempt to reflect on why this phrase elicits such a strong negative emotional response in me. There are a lot of things to criticize about the current AI hype and its impact on the web, the environment, the economy and society at large. That’s not my focus here. I want to instead evaluate what I understand good (software) engineering to mean and contrast that to the activity of writing prompts.
What are some properties you want a software system to exhibit? I’d argue the list should include at least:
- Predictable. My mental model should be aligned with the software so that I can with reasonable certainty predict the outcome of calling any one function in the system. This is very much not the case for LLM prompts. Replacing a word with a synonym, fixing a spelling mistake or changing the order of statements can all have dramatic and unpredictable effects on the result.
- Deterministic / Repeatable. The software should be deterministic. Providing the same inputs to the system should reliably produce the same outputs. LLMs have built-in randomness euphemistically referred to as “temperature”. Execute the same prompt multiple times, get a different result each time.
- Inspectable / Debuggable. I should be able to pop open the hood, look at the code, and figure out what it is doing. If I experience bugs or unexpected outputs I should be able to trace back through the code to understand where and why the logic went wrong or differed from my expectations. None of that is true for LLMs. If I get an unexpected result all I can do is permute my prompt and hope for the best. I can’t inspect the inner workings of the network to figure out which part of the input was responsible for the deviation.
- Composable. Software systems should be building blocks I can combine to build something bigger than the parts. LLMs are all or nothing. Their interface is effectively all of human language as input and output. Multi modal models go even further to include other modalities like images, video and audio. This is too wide and deep an interface. By its very design, an LLM is not only a monolith but also perfect spaghetti. Everything potentially affects everything else. There’s no isolation in the network, no re-use of just parts of it is possible.
- Stable. Software should be stable across versions. LLMs change in random unpredictable ways from one release to the next. A prompt that provides good results now may become entirely useless with the next generation of the model. Numeric embeddings produced with one version are meaningless and incomparable to ones produced with a different version. Great fun to start over each time.
- Testable. Software should be (unit-)testable with fairly high coverage. LLMs offer a single entry point for a huge set of functionality. The input and output space is enormous. It is utterly hopeless to achieve anything approaching good test coverage. All you can do is shine a flashlight in a vast dark ocean and hope that your tiny collection of training examples covers all cases relevant to your problem.
- Efficient. Software should be resource efficient. The principle of least power applies. LLMs are massive resource hogs and you activate large fractions of them no matter how simple or complex your prompt is.
- Fast. Most software counts QPS - queries per second and latency in milliseconds. LLMs count QPM - queries per minute and latency in seconds. Maybe fast enough for an interactive chat bot. Painfully slow for working with large data sets. Especially if conjuring up prompts is so random that responsible development involves a ton of repeated experimentation and testing on large validation sets.
- Precise. Software interfaces should be precise. Human language is not that. It is inherently ambiguous, redundant and open to interpretation. There’s no way to ensure the LLM will choose any one particular interpretation of a statement.
- Secure. Software should be robust against injection attacks, leaking data and other safety concerns. Given the properties listed so far, it seems highly doubtful that this can ever be guaranteed for prompting LLMs. How can we pretend to secure a system we don’t understand and can’t test comprehensively?
- Useful. This is arguably the one point where LLMs make up for not ticking even a single one of the other boxes. They do solve problems where we don’t currently have any alternative approach for a solution.
Prompt engineering is hard for all the wrong reasons. Don’t take my word for it. Take a recent (December 2024) paper The Prompt Report: A Systematic Survey of Prompting Techniques which surveys more than 1500 other papers to compile a taxonomy of prompting techniques. Everyone is just blindly pushing and prodding the machinery, hoping to tickle it in just the right ways to provide useful results. Consider choice quotes like “exemplar order can cause accuracy to vary from sub-50% to 90%+” or “providing models with exemplars with incorrect labels may not negatively diminish performance. However, under certain settings, there is significant performance impact” or the summary “prompt engineering is fundamentally different from other ways of getting a computer to behave the way you want it to: these systems are being cajoled, not programmed, and, in addition to being quite sensitive to the specific LLM being used, they can be incredibly sensitive to specific details in prompts without there being any obvious reason those details should matter”. This is utterly ridiculous and shows our fundamental lack of understanding of cause and effect with these things.
Prompt engineering is performing alchemy in a chemistry lab. It is an insult to the chemists.
So yeah. Devising good prompts is hard and occasionally even useful. But the current practice consists primarily of cargo culting and blind experimentation. It is lacking the rigor and explicit trade-offs made in other engineering disciplines.
No comments:
Post a Comment