Jay Garmon [dot] Net: What "The Monkey's Paw" can teach us about AI prompt engineering

I decided to try out an AI app builder -- in this case, Together.AI's LlamaCoder -- to see if one could actually build something useful from just a few prompts.

TL;DR, these tools are almost useful, but every prompt feels like making a wish with a Monkey's Paw: Unless you are ridiculously specific with your request, you'll end up with something other than what you wanted. (Usually nothing cursed, but also usually nothing truly correct.)

As a test model, I asked LlamaCoder to "Build me an app for generating characters for the Dungeons and Dragons TTRPG, using the the third edition ruleset." Here's what happened.

For those of you who aren't tabletop roleplaying game (TTRPG) nerds, D&D 3rd Edition came out nearly 25 years (it's currently in its Fifth Edition), so lots of its source material has been on the web for a very, very long time. There's no reason an app-builder born of web-scraping wouldn't have plenty of examples to go on, both for the text and the app design.

Here's what my initial prompt produced:

It looks like an app. And, when I click Generate Character, here's what happens:

The app has clearly generated all six standard Ability Scores within typical ranges for a Level 1 character, and randomly chose a Character Class, Race, Alignment, and Background. On its face, this looks like a barebones but reasonable app for pumping out the beginning of a basic D&D 3E character. Not super useful, but okay to save me some dice-rolling and decision-making.

Aside: Yes, manually calculating a full-fledged D&D character is not entirely dissimilar to filling out a tax return. We're only going over the equivalent of the 1040EZ in this example, but a "real" character generator has complexity similar to TurboTax, and for a lot of the same reasons. (In a future post, we can discuss the similarity between tax lawyers and munchkins.) My findings below are about getting an app of basic competency, not one intended for power users.

In our first output, we already have a problem: D&D didn't introduce Backgrounds to player-characters until 5th Edition. This app is already non-compliant with the rules I set forth. Moreover, a lot of vital character components are missing.

However, generative AI is probabilistic, not deterministic, so every time you enter a prompt, you'll get a slightly (or not-so-slightly) different result. Thus, I entered the exact same prompt again to see if the "Background problem" was just a one-time glitch.

Character Background is now gone, despite using word-for-word the same starting prompt. I now also have the option to choose a base Character Class rather than have one randomly assigned, and the system now appears to offer options to automatically calculate Armor Class and Hit Points.

However, Alignment and Race have disappeared, and those are crucial to every D&D character. Moreover, neither version of this app has included Saving Throws, Skills, or a Base Attack Bonus, which are needed to have a character fight or perform any actions in a game.

I also have a new dropdown to choose a method of generating character attributes: Roll 4d6 and drop lowest or Point Buy. This is missing two other classic methods: 3d6 down the line and the Standard Array (which are less popular, to be sure, but absolutely listed in the D&D Players Handbook as approved methods).

And now we have a new problem: choosing the Point Buy option doesn't change anything. The app behaves identically, regardless of my choice on that dropdown. It simply performs a random number generation irrespective of that setting. This is a dummy setting that LlamaCoder threw in of its own accord.

In contrast, choosing a Class does seem to affect the Armor Class and Hit Points of a character, which is to be expected, given there are Class Bonuses for these stats.

LlamaCoder lets you add additional prompts to refine the output, so let's start knocking down these issues.

I added the following secondary prompt to start: Start every character at Level 1, and generate their stats using the Standard Array. Be sure to choose a Race and Alignment for the character. Automatically calculate the character's Base Attack Bonus, Saving Throws, Hit Points, Armor Class, and Skill Bonuses.

This broke LlamaCoder.

Specifically, the system introduced errors into its own code.

Now, this is a free tool with limited capacity, so I tried breaking up the follow-up prompts to see if fewer instructions would prevent the error. I started with just the first one: Start every character at Level 1, and generate their stats using the Standard Array.

This is what I got:

The generation method dropdown is gone, and when I select Calculate Ability Scores, it randomly places a score from the Standard Array into each Ability, with no duplication (as is correct). Also, Saving Throws and Base Attack Bonus are now included despite no specific prompt. I suspect LlamaCoder is playing around with prompt retention, so it decided to add those features based on my last, failed prompt. Skills, however, were not added.

I also tested all the individual buttons to generate Hit Points, Armor Class, Base Attack Bonus, and Saves. Running them before ability scores are distributed (they start at 0) created the correct negative bonuses, and changing Races and Classes appeared to alter these stats appropriately. Unfortunately, when I chose anything from a dropdown, I had to click four buttons to get all the derived stat blocks to regenerate. (LlamaCoder is not LlamaUXDesigner, clearly.) Let's see if we can fix that.

I added this sentence to the follow-up prompt: A single button should re-calculate Ability Scores, Hit Points, Armor Class, Base Attack Bonus, and Saves.

It worked, sort of:

I've got my One Button to Rule Them All, and all the math is calculated correctly when I click it, but once again, Class, Race, and Alignment have been removed. The first two of those can affect the derived stats, so we need them back. Let's see if we can get there without overloading the system.

I added a third sentence to the follow-up prompt: Automatically choose a Class, Race, and Alignment for the character.

Here's what we got:

This is a passably useful app for generating Level 1 D&D 3E characters. But it took a few iterations, a lot of domain knowledge, and some specificity to get there. In other words, you don't get a good Monkey's Paw wish until very late in the process, when you know exactly all the caveats you need to declare to avoid a harmful result. To get something commercial-grade would take a lot more work.

This speaks to me as a software product manager with 20+ years of experience: my initial prompt was a pretty typical user story, but no engineer -- AI or human -- would have likely produced a good result without more specific acceptance criteria.

Or, as someone more pithy than I (but equally cynical) put it:

To replace programmers with robots, clients will have to accurately describe what they want. We're safe.

I have two decades of product definition experience and I've spent the past five years directly developing AI tools, and it still took me several tries and lot of tinkering to get an app I still probably won't use, because it's missing so many key features. Unless the task is simple, GenAI isn't going to build what you really want, because building good stuff is hard and defining what you want is sometimes even harder. (That's why product managers are necessary.)

Anyone who says otherwise is selling something.

Use the GenAI Money's Paw at your own risk.

---

PS - Want to hire a veteran product leader to actually deliver GenAI value? I'm presently available for hire. You can grab my resume and generic cover letter here.

Jay Garmon [dot] Net

Friday, October 11, 2024

What "The Monkey's Paw" can teach us about AI prompt engineering

No comments:

Post a Comment

Blog archive

Cyberstalk me