Jay Garmon [dot] Net: 2024

Friday, October 11, 2024

What "The Monkey's Paw" can teach us about AI prompt engineering

I decided to try out an AI app builder -- in this case, Together.AI's LlamaCoder -- to see if one could actually build something useful from just a few prompts.

TL;DR, these tools are almost useful, but every prompt feels like making a wish with a Monkey's Paw: Unless you are ridiculously specific with your request, you'll end up with something other than what you wanted. (Usually nothing cursed, but also usually nothing truly correct.)

As a test model, I asked LlamaCoder to "Build me an app for generating characters for the Dungeons and Dragons TTRPG, using the the third edition ruleset." Here's what happened.

For those of you who aren't tabletop roleplaying game (TTRPG) nerds, D&D 3rd Edition came out nearly 25 years (it's currently in its Fifth Edition), so lots of its source material has been on the web for a very, very long time. There's no reason an app-builder born of web-scraping wouldn't have plenty of examples to go on, both for the text and the app design.

Here's what my initial prompt produced:

It looks like an app. And, when I click Generate Character, here's what happens:

The app has clearly generated all six standard Ability Scores within typical ranges for a Level 1 character, and randomly chose a Character Class, Race, Alignment, and Background. On its face, this looks like a barebones but reasonable app for pumping out the beginning of a basic D&D 3E character. Not super useful, but okay to save me some dice-rolling and decision-making.

Aside: Yes, manually calculating a full-fledged D&D character is not entirely dissimilar to filling out a tax return. We're only going over the equivalent of the 1040EZ in this example, but a "real" character generator has complexity similar to TurboTax, and for a lot of the same reasons. (In a future post, we can discuss the similarity between tax lawyers and munchkins.) My findings below are about getting an app of basic competency, not one intended for power users.

In our first output, we already have a problem: D&D didn't introduce Backgrounds to player-characters until 5th Edition. This app is already non-compliant with the rules I set forth. Moreover, a lot of vital character components are missing.

However, generative AI is probabilistic, not deterministic, so every time you enter a prompt, you'll get a slightly (or not-so-slightly) different result. Thus, I entered the exact same prompt again to see if the "Background problem" was just a one-time glitch.

Character Background is now gone, despite using word-for-word the same starting prompt. I now also have the option to choose a base Character Class rather than have one randomly assigned, and the system now appears to offer options to automatically calculate Armor Class and Hit Points.

However, Alignment and Race have disappeared, and those are crucial to every D&D character. Moreover, neither version of this app has included Saving Throws, Skills, or a Base Attack Bonus, which are needed to have a character fight or perform any actions in a game.

I also have a new dropdown to choose a method of generating character attributes: Roll 4d6 and drop lowest or Point Buy. This is missing two other classic methods: 3d6 down the line and the Standard Array (which are less popular, to be sure, but absolutely listed in the D&D Players Handbook as approved methods).

And now we have a new problem: choosing the Point Buy option doesn't change anything. The app behaves identically, regardless of my choice on that dropdown. It simply performs a random number generation irrespective of that setting. This is a dummy setting that LlamaCoder threw in of its own accord.

In contrast, choosing a Class does seem to affect the Armor Class and Hit Points of a character, which is to be expected, given there are Class Bonuses for these stats.

LlamaCoder lets you add additional prompts to refine the output, so let's start knocking down these issues.

I added the following secondary prompt to start: Start every character at Level 1, and generate their stats using the Standard Array. Be sure to choose a Race and Alignment for the character. Automatically calculate the character's Base Attack Bonus, Saving Throws, Hit Points, Armor Class, and Skill Bonuses.

This broke LlamaCoder.

Specifically, the system introduced errors into its own code.

Now, this is a free tool with limited capacity, so I tried breaking up the follow-up prompts to see if fewer instructions would prevent the error. I started with just the first one: Start every character at Level 1, and generate their stats using the Standard Array.

This is what I got:

The generation method dropdown is gone, and when I select Calculate Ability Scores, it randomly places a score from the Standard Array into each Ability, with no duplication (as is correct). Also, Saving Throws and Base Attack Bonus are now included despite no specific prompt. I suspect LlamaCoder is playing around with prompt retention, so it decided to add those features based on my last, failed prompt. Skills, however, were not added.

I also tested all the individual buttons to generate Hit Points, Armor Class, Base Attack Bonus, and Saves. Running them before ability scores are distributed (they start at 0) created the correct negative bonuses, and changing Races and Classes appeared to alter these stats appropriately. Unfortunately, when I chose anything from a dropdown, I had to click four buttons to get all the derived stat blocks to regenerate. (LlamaCoder is not LlamaUXDesigner, clearly.) Let's see if we can fix that.

I added this sentence to the follow-up prompt: A single button should re-calculate Ability Scores, Hit Points, Armor Class, Base Attack Bonus, and Saves.

It worked, sort of:

I've got my One Button to Rule Them All, and all the math is calculated correctly when I click it, but once again, Class, Race, and Alignment have been removed. The first two of those can affect the derived stats, so we need them back. Let's see if we can get there without overloading the system.

I added a third sentence to the follow-up prompt: Automatically choose a Class, Race, and Alignment for the character.

Here's what we got:

This is a passably useful app for generating Level 1 D&D 3E characters. But it took a few iterations, a lot of domain knowledge, and some specificity to get there. In other words, you don't get a good Monkey's Paw wish until very late in the process, when you know exactly all the caveats you need to declare to avoid a harmful result. To get something commercial-grade would take a lot more work.

This speaks to me as a software product manager with 20+ years of experience: my initial prompt was a pretty typical user story, but no engineer -- AI or human -- would have likely produced a good result without more specific acceptance criteria.

Or, as someone more pithy than I (but equally cynical) put it:

To replace programmers with robots, clients will have to accurately describe what they want. We're safe.

I have two decades of product definition experience and I've spent the past five years directly developing AI tools, and it still took me several tries and lot of tinkering to get an app I still probably won't use, because it's missing so many key features. Unless the task is simple, GenAI isn't going to build what you really want, because building good stuff is hard and defining what you want is sometimes even harder. (That's why product managers are necessary.)

Anyone who says otherwise is selling something.

Use the GenAI Money's Paw at your own risk.

---

PS - Want to hire a veteran product leader to actually deliver GenAI value? I'm presently available for hire. You can grab my resume and generic cover letter here.

Friday, October 04, 2024

AI isn't going to kill SaaS; AI is going to kill half-ass startups

There's a lot of bold ~~bullshit~~ prognostication about how new generative artificial intelligence is going to kill the Software-as-a-Service (SaaS) business model because now anybody can build custom enterprise apps with a simple ChatGPT prompt. Nothing could be further from the truth.

AI is going to lead to more SaaS products that don't suck. But, along the way, AI is going to kill most SaaS startups -- because most early-stage SaaS startups suck.

I'll let SMB explain what I mean.

For those who don't get the joke, here's the definition of a Minimum Viable Product (MVP) from Wikipedia: "a version of a product with just enough features to be usable by early customers who can then provide feedback for future product development."

The cult of the Lean Startup that dominates Silicon Valley and most venture-backed SaaS companies produces almost nothing but MVP SaaS products that are half-assed on their best day by design. The idea is to get early customers to tolerate these half-baked offerings until the founding team learns enough to turn it into actual enterprise software. (Customers buy at the MVP stage because they assume they'll get permanently grandfathered into absurdly reduced early pricing in exchange for being guinea pigs. Also, sometimes MVPs sort of work.)

But today, thanks to AI, I can type a few sentences and get a half-assed prototype for free. I don't need to try -- let alone pay for -- an outside startup's SaaS MVP that's buggy, unreliable, and may not last long. I can get that kind of V1 crap in an afternoon of puttering, complete with mildly functional code I can hand off to a real engineer as a proof of concept.

No, these AI-generated prototypes won't be very good. Most SaaS startup MVPs aren't any good, either. The difference is I don't have to go to a SaaS startup to get a crappy MVP anymore. I can roll my own.

That doesn't mean startups are going away -- startups are doing great -- it means SaaS startups can no longer get away with barebones MVPs. The bar for an initial version of a product just got a lot higher, especially if you want someone to pay for it.

The absurd idea that ChatGPT can spit out a useful SaaS CRM anytime somebody prompts it with "build me a Hubspot clone" is just that: absurd. Anyone who has ever built commercial-grade SaaS software (and I've built a lot) knows that it's really hard and requires sweating a lot of complicated details that GenAI code-vomiters won't address (and that's before we discuss the issue of systems maintenance and required security compliance). As such, there will be plenty of market left for human-driven SaaS startups to claim.

And, because GenAI makes it easy to spin up prototypes, internal alpha releases are cheaper and faster than previously possible. It will be easier than ever to launch a SaaS startup thanks to GenAI. But it will be harder than ever to make a SaaS product that people will pay for.

GenAI will make it simple to create generic, barely useful SaaS apps. But hyper-focused SaaS apps that meet very specific needs at a high level of competency will be become much more valuable precisely because GenAI can't deliver that level of quality, and because maintaining that quality over time requires significant investment. Moreover, with actual SaaS, the costs of maintenance get spread over all customers, not incurred by a single customer running a bespoke in-house GenAI product. (In other words, build vs. buy economics still largely apply.)

The result will be an absolute explosion of niche, highly mature SaaS startups looking to claim extremely specialized market areas.

Which leads me to my final conclusion, one shared by my old colleague and current AI investor Rob May: venture-backed SaaS is probably dead.

GenAI makes rolling up custom migration tools super-cheap now, so SaaS switching costs are going to drop like crazy. Customer lock-in is going to be really hard to enforce. VCs like moats; GenAI is going to mass-produce bridges. As such, the only way to keep a customer on your product is for your product to actually be great. Being great is hard and expensive and may take a while. VCs aren't known for their patience, and they really hate customer churn.

Moreover, venture capital economics require that every company a VC invests in target a huge total addressable market (TAM), then set money on fire to try and claim a dominant position in the market before anyone else. The niche SaaS apps that AI can't create will have much smaller TAMs than VCs will tolerate. If you want to build a SaaS app in the future, be prepared to bootstrap.

In conclusion: The rise of code-generating artificial intelligence isn't going to destroy the SaaS business model -- just the SaaS business model as we've known it. The future of SaaS is a staggering variety of small, specialized, highly refined products that are ready for prime time at V1.

MVPs -- and VCs -- need not apply.

---

PS - Want to build a killer SaaS product before the GenAI market shift changes everything? I'm presently available for hire. You can grab my resume and generic cover letter here.

Wednesday, May 01, 2024

What Star Trek Can Teach Us About Police Reform

[Given everything happening with campus protests right now, I felt it was time to re-post this classic from July of 2020.]

There is a groundswell of public outcry to "defund the police," which is (to my perception) a provocatively worded demand to reform the police and divert many police duties to other, or even new, public safety agencies. Break up the police into several, smaller specialty services, rather than expecting any one police officer to be good at everything asked of a modern police department.

You know, like Star Trek.

As much as every Star Trek character is a polymath soldier/scientist/diplomat/engineer, Star Trek actually breaks up its borderline superheroes into specialty divisions, each wearing different technicolor uniforms to handily tell them apart. Scientists, engineers, soldiers, and commanders all specialize in their areas of expertise, so no one officer is asked to be all things to all peoples on all planets. Even Captain Kirk usually left the engineering to Scotty, and science-genius Spock most often left the medical work to Dr. McCoy. The same logic should apply to a city's public safety apparatus, which includes the police.

Specialization leads to effectiveness and efficiency. So, why do we expect the same police officer to be as good at managing traffic violations, domestic disturbances, bank robberies, and public drunkards? Those incidents require vastly different skills, resources and tools. They should be handled by different professionals.

This is not a new idea. Until the late 1960s, police departments also handled the duties that emergency medical services tackle today. And they weren't great at it. Pittsburgh's Freedom House Ambulance Service (motivated by the same issues of police racial discrimination and apathy as the current "Defund the Police" movement) pioneered the practice of trained emergency paramedic response, which became a model that the Lyndon Johnson administration helped spread nationwide.

Divesting emergency medical services from police departments has saved countless lives while also helping narrow the focus of modern police departments. Specialization was a net good. So let's expand on that.

So, how do we break up the modern police into their own Trek-style technicolor specialty divisions?

Let's look at what "away missions" that the police commonly undertake. The best indicators are police calls for service (CFS), which are largely 911 calls but can also include flagging down patrol officers in person. These are the "distress signals" the public sends out to request police "beam down" and offer aid. National data on aggregate calls for service is a little hard to come by, but this 2009 analysis of the Albuquerque Police Department CFS data gives a nice local breakdown.

From January of 2008 to April of 2009, this was the general distribution of APD calls for service:

CALL TYPE	# of CALLS	% of CALLS
Traffic	256,398	36.6
Suspicious Person(s)	90,040	12.8
Unknown/Other	88,961	12.7
Public Disorder	88,676	12.6
Property Crime	59,920	8.5
Automated Alarm	35,508	5.1
Violence	35,460	5.1
Auto Theft	12,953	1.8
Hang-up Call	10,017	1.4
Medical Emergency	6,241	0.9
Mental Patient	5,267	0.8
Missing Person	5,382	0.8
Drugs / Narcotics	2,110	0.3
Other Emergency	1,431	0.2
Animal Emergency	1,336	0.2
Sex Offenses	1,391	0.2

(NOTES: Unknown/Other, I believe, refers to calls where general police assistance is requested but the caller won't specify exactly what the police are needed for. Robbery would fall under Violence. Burglary would fall under Property Crime.)

A few items stand out, but first, let's recall how valuable it was to divest police of EMS duties. Medical emergencies are the cause of less than 1% of 911 calls, but they clearly warrant a non-police specialty agency to handle. Certainly some of these other, more common calls warrant specialist responses, too.

Similar findings were generated by this 2013 study of Prince George's County, MD.

"Overall, the top five most frequently used [911 Chief Complaint codes] were Protocol 113 (Disturbance/Nuisance): 22.6%; Protocol 131 (Traffic/Transportation Incident [Crash]): 12.7%; Protocol 130 (Theft [Larceny]): 12.5%; Protocol 114 (Domestic Disturbance/Violence): 7.2%, and Protocol 129 (Suspicious/ Wanted [Person, Circumstances, Vehicle]): 7.0%."

Right off the top, we can see that traffic enforcement takes up an inordinate amount of police calls for service. It seems rather ludicrous to send an armed security officer to write up fender-benders, hand out speeding tickets, rescue stranded motorists, or cite cars with broken tail lights or expired tags. An unarmed traffic safety agency, separate from the police, seems like an obvious innovation, just based on this data.

But what about all the ancillary crime "discovered" during routine traffic stops -- the smell of marijuana, weapons in plain sight, suspicious activity on the part of a driver? Well, a traffic safety officer can just as easily report these discoveries to police. But many of these "discoveries" were made during pretextual stops; cases where police already suspected the driver or passengers of wrongdoing and used a traffic stop as an excuse to search the person and property of the vehicle occupants.

These pretextual stops have been shown to erode trust in police and often lead to rampant abuses of power (and, too often, the paranoid execution of suspects in their own cars, as in the case of Philando Castile). Separating police from traffic enforcement will also separate them from the temptation to abuse pretextual stops.

Also, we could probably get a lot more people to sign up as traffic safety officers knowing they won't be asked to do any armed response work, and a lot more people will be eager to flag down a traffic safety officer for help with a flat tire if there's no chance a misunderstanding with that officer will lead to the motorist getting shot.

Beyond traffic enforcement, where else could specialization and divestment benefit the public and the police? Disturbance/Nuisances, Suspicious Persons, Public Disorder and Domestic Disturbances all represent a significant percentage of calls for service. Most often, someone loitering, being loud, arguing openly, or appearing inebriated (or simply being non-white in a predominantly white area) is not cause for sending in an armed officer. A social worker or mediator would be far more appropriate in many cases.

That said, domestic disturbances are often violent and unpredictable, as are public drunks and mentally ill vagrants. Sometimes a person skulking around is actually a public danger. While unarmed social workers may do more good -- and absolutely will shoot fewer suspects -- it is not entirely wise to send in completely defenseless mediators to every non-violent report of suspicious or concerning activity.

Again, we can learn from Star Trek.

When Starfleet sends some combination of experts on any mission, the diplomats, scientists, doctors, and counselors outrank (and often outnumber) the security officers -- but the redshirts nonetheless come along for the ride. Violence is the last resort, not the first, and persons trained and specialized in the use of force answer to people who lead with compassion, curiosity, and science. That's a great idea on it's face; doubly so for police departments clearly struggling with their use of force.

So, we create a social intervention agency and send them in when the public nuisance has no obvious risk of violence. When there is a reasonable possibility of violence, we send a conventional police officer in to assist the mediator, but the mediator is in command. The redshirts report to Captain Kirk, not the other way around.

So, here's how I would break out a modern public safety agency, using Star Trek as a guide to reform and divest from the police.

Red Shirts: Fire & Rescue, doing all the same jobs fire departments do today
Gold Shirts: Emergency Medical Services, performing exactly as paramedics do today
Blue Shirts: Security, performing the armed response and crowd control duties of conventional police; the thin blue line becomes a bright blue shield
Gray Shirts: Traffic Patrol, handing out traffic citations, writing up non-fatal vehicle accidents, assisting stranded motorists, and other essential patrol duties that don't require an armed response
Green Shirts: Emergency Social Services, serving as mediators, counselors and on-site case managers when an armed police response is not warranted
White Shirts, Investigation and Code Enforcement, bringing together the police detectives, arson investigators, and the forensic and code-enforcement staff of other public agencies (like the Health Department, Revenue Commission, and Building Department) to investigate past crimes and identify perpetrators

Each division is identifiable by their uniform colors, so the public knows who and what they are dealing with at all times. It is also made abundantly clear that only Security blue-shirts are armed and that, if an active violent crime is not in progress, whichever of the other divisions is present on a Public Safety call is in charge.

All six divisions are headed by a Chief -- a Security Chief, a Fire Chief, a Chief of Emergency Medical Services, a Traffic Patrol Chief, a Chief of Emergency Social Services, a Chief Investigator -- that report to a Commissioner of Public Safety.

That Commissioner should report to a civilian Commission, which is an independent oversight board that can investigate the conduct of any officer of any division. Accountability is as important as specialization. No good Starfleet captain was ever afraid to answer for the conduct of his or her crew.

Tricorders -- which is to say, body cameras and dash cams -- will be needed to log every mission. That's for the safety of both the public and the officers. Funding will need to be rethought. Staffing will need to be reallocated. The word "police" may no longer be a common phrase, but blue-clad armed peace officers will still be a necessary component of these new public safety agencies. They just won't be the only option, and they won't be the first option in most cases, either.

As Spock would say, it's only logical.