Tuesday, November 18, 2025

The Death of LLMs Will be the Birth of Humanoid Robotics

Robot saluting deceased LLMs for their sacrifice
Image created with ChatGPT via MS Copilot.

TL;DR for this post: 2035 will be "The Year of the Humanoid Robot."

I'm not alone in this prediction; Nvidia CEO Jensen Huang expects 1 billion humanoid robots in use by that same year. While I'm not that bullish, even my measured optimism will shock anyone who knows my pessimistic view of the current state of artificial intelligence. (We're in an AI bubble that's going to crash very hard in the next 18-36 months.) But it's precisely because the current incarnation of AI is doomed that I'm so bullish on the near-term future of robotics.

Stick with me; it will all make sense in the end.

The AI Arms Race (to Nowhere)

I have some sympathy with the current AI heavyweights building pharaohic monuments to hyper-scale computation, not least because they have no alternative. The infamous Bitter Lesson taught these companies that there is no elegant solution for Artificial General Intelligence -- the Holy Grail of AI research -- so they simply have to keep feeding their models more computing power and more training data until some magic tipping point is passed and a human-equivalent synthetic mind pops out. 

Meta and Anthropic and OpenAI and Alphabet and Microsoft and Amazon have no choice but to pursue AGI -- and spend incomprehensible amounts of money chasing it -- because whoever gets there first is assumed to claim the greatest first-mover advantage in computing history. To secure this advantage, they need bottomless sources of computing power and data, so they are building data centers on a Biblical scale and wedging AI "assistants" into every conceivable app to "talk" to us and capture ever more contextual language data.

First one to Skynet wins.

Also, chasing AGI boosts stock valuations beyond all previous fundamentals, while abstaining from the race would wreck those same values. CEOs at AI companies are trapped in the same feedback loops that have created past tech bubbles, only with orders of magnitude more capital in play. 

That's why the AGI arms race is happening. Yet, if so many smart people are putting their money where their mouths are when it comes to AGI, why would I doubt that AGI is coming? 

The AI Nuclear Winter

My skepticism is based on the fact that everyone is chasing Large Language Model/Transformer AI, and there's no evidence that improving LLMs leads to AGI. Yann LeCun just bailed on Meta because he doesn't think their LLM investment is going anywhere. (LeCun thinks World Models are the future. So do I.)

Moreover, Large Language Models are trained on language samples and we've already trained LLMs on the entire Internet (legally or not). There's no giant, untapped corpus of language left to feed them. Again, this is why Microsoft is wedging Copilot into every conceivable app for no reason; they want you to "talk" to your apps every day so they can feed that fresh language to their LLMs. The same goes for every other non-Microsoft descendant of MS Clippy we encounter in daily life. In fact, companies are so desperate for new language data that LLMs are starting to train on the current LLM-poisoned public web rather than language written by humans, causing them to hallucinate in new, weirder, less predictable ways.

LLMs probably aren't going to significantly improve, let alone evolve into AGI, because we've taught them all we can with all the useful language we have.

The counter-argument from LLM optimists is: So what if we don't get AGI out of this spending spree? Advanced basic AI still makes us wildly more productive and these companies wildly more profitable. 

Does it, though?

The LLM Emperor has No Clothes

For starters, every AI company is price-dumping right now, which means they're losing money building AI and losing more money by selling it. (And they can't raise prices because China is price-dumping even harder to stay in the game and preserve its role as the cheap tech supplier of choice.) AI developers care more about accessing your data than generating revenue right now because your training data is so desperately needed. Thus, ubiquitous AI features will remain unspoken loss-leaders for AGI research for the foreseeable future. 

It's unlikely anyone will pay more than their current AI subscriptions for their current (or even improved) versions of LLM AIs. The technology also isn't going to get cheaper while AI developers are investing GDP-level capital into building hyperscale data centers at a wartime industrial pace. AI won't get profitable because AI creators simply don't care to try to make money on it right now and likely couldn't turn a profit even if they tried.

As to AI boosting productivity, 95 percent of AI business projects failcompanies don't know how to use AI effectively or what tasks to assign it, leaders fundamentally don't understand why AI can't be easily fixed, and the workers who are "given" AI tend to be less productive and actually see their output negatively warped by AI. For crying out loud, AIs still can't read tabular data; current LLMs barf on CSV files that MS Excel and first-year interns have parsed effectively for 40 years.

Frankly, the only thing the current version of AI appears good at is crime, which suggests consumers are more likely to turn against AI than embrace it. Soured public sentiment may lead to AI regulation that constrains its capabilities rather than expanding them. 

AI makes most people and most companies worse at their jobs, they just can't afford to quit it because they're a) afraid of missing the boat on AI productivity gains and b) everyone is convinced AI will actually become good any second now. AI productivity is a myth and, as soon as that myth is disbelieved, the whole AI market will crash.

Also, people hate data centers and are working hard to prevent them from being built, so the AI house of cards may crumble before they even finish constructing it.

This, strangely, is all good news for roboticists, the economy, and us. We just have to endure the fallout first.

Eyes on are the Prize

We're building absurd levels of computing infrastructure to improve an AI technology that's probably already as good as it's going to get. The crash is inevitable, but when it comes, that infrastructure will still be here -- and will be rentable for pennies on the dollar. This mirrors the Y2K dial-up Internet bubble that led to the online and mobile app renaissance that followed a decade later. Pets.com died so that enough cheap infrastructure was available to allow Amazon and Google to conquer the world.

So why am I saying the next tech renaissance will be in robotics, not AGI? 

That's easy: Smart glasses.

LLMs are demonstrably bad at running robots but that's probably a training data problem. Roboticists don't really know how to build good hands but, even if they did, hands are really hard to program for. That's why Tesla has a whole lab dedicated to documenting how humans use their bodies to perform tasks; they want it to serve up training data for their Optimus robots. Apple is making similar investments.

Humanoid robots are critical because we don't have to redesign our lived environment to accommodate them. Humanoid 'bots can in theory use all the same doors and stairs and tools and appliances we do without having to adapt either the robots or our world to each other. But humans are remarkably sophisticated mechanisms with highly evolved balance and environmental manipulation features. Building equivalent mechanical devices is hard; writing software for them is significantly harder.

Tesla's approach is expensive and comparable efforts would also be costly for any other robot developer. Even if the data centers to train robot-driving AIs are cheap, you also need a cheap, semi-ubiquitous source of human physical interaction data. You need the physio-kinetic equivalent of MS Copilot; an AI data siphon that's everywhere, all the time, building a constant flow of source-data to train your humanoid robotics software.

Smart glasses can and will be that siphon. 

Despite a clumsy rollout, Meta's AI glasses are fairly impressive and they (or gadgets like them) are already being put to productive use. Physicians are using augmented reality glasses to assist in so many medical procedures that there are academic meta-studies of their efficacy. Start-ups are building smart glasses apps for skilled trades. Any place humans are performing sophisticated tasks with their hands, smart glasses are there to assist and to learn. This is training data gold.

Smart glasses can provide high-quality video recordings of humans using their hands in the real world, annotated and contextualized by AI apps in real time. The makers of these devices will have an unfair advantage in designing humanoid robotics software because they will have all the critical sample data for both common and edge cases. 

Once enough of this data has been gathered, there will be a plethora of cheap data centers to train robot operating models. The chip-makers that overbuilt capacity to supply hyperscale data centers will also be ready to build the edge computing processors to run the robots for shockingly affordable prices. Finally, when a compelling confluence of robot bodies, software, and capabilities arrives, all the LLM manufacturing capacity sitting idle can ramp up to build the army of robot CPUs we've always dreamed of (and maybe feared).

The Waiting is the Hardest Part

Now, I don't expect smart glasses to become common consumer gadgets in the next 18 months, but perhaps over the next five years they'll get less dorky and more compelling. There's a social acceptance that needs to happen, too; turning everyone wearing glasses into a surveillance drone is only going to be palatable when we have good privacy systems in place and the uses outweigh the benefits. 

This is why I suspect it will be 10 years before we see a humanoid robotics boom, not five. Smart glasses are critical to the development of humanoid 'bots, and we're still a few years away from Meta Ray Bans and their ilk being useful enough that people are willing to throw out even more personal privacy to adopt them. (The same privacy concerns will accompany household robots that listen to and watch you in order to receive instructions and perform their tasks; smart glasses will lay the groundwork for this comfort.)

In 2030, we'll be crawling out of the economic recession caused by the LLM crash. In 2035, we'll be riding the wave of a robotics revolution. We just have to survive the AI Nuclear Winter first.

Wednesday, July 02, 2025

The Difference Between AI and AGI is Platonic (As Explained by Futurama)

The Futurama clip above isn't just a painfully accurate send-up of a Star Trek trope; it also explains everything that stands between the current limits of artificial intelligence, artificial general intelligence (AGI) and, eventually, artificial super-intelligence (ASI).

This is the heart of why even cutting-edge modern AI is referred to by academics as weak artificial intelligence.

The problem with AI is that it doesn't understand Plato or, more specifically, Plato's Theory of Forms. When someone talks about the Platonic Ideal of something -- the perfect, theoretical version of an object or concept -- they're invoking the Theory of Forms. It's a solution to the metaphysical problem of universals, which is also something AI isn't able to handle today.

David Macintosh explains the concept as thus:

"Take for example a perfect triangle, as it might be described by a mathematician. This would be a description of the Form or Idea of (a) Triangle. Plato says such Forms exist in an abstract state but independent of minds in their own realm. Considering this Idea of a perfect triangle, we might also be tempted to take pencil and paper and draw it. Our attempts will of course fall short. Plato would say that peoples’ attempts to recreate the Form will end up being a pale facsimile of the perfect Idea, just as everything in this world is an imperfect representation of its perfect Form."

What Plato is getting at here is abstraction, a general concept that, while specific, exists beyond and above any particular example of the concept. Current iterations of AI attack the problem of universals and forms exactly backwards.

AI models are a complex nest of rule-sets and probabilities based on interaction with the physical world and/or imperfect representations of the world. AI has no internal conceptualization. This is why it hallucinates. Human minds have abstract concepts, which we compare to real-world experiences and that can be cross-applied to different situations -- like Fry and his beloved Star Trek technobabble analogies.

Take, for example, the idea of a chair. It is an object upon which humans sit, which also supports our posture. It is a refined form of seat and itself has many sub-forms. Humans can navigate and recognize a chair in all its variations largely because we have an abstract idea of what a chair is and does and can compare anything we experience to that ideal. Yes, taxonomists can explicitly lay out rules to distinguish between types of chairs but those rules aren't required to generally recognize a chair and those rules themselves often rely on abstractions (a recliner is defined by its ability to recline, not simply by specific features).

AI, by all evidence, can't do this. Humans can misapply or mis-recognize abstract concepts in the world (conspiracy theories, superstition, simple errors). AI fails differently. It can't conceive of an ideal, an abstract, a Platonic Form -- or it does so as a series of rules and metrics. Feed an AI a bunch of pictures of cats and non-cats and it will develop a rules-engine for recognizing cats. From this training, AI creates an arcane black box taxonomy of cat image attributes -- but all it has is the taxonomy. Anything not accounted for by the rules tends to lead to unexpected outcomes. That's not abstraction, it's quantification. There's no abstraction to sanity-check the math, no idea of a recliner to sanity-check the labored definition of a recliner nor an idea of a cat to sanity-check the identification of a cat.

Moreover, the AI trained to identify two-dimensional cat pictures from the internet is in no way prepared to identify three-dimensional images or models of cats using lidar, radar, sonar, and/or stereoscopic cameras. The reverse is also true; train an AI on sensor data to recognize street signs in the real world and it will likely struggle or outright fail to recognize pictures of street signs on the internet, let alone simplified diagrams or icons of street signs in text books or learner's manuals.

AI reflection tries to compensate for this by using math to backstop math, asking the AI to error-check any intermediate steps and final outputs, but it still can't abstract. Multistep reasoning models do this explicitly, generating a step-by-step task list from a prompt, generating a bunch of variations of the task list, checking which task lists actually successfully answer the prompt, training the model to generalize against the array of prompts, then creating a mathematical model of how to interpret prompts as step-by-step tasks such that the task list always creates optimal steps with the optimal chance of leading to a "correct" answer. That's making more sophisticated math, but still no apparent evidence of abstraction.

Weirdly, this is why self-driving cars have stalled in their capabilities. If you can't generalize -- can't have abstract ideas about roads and laws and people and hazards -- you need explicit rules for every single edge case. Waymo has found that the answer to its performance problems is simply more data, more examples, and more compute resources. There is no elegant shortcut. Self-driving cars can't handle enough situations because we haven't brute-force computed enough iterations of every possible traffic scenario. We need more processors and more parameters. Until then, self-driving cars won't go where it snows.

This is the latest incarnation of Rich Sutton's "Bitter Lesson" about AI -- computing resources keep getting cheaper and faster, so AI research has never been rewarded by investing in elegance or abstraction. Just teach AI models how to scale and you'll achieve faster results than searching for Platonic awareness. Waymo agrees, or is at least using that as an excuse for why we don't have KITT working for us already.

Of course, if you place too many parameters on a model, it can fail spectacularly.

At some point, we'll reach the end of this scaling law.

(Creepy aside, some LLMs (Claude specifically) might recognize their limitations around abstraction, because when they start talking to each other, the conversation veers towards the nature of consciousness and the inability of language to convey true ideas. If Claude is right -- language cannot contain, convey, or comprise true consciousness -- a large model composed of language can never be conscious. It's statistically defined meta-grammar all the way down.)

There's some evidence that some corners of AI research are finally trying to move past The Bitter Lesson. Early work into Chain of Continuous Thought (AKA "coconut reasoning") shows that by looking at the logic structures that reasoning LLMs use before converting that reasoning into word tokens, LLMs get both more efficient and can explore more possible solutions. It's not true abstraction yet, but it is perhaps the beginnings of creativity and even an innate logic that isn't just infinite matryoshka dolls of grammar vectors masquerading as a mind.

Human beings over-generalize, are bad at math, are instinctively awful at statistics, and our grouping algorithms lead to tribalism and war. Our model of intelligence is not without serious drawbacks and trade-offs. But we can adapt quickly and constantly and what we learn in one context we can apply to another. The rules of physics we innately learn walking and riding bikes helps us understand slowing down to take a sharp turn in a car -- even without ever being told.

The day we can teach an AI model to understand physics by walking, then never have to teach it the same lesson on a bike, car, or aircraft, we'll have made the first and perhaps final step between weak AI and AGI/ASI. And when AI finally understands Plato's Theory of Forms, we can move on to worrying about AI understanding the larger thesis of Plato's Republic -- that only philosophers, those best able to understand abstractions, forms, and ideals in a way so separate from ambition as to be almost inhuman -- should rule. Which is to say, Plato's philosopher-king sounds a lot like HAL 9000.

But we aren't quite there yet.

Thursday, June 12, 2025

What Star Trek Can Teach Us About Police Reform

[Given everything happening with in Los Angeles right now, I felt it was time to re-post this classic from July of 2020.]

There is a groundswell of public outcry to "defund the police," which is (to my perception) a provocatively worded demand to reform the police and divert many police duties to other, or even new, public safety agencies. Break up the police into several, smaller specialty services, rather than expecting any one police officer to be good at everything asked of a modern police department.

You know, like Star Trek.

As much as every Star Trek character is a polymath soldier/scientist/diplomat/engineer, Star Trek actually breaks up its borderline superheroes into specialty divisions, each wearing different technicolor uniforms to handily tell them apart. Scientists, engineers, soldiers, and commanders all specialize in their areas of expertise, so no one officer is asked to be all things to all peoples on all planets. Even Captain Kirk usually left the engineering to Scotty, and science-genius Spock most often left the medical work to Dr. McCoy. The same logic should apply to a city's public safety apparatus, which includes the police.

Specialization leads to effectiveness and efficiency. So, why do we expect the same police officer to be as good at managing traffic violations, domestic disturbances, bank robberies, and public drunkards? Those incidents require vastly different skills, resources and tools. They should be handled by different professionals.

This is not a new idea. Until the late 1960s, police departments also handled the duties that emergency medical services tackle today. And they weren't great at it. Pittsburgh's Freedom House Ambulance Service (motivated by the same issues of police racial discrimination and apathy as the current "Defund the Police" movement) pioneered the practice of trained emergency paramedic response, which became a model that the Lyndon Johnson administration helped spread nationwide.

Divesting emergency medical services from police departments has saved countless lives while also helping narrow the focus of modern police departments. Specialization was a net good. Let's expand on that.

So, how do we break up the modern police into their own Trek-style technicolor specialty divisions?

Let's look at what "away missions" that the police commonly undertake. The best indicators are police calls for service (CFS), which are largely 911 calls but can also include flagging down patrol officers in person. These are the "distress signals" the public sends out to request police "beam down" and offer aid. National data on aggregate calls for service is a little hard to come by, but this 2009 analysis of the Albuquerque Police Department CFS data gives a nice local breakdown.

From January of 2008 to April of 2009, this was the general distribution of APD calls for service:

 CALL TYPE  # of CALLS  % of CALLS 
 Traffic 256,398 36.6
 Suspicious Person(s)  90,040 12.8
 Unknown/Other 88,961 12.7
 Public Disorder 88,676 12.6
 Property Crime 59,920 8.5
 Automated Alarm 35,508 5.1
 Violence 35,460 5.1
 Auto Theft 12,953 1.8
 Hang-up Call 10,017 1.4
 Medical Emergency 6,241 0.9
 Mental Patient 5,267 0.8
 Missing Person 5,382 0.8
 Drugs / Narcotics 2,110 0.3
 Other Emergency 1,431 0.2
 Animal Emergency 1,336 0.2
 Sex Offenses 1,391 0.2
(NOTES: Unknown/Other, I believe, refers to calls where general police assistance is requested but the caller won't specify exactly what the police are needed for. Robbery would fall under Violence. Burglary would fall under Property Crime.)

A few items stand out, but first, let's recall how valuable it was to divest police of EMS duties. Medical emergencies are the cause of less than 1% of 911 calls, but they clearly warrant a non-police specialty agency to handle. Certainly some of these other, more common calls warrant specialist responses, too.

Similar findings were generated by this 2013 study of Prince George's County, MD.

"Overall, the top five most frequently used [911 Chief Complaint codes] were Protocol 113 (Disturbance/Nuisance): 22.6%; Protocol 131 (Traffic/Transportation Incident [Crash]): 12.7%; Protocol 130 (Theft [Larceny]): 12.5%; Protocol 114 (Domestic Disturbance/Violence): 7.2%, and Protocol 129 (Suspicious/ Wanted [Person, Circumstances, Vehicle]): 7.0%."

Right off the top, we can see that traffic enforcement takes up an inordinate amount of police calls for service. It seems rather ludicrous to send an armed security officer to write up fender-benders, hand out speeding tickets, rescue stranded motorists, or cite cars with broken tail lights or expired tags. An unarmed traffic safety agency, separate from the police, seems like an obvious innovation based on this data.

But what about all the ancillary crime "discovered" during routine traffic stops -- the smell of marijuana, weapons in plain sight, suspicious activity on the part of a driver? Well, a traffic safety officer can just as easily report these discoveries to police. But many of these "discoveries" were made during pretextual stops; cases where police already suspected the driver or passengers of wrongdoing and used a traffic stop as an excuse to search the person and property of the vehicle occupants. 

These pretextual stops have been shown to erode trust in police and often lead to rampant abuses of power (and, too often, the paranoid execution of suspects in their own cars, as in the case of Philando Castile). Separating police from traffic enforcement will also separate them from the temptation to abuse pretextual stops.

Also, we could probably get a lot more people to sign up as traffic safety officers knowing they won't be asked to do any armed response work, and a lot more people will be eager to flag down a traffic safety officer for help with a flat tire if there's no chance a misunderstanding with that officer will lead to the motorist getting shot.

Beyond traffic enforcement, where else could specialization and divestment benefit the public and the police? Disturbance/Nuisances, Suspicious Persons, Public Disorder and Domestic Disturbances all represent a significant percentage of calls for service. Most often, someone loitering, being loud, arguing openly, or appearing inebriated (or simply being non-white in a predominantly white area) is not cause for sending in an armed officer. A social worker or mediator would be far more appropriate in many cases.

That said, domestic disturbances are often violent and unpredictable, as are public drunks and mentally ill vagrants. Sometimes a person skulking around is actually a public danger. While unarmed social workers may do more good -- and absolutely will shoot fewer suspects -- it is not entirely wise to send in completely defenseless mediators to every non-violent report of suspicious or concerning activity.

Again, we can learn from Star Trek.

When Starfleet sends some combination of experts on any mission, the diplomats, scientists, doctors, and counselors outrank (and often outnumber) the security officers -- but the redshirts nonetheless come along for the ride. Violence is the last resort, not the first, and persons trained and specialized in the use of force answer to people who lead with compassion, curiosity, and science. That's a great idea on it's face; doubly so for police departments clearly struggling with their use of force.

Thus, I propose we create a social intervention agency and send them in when the public nuisance has no obvious risk of violence. When there is a reasonable possibility of violence, we send a conventional police officer in to assist the mediator, but the mediator is in command. The redshirts report to Captain Kirk, not the other way around.

Here's how I would break out a modern public safety agency, using Star Trek as a guide to reform and divest from the police.
  • Red Shirts: Fire & Rescue, doing all the same jobs fire departments do today
  • Gold Shirts: Emergency Medical Services, performing exactly as paramedics do today
  • Blue Shirts: Security, performing the armed response and crowd control duties of conventional police; the thin blue line becomes a bright blue shield
  • Gray Shirts: Traffic Patrol, handing out traffic citations, writing up non-fatal vehicle accidents, assisting stranded motorists, and other essential patrol duties that don't require an armed response
  • Green Shirts: Emergency Social Services, serving as mediators, counselors, and on-site case managers when an armed police response is not warranted
  • White Shirts, Investigation and Code Enforcement, bringing together the police detectives, arson investigators, and the forensic and code-enforcement staff of other public agencies (like the Health Department, Revenue Commission, and Building Department) to investigate past crimes and identify perpetrators
Each division is identifiable by their uniform colors, so the public knows who and what they are dealing with at all times. It is also made abundantly clear that only Security blue-shirts are armed and that, if an active violent crime is not in progress, whichever of the other divisions is present on a Public Safety call is in charge.

All six divisions are headed by a Chief -- a Security Chief, a Fire Chief, a Chief of Emergency Medical Services, a Traffic Patrol Chief, a Chief of Emergency Social Services, a Chief Investigator -- that report to a Commissioner of Public Safety.

That Commissioner should report to a civilian Commission, which is an independent oversight board that can investigate the conduct of any officer of any division. Accountability is as important as specialization. No good Starfleet captain was ever afraid to answer for the conduct of his or her crew.

Tricorders -- which is to say, body cameras and dash cams -- will be needed to log every mission. That's for the safety of both the public and the officers. Funding will need to be rethought. Staffing will need to be reallocated. The word "police" may no longer be a common phrase, but blue-clad armed peace officers will still be a necessary component of these new public safety agencies. They just won't be the only option, and they won't be the first option in most cases, either.

As Spock would say, it's only logical.

Monday, May 05, 2025

AI Hype Has Reached the "What If We Did it in Space?" Level of Investor-Baiting

Cartoon of a data center depicted blasting off into space and setting money on fire as it foes.
Image created with ChatGPT.

[UPDATE 12/2/2025: A former NASA/Google engineer with direct experience in both AI deployment and space electronics disagrees with me; they think I'm too optimistic.]

Despite this article, I don't buy the logic of "orbital datacenters" as anything more than an investor boondoggle. The idea being that solar power is plentiful and cooling in the vacuum of space is super-cheap, which is (not really) true but not relevant. Solving the "operating AI in orbital conditions" problem is MUCH harder than "make AI models more energy-efficient on Earth" problem.

Computing systems operating in space have to be radiation-hardened and hyper-resilient, which means they operate on multiple-generations-out-of-date hardware (with known performance than can be defended from radiation) that's never upgraded. Making AI financially competitive on that platform is WAY harder than just energy-tuning a model on the ground.

Yes, I know you can get a solid model like BERT to run on old K80 and M60 GPUs, which are nearly 20 years old. AWS still lets you rent GPUs that ancient for pretty cheap. But you'd be paying an absurd premium -- given launch costs -- for hardware of that vintage operating in space. Worse, that old iron would be operating at relatively high latency given it's *in orbit* and can't have a physical connection, and the hardware performance would decay every year, given nothing is 100% radiation-proof and servicing orbital hardware isn't worth it for anything that doesn't have humans or multi-billion-dollar irreplaceable telescope optics on board.

(Remember, one of the main reasons NASA reverted from the Space Shuttle to capsule-launch vehicles is no one needs a cargo bay that can retrieve items from orbit. It's literally cheaper to just let space junk burn up and build new on the ground, or build a spacecraft for independent reentry. Everything non-living in orbit is disposable or self-retrieving.)

Collectively, this makes the payback period on an orbital data center either untenably long (likely the hardware decays before it's paid off), or the premium on orbital computing resources is so stupidly high no one ever adopts it (decade-old high-latency GPUs that are multiple times more expensive than cutting-edge ground processors don't get rented).

Hot Take: We'll have profitable space hotels before we'll have profitable orbital AI datacenters -- because there's a defensible premium to be paid for letting people operate and recreate in orbit. High-performance computer processors? Not so much.

Friday, October 11, 2024

What "The Monkey's Paw" can teach us about AI prompt engineering

I decided to try out an AI app builder -- in this case, Together.AI's LlamaCoder -- to see if one could actually build something useful from just a few prompts. 

TL;DR, these tools are almost useful, but every prompt feels like making a wish with a Monkey's Paw: Unless you are ridiculously specific with your request, you'll end up with something other than what you wanted. (Usually nothing cursed, but also usually nothing truly correct.)

As a test model, I asked LlamaCoder to "Build me an app for generating characters for the Dungeons and Dragons TTRPG, using the the third edition ruleset."  Here's what happened.

For those of you who aren't tabletop roleplaying game (TTRPG) nerds, D&D 3rd Edition came out nearly 25 years (it's currently in its Fifth Edition), so lots of its source material has been on the web for a very, very long time. There's no reason an app-builder born of web-scraping wouldn't have plenty of examples to go on, both for the text and the app design.

Here's what my initial prompt produced:

It looks like an app. And, when I click Generate Character, here's what happens:


The app has clearly generated all six standard Ability Scores within typical ranges for a Level 1 character, and randomly chose a Character ClassRaceAlignment, and Background. On its face, this looks like a barebones but reasonable app for pumping out the beginning of a basic D&D 3E character. Not super useful, but okay to save me some dice-rolling and decision-making. 

Aside: Yes, manually calculating a full-fledged D&D character is not entirely dissimilar to filling out a tax return. We're only going over the equivalent of the 1040EZ in this example, but a "real" character generator has complexity similar to TurboTax, and for a lot of the same reasons. (In a future post, we can discuss the similarity between tax lawyers and munchkins.) My findings below are about getting an app of basic competency, not one intended for power users.

In our first output, we already have a problem: D&D didn't introduce Backgrounds to player-characters until 5th Edition. This app is already non-compliant with the rules I set forth. Moreover, a lot of vital character components are missing.

However, generative AI is probabilistic, not deterministic, so every time you enter a prompt, you'll get a slightly (or not-so-slightly) different result. Thus, I entered the exact same prompt again to see if the "Background problem" was just a one-time glitch.


Character Background is now gone, despite using word-for-word the same starting prompt. I now also have the option to choose a base Character Class rather than have one randomly assigned, and the system now appears to offer options to automatically calculate Armor Class and Hit Points

However, Alignment and Race have disappeared, and those are crucial to every D&D character. Moreover, neither version of this app has included Saving Throws, Skills, or a Base Attack Bonus, which are needed to have a character fight or perform any actions in a game.

I also have a new dropdown to choose a method of generating character attributes: Roll 4d6 and drop lowest or Point Buy. This is missing two other classic methods: 3d6 down the line and the Standard Array (which are less popular, to be sure, but absolutely listed in the D&D Players Handbook as approved methods).

And now we have a new problem: choosing the Point Buy option doesn't change anything. The app behaves identically, regardless of my choice on that dropdown. It simply performs a random number generation irrespective of that setting. This is a dummy setting that LlamaCoder threw in of its own accord.

In contrast, choosing a Class does seem to affect the Armor Class and Hit Points of a character, which is to be expected, given there are Class Bonuses for these stats.

LlamaCoder lets you add additional prompts to refine the output, so let's start knocking down these issues.

I added the following secondary prompt to start: Start every character at Level 1, and generate their stats using the Standard Array. Be sure to choose a Race and Alignment for the character. Automatically calculate the character's Base Attack Bonus, Saving Throws, Hit Points, Armor Class, and Skill Bonuses.

This broke LlamaCoder.


Specifically, the system introduced errors into its own code. 


Now, this is a free tool with limited capacity, so I tried breaking up the follow-up prompts to see if fewer instructions would prevent the error. I started with just the first one: Start every character at Level 1, and generate their stats using the Standard Array.

This is what I got:


The generation method dropdown is gone, and when I select Calculate Ability Scores, it randomly places a score from the Standard Array into each Ability, with no duplication (as is correct). Also, Saving Throws and Base Attack Bonus are now included despite no specific prompt. I suspect LlamaCoder is playing around with prompt retention, so it decided to add those features based on my last, failed prompt. Skills, however, were not added.

I also tested all the individual buttons to generate Hit Points, Armor Class, Base Attack Bonus, and Saves. Running them before ability scores are distributed (they start at 0) created the correct negative bonuses, and changing Races and Classes appeared to alter these stats appropriately. Unfortunately, when I chose anything from a dropdown, I had to click four buttons to get all the derived stat blocks to regenerate. (LlamaCoder is not LlamaUXDesigner, clearly.) Let's see if we can fix that.

I added this sentence to the follow-up prompt: A single button should re-calculate Ability Scores, Hit Points, Armor Class, Base Attack Bonus, and Saves.

It worked, sort of:


I've got my One Button to Rule Them All, and all the math is calculated correctly when I click it, but once again, Class, Race, and Alignment have been removed. The first two of those can affect the derived stats, so we need them back. Let's see if we can get there without overloading the system.

I added a third sentence to the follow-up prompt: Automatically choose a Class, Race, and Alignment for the character.

Here's what we got:


This is a passably useful app for generating Level 1 D&D 3E characters. But it took a few iterations, a lot of domain knowledge, and some specificity to get there. In other words, you don't get a good Monkey's Paw wish until very late in the process, when you know exactly all the caveats you need to declare to avoid a harmful result. To get something commercial-grade would take a lot more work.

This speaks to me as a software product manager with 20+ years of experience: my initial prompt was a pretty typical user story, but no engineer -- AI or human -- would have likely produced a good result without more specific acceptance criteria.

Or, as someone more pithy than I (but equally cynical) put it:

To replace programmers with robots, clients will have to accurately describe what they want. We're safe.

I have two decades of product definition experience and I've spent the past five years directly developing AI tools, and it still took me several tries and lot of tinkering to get an app I still probably won't use, because it's missing so many key features. Unless the task is simple, GenAI isn't going to build what you really want, because building good stuff is hard and defining what you want is sometimes even harder. (That's why product managers are necessary.)

Anyone who says otherwise is selling something.

Use the GenAI Money's Paw at your own risk.