How We Escaped the Purple Prison of AI Frontends
I Can Smell an AI-Generated Frontend From 10 Miles Away
This isn't a joke. It's a cry for help. We've all seen the endless sea of purple and indigo AI gradients flooding the internet.
https://x.com/IamKyros69/status/1975153164300263533
People everywhere are suffering from Purple Gradient Fatigue. And we know who to blame. Five years ago, the creator of Tailwind CSS—the framework that basically runs the modern AI web—made one fateful decision: He set the default UI color to bg-indigo-500. A choice that, unbeknownst to him, would define the aesthetic of an entire generation of AI.
https://x.com/adamwathan/status/1953510802159219096
Enter Anthropic, which took this purple aesthetic and ran with it. Their flagship model, Claude Sonnet, seems to have a deep, spiritual connection to this color. No matter how detailed your design prompt or how much you beg, it serves up another purple gradient website.
We've used Sonnet a lot here at YouWare. So much so, that our own website started looking like a tribute to the color purple. This is the story of how we broke free.
The "Boost" Button Band-Aid
Our first escape plan was a feature we call "Boost."
System prompts for AI agents are long and finicky. You can't just tell an AI to "make it look good" and expect a masterpiece. So, we created a special "Boost" mode. When a user clicks it, the agent switches to a curated set of modern design guidelines, giving the project an instant aesthetic upgrade. and—most importantly—an exorcism from the purple plague.
Boost is a great feature that works wonders, especially with fresh projects. But as people used it, we learned two things:
- It's a Big Commitment: Clicking "Boost" is like instantly redecorating every room in your house. It's a leap of faith. Because users weren't always sure what to expect, some were hesitant to apply such a major change to a project they'd already put work into.
- It Can Be Disruptive: On more complex websites, a full design-system overhaul can shake things up. While it doesn't break the site, it can introduce unexpected layout shifts or style inconsistencies that require manual cleanup.
While Boost is still a valuable tool in our arsenal, perfect for kickstarting new projects with strong, non-purple aesthetics, it taught us a crucial lesson: you can only do so much with prompting and post-processing. A Band-Aid can cover a wound, but it can't cure it. The real battle needed to be fought at the foundational model itself.

If You Can't E-val It, You Can't Improve It
But how do you fight a battle at the model level? You can't fix what you can't measure. To find a better foundational model—one that doesn't default to purple gradients—we needed a rigorous way to compare models objectively. We needed a systematic evaluation framework.
At YouWare, we treat evaluations with a certain reverence. We built an automated e-val platform for internal use. It's not fancy, but it's effective.
When a new model is released, it only takes a few clicks to run a whole set of test cases against it. We use the LLM-as-a-judge technique to score the output and build our own internal leaderboard, kind of like LMArena.
To solve the inevitable bias of an AI judge, we also label the data ourselves. It keeps the system honest. This effort gives us a strong, automated e-val system that covers every part of website generation, with sub-ranks for:
- Visuals: e.g. Does it look like it was designed in 2025 or 2015?
- Function: e.g. Does the "Contact Us" button work?
- Requirements: e.g. Did the system remember you asked for a "minimalist, brutalist design for a cat cafe?"
For a long time, Claude was the king of our internal benchmarks.
Its designs were complete. And it had its own coherent (if purple) design system, which is something most other models didn't.

A New Hope (and a New Headache)
On August 7, 2025, GPT-5 was released.
We tested it immediately, but the results were bittersweet. It was the first model to even come close to Sonnet's level of quality. But it was only the gpt-5-high variant. Think of it like cranking up the graphics settings in a video game to "Ultra"—the visuals are breathtaking, but the frame rate plummets. And it was slow. Like, really slow.
- Claude Sonnet: 5-6 minutes to generate a site.
- GPT-5-High: Nearly 20 minutes.
Twenty minutes is an eternity in internet time. That made GPT-5-High unusable. But in September, a new challenger appeared: GPT-5-Codex.
The whole company fell in love with the "Codex" model in the CLI immediately. It was fast, smart, and could dynamically adjust its reasoning. But there was no API. So, we waited. A week later, the API dropped, and we ran our e-vals.
When the results came in, we thought our system was broken. For the first time, a model had dethroned Claude. It was faster, cheaper, and most importantly, its visual score demolished Sonnet's.

To be sure, we manually reviewed the results side-by-side. It was 10 PM, and the office was filled with a constant stream of "Whoa." and "No way." Codex was generating websites with modern, premium, brand-first designs. They looked human-made. The next day, we pushed Codex live. The purple reign was finally over.

The Last Mile is Always the Weirdest
So, we got our happy ending, right? Well, not so fast. Codex is a genius, but a weird, psychedelic one. People on X (formerly Twitter) seem to agree.
https://x.com/willccbb/status/1973178973027967132
It's a heavily RL-tuned model that is god-tier at coding and pretty bad at everything else.
- It's weird at tool calling. Give it a tool, and it will find the most unexpected way to use it. We've seen it try to use our web-browsing tool to make HTTP requests to the official Node.js documentation.
- It overthinks everything. Ask it a hard question, and it might spend an hour reviewing every single file in your project before giving you an answer.
- It randomly starts speaking Spanish. This was the weirdest bug. A user would be typing in English, and Codex would reply in fluent Spanish. We spent weeks debugging this. The cause? One of our internal tools was named todo_write. "Todo" is a common Spanish word meaning "all." The model was confused by this tool's name. We changed it to to_do_write, and the problem vanished.
In GPT's eyes, 'todo' is a single word, 'to-do' are two words.
Because Codex is so powerful yet so strange, we had to re-engineer our entire prompting and tool system just to accommodate it. OpenAI's GPT-5-Codex Prompting Guide was a lifesaver, emphasizing that Codex is not a drop-in replacement for other models.
Life After Purple
After weeks of hard work, we've optimized our system for Codex, cutting generation time by 50% and costs by 70%, while maintaining the performance.
It's been a long, strange trip, but worth it. Now, our users can get a website with a top-tier, non-purple design for just a few cents. And we think that means a lot.





