The Rise of the Model Designer

with Barron Webster

January 6, 2026

Hey there, we're Ryan, Federico, and Robin, a trio designers who have noticed an undercurrent of change in the design world. AI is taking over software, and as it does, the demands on its designers are changing. AI is squishy and has a mind of its own, so designing AI-first products feels less like designing for print or HTML, and more like working with some kind of alien intelligence that crash-landed on Earth. The skills needed by designers are rapidly changing, and every day a new model or tool drops that turns everything on its head.

So we started AI Design Field Guide to serve as both a resource for our peers in the field, and also a time capsule of this unique time where we're all figuring out how this new kind of work is done in real time. Each article is an interview with a designer working in the field who has a unique perspective on what's happening.

Our first interview is with Barron Webster. We knew we had to talk to him first because he's been elbow-deep on AI products for over 8 years (AI years are kind of like dog years). If there's anyone that could see through the hype cycles and into the future, it would be him. Early in his career, he designed Teachable Machine at Google Creative Lab, the first consumer tool for training AI models. It launched in 2017. After Google, he joined Replit to work on their AI features, which were the force that drove them to grow from a startup into a unicorn.

He recently joined Figma as one of the world's first model designers – a new hybrid role for people who want to get their hands exceptionally dirty with LLMs. The emergence of this new role seems like proof of this change we're noticing. So, we thought that unpacking it with Barron was the perfect way to kickstart our conversation about how design is changing in the age of AI.

Ryan Mather

So what's a day in the life of a model designer look like? What's your mandate?

Barron Webster

I sit with the AI research team at Figma, and they hired me for two main reasons. For one, they're reaching a point where they're getting all of the juice that they can squeeze out of the foundation models, and it's not good enough. A lot of Figma's data is in a proprietary format that may never see the light of day, so foundation models aren't particularly good at working with it. Part of my job is bridging that gap.

The other big part is building new tools and AI-first thinking to the design org. You know, Figma's a big company – lots of designers working on parts of the product who haven't designed AI experiences before. Right now, there isn't much tooling, inside or outside, that makes designing those experiences easy, fun, or even possible. AI feature design looks different from traditional product design.

There are steps that designers can take a lot earlier in the process to prototype the core of the AI feature, before getting into the UI. If you're designing UI for something that you haven't played with, the risk is that you're designing UI for a perfect case that isn't representative of how it will work. So one of the things that I'm excited about is building the tools that let designers prototype and play with that part of the process early on without having to become an engineer.

It's like the AI is Cthulhu, and the UI is this mech suit. And the goal is to get the designer to understand Cthulhu's anatomy before designing the mech suit so that the mech suit doesn't just explode when Cthulhu puts it on.

Pretty much, yeah.

Can you talk more about what you imagine these new AI Design tools might look like?

The things that I'm most excited about are tools that allow designers to manipulate and run eval cases in a fast feedback loop and in the native format to the material that they're working in. Imagine you're working in a figma file and you try an AI feature and it doesn't work, you should be able to immediately add that as a test case to your evals. What if I adjust the system prompt for the tool that didn't work? What if I try a different model on this one?

Part of the reason that it's so hard right now is that the feedback loop is so slow. Fundamentally, all good design tools remove or shrink the feedback loop. There's a lot of room to improve. I don't know if you are familiar with building your own eval sets, but a lot of it feels like you're doing a lot of manual labor to just shuffle data around.

The other part of the job that I forgot to mention, is thinking about how to make the AI features more differentiated at Figma. It's a design platform. So you'd expect that the outputs will be better designed than, like, Claude Code or Cursor. How do you ensure that the outputs are of the highest quality? I think a lot of that will look like targeted eval strategies and finding proxies for what we consider good design, which is a really fun and interesting kind of art school, philosophical question.

Yeah, if "design is how it works", increasingly the "how it works" part is governed by the weights of the models that are driving these experiences. So design for these sorts of experiences should involve getting your hands dirty in the model weights.

Federico Villa

Barron, was that a role that they were looking for? Or did you co-create it with them?

They were looking for someone to think about this space. I started talking to them this summer, and there were a bunch of problems that I knew that they had, but I don't think it was clear to anyone, including me, which of them was the top priority. I think I'm, to some degree, the canary in the coal mine, which has been fun and chaotic, because I've been jumping from context to context and team to team. It's a huge company with eight different products now, and so there's a lot of stuff happening.

So you've clearly been deep in this space for a while. You're probably one of the designers who's been working with AI the longest – could you share your first introduction to what we now call AI?

There are two points in time that stand out. One was at RISD around 2014 or 2015, in a class called Computer Utopias by Chris Novello. This was pre-LLM, when machine learning research was more about classifiers. We surveyed modern digital products and technology, and the most exciting things were image classification models—the ability to feed data into a model and get pretty good image segmentation and classification. This was driving features like Snapchat face filters and Google's image search.

Alongside image models, content moderation and recommendation systems were big. This was the heyday of Facebook, Twitter, and Cambridge Analytica. The idea that you could design a system to show users curated content based on their consumption patterns was a huge shift. From 2013 to 2017, platforms like YouTube, Facebook, and Twitter inventing the algorithmic feed created a new material to design around, moving away from just subscribing to topics or following people. It was a contentious topic at the time, but that was my first theoretical exposure as a student.

The second big moment was from 2016 to 2018 at Google's Creative Lab, working on Google Lens, Google Assistant, and Teachable Machine. Nearly all our projects applied some form of model innovation. For Lens, it was improved image segmentation and classification. For Assistant, pre-LLM, the innovation was mainly in voice-to-text and text-to-action processing. Teachable Machine was pure image classification. This era was interesting because it wasn't about text generation; it was about using models to sort or annotate stuff that already existed.

You know a lot of people say, I want the robot to do my laundry and dishes, and it's really easy to forget that for the longest time AI was just for processing your content and only recently did it start to write poetry.

Haha, yeah. I remember we even did a promotion about a cucumber farmer in Japan using TensorFlow to sort cucumbers—it's funny because it's just a very practical use case where a simple classifier likely still outperforms an LLM today.

I guess we're kind of post-LLMS now too aren't we?

Yeah I don't know what to call them now. Just "models" I guess?

Yeah I guess "multi-modal models." But I also think we're starting to move into a world where users mostly interact with "model systems", with agentic systems or things like Deep Research.

Were there any projects you've worked on over the years that really surprised you or changed how you think about designing for AI? What did you learn?

I spent over three years at Replit, a collaborative web-based programming environment. I was hired partly to evaluate where we could use AI, as they had no AI features at the time. During my three years there, the models kept getting progressively better. So we were constantly looking for ways to add AI functionality that leveraged the models' new capabilities in a way that was both useful to our audience and reliable.

It started with basic, manually triggered features like selecting code for an AI explanation or generating code in an existing file. With each new feature, user expectations rose. We were on this cycle of trying to meet their desires, where each release showed them a future they thought should be possible, but the models weren't quite there yet. For instance, when we allowed generating code snippets, users asked for entire files or projects. Once we could do that, they wanted specific edits. Then, they wanted to start from scratch.

We knew what kinds of features we wanted to support and would try them with existing models. If it didn't work, we'd pause. When new foundation models came out, we'd try again.

Programming environments have specific product constraints. Even if a model is great at writing code, you have to figure out how to get it to edit code in the right place. I remember until around Sonnet 3.5 came out, models weren't good with line numbers. We had to devise hacky ways to ensure edits were correct, didn't duplicate content, or properly replaced functions. These weren't AI innovation problems per se, but product scaffolding issues to handle model limitations. In hindsight, much of that work was only relevant for six months to a year until a new model obviated it.

That resonates with me. There's this question of "Should we build this feature that's technically possible today but would be obviated by a model release in a few months, or should we build some other feature which just gets better as the models do. It seems like a part of AI design is this dance of testing out new models, prototyping with them, seeing what's possible, what's not, thinking about which features will stand the test of time. Was there anything you did that stood out to you as particularly impactful? It's not typical design work you do in Figma.

A big part of what design and research are doing now is ensuring the team stays focused on the right thing. It's easy to start building a feature, encounter technical hurdles, solve them, find more, and end up with something technically functional but overly complicated for the user or drifted from the original goal.

A concrete example is when we were working on the Replit agent, which automatically created files and wrote code. A huge technical problem was getting the agent to test the applications it built. For instance, verifying if a login page worked. The engineering side saw this as a cool technical problem: spin up a sandbox, build screenshot functionality, feed screenshots to a multimodal model to decide where to click and type—essentially pseudo-computer use by the model. It's satisfying to go down that rabbit hole.

However, myself and another engineer proposed: what if we just showed the user the website and asked them to test it? We offloaded validation and testing to the user, skipping that entire complex technical problem. Having someone in the room focused on the user problem, not just the technical one, can help you skip or simplify many things, even if the solution is less technically exciting.

Yeah, that's a good example of stepping back and asking 'do we actually need to solve this?' I imagine it's hard to know where to draw that line of which tasks to automate, eval, and test, and which ones to leave up to the user to figure out. If you'll allow another analogy - creating AI products is kind of like making little robot assistants that are going out into the world and we don't know what people will ask them to do. Should we make sure they can handle someone who changes their mind halfway through? Or someone who asks three questions at once? Should we make sure they know the heimlich maneuver? How did you navigate those decisions of what needed to be tested and what didn't?

At Replit, while there were subsets of users, we generally knew what our platform was capable of and what users wanted. So, we focused on features that helped them achieve their goals faster with AI, compared to typing all the code themselves.

Building an AI product from scratch today is harder. Traditional startup wisdom is to start niche, find 100 users who love your product, then expand. General chatbots like ChatGPT have seen success with the opposite approach, build a general interface that can be used for anything and… see what happens. That might be a one-time fluke due to the new technology. If I were building an AI product now, I probably wouldn't start by targeting a general use case unless I had a significant technical or distribution edge, like Meta has with Llama.

Zooming in a little bit deeper to the development process at Replit, how did you decide when the models were good enough for a certain feature? Is it just vibes? Or is there more science to it?

This evolved... For our first AI feature, code explanation, it was just me and one engineer, pre-LangChain and widespread evals. We tested it like other platform features: launch internally, have staff use it (most Replit staff are programmers, so they were good evaluators), make adjustments, then launch to a small volunteer group of users (maybe 1-5% of the user base).

We'd interview these users, pull analytics to see who used it most and retained, and talk to those who tried it once and stopped. This is a traditional agile startup approach: incremental rollouts, user conversations, willingness to roll back. For code explanations, judging "good enough" was subjective. Staff would review explanations for correctness.

Once AI started authoring code, tracking success became easier. We looked at signals: Did the user accept the AI-generated code (shown temporarily before insertion)? Did the inserted code cause lint errors? Did they revert or delete the code soon after? The programming space is fortunate because it's easier to validate if something works; you can run the program or it doesn't. It's trickier with, say, an email authoring tool where success is less binary.

Beyond the quantitative metrics that assessed the performance of those features - did you have other ways of detecting product-market fit? How did you know if the features were solving problems for users or not?

Yeah, pre-AI product strategy was quite different. You'd have plans, an existing user base, and you'd strategize abstractly about expanding markets or categories. With AI's rapid changes, our strategies at Replit were much more reactive.

For example, when I started, we focused on education due to strong product-market fit, especially post-COVID for remote teaching. But as AI features improved, we faced a dilemma: indie developers and hackers loved the AI, while teachers often disliked it because students could bypass learning fundamentals.

This happened a couple times, we found product-market fit more organically than strategically. When we released the Replit agent, our last major AI product, we didn't really know who it was for. It felt more reactive than some of our top-down projects like enterprise plans and teams. The more successful projects were ones where we just shipped features and saw what happened.

Later on, we discovered its users—often ops people at tech companies needing to ingest sales data or build dashboards, similar to Zapier or Retool users—by releasing the tool and then talking to those who adopted it.

Probably now, they have a more strategic approach, now that Replit Agent has found a fair amount of product-market fit.

It does feel like it goes against the conventional wisdom. "Build it and they will come" is usually seen as something that doesn't happen, but it seems to keep on happening! It's a humility check for designers, isn't it? Engineering is kind of eating into lower-level design work. Design is moving up the chain. Design used to be about - how does the feature work? But more and more, the models are in charge of how the product works, and designers are more like - where do we focus?

Yeah or like, how do we expose that. In the last era of technology, a lot of the tech was fundamentally pretty simple CRUD, and design played a role in figuring out how to scaffold information to solve a specific problem for the user. But today, it feels like a lot of the questions are just like - is this going to work at all?

That makes me wonder - is that just a reflection of how new we are as a discipline? Like if we're basically saying "we don't know what's going to work until the end" maybe that points to a lack of tools or methodologies.

Well, speaking of "is this going to work at all" - what about evals? Did those play a key role in your decision making?

For the first two years at Replit, we didn't do many evals. The practice wasn't really widespread. With the agent, we leaned into them more, but primarily as indicators for product development rather than for validating our own products. For instance, when a new model comes out (e.g. Llama 3), we'd look at its performance on programming evals to decide if we should test it in our app.

More recently, at Sandbar, I spent a lot of time writing evals, particularly for model personality. There are broad industry benchmark evals for basic stuff like, not saying anything offensive, but building specific evals for things your product uniquely cares about is part of this new design work. The workflow was heavily:

make prompts,
adjust prompts,
create evals,
see how they perform,
combine with manual testing and subjective feedback.

It's a big whack-a-mole problem otherwise.

What are the consequences of not having evals?

If we didn't have evals, we'd have to do a lot more manual labor to verify the AI is working well. Evals are a faster way to test if you're meeting desired characteristics. For example at Sandbar, we cared that if the model didn't know an answer, it should ask a single, specific clarifying question rather than hallucinate. We had an eval for that. We had others: don't ask more than one question at a time, keep answers concise (with exceptions).

The tricky thing about evals too is, at least in my experience, a lot of the time, the cause of poor eval results isn't necessarily poor performance, but a badly written eval. Like if you had an eval saying the model should be very concise, and the user says "oh my mom died" and the AI says "That sucks" maybe that would result in a great eval score "Very concise! 10 out of 10!", but maybe that's not actually what you want as a user in that situation. Maybe it should say "Oh I'm sorry for your loss" even though that's not as concise.

So there's this question of, how much of your time do you spend writing good evals. Evals are kind of like the vital signs of seeing a doctor. Like does this person need to go to the ER?

Yeah I mean in our case, we would have an eval covering empathy too, so you want to have a set of evals covering all the things you care about, not just some of them. For us, evals were primarily for avoiding regressions. We had characteristics we wanted to meet, and if a model or prompt change caused a regression, evals signaled where to massage the system to get back to baseline. I think of it like test coverage in programming.

It's interesting, in traditional programming there's this idea of test-driven development where you write the tests first, and then write the code that needs to pass the tests. I haven't seen the equivalent of that very much in AI engineering, where you write the evals first. I think I've seen a couple papers but not in production.

There might be a future job like, eval designer, which is like a design systems role that designs dashboards for the rest of the team to understand how the AI is performing.

It's a bit like a service design orientation. Imagine you were designing a hotel experience, and you have to decide, how are we going to assess our staff? Is it primarily about how friendly they are, or how proactive?

Totally. Barron, can you speak to times where you've tried things and maybe they don't hit the acceptance marks that you guys are looking for. How did you decide when a feature wasn't good enough to look to ship, versus, let's just go with it and we can tune it as we go.

One of the main things was sycophancy. This was probably one of the hardest things to write evals for - the idea that the model should push back on you in cases where you know you could use it. Some of the time, it's at that point it becomes more of a product and design decision to orient the team on what an acceptable failure rate is, and that becomes part of the design philosophy of your product.

Also, it's easy to feel like, Oh, we did all this testing, it didn't work, and then the new model came out and it was fixed. Was that time wasted? At the same time I'd say - don't sell yourself short on the work that was done to understand the feature, what makes it good or bad, such that when the new model did drop and you tested it, you had conviction that hey, this is good enough.

Robin Chen

How did sycophancy manifest itself in all these different use cases you've designed for? Because you were at Replit and you were at Sandbar. Now you're at Figma and those are all very different.

Yeah. I think at Replit and Figma so far, the idea of sycophancy doesn't really make its way into the product very often. At Sandbar, it's a very open-ended conversational interface as the backbone of the product, and in that sense, a philosophical experience we wanted to avoid was continually agreeing with the user and feeding their ego.

At Figma, one idea we've been thinking about is "design critique as a service"—you ask an AI to critique your design—and that raises interesting questions about the personality of that system. Is it something you opt into, like choosing a "Dieter Rams" attitude, or do we have a default? And do we focus on accessibility or contrast issues—more objective feedback—or aim for something broader? I'm not sure how much that will make its way into the actual product experience.

How would you like the Evals field to evolve overall? Like, what would be really helpful for you as someone practicing in a field that's so new? Is it like an open tool where people can post eval sets or parameters that they're setting? And do you think the industry needs that? Right now it seems like a lot of stuff is proprietary.

The kinds of tools I find myself wishing for, and hopefully can work on at Figma, are ones that reduce the iteration speed for creating evals. It's so painful right now, and I feel like everyone that's working on evals, like, has to basically do this work. How do I map it? What's the format? What pipeline is it running in? And, like, hooking the output of the pipeline up to an interface where you can see everything in one place. There are tools out there that are, like, pretty good at this for text, but not so much for other formats. There are some interesting platforms out there that are like pseudo evals, like Design Arena, or, like, I forget what the other one is…

Yeah, the ones that basically do blind tests side by side, where people vote on the best output that they want.

Yeah. Those are cool. Maybe they're other interesting formats like that, which designers can get involved with. I would love to be able to do something similar but directly in figma files, including commenting on issues and stuff. I want to be able to quickly create sets of tests that I want to run, kick them all off, get like 100 responses, and do it again in like 30 seconds. We have all of those pieces kind of working, but it takes way too long to do all of them.

You know how, like, an architect can look at a building and guess what software it was designed in? I'm just realizing that Barron is essentially creating the fingerprint of like, the next era of web design. Whichever evals you pick, people will be able to guess that it's a figma-AI-created website…

Maybe yeah, that's entirely possible. Or maybe that's the failure case, that there is a specific aesthetic.

I think that's an interesting tension for all of AI design - how much do we want to assert our own point of view as the tool-creator versus adaptive to what the user wants. If the user is really going for a web 1.0 aesthetic with green text on a red background, should the system lean into that and match it, or should it adhere to our idea of what good design is?

Shifting gears a bit, besides the work of building products with AI in them, there's also the process of designing the model itself. In a sense, LLMs are a new kind of "computer", with much of the "programs" being in the weights of the model itself. When it comes to actual model creation, how do you think a designer can create the most value?

I've experienced two main ways – fine tuning vs. training from scratch. If you're training a model from scratch, a designer's biggest impact is pointing the organization to where user needs are greatest and pain points are most acute. Then, work with engineering to find the right approach. At Replit, we trained a custom model on Python for common, simple code errors because we saw significant user frustration there. The design and research input was highlighting these user problems. We were less involved in the actual training, more in defining the problem and then figuring out how to apply the trained model in the product.

The other approach is fine-tuning an existing model. If you have an existing model, product, and evals, and you want to drive performance up—and you're the one writing prompts, evals, and talking to users—you'll have a clear sense if it's meeting expectations. If prompt engineering runs out of steam, fine-tuning might be the next step. This isn't exclusive to designers, but they are often well-positioned to make that call.

A key design translation layer is remembering user assumptions. Engineers and designers working closely with models can forget that users don't know the intricacies, like reminding a model it's working in an HTML file for better output. Designers should engage their "inner doofus" and communicate what a naive user, unfamiliar with AI model quirks, might try and where they'd get stuck.

Hearing you speak reminds me of the importance of comfort with ambiguity. It's always been important, but now it feels even more important. Like, it used to be that there were infinite problems and you had to figure out the one experience that could solve that problem, but now our products are also infinitely dynamic experiences. It's not TurboTax where the user sees the same thing every time.

What other kinds of advice would you give to someone starting a new role where they're designing AI products for the first time?

The most sustainable and impactful thing is to invest significant time upfront to truly understand what goes into the model and what comes out. What's its prompt? What user information is fed in? What tools can it call? What evals are in place? Get an intuitive sense for what happens when you adjust these dials.

You don't want to be just the UI maker for an output you don't deeply understand. If engineers and PMs come to you saying, "The model gives you this, design an interface around it," you can do that, but you won't be able to propose meaningful improvements based on user insights. You'll also be working very reactively to subsequent model changes. You want to be part of the decision about whether a new capability is even something you want, not just on the receiving end.

Getting into that nitty-gritty can be challenging, especially for designers not code-literate. Your company might have interfaces like Langsmith, or you might need to learn to run the development environment yourself. It's a hard task, but crucial.

Reflecting on your AI product work, is there a specific instance where you feel you made the biggest impact?

The Replit agent: convincing the team to simply ask the user to verify if the generated application was working. That decision saved a lot of effort by focusing on the simplest path to user validation.

It kind of ties back to your advice to designers to review how the system works so that they can understand it. The time where you felt like you created the most value was when you were able to take a step back and, understanding the whole system, point out a simpler way to meet the users' needs.

Another example, though not a product per se, was the LaMDA launch (Google's early LLM). A lot of our time was spent just playing with the model, trying to prompt it—though we didn't call it "prompting" then—to get it to pretend to be different things and perform reliably. The demo we chose, where you could talk to Pluto or one of its moons, was purely a function of trying countless things and seeing what performed best. We couldn't have strategically picked that without extensive, hands-on experimentation to discover the model's strengths.

It seems like "Should designers prompt" is different in kind to "should designers code". Ultimately, with coding, the answers to those questions are pretty falsifiable - can we build XYZ with ABC technology? Asking an engineer the question is pretty equivalent to knowing the answer yourself. AI model behavior is inherently more subjective and nuanced. There's no substitution for understanding that material yourself at a deep level.

Barron, do you miss designing? From what we know of you, you've gone through doing evals, engineering. I'm curious in these last roles that you've had, do you still see this craft that you're taking on as a form of design? Do you still mock stuff up? Do you miss the craft part of aesthetics and structure?

It's a good question. I do think of it as design—it's just a very different form of it. You're designing behavior, and you may never get it perfect, which is fine. That's a different mindset from UI design, where you have full control over every pixel and perfection is rewarded.

I still mock things up and play with design tools. At Figma, I make eval cases, go through outputs, and fix what feels off. It's almost therapeutic—like a fidget spinner. Give me a website mockup and thirty minutes to fix the typography, and I'm happy. I still get doses of that kind of work, but it feels different now—and it's the kind of work that's never really done unless the feature gets removed, because you can always keep improving it.

Hot Takes

What is a product that you think should exist?

One thing I've been thinking about—if I hadn't joined Figma, I was probably going to try to raise money for this. The idea I want someone to build, whether it's me or someone else, is this: the better all these AI tools get, the less information I can retain. I want a system that keeps my brain in shape, because you're offloading all these things to an AI. There are a lot of things you won't be practicing as frequently, or even doing manually. I definitely feel this way using Cursor. I get so much more done in a day than I would have two years ago. But if you ask me in a month how any of these algorithms work, I probably won't be able to answer you. Two years ago, it would have taken me a month to get done what I can do in a weekend now. But I would have learned it through pain, and retained it a lot more.

Which company do you think will be the first to release AGI?

I don't really have a super strong opinion on that, actually, which is unfortunate. I feel like I've ceded thinking about that, because it's too -- if I start figuring out that, I'm like, Why do I even have a job? You know? There's certain stuff that I, like, deliberately ignore.

Do you think a designer's job is going to get automated in the next five years. Actually?

I think that it will look very different in five years than it does today, in the same way that I think that designers jobs, especially software designers jobs, have been less directly impacted by AI up until this point, in comparison to software development. If you think about the way that some software developers work now that are in the bleeding edge, they might have three different features they're working on all at the same time, and they're kicking off agents and reviewing their work, and cycling between them. That is not a product experience that exists for design right now. So, yeah, I think that it will look very different. I do hope that designers don't get left behind. A lot of the behavior of AI systems by default falls to the engineers, because they are the ones that have the tools to manipulate the system directly, they're the ones checking in the prompt updates to the repo. They're the ones that are kicking off the eval runs, and because it's such a core part of the product experience. My hope is that designers don't get relegated to just UI makers.