Written by
Argos Multilingual
Published on
18 Mar 2026

Join Stephanie Harris-Yee and Erik Vogt for a deep dive into one of the most overlooked challenges in AI localization: governance. While large language models have dramatically improved translation quality and customization capabilities, many companies struggle to operationalize AI effectively. This conversation explores why governance frameworks are essential for successful AI deployment in localization workflows.

Key topics covered:

  • The three pillars of AI governance: model selection, model grounding, and risk-based workflow management
  • How to leverage RAG, terminology databases, and knowledge graphs to improve LLM outputs
  • Implementing content tiering and risk profiles to optimize human review processes
  • Managing cost-benefit trade-offs in selective human oversight
  • Addressing the industry’s margin compression challenge through efficient orchestration
  • Creating enforceable processes and quality control mechanisms for probabilistic AI systems

More “Field Notes” Episodes

Explore more topics with Stephanie and Erik in our Field Notes series, where we break down complex localization concepts, ideas, and experiments for industry professionals. Check out our other discussions below:


Field Notes – Episode 9: The Governance Problem in AI Localization

Stephanie Harris-Yee: [00:00:00] Hello. I’m Stephanie, and I’m here again with Erik to talk about the latest and greatest in the localization and AI space. So today we’re going to be talking about something called governance. Now that’s a big word. And we’re talking about it specifically in this AI localization field. But let’s go ahead and set some groundwork before we jump into that. Erik, AI translation has improved dramatically with large language models, are now often outperforming the traditional machine translation.

And now they can also be customized a lot in ways that are really incredible I think. But you have been saying in the past that companies are still struggling to operationalize AI in localization. So if the models are getting better, what’s actually going wrong here?

Erik Vogt: There’s plenty of discussion about how the models are improving and we have good data that we use to measure that [00:01:00] improvement. So I’m not gonna go into edit distance reports or quantification of how these systems are improving. It is clear that they are improving.

But there, the base models themselves still have limitations to them, and I think that the LLMs give a lot more control. Things like tone and audience adaptation. There’s more terminology control. There’s domain specific prompts but they’re still probabilistic systems and they’re still gonna have a certain error rate.

They’re not deterministic, and just because they did it once doesn’t mean they’ll always do it. There can be semantic drift. The meaning changes, like how the model is interpreting meaning over time.

And obviously famously, we understand things like bias, hallucinations, incorrect product definitions, rewriting things in new and unexpected ways. So all this stuff doesn’t really surprise anybody. We’ve been talking a lot about that and you can talk, talk about it. But what we’re [00:02:00] seeing is, how does the system react to changes in this workflow? So the type of errors are changing into like believable mistakes. Like things that if you read them, it’s like that’s a perfectly grammatically correct sentence, but how do we govern that? How do we govern our human review process or our measurement ecosystem and our risk profiles around the use of these models. And so I think in general even though they’re getting better, I think we’re under investing in attention to the governance of these models. That’s my starting thesis here.

Stephanie Harris-Yee: Okay. So when you say governance, what does that mean in like the basic sense? Is this just talking about workflow policies or are we talking about some more technical things? What’s the definition of governance in this case?

Erik Vogt: Definitely has nothing to do with parliament and it has nothing to do with democracy. It has nothing to do with current events. But [00:03:00] governance is actually a complex and interesting field. There’s a lot of different layers to it, but I’ll talk about three of them right now. One of them, which is model selection.

So for example, different models behave differently. There’s latency and quality trade-offs, they perform differently depending on the domain. So there’s some elements there of just which model do you use for which application? Just decisions, right? Control points. Second is model grounding, and I’m using this term kind of to address a wide range of things that you can do to help the model perform better. It’s not just prompt engineering, it’s also having to do with supplying the model with the correct terminology authority, product knowledge, regulatory context. These are all things that a lot of LLMs can build into their product ecosystem.

There’s a place where you can put this stuff and the model will perform better. So think RAG, think terminology, databases, think knowledge graphs, think product documentation. All these things can be used to inform a stronger LLM output. Sometimes astonishingly good if those root resources are in [00:04:00] fact valid.

So note to self.

Stephanie Harris-Yee: Right.

Erik Vogt: Those things need to be the correct resources. They have to be coherent and they have to be relevant. So there’s governance in which resources do you supply to which projects becomes an important part of this. So the model interacts with enterprise systems and retrieves authoritative data.

Then it verifies that information before generating text. That’s the way it works. So the best AI systems are not standalone models. Most people are just like, send it to ChatGPT, get it back again, or whatever. It’s not the way it ideally should work, especially with corporate high risk content. These systems are connected and governance is the framework by which you put all this stuff together.

And that gets to my third point, which is risk based workflow governance. So not every piece of content needs to be reviewed the same way, but you need to keep track of that. Which ones went through each of these tasks? And when you have, say, TMs that you created with an MT or with a lightweight review, you don’t wanna leverage those TMS [00:05:00] into a high risk content, so you just keep track of your different tiers. But I think the smartest deployments out there are taking content tiering into consideration, each of which is tagged with a certain risk profile. And then you select and balance the cost benefit.

Of selective human oversight into each of those. And again, the governance is all about keeping track of all this stuff and controlling which content goes through which track and who’s talking about it. And there’s a whole bunch of different aspects of this, but those are really the main ones.

So what you’re really trying to do in terms of creating value, and we’re gonna get to that in a second, but you don’t want to over review low risk content and you don’t wanna under review high risk content.

Stephanie Harris-Yee: So yeah let’s talk about value then. So I guess you can look at value in a couple of different ways. What are we talking about in this case and how is that value coming out specifically in this orchestration process versus say, [00:06:00] AI, LLMs, we’re using those because they’re cheaper, they’re faster.

So that’s the sense of value, but how is this other side of value coming out here?

Erik Vogt: So let’s first look into what orchestration actually is, it’s which model do you use? What information are you asking it to retrieve? When should humans intervene? And to some extent also, which humans should intervene.

And that kind of is relevant to our conversation about subject matter expertise and who is the experience that you’re paying for reviewing this? And what are you tasking them with? What control do they give them? And so that’s also part of it. Also, the quality signals, like how are you gonna measure what a failure is, and get an idea of how much it costs.

And then also the learning feedback mechanisms, like how do you improve models, how do you improve workflows, how do you identify when something’s gone off and how do you fix it? So these are some of the areas that a orchestration conversation should have. Like how are you controlling all these different [00:07:00] aspects?

Exception management, failure, rework, workflows, change management, all that stuff. So as the coordination is really critical, right? Why? What is the business objective? What is the value that this creates? I think one is that we’re using human attention wisely. So if we’re having the right people, we’re looking at the right content and using their time effectively, it means not wasting it on non-value adding tasks whenever possible. Second is you really wanna reduce the redundant QA cycles, a lot of these are very fragmented, but some are very formal and QA cycles cost time, money release, it’s a pain. So how do you detect and fix and prevent from reoccurring the type of errors that are showstoppers.

There’s also the centralization of visibility across workflows. So imagine, let’s say we have three different workflows. We have high risk, moderate risk and low risk. [00:08:00] Each of these workflows has their own kind of tracks, their own characteristics, their own control mechanisms of all these orchestration elements would tell us.

So you want to know. Like how are you doing against different criteria for each of these tracks? Like how many critical areas did you have for the high risk content? That might be your number one KPI is the number of critical fails in your high risk content, whereas your low risk content might be optimized around total cost per word, or even total product cost.

Because now you can generate content that just be like, poof in a few, a few prompts you’ve got an entire article. So how well are you delivering to that business value? And it, and word rate may be irrelevant, but maybe it is just word rate is the, just trying to get from 2 cents a word down to one and a half cents a word through prompt optimization and kind of good token management and things like that. So in general, this is hopefully giving an [00:09:00] idea that there’s fundamentally different ways to think about these and that if you do have a risk based model for how you route work. Then you will come up with different metrics and different orchestration parameters for each of these these models.

And then I guess the last part really to keep in mind is governance is all about creating enforceable processes. So if you don’t want any terminology failures, you need to have some control mechanisms to prevent and review content. Again, remembering we’re dealing with a probabilistic system. We can’t be a hundred percent sure that a model has introduced some surprising interpretation of a common word with more than one meaning without a hundred percent human review. And in some cases that a hundred percent human review might even be more expensive than just having a human writing it in the first place. So it’s very important to keep in mind the essential purpose of the content that you’re creating, and how are you orchestrating all the different systems that [00:10:00] are supporting the delivery of value, which is a sum of all those different characteristics, the right quality, the right cost, the right characteristics of what you’re looking for.

Yeah, so I think just thinking here this is also a big issue for our industry, right? We’re dealing with cost and margin compression as an industry. So the complexity of the orchestration and dealing with that complexity, which I’ve talked about before as well is becoming more and more important to solve.

Stephanie Harris-Yee: Yeah.

Erik Vogt: Cause we don’t have massive amounts of generic word count to buffer optimization around like the cost of optimization let’s say it costs a hundred dollars and it’s a thousand dollars project. Then you’re getting your workflows down, all that stuff is 10% of your total project costs, like 10% of the total

job cost. But if, let’s say you cut the translation price in half now your orchestration overhead is still $10, but you’re now 20% of [00:11:00] your total cost, or I’m not doing my math right but basically you’re, you could double or triple the operational overhead of that orchestration. So one of the things that we all have to figure out how to do is how to orchestrate faster and more efficiently.

And we need tools to be able to optimize these workflows as quickly as possible and be able to customize the parameters by which we measure the success for each of those types of workflows.

Stephanie Harris-Yee: Okay. Thank you Erik. I think this is good. It gives people a list, maybe of things to check to say, okay, am I looking at this aspect of governance and this and this? As they go down the list of trying to get their AI systems up and running, because I know that’s a huge pain point for a lot of folks right now.

The LLM seems great, but now it’s this whole second piece, so thank you for coming in and sharing your insight.

Erik Vogt: Always a pleasure, Steph. Thanks very much and have a great rest of your day.

What to read next...