AI-Powered Quality Assurance in Localization
Join Stephanie Harris-Yee and Erik Vogt for an in-depth exploration of AI LQA (Linguistic Quality Assurance) and its transformative impact on the localization industry. This conversation delves into how artificial intelligence is enhancing quality control processes and reshaping traditional approaches to translation quality assessment.
Key topics covered:
- The current state of AI LQA technology and its practical applications,
- How AI compares to human reviewers in identifying and categorizing errors,
- The complementary relationship between AI and human expertise in quality assurance,
- The importance of fine-tuning and customization in AI LQA implementation,
- Real-world results and success rates from Argos MosAIQ‘s AI quality assessment tools.
More “Field Notes” Episodes
Explore more topics with Stephanie and Erik in our Field Notes series, where we break down complex localization concepts into practical insights for industry professionals. Check out our other discussions on translation technology, localization strategy, and industry best practices.
- Field Notes: Connectors
- Field Notes: Are Fuzzy Matches Dead?
- Field Notes: AI in Marketing
- Field Notes: From Idea to AI
Stephanie Harris-Yee: Hi, I’m here with Erik Vogt to talk about a subject that’s top of mind for many in the localization industry, and that is AI LQA. Now, Erik, for those who are just exploring this topic, can you give a brief explanation of what we mean when we say AI LQA?
Erik Vogt: Sure. I, think there’s a broad group of categories here. So there’s things like quality estimation, there’s things that identify where quality issues might happen and there’s lots of different ways of approaching this. And so under the hood, there’s lots of internal capabilities that are involved here. So whenever, for example, when you use a tool like Grammarly, when you’re typing in English, that is essentially doing a QA for you. Or there’s also lots of products that are built in that help, where the commas go and where you should check your spelling. Things like that are kind of built into the authoring environment, just like they’re built into the translation environment.
So some of this, it’s sort of implicit and a lot of the TMS tools out there and CAT tools are designed to incorporate some of these QA capabilities in it. But I think we’re talking about something else, which is can you use it… After that process is done, is there a tool that can then take another look and find where there might be some issues?
And this can be especially helpful when doing updates on large projects where you don’t really have the budget to review everything. You want to use a tool that will find issues more efficiently for you and just help you find where those things are. So there’s QE capabilities is really like, how, likely is this the right number?
It attaches a score to it or a category to it, like red flag, not red flag. And this category of products. And there’s several very strong ones on the market have several different particular use cases. And one is to get an overall assessment. Like how is this overall? What is our general guess as to whether or not this is a go, no-go the way it is. Another one is to, if you have a limited budget, can you isolate? Can you not review, for example, things that are likely good segments. Or can we eliminate those. And some of the case studies out there are showing you could eliminate 30-40% of the content for review, maybe more if it’s scoring above a certain amount. That cuts that much out of your human labor.
And that can be a very powerful motivator. On the flip side, you could also say, no, just show me the bad ones because I want to take a finite capacity and I want to focus my human attention on the most likely worst stuff that’s out there. So each of these, kind of ways of thinking about the value proposition in these tools is a way of helping either focus on cost reductions as a ‘this is safe.’ Or focus on best use of human attention on what is worst. Now here’s the interesting part. What if you have a capability that can detect what’s wrong, why not just have it fix everything? Just have it, just do it all altogether. Now, therein lies where things start getting a little interesting because, you know, as we think about the, capabilities of a QA system.
You… what’s your bench line? What is, what are you comparing against generally, you’re comparing to a human. Now let’s look at human variability. If you send an LQA to different reviewers using the exact same instructions and rules, we see that there’s generally about a 0.6 correlation between the output of all these using an MQM methodology.
That’s a number that I’ve heard mentioned from multiple sources. I think it’s probably true-ish that generally speaking, there’s about that kind of correlation. What is the difference? It is often in what the reviewer finds. So really good reviewers and I’ve, I’ve done tests myself where I’ve tried to find the errors and generally speaking, you can give people tests and say, well, you know, how many errors there are in here. You do a standard LQA, they’ll find most of them, but different reviewers will miss different ones of them.
In other cases, they might classify them a little differently, put them in different categories. Is it terminology or is it accuracy or in other cases there’s a difference of opinion about severity.
But anyway, there’s a certain preferential dimension. There’s a certain kind of thoroughness dimension. AI has a couple advantages. It’s pretty consistent and it doesn’t get tired, and it’s gonna find what it’s gonna find pretty fast. So it can be a powerful tool to help accelerate the human LQA process also, which is a cool way to think about it. Now, AI is not as good as when you think about the typical correlation between an automated AI LQA and correlating it with what you know, the human standard is. It’s about a 0.4. So it’s not as… it’s good, but it’s gonna facilitate but not replace humans. And we can have a debate about whether it’s gonna get to 0.6 next year or ever.
But that’s kind of, where things are right now. So anyway, just to finish that thought, the quality target here is often to try to get up to equivalent of a human or better. And that’s often a function of the amount of data that the model has that you’re doing that LQA with. Sometimes that needs to be equal to or greater than the amount of data that the MT or other tools had to start with. So it’s a lot more metadata, it’s a lot more training, it’s the same story we’ve worked on years to try to figure out how to make MT the best. It also needs a lot more data to be better than the MT, or the humans that it’s trying to judge. So I think some tools out there take an approach of let’s make this tool available in the LQA environment and help facilitate the human review. And then you end up with a classic Human + AI is better than the alternative.
But as I mentioned, there’s other use cases in there to consider as well. So that gives you. a little bit of an idea of the big picture for, the whole QE and LQA capabilities.
Stephanie Harris-Yee: Yeah. So maybe we already covered this, but what is the state of the research now? Like how good is it? Is this something that, you know, people who wanna test it out can kind of do and expect a reasonably good result? Or are there still vast fluctuations in how things are coming out in the end?
Erik Vogt: Some of the tests we’ve done have been incredibly encouraging. Like some of ’em, we were getting up to catching 90 to 99% of the issues like very, very positive, especially for very good starting point. The AI is capable of doing some really, really cool things.
And again, this is provided with the correct glossary, the correct guidance, you know, it’s theoretically possible. That having been said, there’s other things to think about, like the level of false positives, like how it might find all the issues, but how many false positives are there and what’s the effective effort at clearing all those false positives that the AI thinks could be issues that aren’t. There’s other issues such as, variations between languages. I think East Asian languages pretty reliably are harder to meet the same standards compared to the Western European languages, which were linguistically more similar. And also there’s a ton more data. So it tends to be more, more effective. So I think we run pilots that take about a week to kind of do all the setup and an effective review. I think if, other systems may be able to do pilots in different amounts of times, depending on what their approach is. Some of them are, out of the box, are ready to test sort of the generic version right out of the box.
Like I think some tools are out there. I think TAUS provides sort of a model with estimates that’s pretty much ready to go. And there’s ways of improving it, you know, with the appropriate training and, instructions.
Stephanie Harris-Yee: So it sounds like there’s still some limitations right now on what you would even recommend someone to use this for. Are there any limitations that we haven’t covered that you would kind of warn people against? I know we talked about like the language will make a big difference if it’s Japanese or English, things like that.
But is there any other kind of outlying factors that people should be aware of?
Erik Vogt: The one thing we recognize is models do have things that make them make mistakes. Like we think about it as hallucinating or just errors. Having worked with this a lot, it is hard not to personify these models a little bit because I think of them as kind of children in some ways. I’m finding it fascinating that I’ll give the same instructions to the AI sometimes multiple times, and it’ll come up with different, even fundamental ways of responding.
It’s almost like experimenting to see what’s gonna make me happy. So what’s fundamental and critical is to break down the problem into chunks. So we think about it, people use the word agentic, but we also are thinking about it in terms of tasking AI to do narrow, specific things one at a time, and you put that in sequence and then you’re gonna get to a better result. So it’s not a nebulous like brain, like humans. Ours is one. You know, you can’t unpack our brains and have individual sub processes work on it. But we can work on one part of the problem at atime, we can break down a problem and do one thing at a time.
So you might focus on glossary first, then you might look at style, then you might look at something else. And so we can put a number of different sort of tasks or agents together to string them together to produce the desired outcome. The second thing I’ll just say is that, tuning is critical and tuning manifests in different ways, both from input instructions to the prompts themselves.
So we have a kind of a dynamic prompt mindset. So we automatically have the AI create the prompts that it uses, kind of based on the content that it’s evaluating. But I think it’s really critical to see how that works and then revise and cycle through some iterations there to make sure that you’re getting the most from that output.
I guess to answer your question as succinctly as I can, much more succinctly is: be careful of turnkey solutions out of the box. No effort involved. It’s just gonna do its thing. It’s probably gonna be less than what you could do if you provided more appropriate support and fine tuning to ensure that you’re getting the best possible outcome
Erik Vogt: at this exact time.
Stephanie Harris-Yee: Okay. So what do you see, or what do you think we will see happening in the near future with this technology?
Erik Vogt: Well, it’s gonna keep getting incrementally better. I think the operators will get better at the deploying these capabilities in more efficient and effective ways. You know, I think that efficiency is one thing that factors more at scale. So think about the comp, compute load for each of these tasks.
How efficiently is the prompt transacting the instructions that you’re trying to get to? There is a cost associated with it. There’s time associated with it, and I think we’re seeing, a faster, throughput. Another thing I’d just say too is I would expect, in the way we think about these things that the tech and AI costs may start showing up in proposals as we start thinking about like the cost of these. It’s still relatively minor compared to the human costs.
So just often it’s kind of loaded in there and included in the weighted word. But I think when you start moving at orders of magnitude, higher scale, then it starts to become more of an issue. That’s one second big trend. I’d say the models, the foundation models, that most of our capabilities are built on top of are themselves evolving. And I think I’m loving the increased interest in quantifying the relative value of each of these. I think, for example, Intento does their state of AI report that compares just at a high, high level. It’s very kind of general. Just to give you an idea of the relative strengths of different machine translation models. I imagine we’ll start seeing more comparative, generic sorts of analysis out there that would show us the relative strength of the different foundation models. But really these are just entry points. They’re just a beginning of the conversation. Any given customer’s target is a unique situation and it does require a very careful consideration of the prompt optimization, the optimization of compute, the optimization of which number of steps kind of yields the best possible outcome.
So these are all iterative and, some ways, a little trial and error, but in general, we’re seeing a faster… you’re getting to the right answer faster. That’s the
trend that we’re seeing. Yeah.
Stephanie Harris-Yee: Okay. Well great. Well, thank you, Erik. I think this has given a really good overview of kind of where we are right at this moment and what we should be keeping an eye on as things develop into the future. So thank you so much. Any last, last things you wanna add?
Erik Vogt: This is a really exciting time to be in this industry. It’s really cool. I think that we’re seeing more and more interest in this ourselves. Like I said, some of the pilots, some of the experiments are startlingly good; at other times there’s surprising mistakes. So it’s an evolving story here, but it’s a fun, fun time to kind of be exploring this as we learn a new language around this different approach to how we use AI to get the best possible outcome.
There’s no guaranteed magic bullet yet, but anybody who isn’t using these tools is leaving a considerable amount of potential on the table.
Stephanie Harris-Yee: Yeah. Yeah. All right. Thank you so much.
Erik Vogt: Sure. Thank you. Talk soon, Steph.

AI remains a hot topic in localization right now. It shows up in conference panels and RFPs and there’s a lot of curiosity, urgency, and noise. But what’s still rare are examples of where AI is working well in practice. This creates some risks. It’s easy to talk about potential and efficiency gains in theory, […]


AI is here. It’s showing up in RFPs, boardroom presentations, and internal conversations, and localization teams are feeling the pressure to act. But knowing you need to do something with AI isn’t the same as knowing where to start, how much to change, or what’s actually worth the risk. For most localization programs, the fear […]
