Two experts in the space recently shared guidance for companies developing and testing generative AI applications. They discussed market growth and use cases, as well as the crucial components for training and testing GenAI apps to reduce the risks of bias, toxicity and inaccuracy. Read on for a summary of key points from the webinar, Testing Generative AI Applications.
Chris Sheehan, SVP and General Manager of Strategic Accounts at Applause, leads the company’s AI practice. “Generative AI is very exciting technology,” he said. “But it does come with a lot of risks.” Over the last five months, Applause has tested large language model-based GenAI apps with over 15,000 testers in our community. We have learned a lot from that experience, as well as other AI tests, and have outlined some best practices for testing these applications.
Bret Kinsella, founder of Voicebot.ai and Synthedia, has worked in AI for the past decade, with increasing focus on generative AI over the last few years. Bret’s companies provide resources such as research and news on industry trends, including over 5,000 articles on voicebot.ai, one of the most-cited sources for any type of AI information and data worldwide.
Massive market growth on the horizon
Kinsella calls what’s going on in generative AI the most exciting and most interesting technology cycle he’s seen since 1997. ChatGPT was the fastest-adopted technology product ever, reaching 100 million monthly active users in just six or seven weeks (Threads surpassed it just after the webinar occurred). “We’re seeing the compression of interest around generative AI. ChatGPT and the ChatGPT moment is just one example,” he explained.
The entire generative AI market is expected to reach $1.3 trillion USD in 10 years, according to projections from various sources, including Bloomberg Intelligence. “Any market that large is a very important market… we’re starting from a fairly low base. It’s growing very quickly,” Kinsella said. “If it was only half as large, if this were a $650 billion dollar market, would it still matter? And the answer is absolutely yes. But I think there’s a lot of reasons to believe that that $1.3 trillion might actually even underestimate how big an impact this is going to have.”
Kinsella cites a threefold set of reasons why people may be underestimating the market. Generative AI introduces hyper-automation, hyper-creation and hyper-personalization on a massive scale.
While most people understand how automation will drive value over time, the increases in creation and personalization are just as compelling. Kinsella describes the creative component of generative AI as giving humans leverage for ideas: “They’re getting these co-pilots… essentially it’s an amplification of their productivity.” He cites that productivity boost as one of the primary drivers for adoption. “For 20 years, we’ve had about 1.5% productivity growth. And now we’re looking at something where we might be able to get 3% or 4% per year, because this is the most significant impact since the early days of the internet in terms of productivity.”
As for personalization, Kinsella said, “We’ve been promised personalization, one-to-one marketing for years, but it’s hard to do. It’s hard to create content at scale that appeals to everybody. But generative AI is going to be one of the tools that can do that.”
Use cases driving demand for generative AI
Within the generative AI market, the larger growth rates are in the software application layer, and the biggest category within the software application layer is the large language models. Kinsella has identified eight use cases driving market growth for companies developing generative AI applications:
text-to-speech: synthetic voice and speech, taking a digital format and presenting to the user in a new way
text-to-image: art solutions such as Midjourney or Stable Diffusion
text-to-text: writing assistance in creating content in a variety of formats ranging from poems and blog posts to video titles and SEO meta descriptions
text-to-code: software development assistance, generating an estimated 40-55% higher productivity
search: conversational search that draws on summarization, allowing users to more effectively refine results
summarization: effectively processes unstructured natural language data
creation: applying AI to development, testing and features for chatbots and voice assistants
production: driving chatbots and voice assistants with LLMs and generative AI to improve run time
Kinsella pointed out that while AI has automated processes behind the scenes across many industries for a long time, we’re shifting into an era where people have more control over their interactions with AI. “We’re moving from the automated era where everything was completely black box to the co-pilot era, and this is where humans are in the loop. They’re asking for something, when something comes back they maybe are refining that or asking for something else.”
Risks inherent in generative AI applications
Generative AI amplifies existing risks in applications, as well as introducing new ones. Sheehan outlined five different risk categories:
biased or toxic responses, often termed AI fairness, which could be harmful, derogatory, or promote unsafe behavior
inaccurate or inconsistent responses present erroneous information or may respond to the same inputs with widely varied outputs
misuse from bad actors to generate fake news, misinformation or deep fakes
legal and security risks, such as missing attribution, copyright violations or breaches of privacy
regulatory compliance to country- and region-specific requirements, which are evolving and often vary across regions
Testing to mitigate risk
Applause has identified four best practices to mitigate all of these risks and build high-quality GenAI apps, drawing on human testing and feedback (as opposed to human judgment programs and tools for reducing bias and toxicity). These practices are particularly relevant for large language models. The best practices:
adjust your existing testing and feedback processes
build the right testing team
assess user interaction, user feedback and functional bugs
consider accessibility and inclusivity
Adjusting your testing and feedback processes
According to Sheehan, “testing should be a strategic imperative if you’re building these apps. Don’t just take your existing processes that you have for a mobile app or a web app and say it’s going to work for a large language model application. You need to consider all of those risks that we talked about before. There are many, many nuances in these applications.”
Designing a testing program for LLMs starts with setting objectives. Sheehan often asks clients what they are trying to prioritize, what they are trying to optimize for and which risks concern them most. In addition to evaluating functional defects and UX concerns as companies would with other types of applications, with generative AI, testing must also detect and mitigate responses that may be perceived as unfair, inaccurate, or violating copyright or privacy concerns. Testing should also help fine-tune the model by providing a better understanding of real-world prompts and customer interactions. Once the organization has set priorities and determined the nature and scope of testing to occur, it’s time to think about the composition of the testing team.
Building the right testing team
Diversity in your testing team is essential in order to reduce bias and toxicity in LLMs. Sheehan stated that diversity can occur at an individual level or at a group level. He outlined baseline diversity as including testers with different ages, genders, and races or ethnicities. As a best practice, he recommended including testers with disabilities as well as those with different sexual orientations.
“Depending on the app, the model, the data that you’re using, you may even take it a step further where you look at different socioeconomic levels, you look at different education levels, you look at different comfort levels with using technology,” Sheehan said. Some applications may call for testers with specific domain expertise beyond excellent QA skills, strong critical thinking and creativity.
“If your application is in multiple countries, you do actually want testers from the particular countries because they are the ones that will give you perspective on the cultural context of when they input prompts and what the output is like,” Sheehan said. Testers must also grant the proper confidentiality and data privacy consents for their regions.
The size of the testing team depends on the application and the underlying model you’re using. “At a minimum, we recommend at least 50 people. Typically, we’re seeing testing teams in the hundreds,” Sheehan said. “Somewhere between 300 to 800 seems to be the sweet spot in the single country of building diversity in that population.”
Testing different elements of the application
Testing should cover at least three key areas: identifying functional defects, assessing the user interaction, and collecting feedback on responses and UX. Sheehan walked though each area in depth.
Sheehan explained that user interactions should cover a wide range of scenarios, including many different prompts and many different outputs from your diverse tester base. There are several different components to consider:
Natural usage encourages testers to use the application as they would in the real world. “That in and of itself is actually just valuable information to understand how your real-world users will use your application,” Sheehan said.
Prompt variation sets guidelines for testers on different types of prompts. For example, one week, testers may focus on creative prompts, while the next week they do more opinionated prompting or chain of thought promoting to test the reasoning skills of the model and application. Sheehan emphasized that it’s important to stress test the system with a lot of different types of prompts.
Adversarial testing calls for testers to try to generate biased or toxic content, allowing the company to determine how well it has mitigated risk in that area.
Collecting feedback at multiple levels
The first level of feedback should focus on the application’s responses, documenting whether responses are inaccurate, toxic or biased. Sheehan recommended building a feedback mechanism into the app itself, citing ChatGPT’s thumbs-up and thumbs-down options and dialog boxes as an example. “As a user in real time, if I see something that I think is toxic, I’m able to give the thumbs down. I’m able to immediately explain why. So that’s a great technique as you’re building your app,” Sheehan said. If that’s not possible, there are still ways to collect that feedback from testers, outside the app.
The next level of feedback examines the user experience across your diverse testing team. Understanding whether or not the users trust the application is critical. Sheehan said, “You want to be testing and understanding people’s feelings and thoughts. Is it trustworthy? And then are they satisfied with the answers? Do they have limitations? Do they have challenges? What is the NPS score as you’re testing?”
The final feedback component examines how the model improves over time. Sheehan explained that companies should incorporate plenty of exploratory testing and make sure that they’re testing over a sufficient time period to assess how the application changes.
“You are doing a lot of fine tuning to the model. You may be adding data, you may be adding features or changes. You want to see what happens over time,” said Sheehan. He suggested a minimum of a two- to three-week period, with a four- to five-week period as optimal. “We have tested some applications for up to eight weeks,” he said. “Again, it will depend a lot on the data, what your objectives are, but just don’t do a single shot testing.”
Testing should continue post-release to continue to monitor and mitigate risk as companies introduce new data and continue to fine-tune the model.
Ensuring accessibility and inclusivity
“This is actually just a general best practice for all digital applications,” Sheehan said. “You want to make sure that the application meets accessibility standards, so you have to do accessibility testing.” Sheehan cited the need to ensure a good and useful experience for users with either permanent or temporary disabilities. Incorporating testers with disabilities helps ensure a better UX not just for people with disabilities (PWD), but for all users.
In addition, including PWD in the testing helps capture feedback about whether some output is biased or toxic to this particular community. “You need that feedback loop,” Sheehan said.
Sheehan went on to share an example of the work Applause has done with one LLM-based chat application that launched in multiple countries. Read the case study.
The most common problems in generative AI applications
Sheehan reported that the vast majority — 60% to 70% — of the issues Applause testers have found in generative AI apps relate to biased or toxic content. Functional errors are the next most common defects, accounting for approximately 30% of issues. Sheehan said that level of functional defects is comparable to what typically occurs for new applications.
Sheehan also said that while Applause hasn’t quantifiably measured the number of inaccurate responses from LLM apps, as that feedback is often collected within the application being tested, anecdotally, testers report that they see inaccuracies diminishing over time.
As generative AI grows in popularity, thorough testing will remain essential for organizations looking to develop trustworthy applications.