Blog / The Latest Trends in Digital Quality / Blog Categories / AI Training & Testing / The Keys to Assembling Your AI Datasets
Blog - The Keys to Assembling Your AI Datasets

The Keys to Assembling Your AI Datasets

Artificial intelligence (AI) is all the rage right now, but adopting it for the sake of optics won’t yield the results that matter for your organization. To succeed with AI, you need to identify a clear business case. This objective will determine the datasets you collect and define the parameters for your entire project.

Define the business case for AI

If you are struggling to define a business case for AI, you’re not alone. A Gartner survey shows that 35% of businesses struggle to identify use cases for AI. Because there is no one “AI business case” that applies to all organizations, you’ll need to define an objective that applies to a specific business scenario.

In other words, you need a reason to need artificial intelligence, and for every organization, that reason will be different. PayPal, for example, uses AI to fight money laundering, while Netflix leverages AI to recommend shows its viewers will enjoy.

How should you define your own AI business case? Gartner suggests answering the following four questions to help define your objective:

  1. Why are you doing this project?
  2. Who is the solution for?
  3. What solution and technology framework will you employ?
  4. How will you deliver this project?

Assemble your datasets

Once you clearly define the purpose of your AI initiative, you can focus on the data your model will need to meet the business objective.

Select inputs and outputs

The function of AI is simple — transform specific inputs into outputs (also called targets). Once you determine your business case, you will know what those inputs and outputs should be. For example, a spam filter will turn an input (an email) into one of two outputs: spam or not spam.

Your inputs and outputs should be simple. Don’t overthink them.

Identify relevant variables

A feature (or variable) is any attribute of the object you’re trying to analyze. Take the above example of an algorithm designed to weed out spam. Features can include words used in an email message, the sender’s address, the date it was sent, the presence of attachments and so on.

Your algorithm should use specific and relevant features to weed out spammy or dangerous messages. This requires you to identify the variables you want your model to pay the most attention to, such as messages with specific explicit or spammy language as well as suspicious attachments.


Refine variables

When you’re refining your feature choices, winnow your selections down to the most relevant features rather than add new features. Remove irrelevant features — such as the length of an email — to train your model to focus on the features that matter.

Why is this important? The quality of the features you choose prevents your model from overfitting — making correlations specific to your training values — which will save you grief during validation and testing.

Label outputs

Your target, or output, is the piece in your dataset that you want to learn more about. For example, is an email spam? For an image recognition model, who (or what) is in the picture? The only way for the machine to learn and adapt is to properly label each output.

During model training, your initial dataset should contain inputs and clearly labeled outputs. This is how your model learns to identify outputs correctly when it’s operating independently in the real world. If your initial targets are not labeled properly, your model won’t understand the correlations it’s supposed to make.

Understanding the nuances of your datasets and how they apply to the bigger picture is half the battle. Now comes the hard part — ensuring you collect quality data, then train and test your model with it.

Without quality data, your algorithm will give you more problems than answers. Only high-quality data can produce meaningful AI initiatives. Quality is always the answer, so prioritize it from the start.

See what you need to ensure quality data and how to segment it for training and testing in our next blog.

Published: January 8, 2020
Reading time: 4 min