Software Evaluation using Performance Metrics during Software Testing

Software Evaluation using Performance Metrics during Software Testing

Performance metrics are very powerful tools to evaluate the usability of any software product during software testing effort. These can help us in informing key decisions, such as whether a new software product is ready to launch or more software testing effort is needed. Performance metrics are always based on participants’ behavior rather than what they say.

Anyone using technology has to interact with some type of interface to accomplish his or her goals. For example, a user of a website clicks on different links, a user of a word-processing application enters information via a keyboard, and a user of a DVD player pushes buttons on a remote control. No matter the technology, users are behaving or interacting with a product in some way. These behaviors form the basic foundation of performance metrics.

We must keep it in kind that every type of user behavior is measurable in some way. For example, we can measure whether users clicking through a website found what they were looking for. We can measure how long it took users to enter and properly format a page of text in a word-processing application or how many incorrect buttons users pressed in trying to play a DVD. All performance metrics are

calculated based on specific user behaviors.Performance metrics rely not only on user behaviors but also on the use of scenarios or tasks. For example, if we want to measure success, the user needs to have specific tasks or goals in mind. The task may be to find the price of a CD or submit an expense report. Without tasks, performance metrics aren’t possible. We can’t measure success if the user is only aimlessly browsing a website or playing with a piece of software. How do we know if he or she was successful?

If users are making many errors, we know there are opportunities for improvement. If users are taking four times longer to complete a task than what was expected, efficiency can be greatly improved. Performance metrics are the best way of knowing how well users are actually using a product.

Performance metrics are also useful to estimate the magnitude of a specific usability issue. Many times it is not enough to know that a particular issue exists. We probably intend to know how many people are likely to encounter the same issue after the product is released. For example, by calculating a success rate that includes a confidence interval, we can derive a reasonable estimate of how big a usability issue really is. By measuring task completion times, the software testing engineer can determine what percentage of our target audience will be able to complete a task within a specified amount of time. If only 20 percent of the target users are successful at a particular task, it should be fairly obvious that the task has a usability problem.

How senior managers view the performance metrics?

Senior managers and other key stakeholders on a project usually sit up and pay attention to performance metrics, especially when they are presented effectively. Managers will want to know how many users are able to successfully complete a core set of tasks using a product. They see these performance metrics as a strong indicator of overall usability and a potential predictor of cost savings or increases in revenue.

Performance metrics are not the magical medicine for every situation. Similar to other metrics, an adequate sample size is required. Although the statistics will work whether we have 2 or 100 participants, our confidence level will change dramatically depending on the sample size. If we are only concerned about identifying the lowest of the low-hanging fruit, performance metrics are probably not a good use of time or money. But if we have the time to collect data from at least eight participants, and ideally more, we should be able to derive meaningful performance metrics with reasonable confidence levels.

Excessive relying on performance metrics may be a danger for some. When reporting task success or completion time, it may be easy to lose sight of the underlying issues behind the data. Performance metrics tell the what very effectively but not the why. Performance data can point to tasks or parts of an interface that were particularly problematic for participants, but we will usually want to supplement with other data, such as observational or self-reported data, to better understand why they were problems and how they might be fixed.

Types of performance metrics: There are following five general types of performance metrics.

1) Task success metric: This is perhaps the most widely used performance metric. It measures how effectively users are able to complete a given set of tasks. Two different types of task success will be reviewed: binary success and levels of success.

The task success can be calculated for practically any usability study that includes tasks. It is almost a universal metric because it can be calculated for such a wide variety of things being tested – from websites to kitchen appliances. Task success is something that almost anyone can relate to. As long as the user has a well-defined task, we can measure success. It doesn’t require elaborate explanations of measurement techniques or statistics to get the point across.

Task success metrics are used when we are interested in whether participants are able to complete tasks using the product. Sometimes we might only be interested in whether a user is successful or not based on a strict set of criteria (binary success). Other times we might be interested in defining different levels of success based on the degree of completion, the experience in finding an answer, or the quality of the answer given.

If our participants can’t complete their tasks, then we know something is wrong. Seeing participants fail to complete a simple task can be pretty compelling evidence that something needs to be fixed.

2) Time-on-task metric: This is a common performance metric that measures how much time is required to complete a task.

Time-on-task (sometimes referred to as task completion time or simply task time) is an excellent way to measure the efficiency of any product. Time-on-task is helpful when we are concerned about how quickly users can perform tasks with the product. The time it takes a participant to perform a task says a lot about the usability of the product. In almost every situation, the faster a participant can complete a task, the better the experience. In fact, it would be pretty unusual for a user to complain that a task took less time than expected.

There are a couple of exceptions to the assumption that faster is better. One is a game where we may not want the participant to finish it too quickly. The main purpose of most games is the experience itself rather than the quick completion of a task. Another exception may be learning. For example, if we are putting together an online training course, slower may be better. It may be better that participants not rush through the course but spend more time completing their tasks.

3) Errors metric: This reflect the mistakes made during a task. A task might have a single error opportunity or multiple error opportunities, and some types of errors may be more important than others. Errors can be useful in pointing out particularly confusing or misleading parts of an interface.

Some usability professionals believe errors and usability issues are essentially the same thing. Although they are certainly related, they are actually quite different. A usability issue is the underlying cause of a problem, whereas one or more errors are a possible outcome. For example, if users are experiencing a problem in completing a purchase on an e-commerce website, the issue (or cause) may be confusing labeling of the products. The error, or the result of the issue, may be the act of choosing the wrong options for the product they want to buy. Essentially, errors are incorrect actions that may lead to task failure.

4) Efficiency metric: This can be assessed by examining the amount of effort a user expends to complete a task, such as the number of clicks in a website or the number of button presses on a cell phone.

Time-on-task is often used as a measure of efficiency, but another way to measure efficiency is to look at the amount of effort required to complete a task. This is done by measuring the number of actions or steps that participants took in performing each task. An action can take many forms, such as clicking a link on a web page, pressing a button on a microwave oven or a mobile phone, or flipping a switch on an aircraft. Each action a participant performs represents a certain amount of effort. The more actions taken by a participant, the more effort involved. In most products, the goal is to minimize the number of discrete actions required to complete a task, thereby minimizing the amount of effort.

What do we mean by effort?
There are at least two types of effort: cognitive and physical.

a) Cognitive effort involves finding the right place to perform an action (e.g., finding a link on a web page), deciding what action is necessary (should I click this link?), and interpreting the results of the action.

b) Physical effort involves the physical activity required to take action, such as moving our mouse, inputting text on a keyboard, turning on a switch, and many others.

Efficiency metrics work well if we are concerned with not only the time it takes to complete a task but also the amount of cognitive and physical effort involved. For example, if we are designing an automobile navigation system, we need to make sure that it does not take much effort to interpret its navigation directions, since the driver’s attention must be focused on the road. It would be important to minimize both the physical and cognitive effort to use the navigation system.

5) Learnability metric: This is a way to measure how performance changes over time. It is useful if we want to examine how and when participants reach proficiency in using a product.

Most products, especially new ones, require some amount of learning. Usually learning does not happen in an instant but occurs over time as experience increases. Experience is based on the amount of time spent using a product and the variety of tasks performed. Learning is sometimes quick and painless, but it is at other times quite arduous and time consuming. Learnability is the extent to which something can be learned. It can be measured by looking at how much time and effort are required to become proficient with something. We believe that learnability is an important usability metric that does not receive as much attention as it should. It’s an essential metric if we need to know how someone develops proficiency with a product over time.

Let us analyze the following example. Assume we are a usability specialist who has been asked to evaluate a time-keeping application for employees within the organization. We could go into the lab and test with ten participants, giving each one a set of core tasks. We might measure task success, time-on-task, errors, and even overall satisfaction. Using these metrics will allow us to get some sense of the usability of the application.

Although these metrics are useful, they can also be misleading. Because the use of a time-keeping application is not a one-time event, but happens with some degree of frequency, learnability is very important. What really matters is how much time and effort are required to become proficient using the time-keeping application. Yes, there may be some initial obstacles when first using the application, but what really matters is “getting up to speed.” It’s quite common in usability studies to only look at a participant’s initial exposure to something, but sometimes it’s more important to look at the amount of effort needed to become proficient.

Learning can happen over a short period of time or over longer periods of time. When learning happens over a short period of time, the participant tries out different strategies to complete the tasks. A short period of time might be several minutes, hours, or days. For example, if participants have to submit their time-sheets every day using a time-keeping application, they try to quickly develop some type of mental model about how the application works. Memory is not a big factor in learnability; it is more about adapting strategies to maximize efficiency. Within a few hours or days, maximum efficiency is hopefully achieved.

Learning can also happen over a longer time period, such as weeks, months, or years. This is the case where there are significant gaps in time between each use. For example, if we only fill out an expense report every few months, learnability can be a significant challenge because we may have to relearn the application each time we use it. In this situation, memory is very important. The more time there is between experiences with the product, the greater the reliance on memory.
Many More Articles on Test Planning & Management

0 0 votes

Article Rating