Taxonomy Validation


This is a brief discussion of the various information retrieval and task-based methodologies for analyzing the performance of taxonomies. The effectiveness of a taxonomy in facilitating information retrieval is measured in terms of recall and precision. Task-based usability testing validates user consensus around the meaning and context of taxonomy concepts and their labels.

What is a Taxonomy?

A taxonomy is the specification of the names and aliases for people, places, things, and anything else that is needed to allow search engines and other content applications to work better. Taxonomy is a method of organizing things into hierarchical trees. This method originated in biology as the model for categorizing all living things. 

Because taxonomies can become very complex, it is helpful to break them up into discrete pieces consisting of a common broad and shallow outline, to which narrow and deep specializations can be attached or linked. The broad outline covers the primary descriptive dimensions required to support the subject area. For health care, the broad outline could be: 

  • Body Parts and Systems, 
  • Health Care Facilities, 
  • Health Care Professionals, 
  • Health Conditions, 
  • Medical Specialties, 
  • Medical Supplies, and 
  • Medical Treatments. 

These broad areas correspond to the major trunks or branches of the taxonomy. These major branches are called taxonomy facets.

Evaluating Taxonomies for Information Retrieval

The effectiveness of taxonomy in facilitating information retrieval is measured using a range of techniques. Information retrieval effectiveness is measured in terms of recall and precision. Recall means the number of instances of taxonomy concepts that are recognized out of the total number of possible occurrences, and precision is the fraction of taxonomy concepts recognized that are correct.

There is a tradeoff between recall and precision, meaning that higher precision will cause there to be lower recall and vice versa. The goal is to achieve the best possible precision and recall within a given subject area, and then to tune the system to give the desired balance between the two, based on the user requirements. One important aspect of information retrieval is that the notion of correct and incorrect is often subjective and depends on the specific application context. For example:

  • Should a search on the company name “” retrieve “XYZ Corp. is hoping to become the of Latin America”?
  • Should a search on the company name “IBM” retrieve “My computer is an IBM-compatible PC”? 
  • Should a search on the state “Ohio” retrieve “She was crowned Miss Ohio in the pageant”?

Evaluation methods range from ad hoc techniques to custom generated test collections. 

Ad-hoc Feedback

Ad hoc feedback can identify requirements, performance bottlenecks and known retrieval problems in the early stages of performance testing. For example, a search for medical specialties associated with the medical condition lupus should return useful medical specialties. Since lupus is a systemic condition, there is a requirement to treat systemic conditions differently from conditions that are associated with a specific site in the body. 

Ad hoc feedback is an iterative trial and error method of observation to obtain an initial level of concept recognition in search performance. This method is an essential part of determining the customer’s requirements and “comfort zone” with the technology.

Quality Control Script

It is a software engineering best practice to develop a script consisting of test cases that simulate as much as possible all the use cases. Quality assurance engineers develop a script based on the engineering specification, and then test the product to ensure that it meets the specification. This technique is particularly useful for testing systems integration issues, e.g., integrating a search engine and a taxonomy.

Developing a comprehensive set of instances for all use cases is difficult and sometimes impossible, but developing a script of representative instances is a feasible alternative. For taxonomy validation, the best practice is to obtain and analyze the query log of a comparable resource to identify a representative set of high volume, relevant queries, and the correct responses to them. This can then serve as a baseline for testing how the taxonomy maps to the representative queries in order to provide correct responses. 

Random Sampling

A statistically based approach that yields high confidence results from random sampling of user queries is an established methodology adapted from the social sciences. For taxonomy validation, the best practice is to obtain and analyze the query log of a comparable resource, to randomly sample user queries, and the correct responses to them.

Standard Test Collection

The TREC (Text Retrieval Conference) supports research in information retrieval and related applications by providing
large test collections, uniform scoring procedures, and a forum for organizations interested in comparing their results. TREC runs several tracks each year. The 2022 tracks, which remain relevant, are listed in Table 1.

2022 TREC TracksDescription
Clinical TrialsMatching patients to relevant clinical trial.
Conversational AssistanceTraining and evaluating models for conversational information seeking such as ChatGPT.
CrisisFACTSTemporal summarization technologies to support disaster-response managers’ use of online data sources during crisis events.
Deep LearningUsing very large data sets available for training information retrieval applications.
Fair RankingEvaluating systems according to how well they fairly rank documents.
Health MisinformationIdentifying retrieval methods that promote reliable and correct information over misinformation for health-related decision-making tasks.
NeuCLIRImplementing optimized cross-language information retrieval applications.

Table 1-2022 TREC tracks.

A comparison of the information retrieval application against an existing collection of similar and correctly tagged items such as a TREC test collection can be used to validate taxonomy information retrieval performance. The drawback of this approach is that the tagging requirements of the standard test collection might differ from the customer’s requirements, potentially causing misleading results.

Alternatively, a custom test collection could be created with content selected from the target collection. The drawback of this approach is that the target collection may not be homogeneous, comprehensive, or tagged with consistent metadata. 

Evaluating Taxonomies for Usability

Task-based usability testing is a set of methods that are helpful to validate user consensus around the meaning and context of taxonomy concepts and their labels. The goal of these tasks is to develop confidence that the taxonomy will be functional and effective, and to identify any changes that might be warranted based on the usability test results.

Online tools such as those available from Optimal Workshop, high fidelity prototypes such as those built with tools like Axure, and paper wireframes are all possible platforms for conducting task-based usability tests. Table 2 provides an overview of taxonomy usability testing methods. The primary taxonomy usability methods are described below.


Show and explain the taxonomy to subject matter experts and/or end users. This method is usually employed with early, high-level versions of the taxonomy to confirm the overall approach makes sense, and that it appropriately addresses the use cases, e.g., comprehensive information retrieval, browsing for products, etc. Taxonomy walk throughs are also helpful to validate that the taxonomy has a consistent look and feel and are a first step in developing the editorial rules or style guide for taxonomy labels.

Card Sorting

Card sorting is a simple technique for guiding a group of subject matter experts and/or end users to generate or validate a taxonomy. Card sorting may be open or closed. Open card sorting is an early taxonomy development activity to identify potential groupings for taxonomy concepts. Closed card sorting is a later stage activity to validate that concepts have been grouped in a way that is commonly understood. For taxonomy development and validation, it is a best practice to run open and closed card sorts with concepts derived from the highest volume, highest value queries from website search logs, but other lists can be used as well. 

Open card sorting is a method to generate a taxonomy from a simple list of concepts. In open card sorting, a set of concepts is presented to the users who are asked to group them, and then to label each group. Patterns that emerge provide clues for how to organize the concepts into a commonly understood taxonomy. 

Closed card sorting is a method to validate that a defined taxonomy is commonly understood. In closed card sorting, a set of concepts is presented to the users, who are then asked to sort the concepts into a set of pre-defined groups. Further taxonomy refinements are recommended where there is no consensus around the grouping of concepts. 

User Satisfaction Survey

Surveys of subject matter experts and/or end users is an effective way to get feedback on an existing taxonomy. Online surveys are most effective for evaluating an existing application that uses a taxonomy such as an online shopping website. Online surveys can also be used to poll end users about specific concepts or groups of concepts.

Tree Navigation

Tree navigation is an effective way to explore how subject matter experts and/or end users approach finding items in a target content collection. There are many different varieties of tree navigation tasks. 

From an overall taxonomy evaluation perspective, there is interest in observing how users explore various paths to reach an item and observing any patterns in that journey. Do users ultimately find what they are looking for? How many clicks does it take? How often do users backtrack? 

For website design and product merchandising there is great interest in what the user clicks on first – so-called first click analysis. There is interest in what is the most efficient path to find an item, i.e., how many clicks it takes to reach conversion – the desired outcome of the interaction. Ultimately, if the user completes the task, this equals conversion, and that constitutes success. 

Content Tagging

Content tagging can be done early as well as later to validate whether a taxonomy will be effective for executing this task in a complete and consistent manner. It is helpful to try tagging some content with an early draft of a taxonomy. This helps to validate whether the taxonomy “fits” a target content collection. The best practice is to only include concepts in the taxonomy that are relevant to the collection, or, in other words, you do not want to have taxonomy concepts which have no content in the collection. The assumption being that new concepts that emerge in the target collection can be added to the taxonomy later.

Content tagging done later in the taxonomy development life cycle validates that content can be tagged completely and consistently by various subject matter experts and/or end users. As with closed card sorting the purpose of this testing is to determine whether there is consensus around the choice of concepts to tag items in the target content collection. Content tagging also results in a body of examples of well-tagged content that can be used as subject matter expert and/or end user training materials.

Web toolOpen Card sort
  • Representative users
  • Rough taxonomy
  • Card sort analysis (emergent patterns)
Cards & markupDelphi card sort
  • Representative users
  • Rough taxonomy
  • End result (consensus)
Walk-thruShow & explain
  • Stakeholders
  • Rough taxonomy
  • Approach
  • Appropriateness to task
Walk-thruCheck conformance to editorial rules
  • Taxonomist
  • Draft taxonomy
  • Editorial Rules
  • Consistent look & feel
Web toolClosed card sort
  • End users (or surrogates)
  • Rough taxonomy
  • Top queries, etc.
  • Consistent results
Web toolUser satisfaction Survey
  • End users (or surrogates)
  • Draft taxonomy
  • Reaction to taxonomy
Tagging samplesTag sample content with taxonomy
  • Taxonomist
  • Indexers
  • Sample content
  • Draft taxonomy (or better)
  • Content “fit” 
  • Fills out content inventory
  • Training materials for people & algorithms
  • Basis for quantitative methods
Web tool Task-based scenario testing (Find it, etc.)
  • End users (or surrogates)
  • Draft taxonomy
  • Tasks & answers
  • Tasks are completed successfully
  • Time to complete task is reduced
User SatisfactionSurvey
  • End users (or surrogates)
  • Draft taxonomy
  • UI mockup, Search prototype
  • Reaction to taxonomy
  • Reaction to new interface, Reaction to search results

Table 2-Taxonomy usability testing methods.