Requirements & Capabilities
Taxonomy means different things to different people in different disciplines.
Biological taxonomy, for example, is intended to place organisms in one and only one category, similar to the call number or Dewey Decimal classification that identifies the one location for a book on the shelf of a library.
Such unitary classification is convenient but limiting, because often things belong in more than one category. For example, dogs are the species Canis familiari, genus Canis, family Canidae, order Carnivora, class Mammalia, phylum Chordata, and kingdom Animals in the Linnaean Classification. But dogs are also pets and farm animals. We call taxonomies that are designed to provide multiple contexts for things “faceted taxonomies,” and the classification of things into multiple categories “polyhierarchy.”
Figure 1 Comparison of biological and faceted taxonomies
Types of Schemes for Organizing Concepts
Table 1 Types of Semantic Schemes
|Synonym Ring||A set of words/phrases that can be used interchangeably for searching, e.g., Hypertension or High blood pressure.|
|Controlled Vocabulary||A list of preferred and variant terms, which may have defined hierarchical and associative relationships. A taxonomy is a type of controlled vocabulary.|
|Authority Files||A controlled vocabulary typically used for names of individuals, organizations, countries, and other named entities.|
|Taxonomy||A hierarchical scheme that organizes concepts into “is a” or “part of” trees that may be mono- or poly-hierarchical or faceted into discrete divisions.|
|Classification Scheme||An arrangement of knowledge that does not follow taxonomy rules but is usually enumerated, e.g., the Dewey Decimal Classification.|
|Thesaurus||A tool that controls synonyms and identifies the semantic relationships among terms.|
|Ontology||Resembles a faceted taxonomy but uses richer semantic relationships among terms and attributes and strict specification rules.|
Figure 2 Semantic schemes: Simple to complex
Another way to think about taxonomy is as a set of fields called a metadata scheme, with controlled values that are used to describe what content is about and why it is important.
The taxonomy is used to tag content with categories to make content easier to find, to provide ways to group large sets of search results called search filtering, and to enable web services like RSS feeds and personalization. This has been called “taxonomic metadata.” A taxonomy breaks up a long list of topics into groupings that are easy and natural for different audiences to use to tag and find information.
Steve Papa, the former CEO of Endeca, an early faceted search engine, coined the term “guided navigation” to describe the process of refining a rich metadata search result. Metadata-controlled vocabularies don’t need to be that large or complex to provide the granularity to accomplish this task; four metadata-controlled vocabularies of 10 values each have the same discriminatory power as one taxonomy of 10,000 values.
Figure 3 is a simple example of taxonomic metadata. Broad and shallow taxonomies (visualize 10 groups with 4 categories each) have great utility, and are easier to build, maintain, and apply than narrow and deep taxonomies (visualize 4 groups that are 10 levels deep).
Figure 3 Taxonomic metadata is a simple metadata scheme with just a few controlled vocabularies
Busch’s Golden Rule: Four metadata-controlled vocabularies of 10 values each have the same discriminatory power as one taxonomy of 10,000 values.
Business Taxonomy Problems
Many brick and mortar and online businesses have a large assortment of products. It’s a challenge to devise a high-level product taxonomy that can effectively organize products for merchandising, but also be scalable and maintainable.
The product taxonomy should also be designed to specify the set of product attributes that need to be associated with products in that category. Complete and consistent product attributes are key to picking the specific product you want or need from a large product assortment.
How does taxonomy translate into a front-end interface? Taxonomic metadata is critical to empower the web search interface. Figure 5 shows how on bluefly.com (and all clothing shopping sites) category, brand, size, and color are the key attributes available to quickly narrow down your search in just a few clicks.
Figure 5 Searching for shoes on bluefly.com
How can a customer pick from more than 29,000 types of faucets without giving up? Figure 4 shows how customers can refine their search for faucets on homedepot.com by product attributes such as: Category, Price, Brand, Color/Finish, Number of Handles, Series Name, Water Filter, Faucet Spray, Handle Shape, Soap Dispenser, etc.
Figure 4 Searching for faucets on Homedepot.com
There are several technical standards related to taxonomies. These are important because standards enable systems to talk to each other, or interoperate without the need to do custom programming.
- ANSI/NISO Z39.19-2005 (R2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies.
- ISO 25964. Thesauri and interoperability with other vocabularies:
- ISO 15836-1:2017 Information and documentation — The Dublin Core metadata element set. The Dublin Core is the de facto standard for cataloging content on the Web. There are 15 core elements:
- Functional Requirements for Bibliographic Records (FRBR) is a conceptual entity-relationship model. Entities that are the foundation of the FRBR model are Work, Expression, Manifestation, and Item, and the relationships applied to them.
- Resource Description Framework (RDF) is a standard model for interchanging data on the Web. RDF extends the linking structure of the Web to use URIs to name subject-predicate relationships between things.
So far, we’ve talked about taxonomy from an end user’s perspective—how and why one would want to use this particular type of controlled vocabulary. In this section we talk about taxonomy from a technical perspective, in the context of data standards. First, some fundamental definitions related to taxonomy.
- Concept. A real or imaginary object that is expressed as Terms in the taxonomy.
- Controlled Vocabulary. A list of terms that have been explicitly enumerated. The terms are controlled and published by a designated authority or authoritative source. If multiple terms are used to mean the same thing, one of the terms is identified as the “Preferred Term” in the Controlled Vocabulary and the other terms are listed as synonyms or aliases.
- Facet. A grouping of Concepts of the same inherent category. Examples of categories that may be used for grouping Concepts into facets are: Audience, Channels, Components, Content Types, Functions, Industries, Intentions, Lifecycle, Location, Organization, Products, etc.
- Taxonomy. The core metadata elements and the Controlled Vocabularies required to find, use, and manage content in a collection.
Here are some definitions related to terms in a taxonomy, or a taxonomy data dictionary:
- UID. The unique identifier for the Concept.
- Entry Term. The Preferred Term that is used to label a Concept. An entry term is also known as a Descriptor.
- Broader Term (BT). A term to which another term or multiple terms is subordinate in a hierarchy.
- Narrower Term (NT). A term that is subordinate to another term or to multiple terms in a hierarchy.
- Used For Term (UF). A non-preferred term that is equivalent to the Entry Term. Used For Terms may be synonyms, aliases (such as abbreviations) and quasi-synonyms (such as more specific terms).
- Related Term (RT). A term that is associatively but not hierarchically linked to another term in a Controlled Vocabulary.
- Scope Note (SN). A note following a term explaining its source, rationale, coverage, specialized usage, or rules for assigning it.
Finally, these are the common taxonomy term relationships:
- Associative Relationship. A relationship between or among terms that leads from one term to other terms that are related to or associated with it. An Associative Relationship is a Related Term (RT) or cross-reference relationship.
- Equivalence Relationship. A relationship between or among terms in a Controlled Vocabulary that leads to one or more terms that are to be used instead of the term from which the reference is made. An Equivalence Relationship is a Used For Term (UF) relationship.
- Hierarchical Relationship. A relationship between or among terms in a Controlled Vocabulary that depicts broader (generic) to narrower (specific) or whole-part relationships. A Hierarchical relationship is a Broader Term (BT) to Narrower Term (NT) relationship.
Figure 4 is a simple example of a Concept (IBM) and some of its terms and their relationships.
Figure 6 A simple example of a Concept, and its terms and relationships
Taxonomy Development Process
There are a variety of methods for developing a taxonomy as summarized in Table 2.
Taxonomy Strategies has used all of these methods at one time or another over many years, adapting them for each project based on our judgment of what will be most effective given the requirements and timeframe for the project as well as the organizational culture. Generally, we adopt a hybrid or best of breed approach to taxonomy development.
Table 2 Taxonomy Development Methods
|Method||Description||Pros & Cons|
|Automated Analysis||Analyze content using automated methods to identify key concepts.||Very good for testing, but not very good for taxonomy construction.|
|Workshopping||Guide stakeholder group in activities to identify key concepts.||Can be good for building up a team and getting buy-in, but is not a fast method.|
|Strawman||Prepare a best guess, then bring it to the table to discuss.||Can speed discussions; however, a strawman developed before any client input has been received will always be off-target.|
|Adapt Existing Vocabularies||Customize internal terminology, industry standards, etc.||A fast method that can reduce some of the acceptance problems of the strawman approach. However, existing vocabularies developed for one purpose (such as recognizing revenue across product lines) may be ill-suited for other purposes (such as allowing customers to search a website for product information).|
|>Hybrid||Combination of some or all of these methods.||Allows a solution that builds on the advantages and minimizes the disadvantages mentioned above. However, it relies on having experienced consultants in order to make the proper choice of methods.|
Figure 7 Taxonomy development process - showing key components of a successful taxonomy development process
- Taxonomy Team. Projects require a team that will be dedicated to working on the project over a period of time. This should include a business sponsor and internal stakeholders, as well as a project manager and technical team who will do the bulk of the work.
- Identify the Business Case. Agreement on the business goals of the project is critical to obtaining executive sponsorship for your taxonomy project.
Examples of business goals:
- Improve search and browsing to reduce the amount of time employees spend looking for information.
- Reduce business silos, foster collaboration and content reuse, and reduce redundant work.
- Reduce the amount of time employees spend emailing basic information to each other.
- Build confidence that employees are getting the most up to date information, and increase employee loyalty by helping them stay “up to date” on the company.
Ask yourself: How will this taxonomy project help you save money or make money and mitigate risk? Identify the key costs and benefits, and build a simple model to calculate the return on investment.
For example, if you make content findable, how many minutes will that save per employee every day; or how much is avoiding an inappropriate information disclosure worth in organizational credibility; or how much is it worth to add useful years to expensive industrial equipment through proper operation and maintenance according to the manufacturer’s specification.
- Planning and Research. Collecting quantitative and qualitative information about content and user behavior, and analysis of how users interact with content, is the foundation for a successful taxonomy project.
- Identify the specific target content that is to be focused on.
- Identify and gather a representative sample of content items.
- Gather any query logs, usage statistics (analytics), and usability surveys.
- Collect any existing user research.
- Collect any documentation related to audience personas, content organization, metadata, keywords, and any other guidelines or standards.
- Identify and gather any internal classifications (org charts, sales regions, records retention schedule, code of conduct, product lists, etc.); and any relevant industry standard classifications (UNSPSC, NAICS, USPS, regulated activities, etc.).
- Interview Stakeholders. Just like healthcare professionals, information professionals need you to tell them about your pain points. One-on-one interviews or group workshops with business, information management, IT representatives and sometimes customers are an efficient and effective way to gather information and documentation.
- Recruit people from business-critical functions such as marketing, public relations, product marketing, legal, etc. Include people who have credibility, are early adopters, hold large amounts of content, and are “squeaky wheels” or “fans.”
- Conduct 10-20 interviews.
- The goal is for stakeholders to be the review board during the taxonomy development process and beyond.
- Define Use Cases. Think through both the strategic and practical goals of the taxonomy to help define its scope. A use case can be one or more sentences in the language of the user that describes what the user needs to do. Use cases can be helpful in later validating whether the taxonomy will address its intended purpose and making adjustments during the process to refocus the work. Table 3 compares some intranet and public website use cases.
Table 3 Use Case Examples
|Intranet Use Case Examples||Public Website Use Case Examples|
|Content related to business areas or facilities by geographic location, by type, by specific facility, by access restrictions, by audience, etc.||Web content managers by content type, by topic, by location, etc.|
|Company-wide content by business function, topic, access rights, etc.||Public users seeking information by topic, location, etc.|
A business taxonomy should have no more than 6-10 broad divisions.
- Identify the types of actors (audiences, roles & access rights)
- Identify the types of content
- Identify the types activities (business processes, applications & uses)
- Identify the types of named entities (products, services, projects, organizations, locations, etc.)
- Topics will be everything else.
Plan to reuse existing (especially internal) vocabularies for as many of the facets as possible. Plan to develop fully custom taxonomies for “Content Types” and “Topics.”
The Oracle taxonomy (Figure 8) is built entirely around their list of products; it has no explicit topics, only actors, content types, and named entities. For marketing purposes, these products are grouped by Product Line (Oracle Cloud Infrastructure, Oracle Cloud Applications, and Hardware and Software), Technology (Middleware, Database Technology, and Security), Applications (Customer Relationship Marketing, Retail, and Manufacturing), and Industry (Financial Services, Healthcare, and Automotive). At Oracle, product names are also carefully edited to be consistent when they are mentioned in marketing collateral. By simply recognizing a product name in content text, the content item can be categorized by the appropriate Technology, Application, and/or Industry. However, the Singapore Government taxonomy (Figure 9) is much more focused on topics.
Figure 8 Oracle.com high-level taxonomy
Figure 9 Singapore Government Taxonomy
- Build-Out Taxonomy Detail. If the core stakeholders approve, then buildout the detailed taxonomy. Reuse any existing terminology resources that you can because these will be familiar to the users and smooth adoption of the new taxonomy.
- Get agreement on the broad divisions first, then build-out the detailed taxonomy.
- Use existing terminologies whenever they are available for business functions, locations, products and services, etc., or consider adapting publicly available or published taxonomies. See: BARTOC.org (Basic Register of Thesauri, Ontologies & Classifications). Licensing a pre-existing taxonomy will cost less than developing a taxonomy from scratch, but a pre-existing taxonomy will rarely fit an organization’s needs and may require extensive customization. Table 4 lists free sources for common taxonomies.
- Only build a vocabulary when no alternative authoritative source exists.
- Only create categories for which there already is content, or likely to be content soon.
- Keep the taxonomy broad and shallow.
- Roll-up more specific terms into broader categories.
Table 4 Free Sources for Eight Common Taxonomies
|Organization||Organizational structure||SP 800-87, U.S. Government Manual, Your organizational structure, etc.|
|Content Type||Structured list of the various types of content being managed or used||Dublin Core Type Vocabulary, AGLS Document Type, Your records management policy, etc.|
|Industry||Broad market categories such as lines of business, life events, or industry codes||SIC, NAICS, Your market segments, etc.|
|Location||Place of operations or constituencies||GNIS, ISO 3166, UN Statistics Div, US Postal Service, your sales regions, etc.|
|Business Activity||Business activities or functions performed to accomplish mission and goals||Federal Enterprise Architecture Business Reference Model, enterprise ontology, your business functions, etc.|
|Topic||Business topics relevant to your mission and goals||Federal Register Thesaurus, NAL Agricultural Thesaurus, your research areas, etc.|
|Audience||Subset of constituents to whom a piece of content is directed or is intended to be used by||ERIC Thesaurus, IEEE LOM, your psycho-graphics or personas, etc.|
|Products & Services||Names of products/programs and services||ERP system, your products and services, etc.|
The NASA Taxonomy shown in Figure 10 is an example of an enterprise taxonomy because it’s intended to cover information management organization-wide.
Table 5 Validation Testing and Review Summary
|Walk-through||Show & explain|
|Walk-through||Check conformance to editorial rules|
|Usability Testing||Contextual analysis (card sorting, scenario testing, etc.)|
|Tagging Samples||Tag sample content with taxonomy|
- Migrate Content. Existing content may need to be retrofitted according to the new taxonomy. This is usually accomplished with a combination of automated and editorial efforts. Best practices include:
- Identify and dispose of Redundant, Obsolete, and Trivial content (ROT).
- Prioritize content to be tagged.
- Use business rules to automate content tagging. For example, tag landing pages of major sections, then lower-level pages inherit tags from top-level pages.
- Use workflow to enforce tagging. For example, require entry of simple tagging in order to submit an item into the content management system.
- Use templates to guide user tagging. Pre-populate template fields whenever possible. Use context-sensitive pick lists. Link to a taxonomy tool for more complex controlled vocabularies.
- Provide tagging feedback. For example, set goals and display statistics on how many pages a user has tagged.
- Maintain and Evolve. Implement methods to gather and handle taxonomy change requests according to an agreed service level. Evaluate how the taxonomy is performing by monitoring query logs and collection analytics.
- Review and Revise. Taxonomy is not a “one and done” activity. It requires maintenance and management to adapt to organizational changes.
Taxonomy Construction Tools
There are different kinds of tools related to taxonomy. These are for taxonomy management, content tagging, and content management.
- Taxonomy Management. A taxonomy tool is an application for building, maintaining and governing changes made to a taxonomy scheme. These tools include Data Harmony, MultiTes, PoolParty, Web Protégé, Synaptica KMS, VocBench, and others. Table 6 is a list of taxonomy editing tools and vendors and their key characteristics.
- Content Tagging. Tagging tools are designed to populate metadata with taxonomy terms manually, automatically, or with some combination of manual and automated processes. Sometimes this is referred to as “enriching” content with metadata. These tools include Data Harmony, Expert.ai, Megaputer, PoolParty, SAS, SmartLogic, and others.
- Content Management. Content management applications combine a database with workflow to create, edit, collaborate, publish, and store digital content. These applications include Drupal, OpenText, SharePoint, WordPress, and many others.
Table 6 Taxonomy Editing Tools
|Vendor||Taxonomy Editing Tool||Key Characteristics||URL|
|Access Innovations, Inc.||Data Harmony||Complete platform for taxonomy management & automated tagging||https://www.accessinn.com/data-harmony/|
|Cambridge Semantics||Anzo||Complete platform for knowledge graphs||https://cambridgesemantics.com/anzo-platform/|
|Microsoft||Excel||Simple tool that everyone has||https://www.microsoft.com/en-us/microsoft-365/excel|
|Mondeca||Intelligent Taxonomy Manager||Complete platform for taxonomy management||https://mondeca.com/software/|
|Multites||Multites Pro||Inexpensive thesaurus management tool||https://multites.net/|
|Semantic Web Company||PoolParty||Complete platform for knowledge graphs||https://www.poolparty.biz/taxonomy-thesaurus-management|
|MarkLogic||Semaphore||Complete platform for taxonomy management & automated tagging||https://www.marklogic.com/product/semaphore/|
|Stanford University||Protege||Inexpensive ontology tool||https://protege.stanford.edu/|
|Synaptica||Graphite; KMS||Complete platform for vocabulary management or knowledge graphs||https://www.synaptica.com/|
|TopQuadrant||TopBraid EDG-VM||Complete platform for vocabulary management||https://www.topquadrant.com/vocabulary-management|
|Università degli Studi di Roma 'Tor Vergata'||VocBench||Open source vocabulary management platform||http://vocbench.uniroma2.it/|
Taxonomy Editing Tool Functions
These are some basic taxonomy editing functions that all tools should provide:
- Standard and Custom Fields. Standard fields are those that are specified by the relevant Technical Standards such as Z39.19 and ISO 25964 described above. These include the standard thesaurus fields such as preferred term and scope note (SN). It should also be possible to define custom fields to be used locally in a particular taxonomy, for example, Term Source or Editorial Note.
- Standard and Custom Relations (Intra-Vocabulary Relations). Standard relations are those that are specified in the relevant Technical Standards such as broader (BT), narrower (NT), and related (RT) terms. It should also be possible to define custom relations to be used locally such as IsA, PartOf, HasA, etc.
- Data Typing and Restrictions. Data typing is the capability to define certain characteristics of a field such as it must be an alpha value, or it must be a numeric value. Restrictions are the capability to identify specific valid values for example from a list of predefined values such as True or False, or a range of values such as “1/1/2000 to 12/31/2000”.
- Consistency Enforcement. Enforcement is the capability to require that a value be entered, or whether only a single value can be entered, or multiple values.
- Flexible Reporting. There are standard formats that should be output by the tool. These should include generic formats such as CSV or XML, as well as specific displays such as thesaurus record or hierarchical tree.
- Flexible Importing. There should be standard import formats accepted by the tool, including a CSV import format where the top row specifies the data attributes, and the remaining rows elaborate the entries. There should also be an XML import format that conforms to SKOS and/or OWL.
These are more advanced taxonomy editing functions that most, but not all, tools should provide:
- Unicode. The tool should use and support the Unicode Standard for consistent encoding, representation, and handling of text in most languages, including multiple languages in the same vocabulary.
- Multiple Vocabulary Support. The tool should support the capability to build and maintain multiple discrete vocabularies in the same platform.
- Inter-Vocabulary Relations. The tool should support creating and managing relationships between concepts in separate vocabularies that may be on the same or different platforms.
- Unique IDs. The tool should generate and manage unique and persistent identifiers (URIs)
These are advanced taxonomy editing functions that not all tools provide:
- Workflow. The tool should provide a workflow engine or some other mechanism for users with specified roles to define and manage taxonomy entities throughout their lifecycle.
- Voting. The tool should provide a mechanism for users who have specified taxonomy management roles to review and approve or disapprove of taxonomy changes.
- Change Request Management. The tool should provide a mechanism to gather and prioritize taxonomy requests, and to report on their disposition to all identified stakeholders.
- Stylistic Rules Enforcement. The tool should have the capability to define more advanced rules to enforce term consistency, for example, ensuring terms are in direct vs. inverted order, or terms are spelled out rather than abbreviated, etc.
Table 7 groups taxonomy tools by functional area.
Table 7 Types of Taxonomy Functions
|Metadata Controlled Vocabulary|
Taxonomy Tool Examples
Figure 11 is the Synaptica KMS taxonomy tool showing the hierarchy browser on the left and standard term information on the right. Synaptica KMS is one of the most scalable taxonomy tools. It has been used to manage extremely large vocabularies in the medical domain.
Figure 11 Synaptica KMS taxonomy tool
Figure 12 PoolParty taxonomy tool
A much simpler and less expensive tool is MultiTes Pro. This is a Z39.19 compatible taxonomy editor that supports standard thesaurus fields. MultiTes offers these self-study tutorials:
- Getting Started with MultiTes Pro
- Navigating your thesaurus
- Importing data from text files
- Working with Subject Categories
- Working with Multilingual Thesauri
Figure 13 is the MultiTes Pro taxonomy tool. Since MultiTes was designed as a thesaurus tool, the main MultiTes display is an alphabetical list rather than a tree browser. The term record is a pop-up window. The term’s hierarchical context is viewed in a tab in the pop-up window.
Figure 13 MultiTes taxonomy tool
USE: Affordable Care Act
ACA Reassignment Notice
UF: Affordable Care Act Reassignment Notice
BT: Medicare Documents
Access to Medical Records
Accountable Care Organization
SN: A group of health care providers who give coordinated care, chronic disease management, and thereby improve the quality of care patients get. The organization's payment is tied to achieving health care quality goals and outcomes that result in cost savings. Read a fact sheet about accountable care organizations. (http://www.healthcare.gov/glossary/04262011a.pdf)
RT: Fee for Service
SN: The percentage of total average costs for covered benefits that a plan will cover. For example, if a plan has an actuarial value of 70%, on average, you would be responsible for 30% of the costs of all covered benefits. However, you could be responsible for a higher or lower percentage of the total costs of covered services for the year, depending on your actual health care needs and the terms of your insurance policy.
BT: Condition and Treatment
USE: Moving Residence
Figure 14 MultiTes import format
MultiTes can export the thesaurus as a set of HTML files that operate as a website. This can be used to generate simple websites that can be styled further (see, for example, the NASA Taxonomy in Figure 10).
Figure 15 is an example of a MultiTes generated website with the taxonomy developed for healthcare.gov. The earlier nomenclature was the federal facilitated exchange or FFE (for health insurance). Using an out of the box MultiTes feature, you can search healthcare.gov with any term in the FFE taxonomy.
Figure 15 MultiTes was used to build this health.gov taxonomy
Scenarios to Evaluate Taxonomy Tools
These are some scenarios that are useful to evaluate taxonomy tools
- Database Definition. How is the database created? Where is it stored? Is it Z39.19 and ISO 25964 compliant? Does the database, e.g., Ontotext, need to be licensed?
- Importing/Exporting Data. How are data imported? What file formats are supported? Can data files be in batches that are sequentially uploaded?
- Add, Edit, Delete Categories. How easily are categories added, edited, or deleted? Can categories be added, edited, or deleted in batches?
- Relationship Types. How are relationship types defined? What types are supported? How is polyhierarchy handled?
- Add, Edit, Delete Relationships. How easily are relationships added, edited, or deleted? Can relationships be added, edited, or deleted in batches? Does a change propagate to all instances?
- Reporting. How does the taxonomy tool report new, edited, deleted taxonomies and categories; new, edited, deleted relationship types and relationships; and mapped taxonomies and categories? How are the reports presented? What audit logs are available? Can changes be traced to users who suggested them? Is an “approval” step for changes available for administrators?
- User Access. Can the taxonomy tool integrate user accounts with existing authentication systems, e.g., Active Directory? Is there support for role-based access or defined group membership with configurable access? Is there a workflow to approve changes? What functionality is available or restricted based on a user’s security privileges?