Improving Information Management

with Terminology and Categorization Standards

Image

The information that is produced, managed and used by business professionals has undergone a transformation from primarily static narrative documentation to dynamic content and data that are used to produce reports and visualizations.

Narrative reports and other types of documents are still very important artifacts, but today most of these are produced digitally, in serial versions that need to be managed, and may need to be synchronized with the related data sources.

These factors have created new challenges for information professionals to identify, describe and manage information items so that they can be found and used by business professionals to accomplish their daily tasks.

This paper was originally produced for the Transportation Research Board of the National Academies of Science.[1] It provides an overview of the current information management landscape, and makes recommendations related to:

  • Developing a common categorization scheme for information management, and identifying enhancements in detail or scope of information that should be included in such schemes.
  • Strategies for developing a common terminology and categorization scheme that could be made available for use within a specific industry sector, or within an individual organization.

Types of Information that Need to be Managed

A common way to scope the types of information sources that an organization needs to manage is to look at the key business functions or activities of the organization and the types of information that are produced or consumed in accomplishing each function.

Figure 1 illustrates how the scope of information created as part of common business functions includes data or data-generated information formats such as Computer-Aided Design (CAD), Geographic or Geospatial Information Systems (GIS) and other computer-generated graphics. Often these data-based visualizations are generated for use as part of a document or presentation as a document.

FunctionsInformation TypesInformation Formats
DocumentsDataCADGIS Graphics
Project InformationEngineering (Diagram, Plan, Drawing, etc.)
Specification
Performance Test
Planning Study
Materials Research
Environmental Report
System Condition/ PerformanceTraffic Data
Safety Data
Performance Report
Research/AnalysisResearch Report
AdministrationFinancial
Contact Information
InventoryGIS Data
Asset Inventory Database

Figure 1 Types of Business Information by Function

Use Cases to Maximize Information Use and Value

Information is created to support a business function or activity, although the intended use of the information may not always be readily apparent.

It is even more difficult to anticipate and envision future secondary and potential tertiary uses of that information. For example, information may be generated as part of an immediate operational activity such as accessioning assets, which is part of the asset management function. Later that same information may be analyzed to produce an asset maintenance plan. If business information is to realize its full value as a resource, it must be created and maintained in a form that will support such primary, secondary and tertiary activities.

In other words, the value of business information increases each time that it is used and reused. To maximize the potential value of business information, it is important to consider how that information should be structured to maximize its potential uses. It is also important to consider when it is appropriate to archive and/or purge it from an active collection.

A basis for the requirements on how to structure business information to maximize its usefulness is to consider the key information management activities in the business or industry sector. In particular we want to consider the primary purpose of each of these applications and how it acquires, organizes, retrieves, secures and maintains information to support those specific objectives.

Relationship Between Data and Content Management

The specific types of resources that data and information applications have handled have changed over time. Prior to the mid-1980’s, most information was in databases and was primarily numerical in its basic format. While word processing was available prior to the adoption of the personal computer (PC), this was primarily a centralized operation organized along the lines of a typing pool. As the PC emerged, word processing became ubiquitous, and the production of text files and presentations led to an explosion of documents. The emergence of networked information and the World Wide Web (WWW) in the 1990’s caused an information explosion. In the 2000’s this trend was accelerated by the widespread adoption of email, and in the 2010’s by the adoption of social media.

As the types of business content and strategies for managing them change from numerical to documents to multimedia to analytics, the pendulum swings between data and content management.

Data management is the set of processes and technologies related to the collection and management of data so that it can effectively be used to accomplish tasks. For the purposes of business information management, we are primarily concerned with 1) ensuring the quality of authoritative data, 2) identifying and providing appropriate access to authoritative data to accomplish business functions, 3) generating visualizations or analyses based on processing data, and 4) synchronizing visualizations and narratives with source data. A key aspect of data quality is to ensure that data values exist and that they are consistent. Other aspects of data quality include accuracy, timeliness, validity, and completeness. When data quality needs to be maintained across data sources, extra work is required to obtain a set of consistent values, e.g., for the names of organizations such as government agencies or contractors. Data values that have been assembled and mapped from across multiple data sources is called reference data. Alternately a common or standard set of data values such as ISO 3166[2] to identify countries may be used.

Content management is the set of processes and technologies related to the collection, management and publication of information so that it can effectively be found and used to accomplish tasks. For the purposes of business information management, we are primarily concerned with 1) identifying and providing appropriate access to authoritative versions of content items, 2) providing adequate descriptions of content items so that they can be found and used, 3) linking to data sources for visualizations and analyses included in narratives, and 4) linking to related content items. Effective content management requires the generation of complete and consistent metadata values. While system-generated metadata such as unique identifiers and last modified dates are readily available, descriptive metadata such as topic frequently does not exist because it is not required by organizational business processes. With full-text search readily available, the value of author-generated metadata is not always considered worth the effort.

Content Structure

While the definitions of data and content management are closely aligned, the types and formats of information that are being managed are distinctly different. In the past, these two disciplines have been quite distinct, but that is changing since most information forms today include structures that are amenable to automated processing. Thus it is more useful to describe content items using a continuum from being more to less structured.

In the present paradigm, every type of content item has some data or metadata associated with it. This trend is having an impact on both data and content management. For example, data management applications are typically structured to provide a view at the present time to answer questions like— What is the current balance in a program fund account? It is much more difficult to ask the question— What was the balance in a program fund account a year ago? Similarly, computer file servers can tell you the last modified date of a piece of content. It is usually not possible for them to tell you the effective date of a piece of content unless a human editor has added that information. For example, the effective date of a regulation is normally different from the date of the legislation, or the date of an announcement in the U.S. Code of Federal Regulations.

Content Processing

Another trend is the processing of content to identify any meaningful patterns that can be discovered about it. This can be based on patterns among the words and phrases in the text, extracting named entities such as locations, organizations or people that are mentioned in the text, and presenting these patterns using some form of visualization such as GIS maps. Analytics is the processing of content into a data representation, where all types and forms of content can be reduced to a set of data values.

Today, managing content across an enterprise encompasses both data and content management. Sometimes this is referred to as enterprise content management (ECM), and sometimes this is referred to simply as data management.

Linking Source Data to Published Analysis in Documents

One of the important challenges today is managing heterogeneous content, i.e., narrative content, which may be based on structured data sets and include visualizations of that data. Providing dynamic methods to directly link narrative content to such source data is becoming necessary. It is no longer sufficient to manage such narrative content simply as a static content item. For example, a research report on highway safety which includes tables of data, charts and maps need to be linked explicitly back to the data source so that further analysis of the same data set can readily be replicated, or new analyses performed.

Content Lifecycle, Workflow, Archiving

A content set will typically evolve over time through drafts and versions, and will often have annotations and commentary associated with it. Today’s information manager is faced with the requirement of managing and synchronizing multiple versions of overlapping sets of heterogeneous sources. This is a difficult task, which can be addressed through the use of versioning software available in document management systems. For example, a PowerPoint report on material properties of highway surfaces will typically be developed through many drafts and versions for different audiences such as engineers and budget analysts. It’s difficult to keep track of the multiple versions, and to determine which one is the most current, or which one is the official approved document of record.

Impact of Metadata Standards

Metadata standards provide the basic guidelines for the common description of content so that it can be found and used within and across applications, repositories and organizations. Metadata should be associated with all types of content items including documents, data sets and visualizations. Metadata may be stored embedded in the content item or in a separate metadata database with identifiers to link between the content item and the metadata database record. Metadata will be generated at the time the content item is created as well as each time the content item is used throughout its lifecycle. Ideally, metadata should provide a longitudinal record covering the life of the content item.

Two ISO metadata standards are particularly relevant to business information: ISO 15836 (aka Dublin Core); and ISO 19115.

ISO 15836 (Dublin Core)

Referred to as the Dublin Core which refers to Dublin, Ohio, the site of the meeting where this standard originated, ISO 15836[3] is the standard for describing content published on the web. Figure 2 is a representation of this ISO standard which defines 15 properties for use in resource description. Dublin Core properties can be expressed as HTML meta tags or as RDFa, an HTML extension that is useful for marking up and publishing metadata as linked data, that is, publishing structured data so that it can be interlinked with data items from different data sources. (Linked data is the basis of the semantic web, discussed in the section Linked Data on the Web). Dublin Core has been widely adopted in government and business as the basic properties for describing content items. These properties are available in commercial software such as in Microsoft SharePoint where it is represented as metadata columns. Business information should include descriptive metadata based on the Dublin Core.

Dublin Core elements grouped into description categories

Figure 2 Dublin Core Elements Grouped into Description Categories

ISO 19115 and FGDC

ISO 19115 is the standard for describing geographic information and services. The U.S. Federal Geographic Data Committee (FGDC) endorsed the geographic information metadata standard in 2010. This common metadata standard for the description of GIS data sets has enabled building the U.S. National Spatial Data Infrastructure (NSDI). Geographic data, imagery, applications, documents, websites and other resources have been cataloged for the NSDI Clearinghouse. This metadata catalog can be searched to find geographic data, maps, and online services. Following is a description of the NSDI from the FGDC Data and Services webpage:

“The Clearinghouse Network is a community of distributed data providers who publish collections of metadata that describe their map and data resources within their areas of responsibility, documenting data quality, characteristics, and accessibility. Each metadata collection is hosted by an organization to advertise their holdings within the NSDI. The metadata in each registered collection is harvested by the geo.data.gov catalog to provide quick assessment of the extent and properties of available geographic resources.”[4]

Organizations should participate in the NSDI Clearinghouse Network to help meet their GIS needs and leverage their GIS resources. This is one way to leverage the adoption and use of the FGDC metadata standard.

Impact of Open Data and Digital Government Initiatives

Under the past few administrations, the White House has promoted information management best practices in U.S. government agencies that take advantage of the current and emerging networked information ecosystems. These initiatives have aimed to improve customer service, efficiency, effectiveness, accountability, and transparency.

Digital Government Strategy

Early websites consisted of static HTML pages that were usually authored painstakingly by hand. Web content management (WCM) applications provided interfaces for authoring webpages that provided features such as being able to preview what the HTML coding would look like before the page was published to a website. WCM applications also provide templates for creating different types of content, and for publishing those pages with a particular design. Such templates had the capability to query a database or content repository to find content items to populate those presentation templates thus enabling an early version of dynamic webpages. This model of separating content creation from content publishing has become a key part of enterprise content management strategies.

The U.S. Digital Government Strategy presents a conceptual model with three service layers: 1) the information layer, 2) the platform layer, and 3) the presentation layer[5]. The information layer contains structured, digital information such as traffic and highway safety data, plus unstructured information (content), such as fact sheets, press releases, and compliance guidance. The platform layer includes the systems and processes used to manage this information. The presentation layer defines the way that information is organized for presentation to users via websites, mobile applications, or other modes of delivery. These three layers as illustrated in Figure 3 separate information creation from information presentation thus allowing content and data to be created once, and then used in different ways.

Whether implemented as a portal, wiki, or service-oriented architecture, the intention is the same—to enable wider access to information at the local, regional, and national levels.

Digital government services model

Figure 3 Digital Government Services Model [6]

Open Data

Prior to the World Wide Web, publishing government data (both datasets and bibliographic metadata) was done primarily by commercial publishers, and database publishing continues to this day. With the advent of the web, public government data began to be more widely and freely published, but on a voluntary basis.

The U.S. President’s 2010 Memorandum on Transparency and Open Government, OMB Memorandum M-10-06, the Open Government Directive,[7] made it a priority for the public to be able to easily find, download, and use datasets that are generated and held by the Federal Government. Data.gov was created as a catalog to provide descriptions of these datasets. The federal open data policy has greatly accelerated the trend to make datasets from all levels of government publicly available. Under the data.gov model, U.S. public agencies at all levels of government have the opportunity to develop applications based on the datasets they have, or to simply publish the datasets and let third parties develop those applications. For example, Figure 4 is a dynamically generated traffic map that is made available by a third party based on Caltrans (the California DOT) data.

Transparency, APIs, Third Party Applications

The U.S. digital government strategy is intended to facilitate the development of services that make use of the information layer through the federal government open data policy and thus provide transparency. These services may be developed by the government agencies that produce the data, or they may be developed and deployed by other government agencies or other parties outside government. There are several architectures for doing this: 1) download the data set and implement it as a stand-alone service on your own server, 2) dynamically query the information layer, or 3) link to the information layer. Dynamically querying the information layer can be done using web services description language (WSDL) or, if that is not supported, via an application programming interface (API) or custom application. Linking to the information layer requires the agency to publish their information as a linked data service, that is, publishing structured data so that it can be interlinked with data items from different data sources. Linked data is the basis of the semantic web, which is discussed in the next section.

Traffic data map generated using web API

Figure 4 Traffic Data Map Generated Using Web API (www.trafficpredict.com)

Making APIs available for web developers has been a popular trend. The Google maps API is well known and widely used to represent data, which has geospatial information on a Google map. All mapping applications have APIs to represent geospatial data using one or more of their map layers.

Figure 4 is an example of a mashup[8] which combines traffic data with a GIS map representation. There are several other information services, which make APIs available so that developers can access and reuse their data. For example, the New York Times makes APIs available for article search, Congressional information, location concepts, and many others.[9]

Impact of the Semantic Web

The semantic web is a movement that promotes common data formats on the web that enables the inclusion of semantic content on webpages such as identification of the names of people, organizations, locations, events and other things. Semantic coding is like hyperlinks, but with much more functionality. For example, web browsers can recognize semantic content so unstructured documents can behave more like structured data. Semantic content can enable dynamic linking of data sources to tables and visualizations in documents. For example, traffic and safety data can be linked to an analysis of highway safety in particular regions or related to different modes of transportation.

RDF Schema (RDFS) vocabulary description language

The Resource Description Framework (RDF) is a general-purpose language for representing information in XML on the web. RDF Schema (RDFS) is an RDF extension that is used to describe groups of related resources and the relationships between these resources. A vocabulary description language is a way to represent the components of a schema and the relationships between these components. For example, Dublin Core or ISO 19115 can be represented in RDFS.

Namespaces for XML Schema Elements and Attributes

A namespace is a specification for unique identifiers for labels. A namespace disambiguates labels that are otherwise the same (e.g., homographs), with unique and referenceable identifiers. For example, an application could query a schema via a web service when it is necessary or useful to interact with an application that uses that schema. Figure 5 shows the Dublin Core namespace, which is used to discover the semantics and syntax of the Dublin Core elements and other related resources called DCMI Terms. In this way, the semantics of XML schema elements and attributes are a type of vocabulary, i.e., a controlled list of values with a specific meaning and purpose.

Term NameURL
contributorhttps://purl.org/dc/elements/1.1/contributor
coveragehttps://purl.org/dc/elements/1.1/coverage
creatorhttps://purl.org/dc/elements/1.1/creator
datehttps://purl.org/dc/elements/1.1/date
descriptionhttps://purl.org/dc/elements/1.1/description
formathttps://purl.org/dc/elements/1.1/format
identifierhttps://purl.org/dc/elements/1.1/identifier
languagehttps://purl.org/dc/elements/1.1/language
publisherhttps://purl.org/dc/elements/1.1/publisher
relationhttps://purl.org/dc/elements/1.1/relation
rightshttps://purl.org/dc/elements/1.1/rights
sourcehttps://purl.org/dc/elements/1.1/source
subjecthttps://purl.org/dc/elements/1.1/subject
titlehttps://purl.org/dc/elements/1.1/title
typehttps://purl.org/dc/elements/1.1/type

Figure 5 DCMI Elements Namespace

Namespaces for XML Schema Elements and Attributes

The other types of vocabularies that are important for the semantic web are named entity vocabularies. Named entities are the names of people, organizations, locations, events, topics, and other things with proper names or other specific, controlled names. The RDFS vocabulary description language can also be used to describe the values in named entity vocabularies, the relationships among those values and related values within a particular vocabulary or within another namespace.

SKOS (Simple Knowledge Organization System) is the World Wide Web Consortium (W3C) specification for representing knowledge organization systems using RDF. SKOS makes an important distinction that the discrete components of a categorization scheme (often called nodes in a categorization scheme) are concepts, and that various labels and other types of information can be associated with that concept. Figure 6 provides a simple illustration of a concept, in this case the name for the U.S. Federal Highway Administration Research Library. The actual concept is an identifier, in this case the Library of Congress Name Authority File identifier for the U.S. Federal Highway Administration Research Library (http://id.loc.gov/authorities/names/no2012007308). Each lexical relationship, which in this case are equivalent relationships, can be represented as a subject-object-predicate triple shown in the table at the bottom of the diagram. A SKOS representation can easily be extended to add more information about a concept by adding another row in the triple table referenced to the concept’s identifier as the Subject.

U.S. Federal Highway Administration Research Library
SubjectPredicateObject
http://id.loc.gov/authorities/names/no2012007308skos:prefLabelUnited States. Federal Highway Administration. Research Library
http://id.loc.gov/authorities/names/no2012007308skos:altLabelU.S.  Federal Highway Administration. Research Library
http://id.loc.gov/authorities/names/no2012007308skos:altLabelFHWA Research Library

Figure 6 Some Semantic Relationships for the U.S. Federal Highway Administration Research Library

Linked Data on the Web

An emerging trend in library authority files and other types of authoritative lists of named entities (people, organizations, locations, events, topics, etc.) is to publish them on the web using universal resource identifiers (URI’s). The most common URI’s are URL’s or webpage addresses. URI’s are unique identifiers on the web, so assigning URI’s to authority records allows them to be referenced persistently on the web. The idea is to enable organizations to publish and reference named entities on the web. For example, the Federal Highway Administration (FHWA) could publish and maintain the authoritative list of FHWA agency names, programs, projects, etc. By doing this, the authoritative list of FHWA names would be available to State DOTs as well as application developers just like traffic and safety data is available for mashups.

Review of Terminology and Categorization Schemes

This section provides a brief overview of controlled vocabularies, which can be used to describe and categorize content to make it easier to find and use.

Overview of Types of Ways to Categorize Content Collections

Working with digital content, users can either 1) browse for content using a file manager, or they can 2) search for content using a local search engine. Browsing for content relies on the method used for organizing and naming the file directories and files themselves. These are often ad hoc even when files are kept on a shared file store. Searching relies on the way the internal search engine has been configured including considerations such as what content has been indexed, whether indexing is full text or metadata-driven, how search results are presented, and whether search refinements can be made to the query or the results. Often search engines have not been configured at all.

File Directory Methods

The most common way to organize content is to place it into a physical directory. This is similar to the single access method that is used with paper files where content items are filed in folders in filing cabinets. Everyone who has a PC has to decide how to set up their file manager directories and folders. When there are shared network drives, a standard method of naming directories and folders is usually established.

Folders are organized differently depending on the business activity. Here are some examples.

  • Records management is based on a record retention schedule. These schedules are typically set up by business functions (accounting, administration, environment, finance, human resources, etc.) then by content type (annual report, best practice, correspondence. datasheet, handbook, form, etc.), and then chronologically by date.
  • Project management is based on work breakdown structure (WBS) usually by technical discipline, subdivided by task, and then chronologically.
  • Administrative files are often simply organized in chronological order by date; or, simply in alphabetical order A-Z.

Metadata Description Methods

In addition to placing content in a file directory, it is helpful to associate descriptive metadata with the content item. Metadata supports content retrieval for authors and content managers to help

  • Reliably find whether a content item exists.
  • Determine ownership of a content item and whether it can be reutilized or not.
  • Enable alerts to new content, or subscription to a predefined query.
  • Keep content items up-to-date, accurate, and in compliance with regulations.

Metadata supports content publishing and general use such as

  • Faceted search based on metadata properties.
  • Search optimization.
  • Dynamic content delivery based on standard categorization.
  • Content reuse in multiple distribution channels (web, mobile, RSS alerts, etc.)
  • Content reuse in FAQs (Frequently Asked Questions) on specific topics and other categories.
  • Orienting Googlers on public websites, even when they land on a page fifteen layers deep.
  • Ensuring consistent values for analytics across channels.

Strategies for Developing a Common Categorization Scheme

This section provides a framework for developing and maintaining a common categorization scheme that would meet the needs of a community of organizations based on emerging WWW standards and practices that enable linked data.

Requirements

Many categorization schemes were originally developed to support bibliographic information retrieval, but the information management landscape has changed since the 1990’s. Today most business information is “born digital.” There is an enormous volume of data about assets, particularly related to their location and usage over time. While asset-based information has not yet been fully integrated, we can expect further integration over time so that eventually there will be an up-to-date digital representation of every asset throughout its life, as well as a longitudinal record that can present a series of asset information snapshots over time. We should expect that reports making use of data and visualizations will be linked directly to those data sources. The two big future trends are 1) more data and 2) more integration. The key requirement to support these trends is metadata that will enable integration. That metadata will require common topical categories, as well as common sets of proper names for organizations, persons, computer programs, places, and other relevant named entities.
Flexibility

A business information categorization scheme must be applicable to all types of content in all presentation formats. Business information takes many forms and exists in many formats. Specifically, the categorization scheme must be readily applicable to database schemas, data set metadata, document properties, static and time-based visualizations, as well as heterogeneous content forms which include one or more of these content types, or needs to be synchronized across types such as the data set associated with a visualization in a document.

The scheme needs to be applicable to a collection, an item, as well as a component, and these should be able to inherit properties from the more general to the more specific level.

It should not matter where or how the container holding the values associated with a particular item is implemented—it may be embedded within the object or stored as an external database record that is referenced to the object using an identifier.

Semantics

The business information categorization scheme should be based on standards and best practices for representing semantics between and among categories including: 1) ANSI/NISO Z39.19-2005 (2010) Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies[10], 2) ISO 25964 Thesauri and Interoperability with other Vocabularies (-1:2011 Part 1: Thesauri for Information Retrieval[11] and -2:2013 Part 2: Interoperability with other Vocabularies[12]), as well as 3) SKOS (Simple Knowledge Organization System) the W3C specification on how to represent knowledge organization systems using RDF.

The scheme needs to support hierarchical, equivalent, and associative relationships.

Hierarchical means broader and narrower concepts. The scheme needs to handle multiple parent (broader) concepts called polyhierarchy.

Equivalent means synonyms and quasi-synonyms, near synonyms, abbreviations, acronyms and other alternate labels. The scheme needs to support identifying regional variations where one label might be preferred in one region while another label would be preferred in another region, e.g., Park and Ride vs. Fringe Parking.

Associative relationships means related concepts, called Related Terms or RTs. OWL (Web Ontology Language) may also be used to specialize associative relationships.[13]

Currency and Governance
A business information categorization scheme must be frequently updated to reflect the terminology needs of participating organizations and to remain current in a rapidly evolving area that overlaps many related disciplines and locales. At the same time, it is also important that the scheme is stable and is not changed just in reaction to immediate events. A governance process is needed that 1) defines roles and responsibilities, 2) identifies appropriate policies and procedures, and 3) provides a communication plan to promulgate the scheme, the governance processes, and communicates changes to all parties.
Localization

A business information categorization scheme must support localization so that a local organization can choose which of several alternative labels available for a concept to use, but still be able to align with the core concept. The relationship among alternative labels must be defined in a way that enables retrieval of federated search results from across different agencies, which may use different alternative labels. For example, while “Fringe parking” is the preferred Library of Congress Subject Heading, “Park and ride” or some variation of this label is frequently used instead. Figure 7 shows alternate labels for “Fringe parking” based on the Library of Congress Subject Headings linked data service.[14] The Transportation Research Thesaurus (TRT)[15] also has the category “Fringe parking” which is represented by the TRT unique notational code identifier “Brddf”.

Localization might also take the form of subsets of categories to support specific user communities, however that may be defined—by locale, by function, by expertise, by project, and any other sub-division.

Fringe-Parking-Concept
SubjectPredicateObject
http://id.loc.gov/authorities/subjects/sh85052028skos:prefLabelFringe Parking
http://id.loc.gov/authorities/subjects/sh85052028skos:altLabelPark and Ride Systems
http://id.loc.gov/authorities/subjects/sh85052028skos:altLabelPark and Ride
http://id.loc.gov/authorities/subjects/sh85052028skos:altLabelPark & Ride
http://id.loc.gov/authorities/subjects/sh85052028skos:altLabelPark-n-Ride
trt: Brddfskos:prefLabelFringe Parking
trt: Brddfskos:altLabelPark and Ride

Figure 7 Alternate Labels for the Concept Fringe Parking

Ease of Use

A business information categorization scheme must be easy to use for all stakeholders. All stakeholders need to be able to understand and use the categorization scheme to some extent. The expectation on the web is that categorization schemes need to be understandable without any training—they need to be as easy to use as Google. This doesn’t mean that training, experience and subject matter expertise isn’t important and valuable for obtaining more effective use of a categorization scheme, but it does mean that on the surface, the scheme needs to be usable “out of the box”. The usability of a categorization scheme is usually measured by 1) discreteness of broad categories, 2) consistency in indexing information, and 3) consistency in finding information.

Card Sorting
The discreteness of categories can be measured by closed card sorting. In this test, commonly used terms selected from query logs and analytics are sorted into broad categories by representative users. The sorting results are compared to a baseline to measure consistency. This provides an independent assessment of how distinct the scheme’s broad categories are perceived to be. 70-80% consistency is considered a high usability validation for a categorization scheme. Card sorting is usually done using online tools with iterative sets of 15-20 participants sorting up to 50 terms. There is some debate in the usability community about the number of participants that are required to provide meaningful results, and when the benefit of added participants does not add meaningfully to the results.[16] Our opinion is that 15-20 participants provide meaningful results.
Indexing Consistency
Inter-indexer consistency can be measured by having representative users index a set of representative types of information—data, visualizations and documents. The values used to index each information item are compared to the baseline to measure completeness and consistency. Alternative values are identified, assessed and applied to the scoring scheme as appropriate. This provides an independent assessment of the categorization scheme’s usability for indexing. 70-80% consistency is considered a high usability validation. Indexing is usually done as a paper exercise with iterative sets of 15-20 participants indexing 5-10 information items.
Findability

Findability can be measured by having representative users look for, or describe how they would look for, a set of representative content items. The categories and sub-categories used to find content items are compared to a baseline to measure consistency. Alternative category paths are identified, assessed and applied to the scoring scheme as appropriate. This provides an independent assessment of the categorization scheme’s usability for finding information. 70-80% consistency is considered a high usability validation. Finding is done as a computer or paper-based exercise with iterative sets of 15-20 participants searching for 5-10 information items.

Microthesaurus

The Z39.19 standard for thesaurus construction briefly discusses the concept of a microthesaurus as a subset of a broader thesaurus that is created to be used in “specific indexing products.”[17] According to the standard, microthesaurus requirements include the following:

  • There should be a defined scope for a specialization of the broader thesaurus.
  • Terms and relationships extracted from the broader thesaurus should have integrity, that is, there should be no orphan terms.
  • Additional specific descriptors can be added, but should be mapped to the structure of the broader thesaurus.

Providing a method to enable stakeholders to generate and maintain microthesaurus subsets addresses the flexibility, localization and usability discussed in the previous section.

Microthesaurus Models

There are two broad models for implementing a microthesaurus: 1) a Centralized Model or 2) a Decentralized Service.

Centralized Model

In the centralized model, a new term property and set of controlled values is defined to be used to code any thesaurus term so that it can be included in a microthesaurus. In this model, the so-called broader thesaurus is, in effect, a microthesaurus so each term must be identified. To define a microthesaurus, preferred and variant terms would be individually selected. In this model, term relationships generally remain unchanged from the broader thesaurus. Exceptions are automatically handled so that 1) broken hierarchical relationships will be collapsed, 2) orphaned equivalent and associative term relationships will be ignored, and 3) relationships will be added for any new more specific terms. As shown in Figure 8, a thesaurus management tool with the capability to provide distributed access is usually required to implement such a centralized service. Solid lines indicate tight integration, and dotted lines indicate a loose integration.

Centralized microthesaurus/subset model

Figure 8 Centralized Microthesaurus/Subset Model

Decentralized Service

Figure 9 illustrates a decentralized service that would be decoupled (or only loosely coupled) to the broader thesaurus infrastructure. In this model, all or selected global terms would be downloaded to a microthesaurus site. That site would have its own thesaurus management system, tools, and processes. No coding would be required to identify that a Global term is to be included in the microthesaurus. Exceptions to relationships would be handled by editors so that 1) broken hierarchical relationships would be rationalized, 2) orphaned equivalent and associative relationships would be resolved, and 3) relationships would be added for any new more specific terms. If more specific terms are outside the scope of the Global Terminology, then they would not need to be submitted as candidate terms. Instead the Global Environment would maintain a catalog or registry of microthesauri.

Decentralized microthesaurus/subset model

Figure 9 Decentralized Microthesaurus/Subset Model

It would also be valuable to maintain a catalog of external terminologies for proper names from authoritative sources such as the Dunn & Bradstreet for company names or the Library of Congress for authors, organization, conferences, and other named entities.

The decentralized model is not only applicable to microthesaurus building, but also to distributing the work of identifying, building, and maintaining terminology subsets as a general vocabulary management model.

Community-Based Vocabulary Management Model

This section describes a model for how a community-based terminology management model could work.

What Does a Community-Based Model Look Like?

A community-based model for terminology management delegates responsibility for the creation and maintenance of terms and groups of terms to distributed responsible organizations. The central office functions as the overall terminology editor and coordinates the distributed effort by

  • Managing the terminology management environment.
  • Identifying who will be responsible for what term subsets.
  • Routing candidate term requests to distributed editors based on expertise, volume or other criteria.
  • Communicating editorial policies and answering editorial questions.
  • Providing training to distributed editors.
  • Communicating additions and changes to subscribing users.
  • Coordinating communication with the community at large.

Conclusions

This paper has provided an overview of the current information management landscape in the context of broader information management trends and developments. Throughout, the paper provides many examples of best practices that may be adapted for improving business information management, such as guidance for file formats, naming conventions, and information preservation strategies. We have discussed the convergence of data and content in this landscape which is manifested most visibly in the combination of data and content from multiple sources on the World Wide Web. While data.gov is leading to wider publication and access to raw data sets, rich categorization schemes that gather and map the semantics of business information should facilitate the interchange, combination and use of heterogeneous information sources.