An Introduction to Metadata Chris Taylor Revised 1 April 1999
An Introduction to Metadata
Revised 1 April 1999
1. What is Metadata?
Metadata is structured data which describes the characteristics of a resource. It shares many similar characteristics to the cataloguing that takes place in libraries, museums and archives. The term "meta" derives from the Greek word denoting a nature of a higher order or more fundamental kind. A metadata record consists of a number of pre-defined elements representing specific attributes of a resource, and each element can have one or more values. Below is an example of a simple metadata record:
Each metadata schema will usually have the following characteristics:
Typically, the semantics is descriptive of the contents, location, physical attributes, type (e.g. text or image, map or model) and form (e.g. print copy, electronic file). Key metadata elements supporting access to published documents include the originator of a work, its title, when and where it was published and the subject areas it covers. Where the information is issued in analog form, such as print material, additional metadata is provided to assist in the location of the information, e.g. call numbers used in libraries. The resource community may also define some logical grouping of the elements or leave it to the encoding scheme. For example, Dublin Core may provide the core to which extensions may be added.
Some of the most popular metadata schemas include:
While the syntax is not strictly part of the metadata schema, the data will be unusable, unless the encoding scheme understands the semantics of the metadata schema. The encoding allows the metadata to be processed by a computer program. Important schemes include:
Metadata may be deployed in a number of ways:
The simplest method is for Web page creators to add the metadata as part of creating the page. Creating metadata directly in a database and linking it to the resource, is growing in popularity as an independent activity to the creation of the resources themselves. Increasingly, it is being created by an agent or third party, particularly to develop subject-based gateways.
2. What is a search engine?
In a nutshell, search engines, such as Alta Vista and HotBot, consist of a software package that crawls the Web, extracts and organises the data in a database. People can then submit a search query using a Web browser. The search engine locates the appropriate data in the database and displays it via the browser. This is not to be confused with directories such as Yahoo, that provide subject lists created by humans, that must be browsed. Search engines have three major elements:
Search engine software is also available to run on a local Web site. The software has the same basic components, but the spider just visits the local site or a limited number of sites in a community.
3. Why isn't an Internet search engine good enough?
The problem relates to the underlying nature of the World Wide Web. In the early 1990s, "surfing" the World Wide Web was popularised in the mass media. These days, the concept of browsing the Web is little used. The Web has become a two-edged sword. It is now very easy to publish information, but it is becoming more difficult to find relevant information [EC, p.4]. For outsiders and casual users, much of the useful material is difficult to locate and therefore is effectively unavailable [DC1, p.2].
At the global level, Internet search engines were developed to search across multiple Web sites. Unfortunately, these search engines have not been the panacea that some people had hoped for. Every search engine will give you good results some of the time and bad results some of the time. This is what information scientists term "high recall" and "low precision". The high recall refers to the well known (and frustrating) experience of using an Internet search engine and receiving thousands of "hits". It is popularly known as information overload. The low precision refers to not being able to locate the most useful documents. The search engine companies do not view the high hit rates as a problem. Indeed, they market their products on the basis of their coverage of the Web, not in the precision of the search results.
The Working Group on Government Information Navigation outlined the problems with Internet search engines:
The introduction of the <META> element as part of HTML coding, was in part, an attempt to encourage search engines to extract and index more structured data, such as description and keywords. However, search engines are rather proprietorial in recognising <META> tags. It ranges from no support at all, such as Excite, Northern Light and Web Crawler, to some support by HotBot, support for older style encoding by AltaVista, to reasonable support by Infoseek. Details are available from Search Engine Watch [SEW]. None currently supports metadata schemas. It is the proverbial "chicken and the egg" situation. Web page authors and publishers do not invest in providing metadata if the indexing services do not utilise it and harvesters do not collect metadata if there is not enough data available. The other problem is the malicious "spoofing" of search engines, making them return pages that are irrelevant to the search at hand or pages that rank higher than their content warrants.
Support for <META> tags by search engines designed for local Web servers also varies from non-existent to good. Commercial software includes Netscape Compass (used for the University of Queensland Website) Verity Search97, Infoseek Ultraseek and Microsoft Site Server 3.0. While not offering native support for Dublin Core, they have been implemented at many sites to underpin metadata searching.
A number of Australian products have entered the marketplace. MetaWeb, developed by DSTC as part of the MetaWeb project [MetaWeb] and its successor, HotMeta, developed for the Queensland Government, both offer native support for Dublin Core. Various Australian gateways are using or are planning to use these products, including MetaChem, AVEL and the Australian Digital Theses Project. KE Express is being configured to support an extended Dublin Core metadata schema for Agrigate, the agriculture information gateway.
4. Why use metadata?
The foregoing section has discussed the inadequacy of search engines in locating quality information resources. How does metadata solve the problem? A more formal definition of metadata offers a clue:
Metadata is data associated with objects which relieves their potential users of having full advance knowledge of their existence or characteristics. [DESIRE, p.2]
Information resources must be made visible in a way that allows people to tell whether the resources are likely to be useful to them. This is no less important in the online world, and in particular, the World Wide Web. Metadata is a systematic method for describing resources and thereby improving access to them. If a resource is worth making available, then it is worth describing it with metadata, so as to maximise the ability to locate it.
Resource description is important because good descriptions of information resources are the most important determinant of whether people will find what they are looking for. [IMSC, chapter 6, p.8]
Metadata provides the essential link between the information creator and the information user.
While the primary aim of metadata is to improve resource discovery, metadata sets are also being developed for other reasons, including:
While this document concentrates on resource discovery and retrieval, these additional purposes for metadata should also be kept in mind.
5. Which Metadata schema?
There are literally hundreds of metadata schemas to choose from and the number is growing rapidly, as different communities seek to meet the specific needs of their members.
Recognising the need to answer the question of how can a simple metadata record be defined that sufficiently describes a wide range of electronic documents, the Online Computer Library Center (OCLC) of which the University of Queensland Library is currently the only full member in Australia, combined with the National Center for Supercomputing Applications (NCSA) to sponsor the first Metadata Workshop in March, 1995 in Dublin, Ohio [DC1]. The primary outcome of the workshop was a set of 13 elements (subsequently increased to 15) named the Dublin Metadata Core Element Set (known as Dublin Core). Dublin Core was proposed as the minimum number of metadata elements required to facilitate the discovery of document-like objects in a networked environment such as the Internet.
Below is a summary of the elements in Dublin Core. The metadata elements fall into three groups which roughly indicate the class or scope of information stored in them: (1) elements related mainly to the content of the resource, (2) elements related mainly to the resource when viewed as intellectual property, and (3) elements related mainly to the physical manifestation of the resource.
A description of each element is given in Appendix 1. Below is an example of a Dublin Core record for a short poem, encoded as part of a Web page using the <META> tag:
<TITLE>Song of the Open Road</TITLE>
<META NAME="DC.Title" CONTENT="Song of the Open Road">
<META NAME="DC.Creator" CONTENT="Nash, Ogden">
<META NAME="DC.Type" CONTENT="text">
<META NAME="DC.Date" CONTENT="1939">
<META NAME="DC.Format" CONTENT="text/html">
<META NAME="DC.Identifier" CONTENT="http://www.poetry.com/nash/open.html">
I think that I shall never see
A billboard lovely as a tree.
Indeed, unless the billboards fall
I'll never see a tree at all.
The <META> tag is not normally displayed by Web browsers, but can be viewed by selecting "Page Source".
In addition to the 15 elements, three qualifying aspects have been accepted to enable the Dublin Core to function in an international context and also meet higher level scientific and subject-specific resource discovery needs. These three Dublin Core Qualifiers are:
6. Why Dublin Core?
The Dublin Core metadata schema offers the following advantages:
Dublin Core has received widespread acceptance amongst the electronic information community and has become the defacto Internet metadata standard [AGLS, p.3]. The Resource Discovery Unit of the Distributed Systems Technology Centre (DSTC) based at The University of Queensland, is an acknowledged world leader in metadata research and deployment. It strongly supports the use of Dublin Core to improve resource discovery and delivery.
To date, the depth of implementation in individual sectors has been patchy. In Australia, much activity has taken place in the government sector, under the auspices of the Government Technology and Telecommunications Committee (GTTC). Dublin Core has been formally accepted as the standard for both the Australian Government Locator Service [AGLS] and the Queensland Government [DCILGP].
7. Which elements, sub-elements and schemes should I use?
There is no simple answer to this question. At a fundamental level, it becomes a compromise, based on:
The bottom-line is that a simple description is better than no description at all, as long as it can aid in the consistent discovery of resources.
The level of specificity in resource description is also important. The resources can be described individually or at a collection or aggregate level. It would be practically impossible to provide guidelines as to the appropriate level of specificity. Cataloguing librarians have been arguing the toss for years without reaching a consensus. Again, we should think in terms of what the University's customers want access to. As noted above, with the major search engines, it is possible to have too many records, such that our customers can't see the forest for the trees. Initially, it would be sensible to allow the creators to determine which resources deserve their own record. If a collection-level record is used, it is important to add as much information as possible to ensure appropriate retrieval.
Acting on customer feedback is also important. Monitoring the search terms input by customers, is a well proven technique for improving the quality and coverage of a database. The downside is that the assessment process is essentially a manual one.
8. What about using controlled terminology?
Consistent use of language with metadata descriptions can aid in the consistent discovery of resources. The primary tool for ensuring consistent language usage is via controlled vocabulary, including the use of thesauri. A number of metadata elements would benefit from controlled values.
There are many subject thesauri available. However, most are designed for specialist resource communities. For example, the Edinburgh Engineering Virtual Library (EEVL) originally selected the Engineering Information thesaurus, but decided that it was too complex for the purpose. Instead they developed a modified version to suit their specific needs.
Ultimately, as the AGLS Metadata User Manual notes, "Ö a common sense, author-based approach is still effective and yields a high return to agencies." [AGLS1, p.4].
In the absence of a suitable subject thesaurus, some may be tempted to create one from scratch. This temptation is to be resisted at all cost. History is studded with failed attempts at developing new thesauri. Its like establishing a small business. People don't seem to understand that starting is easy, finding the resources to keep the thesaurus current is the real trick. Keeping a thesaurus up to date is a huge investment in resources that is very difficult to justify.
While strictly not a metadata issue, the mismatch between input and index terms has proven to be a major problem in retrieval from databases, particularly as a result of semantic problems, such as different spellings, singular and plural, etc. Although the basic query interfaces for search engines seem similar, there are important differences that affect the outcome of the search. For example, the query 'Mabo Legislation' could be interpreted by different engines as requesting resources that contain:
Obviously, these three different interpretations will produce different sets of results. Search engines differ in whether queries are case sensitive and how they handle singular versus plural forms of a word. Alternative spellings, for example, labour and labor, may have to be searched separately. The same applies to abbreviations, such as dept and department. This disconcerts the naive user and annoys the experienced user. The solution is a common query interface, or an intermediate query engine which takes a standard query and translates it into the specific forms required by the site search engine. This approach has been implemented at the University of Queensland Web site to improve access to its services and facilities.
9. Where will the metadata be stored?
Metadata may be deployed in a number of ways:
The simplest method is to ask Web page creators to add the metadata as part of creating the page. To support rapid retrieval, the metadata should be harvested on a regular basis by the site robot. This is currently by far the most popular method for deploying Dublin Core. An increasing range of software is being made available to assist in the addition of metadata to Web pages.
Creating metadata directly in a database and linking it to the resource, is growing in popularity as an independent activity to the creation of the resources themselves. Increasingly, it is being created by an agent or third party, particularly to develop subject-based gateways. The University of Queensland Library is involved in a number of gateway projects, Agrigate, Metachem and AVEL (lead site).
10. Which encoding scheme?
For metadata attched to Web pages, there are currently two options, HTML 3.2 or HTML 4.0. While 4.0 is not yet widely used as 3.2, it is no more complex and the encoding is far easier for a computer to manipulate. It therefore seems sensible to go with 4.0.
XML is predicted to be the "next big thing" in development. RDF/XML is still under development [Iannella], but it promises much, including:
Its major drawback is that the full version cannot be embedded in HTML. The abbreviated version can be embedded in HTML, but there are few editors available that support the creation of RDF.
For metadata contained within a database, the encoding scheme is a lesser issue. What is important is its interroperability with other database schemas, to support cross-database searching and the sharing of metadata records.
In the context of Web indexing, there are currently two Webs in existence. The first is the "visible" Web, made up of static Web pages that can be harvested and indexed. The second is the "invisible" Web, made up of dynamic pages generated from a database. These pages canít be directly harvested by a robot and indexed. The records have to exported from the database, not always a trivial matter. Even if they could be harvested, the amount of data in a single, centralised database would be unmanageable.
One option is to interrogate multiple databases at the same time. There are proprietorial systems that can do this, usually at great expense. Individual systems can also talk to one another if they conform to the US National Information Standards Organization (NISO) Z39.50 protocol [NISO]. The Z39.50 protocol for distributed information retrieval, supports the searching of disparate databases, either singularly or in combination, regardless of proprietorial interfaces. Z39.50 supports a number of "profiles" in order to enable translation between various databases. Recognising the value of searching distributed databases via Z39.50, work is taking place to create a Dublin Core profile [LeVan].
Unfortunately, most databases underpinning the Internet and local search engines do not support Z39.50. From a Z39.50 gateway, metadata could be searched separately or in combination with any number of other Z39.50 databases, including library catalogues and increasingly, other metadata repositories. This scenario greatly expands structured resource discovery. HotOil from DSTC, is one product that supports Z39.50 searching.
11. How does one create metadata?
The more easily the metadata can be created and collected at point of creation of a resource or at point of publication, the more efficient the process and the more likely it is to take place. There are many such tools available and the number continues to grow. This makes it difficult to recommend specific tools. Some examples include:
Ideally, metadata should be created using a purpose-built tool, with the manual creation of data kept to an absolute minimum. The tool should support:
Much of the data can be copied and pasted from the Web page. The resulting record may inserted into a Web page or added to a metadata repository.
[AGLS] Australian Government Locator Service Implementation Plan: A Report by the Australian Government Locator Service Working Party (AGLS WG) December, 1997. http://www.aa.gov.au/AA_WWW/AGLSfinal.html
[AGLS1] The Australian Government Locator Service (AGLS) Manual for Users. Office of Government Information Technology and National Archives of Australia, July, 1998. http://www.ogit.gov.au/aglsindex.html
[DESIRE] Specification for resource description methods Part 1: A review of metadata: a survey of current resource description formats. Lorcan Dempsey and Rachel Heery, March,1997. http://www.ukoln.ac.uk/metadata/desire/overview/
[IMSC] Management of Government Information as a National Strategic Resource. Report of the Information Management Steering Committee on Information Management in the Commonwealth Government, September, 1997. http://www.ogit.gov.au/publications/IMSC/executiv.htm
[SEW] Search Engine Watch. Search Engine Feature Company. 1998. http://searchenginewatch.com/webmasters/features.html
[UQ] UQ Web Site Enhancement Project: Report. University of Queensland Web Site Enhancement Project Team (unpublished report). July, 1998.
I would particularly like to acknowledge the expert advice and contributions from the following people:
Appendix 1: Dublin Core Metadata schema
Last updated: Thursday, 02-Dec-1999 12:06:49 EST
© University of Queensland Library