4.1.1 More about Metadata

By mchoate
Last modified: 2008-02-13 14:30:25

Update: See Reuter's Calais Semantic Web API, which is a step in the right direction.

Metadata is data about data — it’s information about information that helps humans and computers find it and understand it. Let’s say you type in the word “Ford” into Google. When you do this, Google has no way to know if you are looking for information about Ford Motor Company, cars, farm equipment, former President Gerald Ford and any other topic that could conceivably make use of the word “Ford”. Even if you add more keywords and type in “President Gerald Ford”, Google cannot tell you the difference between an article about Gerald Ford, and one that is about Richard Nixon, but contains a quote from Gerald Ford. That is, it can’t tell the difference, unless the document contains appropriate metadata.

If all of this terminology is confusing and are wondering what exactly one would do with an RDF editor, I recommend a visit to O’Reilly’s a XML.com and reading the following articles:

What is RDF?

http://www.xml.com/pub/a/2001/01/24/rdf.html

"The most common daily use of metadata is to aid our discovery of things. But there are lots of other uses going on behind the scenes. The library and video store are storing other metadata that you don't see: how often the books and videos are being used, how much it cost to buy them, where to go for a replacement, etc. Running a library or a video store would be unthinkable without metadata. Similarly, the phone company, of course, uses its metadata, most obviously to print the Yellow Pages, but for many other internal management and administration tasks.
What About the Web?
The Web is a lot like a really really big library. There are millions of things out there, and if you know the URL (in effect a kind of call number) you can get them. Since the Web has books, movies, and pizza joints, the number of ways you might want to look things up includes all the things a library uses, plus all the things the video store uses, plus all the things the Yellow Pages use, and lots more.
The problem at the moment is that there is hardly any metadata on the Web. So how do we find things? Mostly by using dumb, brute force techniques. The dumb, brute force is supplied by the wandering web robots of search engine sites like Altavista, Infoseek, and Excite. These sites do the equivalent of going through the library, reading every book, and allowing us to look things up based on the words in the text. It's not surprising that people complain about search results, or that the robots are always way behind the growth and change of the Web.
In fact there is one metadata-based general purpose lookup facility: Yahoo! Yahoo doesn't use a robot. When you search through Yahoo, you're searching through human-generated subject categories and site labels. Compared to the amount of metadata that a library maintains for its books, Yahoo! is pitiful; but its popularity is clear evidence of the power of (even limited) metadata."

--Tim Bray

What Are Topic Maps?

http://www.xml.com/pub/a/2002/09/11/topicmaps.html

"Many years ago, I started looking into SGML and XML as a way to make information more manageable and findable, which was something I had been working on for a long time. It took me several years to discover that, although SGML and XML helped, they did not actually solve the problem. Later I discovered topic maps, and it seemed to me that here was the missing piece that would make it possible to really find what you were looking for. This article is about why I still think so.
What Topic Maps Do
When XML is introduced into an organization it is usually used for one of two purposes: either to structure the organization's documents or to make that organization's applications talk to other applications. These are both useful ways of using XML, but they will not help anyone find the information they are looking for. What changes with the introduction of XML is that the document processes become more controllable and can be automated to a greater degree than before, while applications can now communicate internally and externally. But the big picture, something that collects the key concepts in the organization's information and ties it all together, is nowhere to be found."

--Lars Marius Garshol

The Opportunity

All of this adds up to an important opportunity. According to Tim Bray, “People who have thought about these problems, including many librarians and webmasters, generally agree that the Web urgently needs metadata.” However, despite the obvious need, there are very few tools on the market to help in the creation of metadata. In a recent paper published on the website of the Metadata Generation Research Project, an initiative of the University of North Carolina’s School of Information and Library Science, Abe Crystal concludes:

For decentralized metadata creation to become a reality, better designs are needed to reduce users’ cognitive load and lower barriers to efficient metadata entry. We see many opportunities, which can be organized in three themes: integration, filtering/flagging, and context.

In other words, we need tools for metadata creation that are much easier to use.

Crystal came to this conclusion after observing six scientists at the National Institute for Environmental Health Sciences (NIEHS) as they attempted to create “Dublin Core” metadata using a general web application (“Dublin Core” refers to a set of the most commonly used metadata). The result? “Frequent backtracking and deviations from the expected linear progression through the application” which suggested “uncertainty or confusion”. While the participants were obviously intelligent, the wording used to describe metadata apparently confused them. For example, one scientist entered “Microsoft Word” as the value of the “source” element, rather than a URL, or ISBN, as was expected. More amusingly, another scientist misinterpreted the instruction “enter one per line” on a multiline field and “entered the abbreviation ROC (for Report on Carcinogens) as R | O | C - one letter on each line.” (this reminds me of the time I was taking a test administered by computer. At the beginning of the test, the student was prompted to ìpress any keyî to start. After a few moments, the woman behind me raised her hand and said, “I can’t find theany key!”)

Abe Crystal writes, “Key areas of current research rely heavily on effective metadata interfaces. In particular, implementation of digital libraries and Semantic Web applications will depend heavily on decentralized metadata creation, which implies the need for highly usable metadata systems.” While some tools exist that attempt to improve integration of metadata harvesting, he describes those tools as “fairly crude, and...[not] integrated with authoring interfaces.”

The complete paper can be found at: http://www.ils.unc.edu/%7Ejaneg/mgr/pubs.html

While Abe Crystal points to the need for an effective tool for metadata creation in the field of science, media companies, too, have need for this resource.

Many newspapers use “controlled vocabularies” such as topic maps. The problem is that they all use different vocabularies, so it is very difficult to search effectively across publications. If standard topic maps are to be developed (as they most certainly will), there will be a lot of work getting legacy content converted. The News & Observer has electronic versions of news stories dating back to 1990 and they would all need to be updated. A related need comes from the fact that most metadata is entered after publication in a typical newsroom workflow (except for the obvious byline, and similar information). Current workflows and software tools do not afford the creation of metadata prior to publication, when it makes the most sense to do so. While there are examples of some metadata that is entered prior to publication, it is haphazard and the quality is poor. An ideal publication system for newspapers would include integrated support of metadata creation embedded in the daily workflow.