Measuring The Semantic Web – Measurement Units

In the last post I introduced the concept from HyTime of a finite coordinate space (FCS) consisting of a number of axes each of which is tied to a measurement domain where the measurement domain might be physical or virtual. In this post we will take a look at Clause 9.2 of the standard which talks about how to define the units of measurement for a measurement domain.

Before we go there, lets just pause a moment to think about the problem here. Take length as an example. For most folks going about their daily business, the metre is a good unit of length to use for most things – I’m 1.9m tall, my room is 3.5m by 2.5m. However, for some things we tend to prefer to use smaller length units such as the corner-to-corner size of my monitor (22″), the depth of my desk (40cm…I’m guessing that one), the diameter of the jack that goes into my iPod (3.5mm). The folks that make the chips that power my PC work in nanometres, cellular scientists probably dream in picometers. Physicists are the worst – depending on their speciality a picometer might be way too big or a kilometre way too small.  If we want a measurement system that works we need some way for all of these folks to express measurements in terms of the units that make the most sense to them but at the same time be able to freely map between them.

The HyTime answer to this is to provide a way to specify a measurement domain in terms of some basic unit of measurement and then to define as many other units of measurement for that domain in terms of their ratio to the basic measurement unit. For physical measurements of space and time, HyTime provides definitions measurement domains based on the Systeme Internationale (SI) units of SI second and SI meter. The

Its probably easier to understand this if we take one of these definitions as an example (the following are all snippets from the text of clause 9.2)

Firstly our measurement domain is defined in terms of a standard measurement unit (SMU). The standard measurement unit has an identifier, and this being HyTime and based on SGML, the identifier is a formal public identifier (FPI):

<!notation
   SImeter        -- Systeme International meter --
                  -- Reference unit of real length --

   PUBLIC "ISO/IEC 10744:1997//NOTATION
           Systeme International meter//EN"
>

If you come from the world of XML, just think of “ISO/IEC 10744:1997//NOTATION Systeme International meter//EN” as a funny-shaped URI and you’ll be doing fine. You can also think of this <!notation> tag as declaring a mapping from the short string “SImeter” to the longer identifier string. So this is essentially an SGML-flavoured namespace declaration.

Now we have a standard measurement unit defined and assigned an identifier we move on to use that identifier in a resource that describes our measurement domain and all of the different measurement units that can be used in it. In the HyTime specification each of these units is called a granule and a measurement domain can contain any number of granules. Every granule is defined as a multiple of some other granule in the same measurement domain (with at least one granule based on the SMU). The multiplier is expressed as a ratio of two numbers. These granule declarations are made in an SGML resource like this (I’ve left out many of the granules defined in the standard for brevity):

<measure smu=SImeter>
  <granule gn=um>                    1   1000 mm
  <granule gn=mm>                    1     10 cm
  <granule gn=cm>                    1     10 dm
  <granule gn=dm>                    1     10 meter
  <granule gn=meter>                 1      1 SImeter
  <granule gn=dam>                  10      1 meter
  <granule gn=hm>                   10      1 dam
  <granule gn=km>                   10      1 hm
  <granule gn=pica>                  1      6 inch
  <granule gn=barleycorn>            1      3 inch
  <granule gn=inch>                254    100 cm
</measure>

This being SGML, end-tags are optional, so if you are an XML-head you will have to imagine a </granule> inserted in the relevant places. For each granule, the granule’s magnitude is specified with two numbers and a reference to another granule. The two numbers define the ratio for one granule of the type being defined to the granule referenced by the granule definition. So, a mm is 1/10th of a cm, which is in turn 1/10 of a dm, which is 1/10 of a meter, which is exactly 1 SImeter. Note that this principle works for more complex ratio’s such as 1 inch being 254/100 cm (i.e. 1″=2.54cm). In this way, the standard measurement unit of an SI meter (with its FPI) is used as a base for defining all units of length measurement in a way which allows a conversion between any two particular units of measurement.

To my mind, this is a really cool facility because it means that all units are defined by the mathematical relationship between them making it possible for a machine to simply compare measurements given in different units (indeed, it provides a means for a machine to determine *if* a pair of measurements can be converted).

When I published the first post in this series, Bob DuCharme gave me a pointer to an ontology for Quantities, Units, Dimensions and Types. This has a similar approach to defining units. I think that I prefer the HyTime approach of defining conversions as a ratio of integers as it enables irrational numbers to be used in conversions (cf the definition of a barleycorn above as 1/3 of 1 inch) . What the QUDT ontology also does, which is extremely cool, is allow the definition of what HyTime calls “measurement domains” as a combination of other measurement domains. For example, a measurement domain for speed or acceleration could be defined in terms of the measurement domains for length and time.

Another interesting piece of work is this part of the Morfeo project. As with the QUDT ontology, it allows derivation of measurement domains. The wiki page I have linked to also discusses the problems of attaching measurement units to RDF property values, describing 4 different patterns with their pros and cons. For a topic map practitioner, these are instructive if we consider that an (unscoped) occurrence with a data resource is the same functionally equivalent to an RDF property with a literal value.

In the next post, I’ll try to start pulling together some thoughts on a Topic Maps ontology for defining measurement units and a pattern for expressing measurements.

Measuring Out The Semantic Web

Introduction

In his closing keynote at this years TMRA conference (you weren’t there? you should have been!), Steve Newcomb made reference to the wonders of ISO/IEC 10744 or HyTime to its friends.

HyTime is a monster standard – it is complex and so difficult to implement in its entirety that I believe only one person has ever tried. That said, the standard contains so much that is useful and generally required for a functioning Web, that many of its pieces got cannibalized, stripped down and turned into hacker-friendly W3C “standards” – XLink. Just as XML owes its very existence to SGML, so XLink and SMIL both need to look back to HyTime as an ancestor (albeit one that never gets invited to the family parties). Anyway, Steve talked about how there is still much left in HyTime that could be useful and in particular picked out Clause 9 – Scheduling as one such piece. IMHO he is right on the money.

The Problem Statement

One of the biggest problems on the semantic web or linked data web is that we have no way to communicate measurements or positions using a grounded algorithm. There is no way that a fully conformant Topic Maps processor, RDF processor or OWL reasoner can tell that a given property is actually a point on some axis. And there is no way that these processors could convert from one axis type to another (say feet to metres). To do all this with current technology you need to bake in to your application some ontology-specific knowledge – something that tells you “Values of property X are always expressed in milimeters, and values of property Y are always expressed in seconds since midnight on 1st January 1970″.

It is staggering to realise that we can’t do this yet on the Semantic Web when you think about it. Its even more staggering to see bold statements being made for the “Web of Linked Data” without addressing the basic problem of “How do I know what units this data is expressed in”. I believe that this is where carcass of HyTime can be picked over once more :-)

What I’m going to attempt

I have decided to go back to the HyTime standard and see what can be taken away from it for the benefit of those currently struggling with merging, comparing and meaningfully transforming Linked Data. HyTime is a massive spec, and as the Readers Guide To HyTime recommends, I’m not going to even attempt to read all 450 pages but will instead focus just on Clause 9. This part of the HyTime specification deals with the issue of describing the positioning and extent of an object in N-dimensional space and provides a mechanism for defining the units used for measuring along each axis.

My hope is that this facility can be used not only for defining the size and position of objects but also as a general purpose facility for expressing measurements of all kinds, and that is going to be my closed set of problems to address:

  1. Specifying a measurement with a value and units in a way that allows an application to automatically compare and convert measurements that use different scales or units.
  2. Specifying a location and/or extent of an object in N dimensions where each dimension has its own associated measurement domain.

Because this will probably take some time, and because I know you are too busy to read a three-screen-long thesis, and because I would  actually value feedback as I go along, I’m going to break this exploration up into a number of separate posts. I’ll create a Category to group all the posts together as I go. Please feel free to hit up the links above and dive in there and stick around for the next post where I’ll start to actually read this stuff.

What Do You Think ?

I would be interested in what others think about this. Do you know of some other existing ontology for measurements ? Have you tried to do something similar yourself (and if so what were your experiences) ? What chapter did you get to in the HyTime spec ? All comments, suggestions and peanuts from the gallery are welcome in the comments!

Announcing Metatribble – semantic annotations for web pages

Its a pleasure to announce the first release of Metatribble, an attempt to implement some of the concepts of semantic annotation that I talked about in a previous post. Metatribble is currently packaged as a Ubiquity command and right now doesn’t do an awful lot except for mark up interesting entities with RDFa, but even just that is kinda cool :-)

To this stage the project has been a collaboration between myself and Inigo Surguy, but making use of the tremendous work done by Jeni Tennison on the rdfquery plugin for jQuery – we are truely standing on the shoulders of giants here. We would welcome any comments/suggestions/offers of help.

Links to the project and its source code and the installer for Ubiquity can all be found on the Metatribble project homepage.

BTW We are designating this release a zeta because the concept of “beta” has been somewhat devalued by beta-ware such as gmail, we feel we need to wrap around the alphabet.

rdfQuery + OpenCalais + Cloud Storage = Personal Knowledge Base ?

Last night’s Oxford SWiG meeting was interesting and sociable as usual. There were three great presentations – Jeni Tennison on rdfQuery, a jQuery-like Javascript library for parsing, querying and generating RDFa markup; Iain Emsley presented a WP plug-in that creates RDF graphs for blog posts showing a nice use of multiple ontologies; Laurian Gridinoc talked about the plans for PowerMagpie – with lots of ideas for navigation/presentation of large taxonomies and complex ontologies.

As usual though the real action was in the pub. One of the things we got to discussing was whether rdfQuery could be used to create stand-off markup on someone else’s content. Inigo Surguy pointed out that using tools such as Greasemonkey it should be pretty easy to get rdfQuery to scrape a page for the RDFa it contains and to add custom scripts to do something cool with that data. The problem comes when trying to persist any new RDF statements you might create. RDFa is a syntax for embedding RDF within HTML – so if you are in control of the page that you are adding the markup to, it is trivial to persist that markup simply by saving the modified file. If you are not in control of the page then you have some problems. The easy case is when the publisher of the page has already identified things that you might want to talk about and wrapped them in some RDFa. In this case you can simply add some more statements about those entities. What is harder is if the publisher of the page hasn’t marked up anything with RDFa. What is needed is a “bootstrap” mechanism to locate entities that you might want to talk about.

That is where OpenCalais comes in. The OpenCalais service takes content and locates entities within it, returning the content with markup added that identifies the entities within. Using some custom code interfacing to rdfQuery, it should be possible to turn the results from OpenCalais into RDFa, then you can do all the funky stuff you want with the RDF and serialize it to some persistent store (either on another web service such as the Talis platform or maybe to a local persistence mechanism such as Gears). Now, when you return to the page, your script again goes to OpenCalais to get the entities identified within and again turns this into RDF, but now you can smoosh in the RDF from your persistent store to retrieve all that cool markup you added.

What’s even better is that because OpenCalais has unique identifiers for the entities it recognizes, if you then visit another page that contains a reference to the same entity you should be able to pull in your extra markup automatically. I’m pretty sure that with this approach it should be possible to build up a personal knowledge store that can be merged into web pages as you view them, combine with some clever javascript to present that information and to allow you to extend the set of statements in the store and you have something really rather cool.

Just need to code it now ;-)

Amazon throw down a 1TB gauntlet to the Topic Maps Community

It is very interesting to see that Amazon have now made available over 1TB of public data. Its great that all of this data is now available in one place, ready shredded into queryable structures that allows developers to get to grips with it and start to do something really interesting. But wait a minute, if I want the DBPedia dump I have to go here…, if I want the Wikipedia english articles I need to go over there. If I want the US census data from 2000 its this place, if its the census data from 1990, its somewhere else. Oh and don’t even get me started on having to choose between Windows and Linux. What these data sets are are essentially separate database snapshots that you can load into your own EC2 instance in the Amazon cloud and then start processing.

…and thats kind of disappointing. Having lots of open data is a great start, but it is only a start. And here is the challenge – there are no consistent semantics acrosss these data sets, there is a great deal of wet-ware time that needs to be invested in working out the linkages between them and in getting hold of some consistent notions of identity that could assist in merging. The easy way out is to pick and choose and make a “mash-up”, but there is nothing reusable in a mash-up, and a million mash-ups do not make a viable platform for building the really cool apps of the next decade on. Topic maps on the other hand has a model for reflecting a consistent notion of identity, for reconciling different identity notions and different entity schemas.

There’s the challenge – can we integrate all of this data using topic maps ? Can we make use of the tools provided by the Amazon platform to build something even more cool – a cloud based index of the entities and relations in these data sets ? Because I believe that when we can do that we will really have 1TB (or given the expansion of the topic maps model probably 1PB ;-) of useful knowledge, rather than 1TB of bits and bytes that you can hack with a mash-up.

There’s the glove. Who’s going to pick it up ?

PSIs, Registries and Repositories – Bottom Up Or Top Down ?

Recent discussion on the topicmapmail mailing list has been on the creation and maintenance of Published Subject Indicators (PSIs). A PSI is a resource which describes a vocabulary (or part of a vocabulary) and provides URIs for terms in the vocabulary (called Published Subject Identifiers which confusingly then has the same acronym, PSI). The discussion has been provoked by the proposal to create a registry of PSIs – a task which I personally welcome.

Continue reading

New Paper: ‘Topic Map Patterns For Information Architecture’

A new paper has been added to the Publications section of the site. Topic Map Patterns For Information Architecture presents design patterns for modelling some common information organisation constructs from the world of Information Architecture. The paper presents and explains models for hierarchical and facetted classification systems as well as for thesaurii.

Continue reading