Skip to main content

Take a read through this interesting post on XML Dev. From bitter experience, I can second the assertion that to be sure that you have a valid WXS schema instance you need to validate with at least two parsers and preferably more. Worse still we have XML editors that, because they use one or the other of the common (broken) parsers or their own (usually broken) implementation, will reject valid instances or let you create invalid instances. How did we get here ?

Not such a long time ago, XML was new. Lots of people wrote parsers, because it was an easy thing to do. Lots and lots of people used those parsers because many of them were open source and/or free. Bugs were found. Bugs were fixed. Now most people use one of a handful of XML parsers that, for DTD validation at least, are robust and reliable.

W3C XML Schema has been a recommendation now for 2 and a half years. Some implementations are older than the recommendataion of course, but most are of the order of 2 years old. Why are there so many bugs ? Could it be that not as many people are using the tools (and so reporting the bugs) ? Possible, but not likely given that the W3C juggernaut is forcing even relatively sane people to use schemas in order to unbreak the mess that is Namespaces or because they need to user other standards that require schemas. So perhaps its because the developers of these tools aren't fixing bugs ? Have developers/software companies suddenly decided that parser bugs are "no big deal" - I find that hard to believe.

More likely is that W3C XML Schemas are just too complex to implement. My guess is that there are no "little bugs" left in the parsers, just nasty, hard-to-fix, deep-in-the-code bugs. The sort of bugs you get from trying to implement a specification as impenetrable as the 3 part monster from the W3C.

There is another way - RelaxNG and Schematron - now both under the wing of ISO in the DSDL work. Both have features that WXS does not. Both are much easier to understand and easier to write. The tools for these schema languages are, in my experience, robust and reliable. Of course, the reality is that WXS is here to stay and we have to deal with it - like it or not. But if you do have the luxury of a choice of schema languages, the practical programmer should take a good look at the tool sets available for these non-W3C languages and think hard before following the crowd.

A common modelling decision in creating a topic map is when to use an association with 3 or more roles (an n-ary association) and when to represent it as n-1 binary associations. Herewith a discussion on the relative merits of the two forms and some pointers (ok, opinions) on the Right Thing To Do.

In many cases in creating topic maps we are presented with the issue of how to represent n-way associations. Some examples could be:

  1. The members of a department (an association between one department and n-1 people)
  2. The books written by an author (an association between 1 author and n-1 books)
  3. The parts of an machine (an association between 1 whole and n-1 parts)
  4. A vote taken by a committee (an association between a decision and n-1 committee members)
  5. A murder depicted in an opera (an association between a victim, a murderer and a method of death)

The issue that comes up is whether to code these relationships in a single multi-legged association (an n-ary association) or several two-way associations (binary associations). There are trade-offs to be made, but in my opinion the first rule of thumb is:

Smaller is Better

Or more specifically, "More granular is better" - the smaller statements we make, the more control we have over them. Breaking statements up without creating new topics gives us the ability to apply metadata to those statements individually and to query, traverse and modify one statement without any impact on or concern for the others.

Of course, there is a point of diminishing returns and this is when you need to start adding new classes of entity to your model to be able to split up n-way associations. In general, if you can break up an n-way association without creating new topics, do it. If you need to create a new topic to break up an association it is likely that you are creating a topic that represents the fact of the association - if you end up having a need for that, then all well and good, but in most cases, it is something to be avoided as once you start down this reification route, its hard to know when to stop.

The second rule of thumb I follow is to ask:

"Is the association divisible without creating another topic."

In other words would it make sense to divide up the association into smaller (typically binary) associations.

Another third useful rule of thumb is:

"Does the presence of one player of a given role have any bearing on the presence of the other players"

In other words, if one player were removed, would the statement being made suddenly become untrue (rather than just incomplete).

So, with those three rules of thumb in hand...lets play the "Binary or N-Ary Game"!

  1. The members of a department (an association between one department and n-1 people)
    BINARY! - If Fred, Joe and Barney are members of the Finance Department, Fred and Joe will still be members after Barney retires. There is no dependency between the players of the 'member' role, so we can model this association as 3 binary associations rather than one four-way association.
  2. The books written by an author (an association between 1 author and n-1 books)
    BINARY! - 'Hunter S. Thompson wrote "Fear and Loathing in Las Vegas" and "Hell's Angels"'. These are independent facts and the statement as it stands is incomplete anyway (Thompson wrote more than those two books). In both English and Topic Maps, I can break this statement up into 'Hunter S. Thompson wrote "Fear and Loathing in Las Vegas"' and 'Hunter S. Thompson wrote "Hell's Angels"'. So I would model this as two binary associations rather than a single 3-way association.
  3. The parts of an machine (an association between 1 whole and n-1 parts)
    BINARY! or N-ARY! - If the meaning of the association is that it is a closed and complete list of all the components which make up the machine, it is reasonable to argue that without one of the components, the machine is not complete so in this case we should use an N-ary association that explicitly groups together all the components. On the other hand, often such part-whole relationships are often not complete (e.g. "The engine contains a fuel pump, spark plugs and a carburettor"), in which case the individual parts are independently related to the whole and so should be represented with binary associations.
  4. A vote taken by a committee (an association between a decision and n-1 committee members)
    N-ARY! - There is an example in the RDF Model And Syntax Specification (see section 3.5) which goes "The committee of Fred, Wilma, and Dino approved the resolution". The association between the resolution and Fred, Wilma and Dino cannot be subdivided as the decision was made collectively - we do not want to assert that one of the three made the decision, but instead that all three came to the decision (by some undocumented means). So in this case, an N-Ary association provides us with the necessary dependency between the committee members and the decision made.
  5. A murder depicted in an opera (an association between a victim, a murderer and a method of death)
    N-ARY - No article on topic maps is complete without a reference to Italian Opera and this is no exception. The classic Ontopia topic map contains 4 way associations such as "Baron Scarpia was kiled by stabbing by Tosca in the opera Tosca" - the role players are Baron Scarpia (playing the role of victim), Tosca the character(playing the role of perpetrator), stabbing (playing the role of cause of death) and Tosca the opera (playing the role of opera). To break this down into binary associations we would need to create a new topic of type murder, then we could say:
    • The victim of the murder was Baron Scarpia
    • The perpetrator of the murder was Tosca
    • The method of the murder was stabbing
    • The murder is depicted in Tosca (the opera)
    However, this requires us to create a new class of entity (the murder) and so to follow the rules of thumb above, and avoid this additional reification step, we instead use the n-ary association which adequately (for our purposes) expresses the dependencies between the four role players.
  6. Conclusion Modelling associations is best done with a bit of thought. Although the temptation is to just stuff as much as possible into a single association (especially when writing XTM syntax by hand), using small associations where possible gives you more flexibility in the long run as it allows greater control over attaching metadata to specific statements. More granular associations also enable a great deal more clarity. Allowing the author to be explicit about whether role players are interdependent or not is important and making use of standard topic map machinery to do that means that you need not be dependent on an ontology description to make clear what the topic map model is already capable of expressing. Thinking about the arity of associations at the time you are constructing your topic map ontology will reap benefits in the long run.

Recent discussion on the topicmapmail mailing list has been on the creation and maintenance of Published Subject Indicators (PSIs). A PSI is a resource which describes a vocabulary (or part of a vocabulary) and provides URIs for terms in the vocabulary (called Published Subject Identifiers which confusingly then has the same acronym, PSI). The discussion has been provoked by the proposal to create a registry of PSIs - a task which I personally welcome.

It seems to me that there are at least two different issues here. One is about the creation of PSIs and the other is about raising the profile of a particular set of PSIs.

First of all lets understand that there is nothing magic about a PSI in its technical aspects. Its just a URI that points to a resource that describes a vocabulary. The "magic" (if there is any) is in the processes that surround the maintenance of the PSI. The publisher that makes a PSI available is supposed to make a commitment to the stability of that PSI.

So what does stability mean ? I think that it means two things:

1) Stability of presence - the PSI's URI is not going to go away within some meaningful time frame (although I hear discussions of stability over hundreds of years, my feeling is that in this business aiming for stability over a period of 5-10 years is a sufficiently Herculean task to gain the status of PSI)

2) Stability of meaning - that the PSI's URI will always be dereferenced to a description of a term that is consistent throughout the lifetime of the PSI (Not necessarily the same all the time - e.g. a PSI for a person might be continually updated to reflect his changing status - marriage, promotion, publications and so on)

Now, neither of these commitments require a large investment in resources for those folks from the typical sem web community (it does leave out a large chunk of the world, but that is an issue that the IT industry as a whole must address). Nor do either of these commitments impose any constraints on users of the PSI. As a user of a PSI I am free to make my own value judgments about the stability of a PSI, and balance them against my judgment of its usefulness to me and the community that I am addressing with my applications. I may be uncomfortable using a PSI created by an individual whom I do not know, I may be uncomfortable using a PSI created by any individual, I may be unwilling to use a PSI created by a particular standards body or by a group I percieve as being unreliable (for whatever reason). The fact that ISO, OASIS, or the Spanish Knitting Association have put their imprimateur on a PSI is simply a factor in my judgement about the usefulness of this PSI to me.

There are good examples on the Web of vocabularies created by committe and by community. MARC is a committee-led vocabulary, as is HL7 and any number of XML vocabularies - created by a formal group (perhaps a public and inclusive group, perhaps a private and closed group) and a formal process.

Community-led vocabularies grow more organically from a user base - for example the Friend Of A Friend vocabulary (FOAF) has grown both in terms of its use and indeed its size as users get interested in applying it. The same could be said for the many faces of RSS.

In general, it seems to me that successful community-led vocabularies are smaller in size and more tightly focussed in scope than committee-led vocabularies. In addition, with no organisational imprimateur to fall back on, community-led vocabularies survive or die on their uptake. Thats not to say that the same dynamics do not also apply to committee-led vocabularies, but the organisation can provide some stability against the tide of user opinion.

So in measurement of stability, a community-led vocabulary can be as stable as a committee-led vocabulary and when one considers the other factors in the choice of vocabulary, the lighter weight, tighter focus and the ability to participate as a member of the user community may make a community-led vocabulary more attractive to some users.

Next we come to the issue of publicising a PSI. PSIs could be gathered together in a number of ways using existing web technology:

1) A centralised repository of PSIs - all subject descriptors are placed in a repository under a single common base URI. Some management process determines which PSIs are published and which are rejected.

2) A centralised registry of PSIs - PSI meta data is stored in a repository with a known address and a search interface which enables PSIs of interest to be located (either by human or machine users). A management process may be used to determin which PSIs are published, but it is not necessary in this case.

3) Informal publication - PSIs are announced on mailing lists and in weblogs or through other informal publication channels. Perhaps the author of a set of PSIs writes some articles on them, or publicises them through their use in a project with public visibility.

4) Search - PSI resources are flagged in some way (perhaps a specific META tag in the HTML representation of the resource) which enables an aware search engine to determine that a page is a resource containing Published Subject Indicators.

There are probably some other ways too. It is true that some of these forms are more restrictive than others for the creators of PSIs - particularly (1) which involves a process which could be open to abuse or to the perception of abuse. But what about the users ? Again I believe we come to the issue of choice. Some users will only be comfortable with PSIs from a centralised repository - some may even be required to use those PSIs because of their toolset. But without choice in the matter, the Semantic Web will be a poorer place. Imagine the Web if Yahoo were the only search engine (or if Google were the only search engine, if Yahoo is your preference...). Diversity causes difficulties for some - and this is an opportunity for an enabling organisation such as OASIS to define the management structures for a centralised repository, or for an enterprising vendor to create such a repository as fits with their tool set. But with the SemWeb in its current nascent state, diversity is to be welcomed and the opportunity for all to participate as both publishers and users of PSIs is vital to its success. That is why I welcome recent proposals to create an open-source registry of PSIs with minimal management processes and look forward to participating in its development as a contributor and as a user.