Questions about use cases were not really answered during the sprint, but we decided to gather information SUNCAT holds about contributing institutions together in a linked data format, and some more work is being done on use cases for SUNCAT linked and/or open data in the SUNCAT UK Discovery Project project.

SUNCAT uses the MARC organisational code for libraries when it available. I was introduced to the work of Adrian Pohl and Felix Ostrowski from hbz in Germany who have created an international directory of libraries and related organisations which covers the US codes from the Library of Congress and the German organisation codes. The information for the UK libraries is in a PDF http://www.bl.uk/bibliographic/pdfs/marc_codes.pdf at the moment, but it might be possible to collect the data from this format. Felix and Adrian presented their idea of adding RDFa to webpages containing information about libraries at ELAG2011 “Your Website is your API – How to integrate your Library into the Web of data using RDFa” and a representative from OCLC who attended the presentation directly started implementing this in the WorldCat registry.

The Talis Platform hosting and consultancy blog posts “Linking and Cleaning Data” were a very useful illustration of the use of org:Organization, org:hasSite, and v:VCard for specifying the links between an organisation and its sites and the site addresses.

An organisation ontology was used to describe SUNCAT contributing libraries. There was discussion about whether a “library” should be modelled to represent a single library in one building or be an umbrella term for all an institution’s libraries.

I found the examples on “Howto – Describing libraries, their collections and services in RDF” on the hbz Semantic web wiki very helpful.

Vocabularies used:

The RDF Vocabulary (RDF):
http://www.w3.org/1999/02/22-rdf-syntax-ns#

The RDF Schema vocabulary (RDFS):
http://www.w3.org/2000/01/rdf-schema#

Friend of a Friend (FOAF):
http://xmlns.com/foaf/0.1/

DCMI Metadata Terms (DCT):
http://purl.org/dc/terms/

An Ontology for vCards (V) for representing address and contact information:
http://www.w3.org/2006/vcard/ns#

WGS84 Geo Positioning (GEO):
http://www.w3.org/2003/01/geo/wgs84_pos#

XML Schema (XSD):
http://www.w3.org/2001/XMLSchema#

OWL
http://www.w3.org/2002/07/owl#

SKOS
http://www.w3.org/2004/02/skos/core#

Ordnance Survey Postcode Ontology
http://data.ordnancesurvey.co.uk/ontology/postcode/

The rdf:about RDF/Turtle validator and Converter was useful for checking Turtle files.

There is a JISC MU list of organisations which I used enrich the SUNCAT institution data with JISC MU organisation identifiers by querying the SPARQL endpoint for  JISC MU institutions, also using the Perl CPAN module RDF::Query::Client.

Transforming the SUNCAT institution data into linked data has helped SUNCAT clean our data. The linked data can be used as internal source of data for various SUNCAT configuration files, web pages, and contact information.

Overview of the linked data Buildings Visualisation demo using data from UoE open Repository.

The demo is here http://mab.edina.ac.uk/oarj-linkeddata/ed-rdf.html

The Map
The basis of the map is OpenStreetMap from http://www.openstreetmap.org/ a  is a freely editable map of the whole world.

It is displayed using openlayers javascript library. http://openlayers.org/

OpenLayers is an opensource library that can display map tiles and markers loaded from any source.
test

 

Use case outline.

The user workflow.  The user does a search and navigates down to a buiding shape view.

The search tool is created by extending the openlayers zoombox control and overriding the zoomBox function.

OpenLayers.Util.extend(zoomBox, {
zoomBox: function (position) {
this.notice(position);

},

notice : function(bounds) {
var leftBottom = new OpenLayers.Pixel(bounds.left, bounds.bottom);
var rightTop = new OpenLayers.Pixel(bounds.right, bounds.top);
var southWest = map.getLonLatFromPixel(leftBottom);
var northEast = map.getLonLatFromPixel(rightTop);

southWest = southWest.transform(map.projection,
map.displayProjection);
northEast = northEast.transform(map.projection,
map.displayProjection);
var south = southWest.lat;
var west = southWest.lon;
var east = northEast.lon;
var north = northEast.lat;
if(!isNaN(south) || !isNaN(north) || !isNaN(east) || !isNaN(west)){
addEdDataMarkersToLayer(east, west, north, south);
//reset to the default control
panel.activateControl(defaults);
}

 

 

 

}
});

When the user finishes area selection, a mouseup event is captured and a bounding box of the area is sent to the server.

On the server side the bounding box is used in the creation of a rdf sparql query to the EoU sparql end point.The sparl end point is at http://data.inf.ed.ac.uk/query
The underlying datascource is 4Store

The sparql query takes the form of

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX dct: <http://purl.org/dc/terms/>
PREFIX owl: <http://www.w3.org/2002/07/owl#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX ospc: <http://data.ordnancesurvey.co.uk/ontology/postcode/>
PREFIX vcard: <http://www.w3.org/2006/vcard/ns#>
PREFIX rooms: <http://vocab.deri.ie/rooms#>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?building ?label ?street ?lat ?long ?postcode ?locality ?osmid WHERE {
?address geo:lat ?lat  .

?address geo:long ?long .
FILTER ( xsd:double(?long)  > ").append(bb.getWest())
&&  xsd:double(?long)  < ").append(bb.getEast())
&&  xsd:double(?lat)  > ").append(bb.getSouth())
&&  xsd:double(?lat)  < ").append(bb.getNorth())
?address vcard:street-address ?street .
?address vcard:locality ?locality .
?address foaf:based_near  ?osmid .
?address ospc:postcode ?postcode .
?building vcard:adr ?address .
?building rdfs:label ?label .
}

The FILTER clause does the basic bounding box query. Although the query would execute faster if we had the geometries using a spatial index and use extended syntax.
With Portablity and with a small dataset in mind, I used basic sparql to make sure it worked on both a Virtuoso RDF store as well as 4Store.

It is a java web app and we are using Jena library which creates an abstraction for querying remote sparql end points.

Query sparql = QueryFactory.create(geoQuery.toString());

QueryExecution x = QueryExecutionFactory.sparqlService(
sparqlEndPoint,geoQuery.toString());
ResultSet results = x.execSelect();

List<JSONObject> jsonList = new ArrayList<JSONObject>();

int counter = 0;
while (results.hasNext()) {
counter++;
JSONObject json = new JSONObject();
QuerySolution rs = results.nextSolution();

The results of the search are passed back to the browser using json.

[{
"label" : "William Robertson Building",
"geo:lat" : 55.94384087597896,
"geo:long" : -3.1871938705444336,
"postcode" : "http://data.ordnancesurvey.co.uk/id/postcodeunit/EH89JY",
"locality" : "central",
"street" : "George Square, Edinburgh, EH8 9JY",
"osmid" : "5325199"
}, {
"label" : "Hugh Robson Building",
"geo:lat" : 55.94427947240619,
"geo:long" : -3.1899189949035645,
"postcode" : "http://data.ordnancesurvey.co.uk/id/postcodeunit/EH89XD",
"locality" : "central",
"street" : "George Square, Edinburgh, EH8 9XD",
"osmid" : "27595323"
}...

The json is parsed and the geo:lat and geo:long rdf properties are used to place markers on the map.
The rest of the properties are used to build a pop up for the marker.

var marker = feature.createMarker();

var markerClick = function(evt) {

for ( var i = map.popups.length - 1; i >= 0; --i) {
map.popups[i].hide();
}
if (this.popup == null) {
this.popup = this.createPopup(this.closeBox);
map.addPopup(this.popup);
this.popup.show();
} else {
this.popup.toggle();
}
currentPopup = this.popup;

var content = getContentFromTags(tags, featureShape);
//show feature area

 

 

 

currentPopup.setContentHTML(content);
OpenLayers.Event.stop(evt);
};

If the rdf resolves to an open street map feature shape for the building a link is created on the popup allowing the user to display this shape file.
OpenStreetMap shapes are used as an interim solution until we have full access to the highly detailed estates and builds floor plans which we could overlay instead.

Displaying the shape files on the map.

There are a varity of formats we can consume to display the building shapes on the map, we are currently using KML.

formats = {
'in' : {
wkt : new OpenLayers.Format.WKT(in_options),
geojson : new OpenLayers.Format.GeoJSON(in_options),
georss : new OpenLayers.Format.GeoRSS(in_options),
gml2 : new OpenLayers.Format.GML.v2(gmlOptionsIn),
gml3 : new OpenLayers.Format.GML.v3(gmlOptionsIn),
kml : new OpenLayers.Format.KML(kmlOptionsIn),
atom : new OpenLayers.Format.Atom(in_options)
},


Maybe in the future we could use the WKT format. The idea being to generate the buildings rdf with a dc:spatial (wkt polygons) to describe the building shapes and display them directly without the extra step to openstreetmaps.

We use function ajax call to retrieve the kml shape file and and it to a vector layer.

getFeatureShapeInKml(shapeId) {

var strUrl = "PlaceMarkServlet?osmid=" + shapeId;
var strReturn = "";

jQuery.ajax({
url : strUrl,
success : function(kml) {
strReturn = kml;
},
async : false
});

 

 

return strReturn;
}

On the server the openstreetmap derived kml is queried with the OSM id and the placemark is returned using a simple xpath query.

XPathFactory factory = XPathFactory.newInstance();
XPath xpath = factory.newXPath();
String xpathExp = String
.format("/kml/Document/Folder/Placemark[ExtendedData/SchemaData/SimpleData[@name='osm_id']=%s]",
osmId);

XPathExpression expr = xpath.compile(xpathExp);

Object result = expr.evaluate(doc, XPathConstants.NODESET);
NodeList nodes = (NodeList) result;
StringWriter sw = new StringWriter();
for (int i = 0; i < nodes.getLength(); i++) {

transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
StreamResult out = new StreamResult(sw);
DOMSource source = new DOMSource(nodes.item(i));

transformer.transform(source, out);

 

 

 

 

//add placemarker to return
output = String.format(output, sw.toString());
}

This could be with replaced with geometries derived from the university estates and building instead.

Pages I found useful do this investigation.

 

Sparql tutorial – http://www.cambridgesemantics.com/2008/09/sparql-by-example/

Installing virtuoso on ubuntu - http://ods.openlinksw.com/wiki/main/Main/VOSUbuntuNotes

RDF and Jena - http://jena.sourceforge.net/tutorial/RDF_API/





 

Four weeks ago I started working on customisation of the UoE Open Data Hub.

I started looking at the Drupal CKAN module. I was not able to make much sense of the commited code. This is not surprising as the code had not been formally released yet and I was new to Drupal.

Next, I was pointed to the CKAN Theming Gallery by colleagues on the project. This page had a wiki that pointed you to code examples from various projects – the wiki page has since been deprecated. I found the work done by the Canadian Datadotgc.ca folk on theming the most accessible for a newbie. Based on their ideas, I did some basic theming which consisted  of  adding CSS and overriding templates for the home page and layout. The documentation on CKAN theming is evolving and the latest quick guide for CKAN Theming can be found here.  The advantages of theming is that you don’t alter the CKAN model code, which means upgrading CKAN versions should be easier subject to the  following caveats.

I was working on CKAN 1.4.3 and the UoE Open Data Hub site uses CKAN 1.4.1. I learned the following the hard way:

  • Html tag ids have been added in the newer version. So if you are working on a newer version of CKAN to the deployed version then CSS changes will not necessarily  be backward compatible.
  • Some Python variables changed in the newer version, so template pages used for theming will need to be checked against the default version of the altered templates in the CKAN distribution to discover the new variable names. This problem manifests itself as a Genshi Template error, e.g.,
    WebApp Error: <class ‘genshi.template.eval.UndefinedError’>: “package_list_from_dict” not defined.
    It this case the Python variable which was “package_list_dict” in CKAN 1.4.1 was changed to “package_list_from_dict” in CKAN 1.4.3:
    <!– CKAN VERSION 1.4.1 –>
    <p><strong>Recently changed packages</strong></p>
    ${package_list(c.latest_packages)}
     

    <!– CKAN VERSION 1.4.3 –>
    <p><strong>Recently changed packages</strong></p>
    ${package_list_from_dict(c.latest_packages)}

There is a agreed need to have an international collection of identifiable organisation… be this collated from national or local levels, or globally assigned.

There are several such lists available: the UK has a list of educational establishments at data.gov.uk, it is believed that the US has a similar one at data.gov. OA-RJ has an international list of institutions derived from repositories and the organisations that run them.

There are issues that need addressed, however:

  • Anything that is UK focused is, frankly, a waste of time: Even on a global scale, having 1961 countries each post their own list with no consolidation of reference between them gives rise to two main problems:
    1. Finding the lists becomes a problem, and people need to know where each list is
    2. There are a significant number of organisations are are not geographically restricted to a single country (or, indeed, geographically located at all!)
  • The history of organisations needs to be tracked: They are born, merge, split, rename, and even die.
  • They have complex parent/child relationships (there are research centres, funded by NERC, completely housed within larger organisations.
  • They have (multiple?) geo-spatial locations.
  • There needs to be some real-world examples of use for the data.

There have been a number of previous JISC (and other, overseas) projects in this area, which should be pulled together & combined into a greater whole.

[1] 196 countries is the 195 independent [sovereign] states recognised by the US State Department, plus Tiawan. This ignores the very real situation where England produces its own list, with the expectation that Scotland, Ireland & Wales will produce similar lists (so that’s 199). Add in the expectation that the US will produce lists at (possibly) State level, and you’re up to 250 lists. Add in larger countries like Russia or China devolving lists down, and the problem of isolation and “un-discoverability” get even worse! “Divide and Conquer” is definitely the way forward…. but not by secession and independent action – that would [in my view] be ignored by the larger community.

Fed up with plain black on white text when editing in vim?  I was, but not now!

Did you know Turtle syntax is really a subset of a broader syntax called Notation3?  I didn’t, but I do now! (proof at http://www.w3.org/DesignIssues/diagrams/n3/venn ).

There’s an N3 syntax highlighter for vim at http://www.vim.org/scripts/script.php?script_id=944 .

Just download the v1.1 file and put it in ~/.vim/syntax (you might need to make that directory) and then add the following to ~/.vim/filetype.vim

" RDF Notation 3 Syntax
augroup filetypedetect
au BufNewFile,BufRead *.n3  setfiletype n3
au BufNewFile,BufRead *.ttl  setfiletype n3
augroup END

 

(yes, including that spurious looking quotation mark at the beginning).

TA-DA!

Questions about use cases for the data; what the benefits are to end users or to libraries.
Something we have to work on formulating with data in hand. Morag mentions a 2005 document discussing SUNCAT use cases.

One suggested usage is deriving information about the focus and specialisms of libraries, by extending library subject metadata using journal/article subject metadata – so identifying the bent of universities through the holdings of their libraries.

Another immediate usage is linking bibliographic datasets of journal articles, to journal issues and journal information found in SUNCAT. Medline is a useful example of dataset that can be integrated – work on Linked, Open Medline metadata happening through the OpenBiblio project.

SUNCAT holds a record for each institution, its library location, and this could helpfully be linked to the OARJ Linked Data for institutions, and the JISC CETIS PROD work collecting different sources of UK HE/FE information.

Sources in SUNCAT may have an OID which could be re-used as part of a URI. Journals both electronic and hardcopy also (though not always) have ISSNs.
There are restrictions on re-use of data licensed from the ISSN Network, but one can get some of it from other sources – CONSER is a North-America-focused example, with a bit of a scientific bent (thus useful for Medline).

SUNCAT uses OpenURL to search for journal articles and holdings data in institutional libraries. Libraries run an “OpenURL resolver” – often with a bit of proprietary software such as SFX – to map OpenURLs to stuff in their holdings. Would be interesting to find out more about the inside of an OpenURL resolver and how useful a Linked Data rendering of it would be…

Surprised to learn that university libraries often don’t maintain their own subscription database; journals are bought in “bundles” whose contents are shifting, and libraries depend on vendors to sell them back their processed catalogue data.

SUNCAT contains a dataset describing libraries their affiliations and locations, held in a set of text files. This would be a good place to start with a simple Linked Data model that we can link up to the outcome of the previous LDFocus sprint, and then work on connecting up the library holdings data.

Starting a separate notepad for SUNCAT links. Should have done this earlier, been busy about the new release of Unlock and the wrap-up of the Chalice project

We’re preparing to start the second Linked Data Focus sprint next week (from May 16th) – working with the developers from the SUNCAT team, who are bibliographic data specialists.

Our notepad from the first sprint has a lot of links to relevant resources – introductions to RDF, tools in different languages, and descriptions of related work around academic institutions and communications.

This presentation by Jeni Tennison from the Pelagios workshop is also worth looking at for sensible advice about taking an existing data model into a Linked Data form. Ian took this sort of approach for the Open Access Repository Junction work – working through the different objects in a relational database model, thinking about how to decorate them with common RDF elements, then creating a vocabulary for the missing pieces. Some of the same questions about publishing and structuring Linked Data should come up; and in the middle of the sprint we’ll hold another Linked Data Learn-in at EDINA.

SUNCAT should have a fair bit more in common with existing Linked Data projects – particularly the JISC-supported OpenBiblio – and we’ll try to make links between SUNCAT-listed publications and some of their metadata. If we can get as far as then linking through to pre-prints in the institutional repositories found in OARJ, then I’ll be entirely satisfied.

One of the tenants of linked data is that IRIs should be resolvable (it’s the 4th or 5th star, depending on which notation you are looking at)

There are two approaches to doing this:

  1. Create a server specifically to handle the linked data
    eg: http://opendata.opendepot.org/organisation/EDINA
  2. Create a resolver underneath an existing server
    eg: http://opendepot.org/opendata/organisation/EDINA

The main consideration is probably how many data sets you are resolving, and what association you want to promote. For example, the University of Southampton are exposing all their data at the University level – so having a central resolver (http://opendata.southampton.ac.uk) makes sense for them.

For OARJ, I can use the OpenDepot.org association…. thus it was easier for me to create a resolver within the opendepot.org server – so OARJ IRIs become something like http://opendepot.org/opendata/organisation/EDINA

The resolver script is http://opendepot.org/opendata/ and the standard Apache environment variable ‘PATH_INFO’ contains the rest if the IRI.

The code for the resolver is remarkably simple:

  use XML::LibXML;

  ## define $host
  ## get the full RDF document from the server: $dom
  ## get an XML Document that contains the RDF root element (complete with namespaces): $rdf

  # Get the <:RDF> element from $rdf
  $child = $rdf->firstChild;

  # for all XPAth stuff, we need to define the namespace
  $xpc = XML::LibXML::XPathContext->new;
  $xpc->registerNs('rdf',
                   'http://www.w3.org/1999/02/22-rdf-syntax-ns#');

  # We need the general "about" node with output
  $iri = "$host/reference/linked/1.0/oarj_ontology.rdf";

  # XPath queries are through the XML::XPath object
  @nodes = $xpc->findnodes("/rdf:RDF/rdf:Description[\@rdf:about=\"$iri\"]", $dom);
  $child->appendChild($nodes[0]) if $nodes[0];

  # and now find the specific rdf:Description we want
  $class = &get_first_pathinfo_item;  # will be "organisation" or "network" or....
  $t =     &get_second_pathinfo_item; # will be the name of the record **IRI encoded!**

  $iri = "$host/opendata/$class/$t";

  @nodes=();
  @nodes = $xpc->findnodes("/rdf:RDF/rdf:Description[\@rdf:about=\"$iri\"]", $dom);
  $child->appendChild($nodes[0]) if $nodes[0];

  print $dom->toString;

Obviously there are wrappers around that, but its a good basis.

One of the great things about Turtle format is that it is dead easy to write (see the blog post below on how easy it was.)

One of the great things about RDF format is that it is a well known format, rooted in XML, and very easily parsed….. but not fun to create.

What is needed is an easy way to create RDF from Turtle… and there is – Any23.org

(I’m a Perl-man, so my example code is in Perl – YMMV)

  use File::Slurp;
  use LWP::UserAgent;
  use HTTP::Request;

  ## create turtle text as before, in $t

  # Write the file into web-server space
  write_file("$datadir/$turtle_filename", $t);
  print "turtle written\n";

  # ping the whole thing off to any23.org to transform into RDF
  my $ua  = LWP::UserAgent->new();
  my $query = "http://any23.org/?format=rdfxml&uri=http://$host/$path/$filename";
  my $res = $ua->get($query);
  my $content = "";
  if ($res->is_success) {
    $content = $res->content;
    write_file("$datadir/$rdf_filename", $content);
    print "rdf written\n";
  } ## end if ($res->is_success)
  else { print $res->status_line; }

Et voila!

http://schemas.library.nhs.uk/ApplicationProfile/JournalHolding/ – JournalHoldings within the NHS. Quite a highly engineered vocabulary, and too specific for us to use directly, but perhaps a useful example.

http:/purl.org/spar/fabio – the “FRBR-aligned bibliographic ontology” one of the SPAR (Semantic Publishing and Referencing) ontologies being developed through the JISC Open Citations project.

http://opencitations.wordpress.com/2010/10/14/introducing-the-semantic-publishing-and-referencing-spar-ontologies/

http://bibliontology.com/ | http://purl.org/ontology/bibo/ – BIBO, the bibliographic ontology. Generic and with reasonably wide use but criticised by specialists.

See Also

Discussion of some technical issues during Sprint 1

General advice from Jeni Tennison – start by thinking about your domain objects, e.g. all the things your system is currently modelling, then decide which of them should have a URI. Only then start to think about describing their relations, and use others’ work wherever possible – we should only be minting our own vocabularies as a last resort…

© 2011 Linked Data Focus Suffusion WordPress theme by Sayontan Sinha