Steve Cassidy: Academic Blog

Teacher Professional Development: Building Web Apps

2019-07-11T20:28:59+10:00

I’m running a workshop today for High School teachers in the local area. The goal of the workshop is to provide an example framework for building simple web applications that might be useful for students doing HSC projects. The style of application is based on the material we use in our second year Web Technology unit. In that unit we aim to have the students understand everything about what is going on ‘under the hood’, but we can use the framework in a more practical way to build applications.

Resources

Python Web Programming - my course notes for COMP249
‘Likes’ App - our first simple web app
FlowTow - a bigger image sharing app
App Framework - a framework for new apps

Amplifying Amplify

2018-04-06T00:00:00+10:00

Amplify is a project maintained by the State Library of NSW for crowdsourcing the transcription of Oral History recordings. It is being used to generate transcripts of some of their extensive catalogue of OH recordings. Amplify starts with the results of an Automatic Speech Recognition system and presents these to users of the platform for correction and validation. Amplify is great! I’d encourage you to go along and try out the correction process.

I’m interested in how OH transcripts might be enhanced with the application of Natural Language Processing techniques. As part of the Alveo project, we’re working on developing tools to help with recording, archiving and enhancing OH recordings. As a platform for experimenting with this I’ve built a little web application that takes transcripts from the State Library’s Amplify platform and puts them through a few NLP processes and then presents the results. This is now available online as Amplify Amplifier.

The app will process any transcript from Amplify that has a 100% completed rating. This should mean that it has been corrected and verified by users but there are some wrinkles in Amplify (that SLNSW know about) that mean that often errors slip through the net. The app applies three processes at the moment:

Topic segmentation using the NLTK implementation of Marti Hearst’s text-tiling algorithm. This chunks the interview into topics based on the distribution of words in the text.
For each topic, we find keywords that are more common in that topic than in the rest of the text using the TF-IDF metric. This is an attempt to identify the main concepts in each topic.
For each topic, we use the DB-Pedia Spotlight named entity linking service to find the names of people, places and concepts and link them to relevant entries in DBPedia (a machine readable version of Wikipedia).

All of this is presented in the page with buttons to allow you to play each topic.

The results are interesting. The topics that Text Tiling finds do give an idea of the overall structure of the interview, although in many cases it is over-segmenting. Keywords give some idea of what each topic is about and sometimes seem to act as a nice summary; other times they are entirely un-useful, such as the single keyword ‘er’ - although I guess this means the speaker is hesitating a lot in that chunk.

Named entities are perhaps the most random aspect of the result. This is not too surprising since we are using a system trained on very different kinds of text in the context of Australian oral histories. So it will try to find the most common entity to link to a name or place and often end up with Harry Potter or some other popular icon rather than a local alternative.

There is a lot of scope here for improving the results that are generated. However, I’m interested in feedback from OH researchers and other readers of these transcripts to see if this kind of presentation is useful at all. How could this be improved? What other elements of the text could be brought out in a useful way?

I’ll continue to experiment with this and hopefully develop some more useful tools with some feedback from interested users.

Supporting accessibility and reproducibility in language research in the Alveo virtual laboratory

2017-02-28T20:56:57+11:00

Our paper discussing Alveo in the context of reproducibility in language sciences is now available in Computer Speech & Language: DOI:10.1016/j.csl.2017.01.003

Highlights

Reviews a number of publications in CSL regarding their practice in using and citing data collections.
Finds that authors are keen to identify and share data but that practices vary in how precise they are or how easy it is to get the data.
Reviews research workflows in speech and language, including the use of software tools.
Suggests a ‘hierarchy of needs’ for reproducibility in speech and language research.
Describes how the Alveo Virtual Laboratory supports a model of research that facilitates data sharing and citation of software tools.

Abstract

Reproducibility is an important part of scientific research and studies published in speech and language research usually make some attempt at ensuring that the work reported could be reproduced by other researchers. This paper looks at the current practice in the field relating to the citation and availability of both data and software methods. It is common to use widely available shared datasets in this field which helps to ensure that studies can be reproduced; however a brief survey of recent papers shows a wide range of styles of citation of data only some of which clearly identify the exact data used in the study. Similarly, practices in describing and sharing software artefacts vary considerably from detailed descriptions of algorithms to linked repositories. The Alveo Virtual Laboratory is a web based platform to support research based on collections of text, speech and video. Alveo provides a central repository for language data and provides a set of services for discovery and analysis of data. We argue that some of the features of the Alveo platform may make it easier for researchers to share their data more precisely and cite the exact software tools used to develop published results. Alveo makes use of ideas developed in other areas of science and we discuss these and how they can be applied to speech and language research.

Docker and MAUS

2016-10-20T16:48:55+11:00

Today’s problem was to write a wrapper for the MAUS automatic segmentation system in preparation for including it as a Galaxy tool. MAUS comes from Florian Schiel in Munich and is a collection of shell scripts and Linux executables that take a sound file and an orthographic transcription and generate an aligned phonetic segmentation. The core of this process is the HTK speech recognition system and getting it to work on anything other than Linux is a pain that is best not lived with.

Galaxy runs on a server and so the executables would be ok in the production environment, but to do development I need to be able to run things locally. The solution is to build a Docker container based on a base linux container (debian) with just the minimal set of tools installed to allow MAUS to run. This turned out to be very simple. The only pre-requisite that I needed to add was sox (for sound file manipulation); I had to make sure that the container was able to run the 32 bit binaries that are included in the MAUS distribution and it all worked ok.

I’m getting used to working with Docker. So far all of the images I’ve worked with have been for web services so a significant issue has been forwarding the right ports to the local system. In this case, the container is intended just to run a single command and then exit, so the setup is much simpler. The only thing that is required is to share a directory on the local system with the container so that we can pass data files to MAUS and get the results back.

Here’s an example of running MAUS over a single audio file:

docker run -v `pwd`:/export  stevecassidy/maus \
    /home/maus/maus OUT=/export/test.TextGrid \
    OUTFORMAT=TextGrid \
    SIGNAL=/export/test/1_1119_2_22_001-ch6-speaker16.wav \
    BPF=/export/test/1_1119_2_22_001.bpf \
    LANGUAGE=aus

The Dockerfile is available on GitHub (stevecassidy/docker-maus) and the image is on Docker Hub (stevecassidy/maus). If you have docker installed, the above command should download the image from the hub the first time you run it.

The next step is to write some wrapper code to make it easier to incorporate this into a workflow. I already have Python code that wraps around the regular command line version so the choice I face is whether to put this inside the container or call the container from the script.

A Galaxy Workflow for Acoustic Phonetic Analysis

2016-10-18T15:46:53+11:00

I’ve been working for a while now on adapting the Galaxy Workflow engine for use in speech analysis, specifically for acoustic phonetic analysis of vowel sounds. Galaxy is a system used in bioinformatics for constructing workflows to do genetic analysis and other things. As part of the Alveo project we’ve been building tools for doing text and speech analysis for Galaxy. My recent work has been specifically looking at acoustic phonetic analysis with the Emu library with a goal of reproducing some work on children’s vowels that I did with Catherine Watson many years ago.

I’ve just managed to get the first full workflow going. It extracts the hVd monophthongs for a single speaker from the Austalk corpus stored in Alveo, finds vowels in the TextGrid files via the python tgt package, uses the wrassp package from Emu to compute the formants and then plots them with the phonR library. The workflow is shown at the top of this post in the edit view of Galaxy - this shows the individual tools that are used for each step of the process. The end result is a PDF plot generated by R and reproduced here as a PNG image.

The plot here is for speaker 1_1308 from the Austalk corpus (a 35yr old male from Sydney, born in Adelaide). There is clearly a formant tracking error with the { vowel (had) in one instance (putting it inside the O cluster). Just to prove that it works with any speaker here’s a second example from speaker 1_366 (a 73yr old female from Adelaide); in this case something odd is going on with the i: vowel where F2 is zero for the two instances found.

This workflow relies on there being stored TextGrid files for the speaker in question. An obvious next step is to incorporate a forced-alignment tool to derive vowel segment boundaries given the transcription. I’m currently looking at options for doing this with MAUS and/or FAVE.

The hardest part of getting to this point has been dealing with the underlying Galaxy infrastructure. We had an earlier Galaxy installation for the Alveo project but it was based on an older version of Galaxy, which is a rapidly moving project. A significant change has been the use of Conda to manage package dependencies. I’ve been working on writing conda packages for the tools that I’ve used here so that I could write Galaxy tools that made use of them. This has been helped a lot by the Bioconda repository run by Björn Grüning which automates building of these packages and makes them available via the bioconda channel (bioconda is the home of many biotechnology packages used in Galaxy, we’ve been invited to submit speech and NLP packages there since we’re working with Galaxy too). Once these dependencies are in place, I can write tools that use them and have them installed in a Galaxy instance. I’ve been using Docker to build a Galaxy flavour for Alveo which installs a set of tools useful for speech and text analysis. These include some of the tools we developed earlier in the Alveo project and add some new tools for speech analysis.

A major issue when writing tools for Galaxy is the level of granularity to pitch them at. For example, it would be possible to write a single tool script to carry out this entire workflow, but that then removes the possibility that parts of the workflow could be re-used for other tasks. For example, I also want to generate a plot of the durations of the vowel segments, so I need access to the intermediate result that is the output of the query step. One choice I made was to build a single tool to accept a list of vowel segments and run the wrassp formant tracker over them and then select the value at the vowel midpoint. This could be decomposed into smaller steps that would then allow the intermediate formant tracks to be re-used for other tasks. I’m still considering the best way to approach this.

I now have a version of my Galaxy server running with these tools installed. It is not quite ready for production but at some point it will be accessible at galaxy.alveo.edu.au to replace our older server. At that point I will be able to share this workflow with others so that you can run it yourself on other data from Austalk.

Mobile Apps for Aboriginal Languages

2016-10-05T14:44:29+11:00

My introduction to Darwin was on a borrowed bike used to discover the streets around CDU and eventually making my way to the city and Midil Beach markets for a Sunday evening feast of Gado-Gado watching the sunset on the sand. I’m in Darwin for a workshop organised by Steven Bird aiming to build mobile apps aimed at “Keeping our Languages Strong”. While a lot of the work with Australian languages is aimed at preservation and documentation, Steven’s work is aimed more at maintaining the living languages within their communities.

The invitees to the workshop were a mixture of technologists like me, linguists, people working with the language communities and members of the communities themselves. The premise was to bring us together to imagine what mobile apps we might build in the context of Aboriginal languages and them maybe even try to build some demonstrations as a proof of concept in the week. The first two days explored possibilities; the next two left the hackers alone to try to build something; the final morning was a show and tell and reflection on what we’d managed to achieve.

My agenda coming to the workshop was to promote the use of the Alveo Virtual Laboratory as a repository for language data. I’ve never worked with Aboriginal languages and my main contact with them is through linguists who collected data for later study; Alveo provides a repository and we are keen to be able to help look after any collections of data that would benefit from the resources we provide. So I was looking for opportunities to help collect data and provide a gateway to storing it in a repository like Alveo for later study by linguists.

On Friday morning I realised that when we frame the task as one of language maintenance rather than preservation the problems deserve a different treatment. Understand that this language isn’t something to be collected and shared back with the community but something that belongs to them that they might share with us. Helping them build an eco-system of tools that might help keep their language alive is the first priority with documentation and preservation as side effects.

Exploring Ideas

Our first session as a group was aimed at exploring possibilities. What does a language app look like, what could be done, what might be popular in the community. Our group talked about an idea that had already had some thought put into it — a “Calling to Country” app that could play a welcome message in the local language when a visitor arrived in an area. Coupled with some knowledge of local significant sites this might provide an introduction to the local culture for tourists and visitors to an area.

We wondered if it might be possible to link in to Google Now and make a Card appear when you arrived in a place showing the welcome to country and linking to the content in the app. Based on some research this seems to be an option that is limited to a small number of installed apps at present but that might be opened up in future.

I talked about some ideas I’d had around dictionaries, how the relatedness of words might be captured and represented in some way, about the collective knowledge captured in Wikipedia and Wiktionary and how we might encourage people to collect this knowledge for Aboriginal languages. I have been wanting to apply semantic web technology to representing Aboriginal language dictionaries, in particular the Lemon model that is designed to link lexical entries and senses to ontologies like dbpedia. I think there is scope to link many language dictionaries together like this if they share common concepts. My description of these kinds of resource triggered some discussion about building an encyclopaedia linked to place that might capture stories and knowledge associated with different places such that someone might be able to find relevant entries based on where they are.

Networks for Counselling

Another surprising (to me) link was made by one member of the group who pulled out a collection of laminated work-sheets she had made to support her work in counselling. They showed different ‘prompt’ words arranged as a network (graph) showing how they might be related to each other; the size of each word was related to how likely it was to be relevant to a particular age group. The diagrams were effectively a semantic network of terms that could be used as a prompt when discussing issues in a counselling session. We talked about how we might build an app based around this idea, one that might be able to construct a suitable diagram based on information gathered from the patient. This then developed into a discussion of exploring self-help materials through this kind of interface so that this resource might be used by the community directly. All of these were great ideas but were not quite the language applications that we had been asked to come up with, so we shelved these ideas for future reference.

Community Search Engine

Our discussion of networks and wikipedia led us to thinking about finding information in documents written in Aboriginal languages and the idea of a dedicated search engine for a collection of community documents. We envisaged a kind of repository where community members could lodge documents, perhaps tagged with place and topic tags, and these could then be made available via a search or browse interface to members of the community. We imagined developing this into a place-based encyclopaedia of community knowledge, managed by the community with an interface on mobile devices to make it accessible. We thought that it could be made available in some kind of offline peer-to-peer mode in communities to make up for a lack of network access.

This was the idea that we finished the day with; we developed some scenarios and thought about where the documents might originate, who might submit them and about a process of approval by the elders in a community before documents were published widely.

I think that this project would be relatively easy to implement at a technical level. We could implement a Solr index over a document collection and build some infrastructure to allow tagging of documents with locations and other category tags. The main problem with it is who is going to take responsibility for running the servers and keeping the service alive? Perhaps this is the problem with all of the apps that we’re proposing but it seemed to be a particular issue in this case.

Pictures and Words

Day two saw us re-convening around four apps that we had proposed the previous day. Our document search app was one of these but following some overnight thought and more discussion we decided that it was ‘too hard’. The context of this workshop was to be able to build something and to focus on mobile applications; the document repository ideas was mainly a backend service, the mobile part was just reading documents and so was perhaps not that interesting.

We re-focussed and talked over some of the other things that had come up the day before. We were still interested in networks and links between words and concepts and one idea that was mentioned was a mobile game that is apparently very popular in the communities already. Four Pictures One Word is a puzzle game that shows four pictures and asks the user to guess the one word that is common to them all. The game is played in English and is a free download - money changes hands when you can’t get the answer and want to buy a clue. Many of us in the group had never heard of this game but it is apparently hugely popular.

The attraction of this game was that it links together words into some kind of semantic category. We imagined a version in an Aboriginal language that would show pictures, perhaps with an associated spoken prompt, and ask for a response in the language. Our version of the game would have an interface to allow new games to be constructed and shared with others. In this way we allow the community to support the game in their own language and build resources over time. The collection of images and recordings would grow and could be used in many games. The obvious side effect is a collection of recordings of words in the language along with some semantic relations - even if those relations are not well defined.

We developed this idea in to a full proposal and thought about how different users would work through the gameplay and game construction. The game itself is relatively straightforward and there is an existing game to copy from so most of the thinking went into how new games would be made and the infrastructure we would need to store words and games on the back-end.

Part of the motivation for this game was the collection of language data for possible future linguistic analysis. To facilitate this we need to collect as much useful metadata as possible about the speaker their location and the word or phrase being spoken. Balanced with this is the need to not collect data that might be considered private or to complicate the game creation process too much.

We presented our idea back to the larger group and got some great feedback. Our presentation included some sample games put together by the native speakers in our group. I was surprised at how well these worked in the presentation and the enthusiasm they generated; it seems that this really is an idea that might work in the community, at least from a game-play perspective.

Building

For the next couple of days the developers got together to explore the implementation of the ideas that had been generated. From our perspective, there were some common themes among the ideas that might be able to use common components in their implementation. From my point of view the most interesting of these was the need to store recordings of words and phrases for use in the games. There is a clear analog here to the capabilities of the Alveo system but importantly, the data that backs up these games is not a curated collection that can be shared with researchers - it is raw data collected by the language community that might one day be shared.

The data for our game can be structured very much like a dictionary with entries for lexical items that have associated textual transcriptions, images and sound recordings. A naive implementation would just lump these together in a single record but we can apply our understanding of lexical structure to develop a more sophisticated store. I look to the Lemon

Ontology for a useful lexical model, it might represent:

each word as a LexicalEntry
the pronunciation of the word as a kind of LexicalForm
a definition or image associated with the word as attached to a LexicalSense

An advantage of this form would be the ability to support multiple languages referencing `meanings’ represented by images. It might also be possible to link to other ontologies to describe meanings, for example to dbpedia if there are relevant entities described there.

For the purposes of our hackathon session however, we compromised and explored a simpler storage model using Google’s Firebase cloud storage solution. Firebase is a JSON data store particularly useful for developing mobile applications. I hadn’t used it before so it was interesting to learn about this new technology. In Firebase you essentially store one large JSON object structure; the mobile app can then request that all or part of that is available locally and local changes are reflected instantly on the device. For this project we hacked together a data store that is reminiscent of the Alveo data model: each word is an item and can have associated documents that are the audio recording or depiction (image) associated with the word. Metadata on the items records the language, speaker details etc. This is more of an archive of recordings than a lexical structure but it serves the purpose of representing the resources for the game we are developing.

Ben Foley worked on the front end game implementation using the AngularJS framework. With the help of some boilerplate put together by Matt Bettinson he was able to generate a working version of the game while learning about Angular at the same time. The screenshot here shows a sample game in progress. We were able to demonstrate this at the final presentations and got great feedback from the group who were pleased to see their idea at least partially implemented in a real application.

The Next Steps

I think the workshop showed that there is a lot of scope for applications that deal with language data in a way that can support the language community rather than just treat them as a source of research materials. All of the ideas we generated were applications that would collect data and use it in a useful context: to inform, to entertain or to engage.

Between the different projects that we proposed there was a clear common need for a back-end store of data that was a hybrid of a lexicon and a repository of recordings and images. Alveo provides this kind of service but it isn’t appropriate to have these game-creation engines upload data directly to Alveo: our system is owned by researchers and keeps data as a resource for researchers - the need here is for a data store owned by the community and managed for or by them for the purpose of language maintenance. There should of course be a gateway from the community resources to allow for future research use of this data if it is deemed appropriate; but such usage should not be the primary goal of the data store.

While it would be possible to use the kind of custom data store that we hacked together for our game to support one or even a few different applications, it makes more sense to think about the design of a data store more generally to support as wide a range of applications as possible. The kind of data store hinted to above would model the lexical resources and recordings appropriately and could then be the basis of a wide range of use cases. We contrasted this approach with the common model that has been used in the past for building applications around Aboriginal languages. In many cases a group is funded to build an app, contracts to a development group who use whatever technology they understand to store the data in a custom database. The disadvantage is that any data stored in this way is locked away and will not be available for other kinds of use. The goal of the kind of back-end that it could become a resource for future use and if we explicitly build in a more general interface - new ideas could flow from seeing the data in-place as it grows.

My thanks to Steven Bird for the invitation to Darwin and to the CoEDL for the funding. It was an amazing experience and I hope we can build on it to progress some of these ideas in the future.

Galaxy Tool Generating Dataset Collections

2015-10-21T00:00:00+11:00

As part of the Alveo project we’ve been using the Galaxy Workflow Engine to provide a web-based user-friendly interface to some language processing tools. Galaxy was originally developed for Bioinformatics researchers but we’ve been able to adapt it for language tools quite easily. Galaxy tools are scripts or executable command line applications that read input data from files and write results out to new files. These files are presented as data objects in the Galaxy interface. Chains of tools can be run one after another to process data from input to final results.

One of the recent updates to Galaxy is the ability to group data objects together into datasets. These datasets can then form the input to a workflow which can be run for each object in the dataset. This is something we’ve wanted for Alveo for a long time since applying the same process to all files in a collection is a common requirement for language processing. After a bit of exploration I’ve worked out how to write a tool that generates a dataset and since the documentation for this is somewhat sparse and confusing, I thought I’d write up my findings.

To work through the issues I built the simplest tool I could that generated a collection of files: a python script that creates three files with a bit of random data. The script takes a single required option which is the name of the output directory.

To turn this into a Galaxy tool we need to write an XML configuration file (see the Gist below for the code). This has a section that defines the command line to be run to run the tool and the names of any input options. In this case the only input is a name for the resulting dataset.

One thing that I learned in getting to this solution is that when Galaxy runs a tool it does so in a newly created temporary directory; this means that there is no problem with the output from the tool overwriting the output of any other tool, so output filenames or directory names don’t need to be unique. However, I did find that this directory contains three temporary files generated by Galaxy (galaxy_1.ec, galaxy_1.sh and set_metadata_7OxS74.py) this tripped me up before I worked out that I needed to write files to a sub-directory.

The important part of the configuration file is the section. This normally just lists the expected output of the tool, but in this case the tool is writing an unknown number of files to a directory. The output section of my tool configuration is:

<outputs>
   <collection type="list" label="$job_name" name="output1">
     <discover_datasets pattern="(?P<name>.*)" directory="SampleDataset" />
   </collection>
 </outputs>

The tag says that we're expecting a collection of data (a dataset). The tag describes how Galaxy can find the elements of the dataset - in this case by finding files in the directory SampleDataset matching the regular expression “.*” (ie. all files in this directory). The file name becomes the name of the data object.

The code for the python script and the XML file are in the gist below. Developing Galaxy tools is relatively easy especially with the help of planemo - a collection of scripts that help you write, test and run your new tools. Once you have planemo installed, store these two files in a directory and run “planemo serve”; planemo will download a copy of Galaxy if you don’t already have one and run the server so that you can access galaxy on http://127.0.0.1:9090.

https://gist.github.com/stevecassidy/0fa45ad5853faacb5f55

Updating the ICE Annotation System: Tagging, Parsing and Validation

2011-03-01T21:28:10+11:00

Authors: Deanna Wong, Steve Cassidy and Pam Peters

To appear in Corpora, expected publication in 2012. Manuscript available on request.

The textual markup scheme of the International Corpus of English (ICE) corpus project evolved continuously from 1989 on, more or less independent of the Text Encoding Initiative (TEI). It was intended to standardise the annotation of all the regional ICE corpora, in order to facilitate inter-comparisons of their linguistic content. However this goal has proved elusive because of gradual changes in the ICE annotation system, and additions to it made by those working on individual ICE corpora. Further, since the project pre-dates the development of XML-based markup standards, the format of the ICE markup does not match that in many modern corpora and can be difficult to manipulate. As a goal of the original project was interoperability of the various ICE corpora, it is important that the markup of existing and new ICE corpora can be converted into a common format that can serve their ongoing needs, while allowing older markup to be fully included. This paper describes the most significant variations in annotation, and focuses on several points of difficulty inherent in the system: especially the non-hierarchical treatment of the visual and structural elements of written texts, and of overlapping speech in spontaneous conversation. We report on our development of a parser to validate the existing ICE markup scheme and convert it to other formats. The development of this tool not only brings the Australian version into line with the current ICE standard, it also allows for proper validation of all annotation in any of the regional corpora. Once the corpora have been validated, they can be converted easily to a standardised XML format for alternate systems of corpus annotation, such as that developed by the TEI.

Notes on Conversion of GrAF to RDF

2011-02-19T04:32:47+11:00

The Graph Annotation Format (GrAF) is the XML data exchange format developed for the model of linguistic annotation described in the ISO Linguistic Annotation Framework (LAF). LAF is the abstract model of annotations represented as a graph structure, GrAF is an XML serialisation of the model intended for moving data between different tools. Both were developed by Nancy Ide and Keith Suderman in Vasser with input from the community involved in the ISO standardisation process around linguistic data.

Like the other candidate universal annotation models (e.g. Annotation Graphs and the model embodied by our own DADA system), LAF is a directed graph model. In this case, the graph connects nodes which are associated with one or more annotations with edges representing relations between nodes, by default, the parent-child relation. This is almost exactly the same as the DADA model, but the minor differences have been tripping me up for a while as I’ve tried to understand LAF enough to write conversion filters to ingest data into the DADA system.

A visit to Vassar last week was the ideal opportunity to clear up my understanding. As a step towards updating our GrAF to DADA ingestion process which is implemented as an XSL stylesheet, I decided to write a stylesheet to convert GrAF into a fairly literal RDF model. This allowed me to think about the interpretation of GrAF structures independent of their translation to the current DADA model. I was concerned mainly with the structural elements of GrAF, rather than the annotation meta-data; this is equally important, but can be dealt with separately.

I armed myself with the latest version of the ISO LAF documentation and a copy of the manually annotated sub-corpus (MASC) of the American National Corpus. This is a nice sized data set where the automatically generated annotations have been manually checked and corrected.

GrAF is an XML format for standoff markup, meaning that the annotation is stored in a separate file to the source text rather than being embedded in the text as is normal in TEI for example. A single text has a number of associated XML annotation files, each containing a different kind of annotation. In the MASC corpus these include Penn Treebank, part of speech and named entity annotations. A single .anc file acts as a master reference and contains pointers to the raw text as well as the other XML files.

GrAF defines five main elements to represent annotation structures: nodes, annotations, edges, links and regions. The graph structure is made up of nodes and edges while regions define the parts of the source document being annotated. The link element relates a region to a node and the annotation element defines an annotation structure that can be attached to a node or an edge.

Identifiers

One thing that’s required for an RDF representation is that each entity is denoted by a unique identifier (a URI). Most but not all of the GrAF elements have identifiers denoted by the xml:id attribute so we can re-use this in the RDF representation prefixed with a suitable base URI. In choosing a base URI it makes sense to generate one that denotes the collection as a whole, something like http://www.anc.org/MASC/spoken/RindnerBonnie (not a working URI although DADA could make it so). So the first node in the Penn Treebank annotation for this document which has the xml:id ptb-n00000 would have the RDF identifier http://www.anc.org/MASC/spoken/RindnerBonnie#ptb-n00000.

An implication of this is that all identifiers need to be unique within the collection of XML files. The GrAF specification doesn’t mandate this and the use of xml:id attributes will only ensure that the identifier is unique within the XML file. As it happens, many of the identifiers in the MASC corpus are made unique by being prefixed by the annotation set name (ptb in the example). Some are not unique however and so to generate useful RDF we need to either generate our own unique identifiers or fix the original data.

One entity that doesn’t have an identifier is the annotation element. Annotations are connected to either a node or an edge and use a ‘ref’ attribute to indicate what they are attached to. To represent these in RDF, we generate a unique identifier for each one.

Types

We need RDF types for each of the entities being represented. The GrAF XML namespace URI can be repurposed to generate names for the types, e.g. http://www.xces.org/ns/GrAF/1.0/Node, abbreviated as graf:Node. We use capitalised names for RDF types as per the convention.

A second a more tricky type issue is that of denoting the different kinds of annotation that are used in the corpus. LAF avoids any reference to types because there is no consensus on what constitutes a type in this context. Instead it has the idea of an annotation set which gives a name to a group of annotations, for example Penn Treebank or Framenet. Each name as an associated URI defined in an annotationSet definition in either the annotation XML file or the corpus header. These aren’t formal namespace URIs, just a URI that would provide some information about the kind of annotation being used.

An annotation has the following form in the XML file:

    <a label="vchunk" ref="vc-n0" as="xces">
        <fs>
            <f name="voice" value="active"/>
            <f name="tense" value="SimPre"/>
            <f name="type" value="FVG"/>
        </fs>
    </a>

Something I’ve been a little confused about is the meaning of the ‘label’ attribute. Following some discussion, it seems that the label is a kind of annotation type and that we can think of it as being within a ‘namespace’ defined by the annotation set label (‘xces’ in this case). The three features listed in the feature set can also be thought of as being in the same namespace. Hence we can translate this to RDF as a resource of type graf:Annotation and introduce a property graf:type to denote the type of annotation:

<id35803001> a graf:Annotation;
    graf:type xces:vchunk;
    graf:annotates <id35993621>
    xces:voice "active";
    xces:tense "SimPre";
    xces:type  "FVG" .

Note that we don’t translate the feature structure node into an RDF resource, feature structures map well directly to RDF properties and there is no sense in which the feature structure element has any status other than as a container for feature value pairs in the XML serialisation.

This all works well in most cases but there are a few instances in the MASC data that cause trouble. In a small number of files there is no annotation set associated with some annotations (eg. in data/written/116CUL032-vc.xml). This means that there is no namespace to associate with the feature names. In the GrAF schema, the annotation set is marked as an optional attribute, so this is not an error. However, some way of assigning a default namespace to bare features like this is needed to convert to RDF. I’d argue that someone converting annotations to GrAF should be forced to make a decision and give a name (URI) to their annotation set; in this way, the ownership of annotations is clear and we won’t get confused between two uses of the same feature name by different people.

A second complication comes when a feature name or annotation set label is not a valid QName (XML element name). This makes the conversion to XML/RDF difficult although in some cases the name may still be a valid RDF identifier (URI). One example in the MASC data is a feature xmlns:xsi (eg. in data/written/110CYL072-logical.xml), obviously translated literally from an XML instance. In this case, one could argue that the feature isn’t really an annotation on the source data and so shouldn’t be included, but it raises the issue of what a valid identifier should be. I think there’s a strong case for requiring all identifiers to be qualified names in the sense described by the XML Namespace standard, not just because I want to convert them easily to RDF, but because the concept of URI based names is so powerful in standards like this one. We already have an emerging data category registry (ISOCat) for names in the linguistic annotation space; this requirement would mesh well with the ISOCat facility to register names and would facilitate sharing of feature names and definitions.

In the style-sheet I’m writing now, I gloss over these two issues by generating a fake namespace URI where needed.

Edges

In LAF, edges define relations between nodes and represent structural relations, mainly the parent-child relations needed to represent hierarchical structure. Edges can also have annotations attached to them and the main use-case for this is the need for relationship types other than the default parent-child; a co-reference relationship between two nodes would be represented by an edge with an attached annotation containing the type name as a feature value. Both of these cases are best represented in RDF by a regular relationship of an appropriate type. In the MASC corpus, there aren’t any examples of edges with attached annotations so all edges are converted to child relations by the stylesheet. As an illustration, a resource of type graf:Edge is also created; an annotation could be attached to this in the same way as it is to a node.

Regions

Regions are the means by which nodes in the graph are attached to the source media that is being annotated. All regions in the MASC corpus are defined by two character offsets stored in the anchors attribute. The main issue with regions is not their representation in RDF but the choice of this kind of means of indicating location. I’ll leave that for another discussion as it doesn’t impact on the choices made here to generate RDF from GrAF.

Results

The most interesting result of this exercise is some insight into the design of GrAF and a better understanding on my part of the structures used in that format. However, we can also apply the stylesheet to the data in the MASC corpus to get a set of RDF/XML files. These can be fed into a triple store and queried with SPARQL.

To give an idea of the size of the data, the original XML files consists of 3505944 lines and 108M of text. This translates to 3,935,634 triples. I loaded this into a Sesame triple store and was able to browse the data easily using the workbench interface. Just as an illustration, a sample SPARQL query to find Penn Treebank annotations related by the child relation looks like:

PREFIX PTB:<http://www.cis.upenn.edu/~treebank/>
PREFIX graf:<http://www.xces.org/ns/GrAF/1.0/>
select ?parent ?plabel ?clabel
where {
        ?parent graf:child ?child .
        ?pann graf:annotates ?parent .
        ?pann graf:type PTB:tok .
        ?pann PTB:msd ?plabel .
        ?cann graf:annotates ?child .
        ?cann graf:type PTB:tok .
        ?cann PTB:msd ?clabel .
}

This runs reasonably quickly via the workbench web interface and returns a long list of results such as:

Parent	Plabel	Clabel
http://example.org/Article247_327/ptb-n00252	“PRP$”	“NN”
http://example.org/Article247_327/ptb-n00806	“PRP$”	“JJ”
http://example.org/Article247_327/ptb-n00973	“PRP$”	“JJ”
http://example.org/Article247_327/ptb-n00973	“PRP$”	“NNP”
http://example.org/Article247_327/ptb-n00370	“PRP$”	“JJ”
http://example.org/Article247_327/ptb-n00370	“PRP$”	“JJ”

Summary

This has been a useful exercise in understanding the structure of GrAF and hopefully illustrating some of the advantages of an RDF translation, in particular the usefulness of proper identifiers for each of the objects being described. I’ll take what I’ve learned here and modify the current GrAF ingestion scripts that are used to load annotations into the DADA triple store. Once that’s done I should be able to publish a sample DADA linked data interface to the MASC corpus. Watch this space for a link.

The stylesheet can be found in the DADA source tree: graf2rdf.xsl or check the DADA project on Bitbucket for a more recent version.

DADA Project Update

2011-02-07T07:13:24+11:00

The DADA project is developing software for managing language resources and exposing them on the web. Language resources are digital collections of language as audio, video and text used to study language and build technology systems. The project has been going for a while with some initial funding from the ARC to build the basic infrastructure and later from Macquarie University for some work on the Auslan corpus of Australian Sign Language collected by Trevor Johnston. Recently we have two projects which DADA will be part of, and so the pace of development has picked up a little.

The Australian National Corpus (AusNC) is an effort to build a centralised collection of resources of language in Australia. The core idea is to take whatever existing collections we can get permission to publish and make them available under a common technical infrastructure. Using some funding from HCSNet we build a small demonstration site that allowed free text search on two collections: the Australian Corpus of English and the Corpus of Oz Early English. We now have some funding to continue this work and expand both the size of the collection and the capability of the infrastructure that will support it. What we’ve already done is to separate the text in these corpora from their meta-data (descriptions of each text) and the annotation (denoting things within the texts). While the pilot allows searching on the text the next steps will allow search using the meta-data (look for this in texts written after 1900) and the annotation (find this in the titles of articles). This project is funded by the Australian National Data Service (ANDS) and is a collaboration with Michael Haugh at Griffith.

The Big Australian Speech Corpus, more recently renamed AusTalk, is an ARC funded project to collect speech and video from 1000 Australian speakers for a new freely available corpus. The project involves many partners around the country each of who will have a ‘black box’ recording station to collect audio and stereo video of subjects reading words and sentences, being interviewed and doing the Map task - a game designed to elicit natural speech between two people. Our part of the project is to provide the server infrastructure that will store the audio, video and annotation data that will make up the corpus. DADA will be part of this solution but the main driver is to be able to provide a secure and reliable store for the primary data as it comes in from the collection sites. An important feature of the collection is the meta-data that will describe the subjects in the recording. Some annotation of the data will be done automatically, for example some forced alignment of the read words and sentences. Later, we will move on to support manual annotation of some of the data - for example transcripts of the interviews and map task sessions. All of this will be published via the DADA server infrastructure to create a large, freely available research collection for Australian English.

Since the development of DADA now involves people outside Macquarie, we have started using a public bitbucket repository for the code. As of this writing the code still needs some tidying and documentation to enable third parties to be able to install and work on it, but we hope to have that done within a month. The public DADA demo site is down at the moment due to network upgrades at Macquarie (it’s only visible inside MQ) - I hope to have that fixed soon with some new sample data sets loaded up for testing. 2011 looks like it will be a significant year for DADA. We hope to end this year with a number of significant text, audio and video corpora hosted on DADA infrastructure and providing useful services to the linguistics and language technology communities.