2023 2024 roadmap

2023-2024 Roadmap for NCI Protege

(Based on discussion between Gilberto, Mark, and Bob)

Introduction

NCI Thesaurus is growing more and more. Over time it's reaching a point where even with large RAM sizes it's becoming increasingly hard to edit content, classify it, and perform ad hoc sparql queries. This page discusses several enhancements and reworkings of the software in order to achieve three goals:

increase the scalability of the system to allow terminologies of any size by not requiring the entire content be all in memory at once
enhance the protege server to support an RDF triple store, and support sparql queries over it, keeping it in sync with the terminology as it's edited
make the EditTab customizations used at NCI into configurable components and more generally useful. This will allow other content developers to do complex operations such as splits, copies, merges, and retirements, without needing annotation properties only specific to NCI
develop a business rule capability then enables modelers to encode business rules using the ontology they are building. This will enable better and more flexible and auditable workflows amongst modelers.

This roadmap will discuss these goals in detail and layout an approach to design and implementation. We will reference Github issues and use the projects support in Github in parallel with this roadmap document for stakeholders to read and follow the activity.

Client/Server

Protege5 was not originally designed to support a client/server mode. This was added after the fact. Currently the client connects over HTTP to a server that is used mainly to provide a concurrent editing environment for modelers. An MVCC approach ensures a strict linear sequence of versions is maintained. Users are not allowed to commit changes to the server unless they are currently editing against the latest version. The terminology is divided into two components:

a baseline, a serialized version of an OWL file.
a set of incremental changes made to that baseline.

The baseline is stored on both the client and the server. Periodically all the changesets are incorporated into a new baseline through a process called squashing. Workflow managers do this work using the revision history plugin, which allows them to view the history and check for conflicts. Once it is seen to be consistent, a new snapshot is created on the server. Clients use a checksum to check on login if there is a new snapshot to fetch.

The server also maintains configuration data for all the various ontologies under development, as well as user login/passwd and permissions data. Currently this is all stored in one large configuration file. A key goal is to effect a separation of concerns in order to make the server more robust and the clients more flexible.

The server also maintains some other files specific to NCI's use cases, .eg. a history file of edit operations and a code file used to generate uniue codes to use as IDs for new classes.

EditTab

EditTab is a protege plugin that adds considerable capabilities to the editors built in to protege. It supports larger more complex operations on classes such as splits, merges, and retirements. These assist the modelers in helping keep the terminology consistent. Additionally certain business rules are implemented that enforce constraints imposed by NCI, and there is integration with Lucene searching and a description logic classifer, that help curate the editing operations even more.

A key concept in the use of EditTab is a project. Currently projects are configured in the servers config file and are very specific to NCI. Protege has this notion of a metaproject, which is intended to be more general, and all of NCI's customizations are in a single array of optional properties.

However it's clear that there is a certain amount of this advanced functionality that is common to all terminology editing. Splits,merges, cloning, and retiring of classes are operations many need to use. So the goal here is twofold:

factor out the NCI specifics of the operations, .eg. the properties used to annotate the operations, leaving the core complex operations fully usable with no customizations.
Allow users to add these customizations as part of the process of configuring new projects.

A typical workflow might be the following:

A new modeler starts by loading up an OWL file in protege in file mode. As they are working with this they may decide they need splits or merges. In the preferences tab they enable these operations and also possibly annotate them, choosing the properties they may wish to use as sort of metaprops for the operations. They then save this configuration as a potential project config file.

At the point they wish to begin collaborating with other modelers and concurrently editing the ontology, they use the metaproject admin plugin to create a new project. In addition to specifying the OWL file for the baseline, they also specify a config file that they've created. This then becomes the shared config that all modelers see when they login and open the project.

We've already discussed this use case in the need to optional project props, having EditTab work in standalone mode, and adding preferences for EditTab

Sparql Query

This plugin has proven itself to be very useful for curating and repairing the terminology because it allows for ad hoc queries. In practice a fixed set of queries have been developed and are run as needed. The plugins incorporates these as bookmarks. Occasionally the need arise for new ones, which are then added to the bookmakrs list.

The main drawback is that it builds a triplestore in memory. This adds considerably to the memory rquired as protege keeps the entire terminology in memory also. Additionally when the classifier needs to be run it also builds a new object graph in memory.

The solution proposed for this is to run a virtuoso triple store on the server, that is shared by all the modelers. Queries are read only and will be executed directly against the virtuoso triplestore. On the client side we will maintain the same API so that we'll keep our future options open. One of the upsides of this approach is that it simplifies the client. When edits are performed and committed to the server, the server will sync with and update the triplestore. The clients will not need to worry about this. This will clearly improve memory usage

Business Rules

When you look at the editing rules currently enforced by EditTab they are considerable. Many of these have influenced the UI design, and the overall workflow in EditTab. It's been a long standing desire to come up with a more generic solution and refactor these business rules.

One approach would be to define a few interfaces that allow EditTab to callout to 3rd party codes at key points in the process, .eg. before committing edits, passing the class or classes in question and some context. This plugin approach was used in earlier systems (TDE) with some success. Another approach would be to use a declarative framework base on the OWL language so that the constraints would be encode in the ontology. SHACL

Scalability

This is by far the most challenging task, mainly due to the design of protege. It's intended to edit multiple ontologies in the same running instance. NCI Thesaurus alone will easily take up most of the resources on the average desktop, even with 16G of RAM. Moreover it has both it's own protege-owl layer as well as using the owlapi. It's not clear both are necessary. This is one of the major goals this year.

The best approach to handling larger ontologies is to make use of the most typical use cases that demonstrate how users work and how much of the ontology needs to be in memory at any one time. A well built ontology has a natural taxonomic structure to it, thus we have a navigation widget that allows the user to walk about the hierarchy. At any point in the process the user only needs to see the classes on one page. As they select them the right hand panel will display property and logical detail for a given class. The lion's share of the data is in the form of annotation properties, many of which are complex and themselves have other properties on them. This data doesn't need to be in RAM unless the editor is viewing it or want to edit it.

Another use case is searching via lucene. When a user performs a search the results are presented as a list of classes. The only additional data required is if the user selects something from the list to view or edit.

So a good deal of the data can be fetched when needed. The only time when the entire ontology is needed in memory is during classification, but none of the annotation data is needed for that. Only the object graph, ids representing the classes and the object properties that link the classes to one another.

Several different approaches have been prototyped and researched, which will be descibed here:

scalability

Provide feedback

Saved searches

Use saved searches to filter your results more quickly