Blog Taxonomy

Posts tagged with 'jena'

UUID's in Apache Jena

kinow @ Aug 11, 2018 19:02:16

In this post I won’t talk about what are UUID’s, or how they work in Java. Here‘s a great article on that. Or access the always reliable Wikipedia article about it. (or if you would rather, read the RFC 4122)

I found out that Jena had UUID implementations after writing a previous post. And then decided to look into which UUID’s Jena has, and where these UUID’s were used. This way I would either understand why Jena needed UUID’s, or just be more educated in case I ever stumbled with a change in Jena that required related work.

Jena Core’s org.apache.jena.shared.uuid

This package is small and simply contains: factories,

Zatoichi Crying

  • UUIDFactory: interface for a factory of UUID’s
  • UUID_V1_Gen: a factory for UUID_V1
  • UUID_V4_Gen: a factory for UUID_V4

and UUID implementations,

  • JenaUUID: abstract base class for UUID implementations
  • UUID_nil: a special UUID, nil, filled with zeroes
  • UUID_V1: UUID V1
  • UUID_V4: UUID V4

and a utility class

  • LibUUID: with methods to create a Random and to create byte[] seeds.

JenaUUID contains a method to return a JenaUUID as a Java’s UUID. And is used in the command line utility juuid, for transaction ID’s, and when a new dataset is created. For the new dataset, Fuseki will create files in a temporary location. The name of the temporary location is created using an instance of JenaUUID.

UUID_V1 and UUID_V4

Jena’s UUID_V1 is an implementation of Version 1 (time based), variant 2 (DCE). Which means it uses MAC address and timestamp to generate the universal unique ID’s.

It uses NetworkInterface.getNetworkInterfaces() to retrieve the MAC address of the node running Jena. When using localhost, the MAC address is not available, so it resorts to using a random number.


And Jena’s UUID_V4 is an implementation of Version 4(random), variant 2 (DCE). Which means it uses random numbers to generate the universal unique ID’s.

The factory for V4 will have a random for the most significant bits, and for the least significant bits of the UUID (also including version and variant). The random for the factory is created by LibUUID#makeRandom(). This method returns a SecureRandom with two seeds, one being random, and the other created with LibUUID#makeSeed().

UUID_V4 uses a SecureRandom created locally but with the seed also set by LibUUID#makeSeed(). The seed returned by this method may use the MAC address, but will also use the os.version, user.name, java.version, number of active threads, total memory, free memory, and the hash code of a newly created Object.

Transaction ID UUID (TxnIdUuid) — uses JenaUUID

Jena contains two implementations of TxnId (transaction identifiers),

  • TxnIdSimple: transaction IDs are created with a counter within each JVM
  • TxnIdUuid: transaction IDs are created using JenaUUID.

The first thing that called my eye in this class was the inconsistency with the name - which is quite normal in large projects such as Jena.

As TxnIdUuid calls JenaUUID#generate(), it will use the default factory, UUID_V1_Gen. Then it will call asUUID to return a Java UUID object but with the same UUID.

Create a new dataset in Fuseki (ActionDatasets) — uses JenaUUID

When you create a new dataset in Fuseki, as explained in the previous post, Fuseki will create some temporary files and folders. For at least one folder, it will use an instance of JenaUUID, in ActionDatasets#execPostContainer().

Blank node IDs (BlankNodeId) — uses Java’s UUID

Blank nodes in Jena need an identifier too. It is possible to configure Jena to either return a JVM bound counter (similarly to how TxnIdSimple works), or otherwise blank nodes identifiers will be generated with java.util.UUID.randomUUID().

I wonder why the transaction ID’s use Jena’s JenaUUIDs, but the blank node IDs use Java’s UUID? They are compatible anyway.

Other methods related to blank nodes also use Java’s UUID,

  • BlankNodeAllocatorFixedSeedHash
  • BlankNodeAllocatorHash#freshSeed().

SPARQL functions, and NodeFunctions — uses Java’s UUID

SPARQL 1.1 contains functions UUID and STRUUID. Apache Jena provides these two functions, and users can use them in queries such as

SELECT (UUID() AS ?uuid) (StrUUID() AS ?strUuid) WHERE { }

(but before users would have to call extra functions in a different namespace).

The function implementations use NodeFunctions methods struuid and uuid. Both methods in NodeFunctions use Java’s UUID, and not JenaUUID.

Files and directories for databases / datasets — uses Java’s UUID

Files and directories created in Jena use Java’s UUID,

  • BufferAllocatorMapped#getNewTemporaryFile()
  • TDBBuilder#create methods and ComponentIdMgr constructor
  • AbstractDataBag#getNewTemporaryFile().

Conclusion

In Jena there are places where instances of JenaUUID are used to produce a UUID, and other places where Java’s UUID is used.

Java’s UUID provides a variant 2 version 4 (random DCE), which is equivalent to UUID_V4. But there is no equivalent of UUID_V1, the default used in Jena.

And even though UUID_V4 and UUID are compatible, I believe Jena’s version is using a seed with so many JVM and operating system related settings (os.version, free memory, etc) in order to have a unique seed per node running Jena, independent of whether there are multiple JVM’s in the same node.

But to be honest, I am still not sure which one I would have to use, nor if there are cases where I should pick one over the other…

EDIT: Apache Jena’s lead dev replied with a bit of history about the project too (:

What happens when you create a new dataset in Apache Jena Fuseki

kinow @ May 29, 2018 18:59:04

Last post was about what happens when you upload a Turtle file to Apache Jena Fuseki. And now today’s post will be about what happens when you create a new dataset in Apache Jena Fuseki.

In theory, that happens before you upload a Turtle file, but this post series won’t follow a logical order. It will be more based on what I find interesting.

Oh, the dataset created is an in-memory dataset. Here’s a simplified sequence diagram. Again, these articles are more brain-dumps, used by myself for later reference.

ActionDatasets#execPostContainer() (Fuseki Core)

ActionDatasets, as per the name, handles HTTP requests related to datasets. Such as when you create a new dataset. It is also an HTTP action. The concept in Jena, as far as I could tell, is similar to Jenkins’ actions, but simpler, without the UI/Jelly/Groovy part.

This class also has a few static fields, which hold system information. One variable is actually called system. (wonder how well it works if you try to deploy Jena Fuseki with multiple JVM’s 1 2).

Its first task is to create an UUID, using JenaUUID (from Jena Core). This class looks very interesting, wonder how it works.

Then it creates a DatasetDescriptionRegistry, which is a registry to keep track of the datasets created. There is also some validation of parameters and state check, and then the transaction is started (system.begin()).

Model / (Core)

I used Model and ModelFactory before when working with ontologies and Protégé. ActionDatasets will create a Model.

An RDF Model. An RDF model is a set of Statements. Methods are provided for creating resources, properties and literals and the Statements which link them, for adding statements to and removing them from a model, for querying a model and set operations for combining models.

It also gets a StreamRDF from the model (i.e. model.getGraph()), which will be used later by the RDFParser.

ActionDatasets#assemblerFromForm() (Fuseki)

In #assemblerFromForm(), it will create a template, and then use RDFParser to parse the template and load into SteamRFF. The template looks like this:

# Licensed under the terms of http://www.apache.org/licenses/LICENSE-2.0

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

## ---------------------------------------------------------------
## Updatable in-memory dataset.

<#service1> rdf:type fuseki:Service ;
    # URI of the dataset -- http://host:port/aaa
    fuseki:name                        "aaa" ;
    fuseki:serviceQuery                "sparql" ;
    fuseki:serviceQuery                "query" ;
    fuseki:serviceUpdate               "update" ;
    fuseki:serviceUpload               "upload" ;
    fuseki:serviceReadWriteGraphStore  "data" ;     
    fuseki:serviceReadGraphStore       "get" ;
    fuseki:dataset                     <#dataset> ;
    .

# Transactional, in-memory dataset. Initially empty.
<#dataset> rdf:type ja:DatasetTxnMem .

Persisting the dataset

Jena Fuseki will create a local copy of the template, using RIOT’s RDFDataMgr. For my environment, running from Eclipse, the file location was /home/kinow/Development/java/jena/jena/jena-fuseki2/jena-fuseki-core/run/system_files/902154aa-2bb6-11b2-8053-024232e7b374.

Fuseki now will look for exactly one Service Name statement (http://jena.apache.org/fuseki#name). The name will be validated for things like blank space, empty, ‘/’, etc.

Once the validation passes, then it will persist the file in somewhere like /home/kinow/Development/java/jena/jena/jena-fuseki2/jena-fuseki-core/run/configuration/aaa.ttl. But not without checking first it the file existed.

As this is a brand new file, the Model instance will be written on the file now.

DatasetAccessPoint (ARQ)

Funny, when I wrote it I immediately put this class under Fuseki Core, but it is actually in ARQ. Why not in Fuseki?. Looks like ARQ has a Web layer too.

Fuseki’s FusekiBuilder#buildDataAccessPoint() creates the DataAccessPoint. The DataAccessPoint‘s Javadocs say: “A name in the URL space of the server”.

  • DataService contains operations, and endpoints
  • Services are added to endpoints, such as an endpoint for the Quads_RW, the REST_Quads_RW from previous post
  • The name and the data service are used to create the DataAccessPoint

The current HttpAction in the request contains a reference to Fuseki’s DataAccessPointRegistry. Fuseki’s DataAccessPointRegistry extends Atlas’ Registry, which uses a ConcurrentHashMap (again, how does it work with multiple JVM’s?).

The new DataAccessPoint is registered with Fuseki’s registry (not the other registry).

And then, finally, the transaction is committed, the response is prepared (a 200 OK in text/plain). And a null is returned, which means empty response.

And here’s some of the logs produced during this experiment.

[2018-05-28 21:11:43] Config     INFO  Load configuration: file:///home/kinow/Development/java/jena/jena/jena-fuseki2/jena-fuseki-core/run/configuration/ds2.ttl
[2018-05-28 21:11:43] Config     INFO  Load configuration: file:///home/kinow/Development/java/jena/jena/jena-fuseki2/jena-fuseki-core/run/configuration/p1.ttl
[2018-05-28 21:11:43] Config     INFO  Register: /ds2
[2018-05-28 21:11:43] Config     INFO  Register: /p1
[2018-05-28 21:11:43] Server     INFO  Started 2018/05/28 21:11:43 NZST on port 3030
[2018-05-28 21:11:52] Admin      INFO  [1] GET http://localhost:3030/$/server
[2018-05-28 21:11:52] Admin      INFO  [1] 200 OK (11 ms)
[2018-05-28 21:12:41] Admin      INFO  [2] GET http://localhost:3030/$/server
[2018-05-28 21:12:41] Admin      INFO  [2] 200 OK (3 ms)
[2018-05-28 21:22:33] Admin      INFO  [3] POST http://localhost:3030/$/datasets
[2018-05-28 21:46:25] Admin      INFO  [3] Create database : name = /aaa
[2018-05-28 22:16:17] Admin      INFO  [3] 200 OK (3,224.390 s)
[2018-05-28 22:16:18] Admin      INFO  [4] GET http://localhost:3030/$/server
[2018-05-28 22:16:18] Admin      INFO  [4] 200 OK (8 ms)

So that’s that.

Happy hacking !

What happens when you upload a Turtle file in Apache Jena Fuseki

kinow @ May 27, 2018 22:42:55

I am working on an issue for Skosmos that involves preparing some Turtle files and uploading them using Apache Jena Fuseki’s web interface.

The issue is now pending feedback, which gives me a moment to have fun with something else. So I decided to dig down the rabbit hole and start learning more about certain parts of the Apache Jena code base.

This post will be useful to myself in the future, as a note-taking in a series, so that I remember how things work - you never know right? But hopefully this will be useful to other wanting to understand more about the code of Apache Jena.

Where upload happens - SPARQL_Upload and Upload (Fuseki Core)

Knowing a bit of the code base, I went straight to the SPARQL_Upload class, from the Fuseki Core module. Set up a couple of breakpoints, uploaded my file, but nothing. Then tried on its package-neighbour class, Upload.

Actually, it is easier to understand seeing the class hierarchy, and knowing that when I run the application in Eclipse, it is running with Jetty, serving servlets (there is no framework like Wicket, Struts, etc, involved).

Several filters are applied to the HTTP request too, like Cross Origin, Shiro, and the FusekiFilter. The latter looks at the requests to see if it includes a dataset. If a dataset is found - it is in our case - then it hands the request over to the right class to handle it.

REST_Quads_RW will take care of the upload action, using the Upload class where my breakpoint stopped.

Upload#incomingData() (Fuseki Core)

Upload#incomingData() starts by checking the Content Type from the request. In my case it is a multipart/form-data. Then it calls its other method #fileUploadWorker().

#fileUploadWorker() creates a ServletFilterUpload, from Apache Commons FileUpload. With that, it opens a stream for the file, retrieves its name and other information, such as the content type.

Ah, the content type is interesting too. It defaults to RDFXML, but what’s interesting is the comment.

if ( lang == null )
    // Desperate.
    lang = RDFLanguages.RDFXML ;

Well, in this case we are getting a Lang:Turtle. So it now knows that it has a Turtle file, but it still needs to parse it.

ActionLib#parse() (Fuseki Core)

Upload calls ActionLib#parse(), which uses RDFParserBuilder to build a parser. It applies a nice fluent API design when doing that.

RDFParser.create()
    .errorHandler(errorHandler)
    .source(input)
    .lang(lang)
    .base(base)
    .parse(dest);
Side note to self: the `RDFParser` has a `canUse` flag. It seems to indicate the parser can be used just once. Though it looks actually it works until the stream is closed…

So RDFParserBuilder will call RDFParser, which in turn will use the classes LangTurtle and LangTurtleBase.

LangTurtle (ARQ)

ARQ is a low level module in Jena, responsible for parsing queries, and also some of the interaction with graphs and datasets.

LangTurtle extends LangTurtleBase. Their task starts by populating the prefixMap, which contains all those prefixes used in queries like rdfs, void, skos, etc.

Then it will keep parsing triples until it finds an EOF. For every triple, after the Predicate-Object-List is found, it calls LangTurtle#emit().

The #emit()method creates a Triple object (Jena Core, graph package). And also a StreamRDFCountingBase to keep track of statistics to display back to the user.

StreamRDFCountingBase extends StreamRDFWrapper, and wraps - as per name - other StreamRDF‘s, such as ParserOutputDataset.

ParserOutputDataset holds a reference to the DatasetGraph and also to the prefixMap populated earlier in LangTurtle. For each Triple that we have it will call the DatasetGraph#add method, creating a new Quad with the default graph name.

Conclusion

Finally, readers and streams are closed. An UploadDetails object is created holding stats ollected in StreamRDFCountingBase, which are also used for logging.

Upload#incomingPath() will return the UploadDetails. If there are no errors then the transaction will be commited. It involves again classes from ARQ and TDB (for journaling), but that will be for another post.

The final method called in the Upload class will be detailsJson(), which returns the object as JSON. This JSON string is then finally returned to the user.

So that’s it. Probably the next step will be to learn how DatasetGraph works, or maybe more about transactions in Jena.

Happy hacking !

Learning more about SPARQL and Jena internals

kinow @ Apr 28, 2018 18:20:28

O Corvo
O Corvo

Recently a pull request for Apache Jena that I started three years ago got merged. Even though it has been three years since that pull request, there are still many parts of the project code base that I am not familiar with.

And not only the code, but there are also many concepts about SPARQL, other standards used in Jena, and internals about triple stores.

The following list contains some presentations and posts that I am reading right now, while I try to improve my knowledge of SPARQL and Jena internals.

Contributing to Apache Jena

kinow @ Jan 01, 2015 18:49:03

As I mentioned in my previous post, I am using Apache Jena for a project of a customer. I had never used any triple store, nor a SPARQL Endpoint server before. But for being involved with the Apache Software Foundation, and since the company itself is using several Apache components, it was only natural Jena to be our first choice.

It has served us very well so far. At the moment we have less than 100 queries per day, but the project is still under development and we expect 1000 queries per day by the first quarter of 2015 and 1000000 near the end of 2015. We also have few entries in TDB, but expect to grow this number to a few million before 2016.

When I work for companies and we use Open Source Software (OSS) in a project, I always prepare assessment reports to include in the deliveries. In this report I justify the choice of Open Source Software (as well as commercial software). Sometimes I am lucky to work for a company that asks me to include hours to work on OSS :-)

I use Trello to triage issues in OSS projects (and for several other things). I have a board with several cards for Open Source. About a month ago I set up one for Jena and listed the issues that I thought I could contribute to.

Jena Trello card
Jena Trello card

I annotate easy issues with a “lhf” suffix for Low Hanging Fruit issues, and delete issues from the card once I submit a patch or update it (and include it in another card for the TupiLabs reports).

Most of the issues I included in the card for Jena had been created over two years ago, and hadn’t been updated in a while. When you test these issues against the current code, usually you find that some of them have already been fixed. Other issues included documentation problems, and minor features. I didn’t find any blocker issue that would impede us to use Jena in production.

Jena JIRA activity summary
Jena JIRA activity summary

The picture above shows the past 30 days activity summary in JIRA for Jena. The red line shows issues created, and the green line issues resolved. Andy Seaborne was very active in the past days and fixed several issues that were too old and had already been fixed in the trunk, and kindly merged patches and pull requests.

Some issues like JENA-632 will take a longer time to fix, but I’m getting used to Jena’s source code, and at the same getting more confident to use it in production - especially with a supportive OSS community. We are using Jena for RDF with Hadoop, and I learned that I can replace some custom Writables by others in the Jena Hadoop submodule.

By the way, even though this project ends in April, I intend to continue contributing to Jena. There is a lot of parts of the code that I would love to be able to understand and contribute, in special the Graph database, optimization techniques for SPARQL queries, the grammars used in the project, Fuseki v2 and enhance its testing harness (as well as the test coverage).

If you are looking for a interesting project to get you started with semantics, linked data, RDF, and even graphs and database querying, try contributing to Jena. I bet you’ll have a lot of fun!

Happy hacking and happy 2015!