Blog Taxonomy

Posts tagged with 'apache software foundation'

What happens when you create a new dataset in Apache Jena Fuseki

kinow @ May 29, 2018 18:59:04

Last post was about what happens when you upload a Turtle file to Apache Jena Fuseki. And now today’s post will be about what happens when you create a new dataset in Apache Jena Fuseki.

In theory, that happens before you upload a Turtle file, but this post series won’t follow a logical order. It will be more based on what I find interesting.

Oh, the dataset created is an in-memory dataset. Here’s a simplified sequence diagram. Again, these articles are more brain-dumps, used by myself for later reference.

ActionDatasets#execPostContainer() (Fuseki Core)

ActionDatasets, as per the name, handles HTTP requests related to datasets. Such as when you create a new dataset. It is also an HTTP action. The concept in Jena, as far as I could tell, is similar to Jenkins’ actions, but simpler, without the UI/Jelly/Groovy part.

This class also has a few static fields, which hold system information. One variable is actually called system. (wonder how well it works if you try to deploy Jena Fuseki with multiple JVM’s 1 2).

Its first task is to create an UUID, using JenaUUID (from Jena Core). This class looks very interesting, wonder how it works.

Then it creates a DatasetDescriptionRegistry, which is a registry to keep track of the datasets created. There is also some validation of parameters and state check, and then the transaction is started (system.begin()).

Model / (Core)

I used Model and ModelFactory before when working with ontologies and Protégé. ActionDatasets will create a Model.

An RDF Model. An RDF model is a set of Statements. Methods are provided for creating resources, properties and literals and the Statements which link them, for adding statements to and removing them from a model, for querying a model and set operations for combining models.

It also gets a StreamRDF from the model (i.e. model.getGraph()), which will be used later by the RDFParser.

ActionDatasets#assemblerFromForm() (Fuseki)

In #assemblerFromForm(), it will create a template, and then use RDFParser to parse the template and load into SteamRFF. The template looks like this:

# Licensed under the terms of http://www.apache.org/licenses/LICENSE-2.0

@prefix :        <#> .
@prefix fuseki:  <http://jena.apache.org/fuseki#> .
@prefix rdf:     <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

@prefix rdfs:    <http://www.w3.org/2000/01/rdf-schema#> .
@prefix tdb:     <http://jena.hpl.hp.com/2008/tdb#> .
@prefix ja:      <http://jena.hpl.hp.com/2005/11/Assembler#> .

## ---------------------------------------------------------------
## Updatable in-memory dataset.

<#service1> rdf:type fuseki:Service ;
    # URI of the dataset -- http://host:port/aaa
    fuseki:name                        "aaa" ;
    fuseki:serviceQuery                "sparql" ;
    fuseki:serviceQuery                "query" ;
    fuseki:serviceUpdate               "update" ;
    fuseki:serviceUpload               "upload" ;
    fuseki:serviceReadWriteGraphStore  "data" ;     
    fuseki:serviceReadGraphStore       "get" ;
    fuseki:dataset                     <#dataset> ;
    .

# Transactional, in-memory dataset. Initially empty.
<#dataset> rdf:type ja:DatasetTxnMem .

Persisting the dataset

Jena Fuseki will create a local copy of the template, using RIOT’s RDFDataMgr. For my environment, running from Eclipse, the file location was /home/kinow/Development/java/jena/jena/jena-fuseki2/jena-fuseki-core/run/system_files/902154aa-2bb6-11b2-8053-024232e7b374.

Fuseki now will look for exactly one Service Name statement (http://jena.apache.org/fuseki#name). The name will be validated for things like blank space, empty, ‘/’, etc.

Once the validation passes, then it will persist the file in somewhere like /home/kinow/Development/java/jena/jena/jena-fuseki2/jena-fuseki-core/run/configuration/aaa.ttl. But not without checking first it the file existed.

As this is a brand new file, the Model instance will be written on the file now.

DatasetAccessPoint (ARQ)

Funny, when I wrote it I immediately put this class under Fuseki Core, but it is actually in ARQ. Why not in Fuseki?. Looks like ARQ has a Web layer too.

Fuseki’s FusekiBuilder#buildDataAccessPoint() creates the DataAccessPoint. The DataAccessPoint‘s Javadocs say: “A name in the URL space of the server”.

  • DataService contains operations, and endpoints
  • Services are added to endpoints, such as an endpoint for the Quads_RW, the REST_Quads_RW from previous post
  • The name and the data service are used to create the DataAccessPoint

The current HttpAction in the request contains a reference to Fuseki’s DataAccessPointRegistry. Fuseki’s DataAccessPointRegistry extends Atlas’ Registry, which uses a ConcurrentHashMap (again, how does it work with multiple JVM’s?).

The new DataAccessPoint is registered with Fuseki’s registry (not the other registry).

And then, finally, the transaction is committed, the response is prepared (a 200 OK in text/plain). And a null is returned, which means empty response.

And here’s some of the logs produced during this experiment.

[2018-05-28 21:11:43] Config     INFO  Load configuration: file:///home/kinow/Development/java/jena/jena/jena-fuseki2/jena-fuseki-core/run/configuration/ds2.ttl
[2018-05-28 21:11:43] Config     INFO  Load configuration: file:///home/kinow/Development/java/jena/jena/jena-fuseki2/jena-fuseki-core/run/configuration/p1.ttl
[2018-05-28 21:11:43] Config     INFO  Register: /ds2
[2018-05-28 21:11:43] Config     INFO  Register: /p1
[2018-05-28 21:11:43] Server     INFO  Started 2018/05/28 21:11:43 NZST on port 3030
[2018-05-28 21:11:52] Admin      INFO  [1] GET http://localhost:3030/$/server
[2018-05-28 21:11:52] Admin      INFO  [1] 200 OK (11 ms)
[2018-05-28 21:12:41] Admin      INFO  [2] GET http://localhost:3030/$/server
[2018-05-28 21:12:41] Admin      INFO  [2] 200 OK (3 ms)
[2018-05-28 21:22:33] Admin      INFO  [3] POST http://localhost:3030/$/datasets
[2018-05-28 21:46:25] Admin      INFO  [3] Create database : name = /aaa
[2018-05-28 22:16:17] Admin      INFO  [3] 200 OK (3,224.390 s)
[2018-05-28 22:16:18] Admin      INFO  [4] GET http://localhost:3030/$/server
[2018-05-28 22:16:18] Admin      INFO  [4] 200 OK (8 ms)

So that’s that.

Happy hacking !

What happens when you upload a Turtle file in Apache Jena Fuseki

kinow @ May 27, 2018 22:42:55

I am working on an issue for Skosmos that involves preparing some Turtle files and uploading them using Apache Jena Fuseki’s web interface.

The issue is now pending feedback, which gives me a moment to have fun with something else. So I decided to dig down the rabbit hole and start learning more about certain parts of the Apache Jena code base.

This post will be useful to myself in the future, as a note-taking in a series, so that I remember how things work - you never know right? But hopefully this will be useful to other wanting to understand more about the code of Apache Jena.

Where upload happens - SPARQL_Upload and Upload (Fuseki Core)

Knowing a bit of the code base, I went straight to the SPARQL_Upload class, from the Fuseki Core module. Set up a couple of breakpoints, uploaded my file, but nothing. Then tried on its package-neighbour class, Upload.

Actually, it is easier to understand seeing the class hierarchy, and knowing that when I run the application in Eclipse, it is running with Jetty, serving servlets (there is no framework like Wicket, Struts, etc, involved).

Several filters are applied to the HTTP request too, like Cross Origin, Shiro, and the FusekiFilter. The latter looks at the requests to see if it includes a dataset. If a dataset is found - it is in our case - then it hands the request over to the right class to handle it.

REST_Quads_RW will take care of the upload action, using the Upload class where my breakpoint stopped.

Upload#incomingData() (Fuseki Core)

Upload#incomingData() starts by checking the Content Type from the request. In my case it is a multipart/form-data. Then it calls its other method #fileUploadWorker().

#fileUploadWorker() creates a ServletFilterUpload, from Apache Commons FileUpload. With that, it opens a stream for the file, retrieves its name and other information, such as the content type.

Ah, the content type is interesting too. It defaults to RDFXML, but what’s interesting is the comment.

if ( lang == null )
    // Desperate.
    lang = RDFLanguages.RDFXML ;

Well, in this case we are getting a Lang:Turtle. So it now knows that it has a Turtle file, but it still needs to parse it.

ActionLib#parse() (Fuseki Core)

Upload calls ActionLib#parse(), which uses RDFParserBuilder to build a parser. It applies a nice fluent API design when doing that.

RDFParser.create()
    .errorHandler(errorHandler)
    .source(input)
    .lang(lang)
    .base(base)
    .parse(dest);
Side note to self: the `RDFParser` has a `canUse` flag. It seems to indicate the parser can be used just once. Though it looks actually it works until the stream is closed…

So RDFParserBuilder will call RDFParser, which in turn will use the classes LangTurtle and LangTurtleBase.

LangTurtle (ARQ)

ARQ is a low level module in Jena, responsible for parsing queries, and also some of the interaction with graphs and datasets.

LangTurtle extends LangTurtleBase. Their task starts by populating the prefixMap, which contains all those prefixes used in queries like rdfs, void, skos, etc.

Then it will keep parsing triples until it finds an EOF. For every triple, after the Predicate-Object-List is found, it calls LangTurtle#emit().

The #emit()method creates a Triple object (Jena Core, graph package). And also a StreamRDFCountingBase to keep track of statistics to display back to the user.

StreamRDFCountingBase extends StreamRDFWrapper, and wraps - as per name - other StreamRDF‘s, such as ParserOutputDataset.

ParserOutputDataset holds a reference to the DatasetGraph and also to the prefixMap populated earlier in LangTurtle. For each Triple that we have it will call the DatasetGraph#add method, creating a new Quad with the default graph name.

Conclusion

Finally, readers and streams are closed. An UploadDetails object is created holding stats ollected in StreamRDFCountingBase, which are also used for logging.

Upload#incomingPath() will return the UploadDetails. If there are no errors then the transaction will be commited. It involves again classes from ARQ and TDB (for journaling), but that will be for another post.

The final method called in the Upload class will be detailsJson(), which returns the object as JSON. This JSON string is then finally returned to the user.

So that’s it. Probably the next step will be to learn how DatasetGraph works, or maybe more about transactions in Jena.

Happy hacking !

Learning more about SPARQL and Jena internals

kinow @ Apr 28, 2018 18:20:28

O Corvo
O Corvo

Recently a pull request for Apache Jena that I started three years ago got merged. Even though it has been three years since that pull request, there are still many parts of the project code base that I am not familiar with.

And not only the code, but there are also many concepts about SPARQL, other standards used in Jena, and internals about triple stores.

The following list contains some presentations and posts that I am reading right now, while I try to improve my knowledge of SPARQL and Jena internals.

Exif Odd Offsets

kinow @ Dec 25, 2017 21:43:33

A file format like JPEG may contain metadata in JFIF, Exif, or a vendor proprietary format. The Exif format is based - or uses parts of - on the TIFF format.

Within an Exif metadata block, you should see directories, with several entries. The entries have fields like description, value, and also an offset. The offset indicates the offset to the next entry.

The Exif specification defines that implementers must make sure to keep the offset an even number, within 4 bytes.

I recently worked on IMAGING-205, a ticket about odd offsets in files with Exif metadata. This issue was exactly to address that when files were rewritten with Apache Commons Imaging, even though the image initially had no odd offsets, after the entries were rearranged, we could have odd offsets.

The fix was simply checking for odd offsets, adding +1, and later it would be put within the 4 bytes limit.

A screen shot of Eclipse with source code
Locating the bug

One interesting point, however, is that this is in the standard, but not all software that read and write Exif follow the specification. So it is quite common to find images with odd offsets.

Which means you could take a picture with your phone, that contains some Exif metadata, and be surprised to analyze it with exiftool and get warnings about odd offsets. Most viewers handle odd and even offsets, so it should work for most cases, unless you have a strict reader/viewer.

Happy hacking!

&heart; Open Source

Watch out for Locales when using NumberFormat with currencies

kinow @ Dec 02, 2017 22:51:00

In Java you have the NumberFormatException to help you formatting and parsing numbers for any locale. Said that, here’s some code.

BigDecimal negative = new BigDecimal("-1234.56");

DecimalFormat nf = (DecimalFormat) NumberFormat.getCurrencyInstance(Locale.UK);
String formattedNegative = nf.format(negative);

System.out.println(formattedNegative);

The output for this code is -£1,234.56. That’s expected, as the locale is set to UK, so the currency symbol used is for British Pounds. And as the number is negative, you get that minus sign as a prefix. For Japanese locale you’d get -¥1,235, and for Brazilian locale you’d get -R$ 1.234,56.

So far so good.

What about the following code, with nothing different except for the locale set to US.

BigDecimal negative = new BigDecimal("-1234.56");

DecimalFormat nf = (DecimalFormat) NumberFormat.getCurrencyInstance(Locale.US); // <--- US now
String formattedNegative = nf.format(negative);

System.out.println(formattedNegative);

Some could intuitively expect -$1,234.56. However, the output is actually ($1,234.56).

There are different prefixes and suffixes. But in some locales the prefix can be empty, or, as in the case of the US locale, it can be quite different than what you could expect.

Learned about this peculiarity from NumberFormat while working on VALIDATOR-433 for Apache Commons Validator.

Happy hacking!