Blog Taxonomy

Posts tagged with 'opensource'

Cylc Scheduler Internals - Part 3

kinow @ Aug 18, 2018 18:27:37

This is part 2, in a series of posts about Cylc internals. The part 1 had the beginning of the workflow. part 2 documented from the moment the method configure() is called. This post will continue right after the continue() method returns, going on with the next method: run().

NB: this is a post to remember things, not really expecting to give someone enough information to be able to hack the Cylc Scheduler (though you can and would have fun!).

At this point, the Suite Server program must have been initialized, with the objects that it requires, and with everything configured. So this method is the one that starts the whole work on the tasks & proxies.

The runahead points, i.e. what are the next available cycle points, are calculated and scheduled. In the scheduling, tasks are queued for execution.

Most of the interesting action takes place when process_task_pool() is called, and in the submit_task_jobs(). The latter is a method from TaskJobManager, and it is here where - in this case - my shell command is executed through a temporary shell script file.

The graph was created from the initial execution of a suite that was starting from scratch (they can also be reinitialized). If there are multiple tasks waiting, or if a suite was restarted, the diagram would look considerably different.

You can download the source file for the diagram used in this post, and edit it with draw.io.

Use of Logging in Java Image Processing libraries

kinow @ Aug 12, 2018 17:55:44

For IMAGING-154 I was trying to think in a solution for the existing Debug class. This class was the issue of discussion during a previous 1.0 release vote thread.

Initially I tried simply changing the class a bit, and make it configurable, so that we could keep it - as there is a valid use case for having a class that collects information during the image processing algorithms were applied.

And for me, the next step would naturally be to remove the use of System.out, as the Debug class would now have a PrintStream that could be System.out or something else.

That’s when I realized in the current version of Apache Commons Imaging (née Sanselan) uses System.out when debugging, but also when it wants to enable a verbose” mode in some classes.

I am using this post to collect a few cases, and check what other image processing images are doing with regards to logging.

Library Logger
Java ImageIO No logging (exceptions only)
im4java No logging (exceptions only)
opencv JNI No logging (exceptions only)
GeoSolutions ImageIO-Ext java.util.logging
Catalano java.util.logging
Apache PDFBox JBIG2 Custom Logger
ImageJ2 SciJava Logger
Fiji SLF4J + logback
OpenJPEG Custom Logger
imgscalar System.out
Apache Commons Imaging System.out
Sanselan System.out
Marvin System.out
Processing System.out

UUID's in Apache Jena

kinow @ Aug 11, 2018 19:02:16

In this post I won’t talk about what are UUID’s, or how they work in Java. Here‘s a great article on that. Or access the always reliable Wikipedia article about it. (or if you would rather, read the RFC 4122)

I found out that Jena had UUID implementations after writing a previous post. And then decided to look into which UUID’s Jena has, and where these UUID’s were used. This way I would either understand why Jena needed UUID’s, or just be more educated in case I ever stumbled with a change in Jena that required related work.

Jena Core’s org.apache.jena.shared.uuid

This package is small and simply contains: factories,

Zatoichi Crying

  • UUIDFactory: interface for a factory of UUID’s
  • UUID_V1_Gen: a factory for UUID_V1
  • UUID_V4_Gen: a factory for UUID_V4

and UUID implementations,

  • JenaUUID: abstract base class for UUID implementations
  • UUID_nil: a special UUID, nil, filled with zeroes
  • UUID_V1: UUID V1
  • UUID_V4: UUID V4

and a utility class

  • LibUUID: with methods to create a Random and to create byte[] seeds.

JenaUUID contains a method to return a JenaUUID as a Java’s UUID. And is used in the command line utility juuid, for transaction ID’s, and when a new dataset is created. For the new dataset, Fuseki will create files in a temporary location. The name of the temporary location is created using an instance of JenaUUID.

UUID_V1 and UUID_V4

Jena’s UUID_V1 is an implementation of Version 1 (time based), variant 2 (DCE). Which means it uses MAC address and timestamp to generate the universal unique ID’s.

It uses NetworkInterface.getNetworkInterfaces() to retrieve the MAC address of the node running Jena. When using localhost, the MAC address is not available, so it resorts to using a random number.


And Jena’s UUID_V4 is an implementation of Version 4(random), variant 2 (DCE). Which means it uses random numbers to generate the universal unique ID’s.

The factory for V4 will have a random for the most significant bits, and for the least significant bits of the UUID (also including version and variant). The random for the factory is created by LibUUID#makeRandom(). This method returns a SecureRandom with two seeds, one being random, and the other created with LibUUID#makeSeed().

UUID_V4 uses a SecureRandom created locally but with the seed also set by LibUUID#makeSeed(). The seed returned by this method may use the MAC address, but will also use the os.version, user.name, java.version, number of active threads, total memory, free memory, and the hash code of a newly created Object.

Transaction ID UUID (TxnIdUuid) — uses JenaUUID

Jena contains two implementations of TxnId (transaction identifiers),

  • TxnIdSimple: transaction IDs are created with a counter within each JVM
  • TxnIdUuid: transaction IDs are created using JenaUUID.

The first thing that called my eye in this class was the inconsistency with the name - which is quite normal in large projects such as Jena.

As TxnIdUuid calls JenaUUID#generate(), it will use the default factory, UUID_V1_Gen. Then it will call asUUID to return a Java UUID object but with the same UUID.

Create a new dataset in Fuseki (ActionDatasets) — uses JenaUUID

When you create a new dataset in Fuseki, as explained in the previous post, Fuseki will create some temporary files and folders. For at least one folder, it will use an instance of JenaUUID, in ActionDatasets#execPostContainer().

Blank node IDs (BlankNodeId) — uses Java’s UUID

Blank nodes in Jena need an identifier too. It is possible to configure Jena to either return a JVM bound counter (similarly to how TxnIdSimple works), or otherwise blank nodes identifiers will be generated with java.util.UUID.randomUUID().

I wonder why the transaction ID’s use Jena’s JenaUUIDs, but the blank node IDs use Java’s UUID? They are compatible anyway.

Other methods related to blank nodes also use Java’s UUID,

  • BlankNodeAllocatorFixedSeedHash
  • BlankNodeAllocatorHash#freshSeed().

SPARQL functions, and NodeFunctions — uses Java’s UUID

SPARQL 1.1 contains functions UUID and STRUUID. Apache Jena provides these two functions, and users can use them in queries such as

SELECT (UUID() AS ?uuid) (StrUUID() AS ?strUuid) WHERE { }

(but before users would have to call extra functions in a different namespace).

The function implementations use NodeFunctions methods struuid and uuid. Both methods in NodeFunctions use Java’s UUID, and not JenaUUID.

Files and directories for databases / datasets — uses Java’s UUID

Files and directories created in Jena use Java’s UUID,

  • BufferAllocatorMapped#getNewTemporaryFile()
  • TDBBuilder#create methods and ComponentIdMgr constructor
  • AbstractDataBag#getNewTemporaryFile().

Conclusion

In Jena there are places where instances of JenaUUID are used to produce a UUID, and other places where Java’s UUID is used.

Java’s UUID provides a variant 2 version 4 (random DCE), which is equivalent to UUID_V4. But there is no equivalent of UUID_V1, the default used in Jena.

And even though UUID_V4 and UUID are compatible, I believe Jena’s version is using a seed with so many JVM and operating system related settings (os.version, free memory, etc) in order to have a unique seed per node running Jena, independent of whether there are multiple JVM’s in the same node.

But to be honest, I am still not sure which one I would have to use, nor if there are cases where I should pick one over the other…

EDIT: Apache Jena’s lead dev replied with a bit of history about the project too (:

Creating a Docker container to run as a command

kinow @ Aug 04, 2018 22:00:36

For the past two weeks at work I have been assigned to work on PHP projects. Though I used PHP some time ago - especially with Code Igniter and Laravel - I have not used it in a few years. And have been doing mostly Java nowadays.

The complete project setup was done by co-workers. I had a PHP project, using Symfony, several bundles and libraries, and Postgres. But it required just running a few commands to set up AWS settings, and then fire up Docker Compose.

Besides Docker Compose starting a web container, and another Postgres container, we used the web container to run Composer. In the past, I would see the PHP project mapped as a folder in the running container, and then composer would be executed in the host.

I noticed then that I could do the same to an image I created in 2016 to run Cylc. The previous version would be used to start a container in a terminal, then run some commands while maintaining the state within the container.

The new version now maps a volume to the version of cylc, installs the dependencies, and then allows the state to be maintained solely in the mapped volume.

Furthermore, the entrypoint of the image is the cylc command. Which means that you do not have to execute a terminal in order to run commands to the container (or change the command/entrypoint).

...
VOLUME "/opt/cylc" "/tmp" "/run" "/var/run" # we have the state stored only in /opt/cylc
WORKDIR "/opt/cylc"
ENV PATH /opt/cylc/bin:$PATH
WORKDIR /opt/cylc
ENTRYPOINT ["cylc"] # here's the trick!

It uses the same approach from the PHP image I am using for Composer, with a few additional improvements from the Best practices for writing Dockerfiles documentation.

$ docker run -ti -v "$(pwd -P)"/cylc:/opt/cylc cylc validate /opt/cylc/etc/examples/tutorial/oneoff/basic/
Valid for cylc-7.7.2
$ docker run -ti -v "$(pwd -P)"/cylc:/opt/cylc cylc register /opt/cylc/etc/examples/tutorial/oneoff/basic/
REGISTER /opt/cylc/etc/examples/tutorial/oneoff/basic/
$ docker run -ti -v "$(pwd -P)"/cylc:/opt/cylc cylc start --non-daemon --debug /opt/cylc/etc/examples/tutorial/oneoff/basic/
            ._.                                                       
            | |                 The Cylc Suite Engine [7.7.2]         
._____._. ._| |_____.           Copyright (C) 2008-2018 NIWA          
| .___| | | | | .___|  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
| !___| !_! | | !___.  This program comes with ABSOLUTELY NO WARRANTY;
!_____!___. |_!_____!  see `cylc warranty`.  It is free software, you 
      .___! |           are welcome to redistribute it under certain  
      !_____!                conditions; see `cylc conditions`.       
2018-07-30T12:10:32Z INFO - Suite starting: server=3714d1a17999:43040 pid=1
2018-07-30T12:10:32Z INFO - Cylc version: 7.7.2
2018-07-30T12:10:32Z INFO - Run mode: live
2018-07-30T12:10:32Z INFO - Initial point: 1
2018-07-30T12:10:32Z INFO - Final point: 1
2018-07-30T12:10:32Z INFO - Cold Start 1
2018-07-30T12:10:32Z DEBUG - [hello.1] -released to the task pool
2018-07-30T12:10:32Z DEBUG - BEGIN TASK PROCESSING
2018-07-30T12:10:32Z DEBUG - [hello.1] -waiting => queued
2018-07-30T12:10:32Z DEBUG - 1 task(s) de-queued
2018-07-30T12:10:32Z INFO - [hello.1] -submit-num=1, owner@host=localhost
2018-07-30T12:10:32Z DEBUG - [hello.1] -queued => ready
2018-07-30T12:10:32Z DEBUG - END TASK PROCESSING (took 0.00642895698547 seconds)
2018-07-30T12:10:32Z DEBUG - ['cylc', 'jobs-submit', '--debug', '--', '/opt/cylc/etc/examples/tutorial/oneoff/basic/log/job', '1/hello/01']
2018-07-30T12:10:33Z DEBUG - [client-connect] root@3714d1a17999:cylc-message privilege='full-control' 7b05596a-5971-4733-82de-28528f702ff0
2018-07-30T12:10:33Z INFO - [client-command] put_messages root@3714d1a17999:cylc-message 7b05596a-5971-4733-82de-28528f702ff0
2018-07-30T12:10:33Z DEBUG - [jobs-submit cmd] cylc jobs-submit --debug -- /opt/cylc/etc/examples/tutorial/oneoff/basic/log/job 1/hello/01
    [jobs-submit ret_code] 0
    [jobs-submit out]
    [TASK JOB SUMMARY]2018-07-30T12:10:32Z|1/hello/01|0|61
    [TASK JOB COMMAND]2018-07-30T12:10:32Z|1/hello/01|[STDOUT] 61
2018-07-30T12:10:33Z INFO - [hello.1] -(current:ready) submitted at 2018-07-30T12:10:32Z
2018-07-30T12:10:33Z INFO - [hello.1] -job[01] submitted to localhost:background[61]
2018-07-30T12:10:33Z DEBUG - [hello.1] -ready => submitted
2018-07-30T12:10:33Z INFO - [hello.1] -health check settings: submission timeout=None
2018-07-30T12:10:33Z DEBUG - BEGIN TASK PROCESSING
2018-07-30T12:10:33Z DEBUG - 0 task(s) de-queued
2018-07-30T12:10:33Z DEBUG - [hello.1] -forced spawning
2018-07-30T12:10:33Z DEBUG - END TASK PROCESSING (took 0.000992059707642 seconds)
2018-07-30T12:10:33Z INFO - [hello.1] -(current:submitted)> started at 2018-07-30T12:10:33Z
2018-07-30T12:10:33Z DEBUG - [hello.1] -submitted => running
2018-07-30T12:10:33Z INFO - [hello.1] -health check settings: execution timeout=None
2018-07-30T12:10:34Z DEBUG - BEGIN TASK PROCESSING
2018-07-30T12:10:34Z DEBUG - 0 task(s) de-queued
2018-07-30T12:10:34Z DEBUG - END TASK PROCESSING (took 0.000984191894531 seconds)
2018-07-30T12:10:43Z DEBUG - [client-connect] root@3714d1a17999:cylc-message privilege='full-control' 524d658e-9932-4135-9523-3c2ace3990cf
2018-07-30T12:10:43Z INFO - [client-command] put_messages root@3714d1a17999:cylc-message 524d658e-9932-4135-9523-3c2ace3990cf
2018-07-30T12:10:43Z INFO - [hello.1] -(current:running)> succeeded at 2018-07-30T12:10:43Z
2018-07-30T12:10:43Z DEBUG - [hello.1] -running => succeeded
2018-07-30T12:10:43Z INFO - Suite shutting down - AUTOMATIC
2018-07-30T12:10:44Z INFO - DONE
$

If you have a utility in your computer that requires a few dependencies, but that can store the state/data in a mapped volume, the same kind of container may be useful.

Cylc Scheduler Internals - Part 2

kinow @ Jul 27, 2018 00:25:37

This is part 2, in a series of posts about Cylc internals. The part 1 had the beginning of the workflow. And here we will have the continuation, from the moment the method configure() is called.

NB: this is a post to remember things, not really expecting to give someone enough information to be able to hack the Cylc Scheduler (though you can and would have fun!).

The configure method is responsible for configuring the Suite Server Program. Which means it will interact with the configuration singletons to retrieve the necessary configuration for the program.

It also interacts with other objects that store state, and also creates basic data structures and objects required for a suite program (e.g. Queues).

This is also where the contact file is created, as well as the HTTP server.

You can download the source file for the diagram used in this post, and edit it with draw.io.