Running Cylc tasks on PBS Torque with Docker

A few days ago I saw a post at the Cylc Google Group, about file permissions for files generated by Cylc. The post was related to content created by Cylc, but in an environment with PBS.

For context, Cylc is an Open Source meta-scheduler, written in Python, that allows you to define cycle points with dependencies. These cycle points can be simple incremental integer numbers, or ISO8601 periods or points (e.g. run every 5 minutes, from 10 days ago until the next year). Cylc takes care to create an execution schedule for you, and delegate that to a system that runs your workflow. I work full time on this amazing Open Source tool!

Such system could be the local computer in background, batch systems such as at, or PBS. PBS was created for NASA, to manage executing jobs taking into consideration cluster resources, and also using queues, priorities, and other features useful for HPC programming. Later PBS was acquired by Altair, an Open Source version OpenPBS was created, and later abandoned. And there is another fork called PBS Torque. I first encountered PBS at the São Paulo University, in Brazil, where they had a PBS Torque cluster.

Running PBS Torque with Docker

Even though I have access to an environment with Cylc and with PBS, I decided to give it a try and see how hard it would be to reproduce it with Docker. One thing that I like about this approach is the possibility to share the work with others online. I believe it improves communication, agility, and can be useful for posterity.

I had some experience with PBS Torque on Docker, because of some old work for another Open Source project called BioUno. So I started testing the image I used before, agaveapi/torque.

You can get PBS Torque up and running with one line if you have Docker and a good Internet connection - Docker is in the same category as NPM, Maven, etc (though I find the way Maven manages common dependencies saner).

$ docker run -d -h docker.example.com -p 10022:22 --privileged --name torque agaveapi/torque
f18f2cc2a2f90f56d1b74370060a272d5faf5b39dfc295a2025c78e469950194

And from here you can submit a job after having access to the testuser user in the container.

$ docker exec -t -i torque /bin/bash
bash-4.1# su - testuser
[testuser@docker ~]$ qsub /home/testuser/torque.submit
0.docker.example.com

Running Cylc with Docker

You can also start an Ubuntu container with Cylc in one line.

$ docker run -t -i -v ~/Development/python/workspace/cylc-docker/standalone/cylc:/opt/cylc --entrypoint /bin/bash kinow/cylc-standalone:0.1 
root@58118e44f171:/opt/cylc# cylc check-software
Checking your software...

Individual results:
==========================================================================================
Package (version requirements)                                     Outcome (version found)
==========================================================================================
                                   *REQUIRED SOFTWARE*                                   
Python (2.6+, <3)................................FOUND & min. version MET (2.7.12.final.0)

             *OPTIONAL SOFTWARE for the GUI & dependency graph visualisation*             
/usr/lib/python2.7/dist-packages/gtk-2.0/gtk/__init__.py:57: GtkWarning: could not open display
  warnings.warn(str(e), _gtk.Warning)
Python:pygtk (2.0+)......................................FOUND & min. version MET (2.24.0)
graphviz (any)..............................................................FOUND (2.38.0)
Python:pygraphviz (any)......................................................FOUND (1.3.1)

                       *OPTIONAL SOFTWARE for the HTML User Guide*                       
ImageMagick (any)............................................................NOT FOUND (-)

                  *OPTIONAL SOFTWARE for the HTTPS communications layer*                  
Python:urllib3 (any).........................................................NOT FOUND (-)
Python:OpenSSL (any)........................................................FOUND (18.0.0)
Python:requests (2.4.2+).....................................................NOT FOUND (-)

                       *OPTIONAL SOFTWARE for the LaTeX User Guide*                       
TeX:framed (any).............................................................NOT FOUND (-)
TeX (3.0+)...........................................FOUND & min. version MET (3.14159265)
TeX:preprint (any)...........................................................NOT FOUND (-)
TeX:tex4ht (any).............................................................NOT FOUND (-)
TeX:tocloft (any)............................................................NOT FOUND (-)
TeX:texlive (any)............................................................NOT FOUND (-)
==========================================================================================

Summary:
                               ****************************                               
                                  Core requirements: ok                                  
                                Full-functionality: not ok                                
                               **************************** 

Running PBS Torque and Cylc together with Docker

The repository https://github.com/kinow/cylc-docker contains some Dockerfile’s for running Cylc, including a new one where I combined the PBS Torque image, with the Cylc standalone image, with SSH between both. There is a docker-compose.yml configuration to initialize a mini cluster of two computers, with SSH and trusted configured, and shared volume/file system.

$ cd cylc-docker/pbs
$ ssh-keygen -t rsa -f ./id_rsa -N "" -q
$ export CYLC_SSH_PUBKEY=$(cat id_rsa.pub)
$ echo "CYLC_SSH_PUBKEY=${CYLC_SSH_PUBKEY}" >> .env
$ docker-compose up
Creating pbs  ... done
Creating cylc ... done
Attaching to cylc, pbs
cylc    | /usr/lib/python2.7/dist-packages/supervisor/options.py:297: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
cylc    |   'Supervisord is running as root and it is searching '
cylc    | 2018-12-22 02:29:06,526 CRIT Supervisor running as root (no user in config file)
cylc    | 2018-12-22 02:29:06,526 WARN No file matches via include "/etc/supervisor/conf.d/*.conf"
pbs     | /usr/lib/python2.6/site-packages/supervisor/options.py:295: UserWarning: Supervisord is running as root and it is searching for its configuration file in default locations (including its current working directory); you probably want to specify a "-c" argument specifying an absolute path to a configuration file for improved security.
pbs     |   'Supervisord is running as root and it is searching '
cylc    | 2018-12-22 02:29:06,535 INFO RPC interface 'supervisor' initialized
cylc    | 2018-12-22 02:29:06,535 CRIT Server 'unix_http_server' running without any HTTP authentication checking
cylc    | 2018-12-22 02:29:06,535 INFO supervisord started with pid 1
pbs     | 2018-12-22 02:29:07,480 CRIT Supervisor running as root (no user in config file)
pbs     | 2018-12-22 02:29:07,483 INFO supervisord started with pid 1
pbs     | 2018-12-22 02:29:08,487 INFO spawned: 'pbsmom' with pid 10
pbs     | 2018-12-22 02:29:08,490 INFO spawned: 'sshd' with pid 11
pbs     | 2018-12-22 02:29:08,493 INFO spawned: 'pbssched' with pid 12
pbs     | 2018-12-22 02:29:08,497 INFO spawned: 'pbsserver' with pid 13
pbs     | 2018-12-22 02:29:08,499 INFO spawned: 'trqauthd' with pid 14
pbs     | 2018-12-22 02:29:08,761 INFO exited: pbssched (exit status 0; not expected)
pbs     | 2018-12-22 02:29:08,794 INFO gave up: pbssched entered FATAL state, too many start retries too quickly
pbs     | 2018-12-22 02:29:10,078 INFO success: pbsmom entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
pbs     | 2018-12-22 02:29:10,079 INFO success: sshd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
pbs     | 2018-12-22 02:29:10,079 INFO success: pbsserver entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
pbs     | 2018-12-22 02:29:10,079 INFO success: trqauthd entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)

This should start two containers, cylc, and pbs, and with an example suite.rc.

$ docker-compose exec cylc /bin/bash
bash-4.1# su - testuser
testuser@f85bd666bced:~$ cylc register pbs1 ./suites/pbs1/
REGISTER pbs1: ./suites/pbs1/

The example suite has one task that runs on a PBS remote node. In the output of running the suite below, you can tell the tasks are being executed in different computers, by different owners, through Cylc and PBS.

The task a has [a.1] -submit-num=1, owner@host=pbs in the logs, and b has [b.1] -submit-num=1, owner@host=7ce742eb050c.

testuser@7ce742eb050c:~$ cylc run --no-detach pbs1
            ._.                                                       
            | |              The Cylc Suite Engine [7.8.0-dirty]      
._____._. ._| |_____.           Copyright (C) 2008-2018 NIWA          
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.  
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY;
      .___! |          see `cylc warranty`.  It is free software, you 
      !_____!           are welcome to redistribute it under certain  
2018-12-22T02:36:36Z INFO - Suite server: url=https://7ce742eb050c:43037/ pid=90
2018-12-22T02:36:36Z INFO - Run: (re)start=0 log=1
2018-12-22T02:36:36Z INFO - Cylc version: 7.8.0-dirty
2018-12-22T02:36:36Z INFO - Run mode: live
2018-12-22T02:36:36Z INFO - Initial point: 1
2018-12-22T02:36:36Z INFO - Final point: 1
2018-12-22T02:36:36Z INFO - Cold Start 1
2018-12-22T02:36:38Z INFO - [a.1] -submit-num=1, owner@host=pbs
2018-12-22T02:36:39Z INFO - [a.1] -(current:ready) submitted at 2018-12-22T02:36:39Z
2018-12-22T02:36:39Z INFO - [a.1] -health check settings: submission timeout=None
2018-12-22T02:36:40Z INFO - [a.1] -(current:submitted)> started at 2018-12-22T02:36:39Z
2018-12-22T02:36:40Z INFO - [a.1] -health check settings: execution timeout=None
2018-12-22T02:36:40Z INFO - [a.1] -(current:running)> succeeded at 2018-12-22T02:36:40Z
2018-12-22T02:36:41Z INFO - [b.1] -submit-num=1, owner@host=7ce742eb050c
2018-12-22T02:36:42Z INFO - [client-command] poll_tasks testuser@7ce742eb050c:cylc-poll d03e9798-14bb-4ef3-86a6-c01ea3a380ee
2018-12-22T02:36:42Z INFO - Command succeeded: poll_tasks([u'a'], poll_succ=False)
2018-12-22T02:36:42Z INFO - Processing 1 queued command(s)
	+	poll_tasks([u'a'], poll_succ=False)
2018-12-22T02:36:42Z INFO - [b.1] -(current:ready) submitted at 2018-12-22T02:36:42Z
2018-12-22T02:36:42Z INFO - [b.1] -health check settings: submission timeout=None
2018-12-22T02:36:42Z INFO - [b.1] -(current:submitted)> started at 2018-12-22T02:36:42Z
2018-12-22T02:36:42Z INFO - [b.1] -health check settings: execution timeout=None
2018-12-22T02:36:43Z INFO - [b.1] -(current:running)> succeeded at 2018-12-22T02:36:43Z
2018-12-22T02:36:43Z INFO - Suite shutting down - AUTOMATIC
2018-12-22T02:36:44Z INFO - DONE

Reproducing the issue

All right, so by now we have a working cluster of two computers, that share ~/cylc-run via Docker volumes (in the real world this is normally done via NFS).

The issue was related to the retrieval of remote logs. So we need to change ~/.cylc/global.rc in order to get it working.

# File: global.rc

[hosts]
[[pbs]]
retrieve job logs command = rsync -v -rltgoD --chmod=Du=rwx,Dgo=rx,Fu=rw,Fgo=r

The first thing I noticed debugging it, was that when I copied the command it had an extra space, and it failed to execute. However, nothing appeared in the logs (!).

Then I managed to print the rsync command executed.

sync -v -rltgoD --chmod=Du=rwx,Dgo=rx,Fu=rw,Fgo=r '--rsh=ssh -oBatchMode=yes -oConnectTimeout=10' -v --include=/1 --include=/1/a --include=/1/a/01 '--include=/1/a/01/**' '--exclude=/**' 'pbs:$HOME/cylc-run/pbs1/log/job/' /home/testuser/cylc-run/pbs1/log/job/

The thing that got me here as that it wasn’t rsync‘ing the logs from PBS Torque spool, but rather the logs that were already synced via Docker shared volume (or NFS in other environments).

The Cylc suite contains a PBS directive -W umask=0077, which forces the logs to be readable only by the owner. It is possible to confirm it in the output logs. But only because it is using an empty directory.

testuser@7ce742eb050c:~$ ls -lah /home/testuser/cylc-run/pbs1/log/job/1/a/01/
total 24K
drwxrwxr-x 2 testuser testuser 4.0K Dec 22 02:36 .
drwxrwxr-x 3 testuser testuser 4.0K Dec 22 02:34 ..
-rwxrwxr-x 1 testuser testuser 1.3K Dec 22 02:36 job
-rw-rw-r-- 1 testuser testuser  266 Dec 22 02:36 job-activity.log
-rw------- 1 testuser testuser    0 Dec 22 02:36 job.err
-rw------- 1 testuser testuser  147 Dec 22 02:36 job.out
-rw-rw-r-- 1 testuser testuser  231 Dec 22 02:36 job.status

Even though the rsync command was executed, it had no effect. We can confirm it by running it in the cylc node.

testuser@7ce742eb050c:~$ rsync -v -rltgoD --chmod=Du=rwx,Dgo=rx,Fu=rw,Fgo=r '--rsh=ssh -oBatchMode=yes -oConnectTimeout=10' -v --include=/1 --include=/1/a --include=/1/a/01 '--include=/1/a/01/**' '--exclude=/**' 'pbs:$HOME/cylc-run/pbs1/log/job/' /home/testuser/cylc-run/pbs1/log/job/
opening connection using: ssh -oBatchMode=yes -oConnectTimeout=10 pbs rsync --server --sender -vvlogDtre.iLsfx . "$HOME/cylc-run/pbs1/log/job/"  (10 args)
receiving incremental file list
[sender] showing directory 1 because of pattern /1
delta-transmission enabled
[sender] showing directory 1/a because of pattern /1/a
[sender] hiding directory 1/b because of pattern /**
[sender] showing directory 1/a/01 because of pattern /1/a/01
[sender] hiding file 1/a/NN because of pattern /**
[sender] showing file 1/a/01/job.out because of pattern /1/a/01/**
[sender] showing file 1/a/01/job-activity.log because of pattern /1/a/01/**
[sender] showing file 1/a/01/job.status because of pattern /1/a/01/**
[sender] showing file 1/a/01/job because of pattern /1/a/01/**
[sender] showing file 1/a/01/job.err because of pattern /1/a/01/**
1/a/01/job is uptodate
1/a/01/job-activity.log is uptodate
1/a/01/job.err is uptodate
1/a/01/job.out is uptodate
1/a/01/job.status is uptodate
total: matches=0  hash_hits=0  false_alarms=0 data=0

sent 97 bytes  received 935 bytes  688.00 bytes/sec
total size is 1,888  speedup is 1.83

rsync realizes that the files are already up to date, and does not change the permission of the files. If we run this against a different empty folder, rsync will adjust the permissions. In the next example first a directory test is created, and rsync command is changed to use this new local folder.

testuser@7ce742eb050c:~$ mkdir ~/test
testuser@7ce742eb050c:~$ rsync -v -rltgoD --chmod=Du=rwx,Dgo=rx,Fu=rw,Fgo=r '--rsh=ssh -oBatchMode=yes -oConnectTimeout=10' -v --include=/1 --include=/1/a --include=/1/a/01 '--include=/1/a/01/**' '--exclude=/**' 'pbs:$HOME/cylc-run/pbs1/log/job/' /home/testuser/test                  
opening connection using: ssh -oBatchMode=yes -oConnectTimeout=10 pbs rsync --server --sender -vvlogDtre.iLsfx . "$HOME/cylc-run/pbs1/log/job/"  (10 args)
receiving incremental file list
[sender] showing directory 1 because of pattern /1
delta-transmission enabled
[sender] showing directory 1/a because of pattern /1/a
[sender] hiding directory 1/b because of pattern /**
[sender] showing directory 1/a/01 because of pattern /1/a/01
[sender] hiding file 1/a/NN because of pattern /**
[sender] showing file 1/a/01/job.out because of pattern /1/a/01/**
[sender] showing file 1/a/01/job-activity.log because of pattern /1/a/01/**
[sender] showing file 1/a/01/job.status because of pattern /1/a/01/**
[sender] showing file 1/a/01/job because of pattern /1/a/01/**
[sender] showing file 1/a/01/job.err because of pattern /1/a/01/**
./
1/
1/a/
1/a/01/
1/a/01/job
1/a/01/job-activity.log
1/a/01/job.err
1/a/01/job.out
1/a/01/job.status
total: matches=0  hash_hits=0  false_alarms=0 data=1888

sent 177 bytes  received 3,022 bytes  6,398.00 bytes/sec
total size is 1,888  speedup is 0.59
testuser@7ce742eb050c:~$ ls -lah test/1/a/01/
total 24K
drwxr-xr-x 2 testuser testuser 4.0K Dec 22 02:36 .
drwxr-xr-x 3 testuser testuser 4.0K Dec 22 02:34 ..
-rw-r--r-- 1 testuser testuser 1.3K Dec 22 02:36 job
-rw-r--r-- 1 testuser testuser  266 Dec 22 02:36 job-activity.log
-rw-r--r-- 1 testuser testuser    0 Dec 22 02:36 job.err
-rw-r--r-- 1 testuser testuser  147 Dec 22 02:36 job.out
-rw-r--r-- 1 testuser testuser  231 Dec 22 02:36 job.status

It is possible now to confirm that rsync has synced the folders, and instead of the previous -rw------- permissions, both job.out and job.err have -rw-r--r-- now.

Conclusion

This was a good experiment, where I managed to publish some images to DockerHub for Cylc, though they still lack maturity, and are probably not suitable for production use. There are two other images in the same GitHub repository for Cylc, but these are even less mature.

The Cylc PBS image can be used by people familiar with Docker (i.e. who know what and how to kick). It provides a quick way to test and troubleshoot issues with Cylc and PBS.

One issue identified during this analysis is that the log retrieval command may fail in Cylc, but without any output in the logs.

I believe the problem reported in the Google Groups could be due to rsync not applying the new chmod values on existing values, but only on new ones. The documentation for --chmod does not say much, however, if you look under --permissions:

In summary: to give destination files (both old and new) the source permissions, use –perms. To give new files the destination-default permissions (while leaving existing files unchanged), make sure that the –perms option is off and use –chmod=ugo=rwX (which ensures that all non-masked bits get enabled).

My understanding is that there is a difference on how some settings in rsync work for new files, and for existing files. I think the documentation could be clearer about it. The cylc container’s rsync version was 3.1.1. I compiled the latest version (3.1.3) on my Ubuntu LTS, and copied it across to the container, and confirmed the behaviour is the same (might message the rsync devs later to confirm).

Finally, I think there is room for improvement in the current log retrieval strategies in Cylc, and hopefully I will be able to later translate the fuzzy ideas I have in my mind right now to a simple description to discuss it with other developers, and maybe improve this feature in the future.

Live long and prosper!

♥ Open Source