Blog Taxonomy

Posts tagged with 'python'

A couple of class diagrams of JupyterHub

kinow @ Oct 06, 2018 21:43:44

Started on a new project last Monday. One of the tasks in this project involves a new design for the Web layer. And as the application is quite similar to JupyterHub, we are all learning more about its internal API and general system design.

This post contains only two class diagrams created with PyCharm. One is actually a SQLAlchemy ORM diagram, below.

And the class diagram (which I removed object and a tried to make it simpler to interpret).

I enjoyed the parts of the code and the part of the system design that I could read about so far. But that’s all for now, until I have more time to learn more about the project and the code.

p.s. there are spawners and other implementation classes in other GitHub repositories… so a more complete diagram may come later on

Cylc Scheduler Internals - Part 3

kinow @ Aug 18, 2018 18:27:37

This is part 2, in a series of posts about Cylc internals. The part 1 had the beginning of the workflow. part 2 documented from the moment the method configure() is called. This post will continue right after the continue() method returns, going on with the next method: run().

NB: this is a post to remember things, not really expecting to give someone enough information to be able to hack the Cylc Scheduler (though you can and would have fun!).

At this point, the Suite Server program must have been initialized, with the objects that it requires, and with everything configured. So this method is the one that starts the whole work on the tasks & proxies.

The runahead points, i.e. what are the next available cycle points, are calculated and scheduled. In the scheduling, tasks are queued for execution.

Most of the interesting action takes place when process_task_pool() is called, and in the submit_task_jobs(). The latter is a method from TaskJobManager, and it is here where - in this case - my shell command is executed through a temporary shell script file.

The graph was created from the initial execution of a suite that was starting from scratch (they can also be reinitialized). If there are multiple tasks waiting, or if a suite was restarted, the diagram would look considerably different.

You can download the source file for the diagram used in this post, and edit it with draw.io.

Cylc Scheduler Internals - Part 2

kinow @ Jul 27, 2018 00:25:37

This is part 2, in a series of posts about Cylc internals. The part 1 had the beginning of the workflow. And here we will have the continuation, from the moment the method configure() is called.

NB: this is a post to remember things, not really expecting to give someone enough information to be able to hack the Cylc Scheduler (though you can and would have fun!).

The configure method is responsible for configuring the Suite Server Program. Which means it will interact with the configuration singletons to retrieve the necessary configuration for the program.

It also interacts with other objects that store state, and also creates basic data structures and objects required for a suite program (e.g. Queues).

This is also where the contact file is created, as well as the HTTP server.

You can download the source file for the diagram used in this post, and edit it with draw.io.

Multithreaded code and Pandas

kinow @ Jul 22, 2018 13:22:14

Woman looking

Pandas provides high-performance data structures in Python. I think in Java there are similar data structures in projects like Apache Commons Collections, Google Guava, and also Trove.

In the Java libraries thread-safety is always a must-have feature. Probably as it is quite common for a Java program to have more than one thread, especially if the code runs in some sort of web container.

I recently learned that Pandas, on the other hand, does not guarantee any thread-safety. I found that while reading an issue about race condition in the IndexEngine, and after preparing a pull request for that.

from concurrent.futures import ThreadPoolExecutor
import pandas as pd

x = pd.date_range('2001', '2020')
with ThreadPoolExecutor(2) as p:
    assert all(p.map(lambda x: x.is_unique, [x]*2))

When you create an index like that, it will delegate most of the hard work to the IndexEngine. Inside the IndexEngine, the values passed for the index are stored, and then an empty Hashtable is created (as well as several flags for the state of the object, such as unique, which defines whether the index has unique elements or not).

Once a user calls a method like is_unique, then the flags are updated, the Hashtable mapping is populated, and while doing so, if not all elements are unique, the flag for unique is set to false, or true otherwise. But if the user does not need that operation, we will avoid populating the mapping until we really need it.

I believe it is done that way for performance. However, at the cost that the state is shared among calls, which makes it harder to use this API in a multithreaded environment - though still possible by moving the synchronization to your caller code.

Some Apache software also do the same, asking users to synchronize, serialize, or handle certain corner cases on their side. There is a huge cost associated with maintaining an Open Source project that promises thread-safety.

But maybe Dask provides an alternative to use Pandas with multiple threads. But I have not used that yet.

♥ Open Source

NB: even though Pandas is not thread-safe, it does not mean you should not use it. Just use with care when using multiple threads

Cylc Scheduler Internals - Part 1

kinow @ Jul 14, 2018 22:42:47

This is the first post in a series of three (or maybe four later) based on diagrams I collected while debugging the Cylc scheduler. The scheduler is called by the cylc start utility.

NB: this is a post to remember things, not really expecting to give someone enough information to be able to hack the Cylc Scheduler (though you can and would have fun!).

Instead of going at length on what happens (and there is quite a bit happening when you run cylc start my.suite), I will use the following diagram, followed by a few paragraphs to highlight certain parts. The code used was based on Cylc 7.7.1.

When a Cylc command like cylc start is invoked, it actually gets translated into bin/cylc-$command_name. cylc-start is a Shell file, that will simply call cylc-run. cylc-run then imports scheduler_cli… but what you need to know is that in the end scheduler_cli will create an instance of Scheduler with the right constructor arguments, and call its start method.

After that point, you are at the left-most lifeline, on the Scheduler constructor (i.e. the init method of the scheduler.py’s Scheduler class).

If you follow the method calls - which are hopefully easy to understand and follow - you will find that the constructor merely creates a few objects, prepares the suite information, and the suite database.

Then the start method kicks things off, interacting with previously created objects, but also with some singletons for logging and configuration. Oh, that Cylc banner is also printed here (in case you would like to customize it as in SpringBoot).

After that, if you are running the suite as daemon, it will be daemonized, by forking the current process, but with Python.

One important step that happens here, is the initialization of the HTTP Server. This server will be used to communicate with the Suite Server Program. It will be listening to connections with the right endpoints available only after the configure method.

Lastly, we have configure and run methods, which are two very important methods to be discussed in the next part of this series, as they are quite extensive, and deserve their own diagrams.

You can download the source file for the diagram used in this post, and edit it with draw.io.