Implementing a process proxy#

A process proxy implementation is necessary if you want to interact with a resource manager that is not currently supported or extend some existing behaviors. For example, recently, we’ve had contributions that interact with Kubernetes Custom Resource Definitions, which is an example of extending the KubernetesProcessProxy to accomplish a slightly different task.

Examples of resource managers in which there’s been some interest include Slurm Workload Manager and Apache Mesos, for example. In the end, it’s really a matter of having access to an API and the ability to apply “tags” or “labels” in order to discover where the kernel is running within the managed cluster. Once you have that information, then it becomes of matter of implementing the appropriate methods to control the kernel’s lifecycle.

Important!

Before continuing, it is important to consider timeframes here. You may instead want to implement a Kernel Provisioner rather an a Process Proxy since provisioners are available to the general framework!

The Enterprise Gateway 4.0 release is slated to adopt Kernel Provisioners but must remain on a down-level jupyter_client release (< 7.x) until that time as Enterprise Gateway (and process proxies) are currently incompatible.

That said, if you and your organization plan to stay on Enterprise Gateway 2.x or 3.x for the next couple years, then implementing a process proxy may be in your best interest. Fortunately, the two constructs are nearly identical since Kernel Provisioners are essentially Process Proxies properly integrated into the Jupyter framework thereby eliminating the need for various KernelManager hooks.

General approach#

Please refer to the Process Proxy section in the System Architecture pages for descriptions and structure of existing process proxies. Here is the general guideline for the process of implementing a process proxy.

  1. Identify and understand how to decorate your “job” within the resource manager. In Hadoop YARN, this is done by using the kernel’s ID as the application name by setting the --name parameter to ${KERNEL_ID}. In Kubernetes, we apply the kernel’s ID to the kernel-id label on the POD.

  2. Today, all invocations of kernels into resource managers use a shell or python script mechanism configured into the argv stanza of the kernelspec. If you take this approach, you need to apply the necessary changes to integrate with your resource manager.

  3. Determine how to interact with the resource manager’s API to discover the kernel and determine on which host it’s running. This interaction should occur immediately following Enterprise Gateway’s receipt of the kernel’s connection information in its response from the kernel launcher. This extra step, performed within confirm_remote_startup(), is necessary to get the appropriate host name as reflected in the resource manager’s API.

  4. Determine how to monitor the “job” using the resource manager API. This will become part of the poll() implementation to determine if the kernel is still running. This should be as quick as possible since it occurs every 3 seconds. If this is an expensive call, you may need to make some adjustments like skip the call every so often.

  5. Determine how to terminate “jobs” using the resource manager API. This will become part of the termination sequence, but probably only necessary if the message-based shutdown does not work (i.e., a last resort).

Tip

Because kernel IDs are globally unique, they serve as ideal identifiers for discovering where in the cluster the kernel is running.

You will likely need to provide implementations for launch_process(), poll(), wait(), send_signal(), and kill(), although, depending on where your process proxy resides in the class hierarchy, some implementations may be reused.

For example, if your process proxy is going to service remote kernels, you should consider deriving your implementation from the RemoteProcessProxy class. If this is the case, then you’ll need to implement confirm_remote_startup().

Likewise, if your process proxy is based on containers, you should consider deriving your implementation from the ContainerProcessProxy. If this is the case, then you’ll need to implement get_container_status() and terminate_container_resources() rather than confirm_remote_startup(), etc.

Once the process proxy has been implemented, construct an appropriate kernel specification that references your process proxy and iterate until you are satisfied with how your remote kernels behave.