Distributed deployments#

This section describes how to deploy Enterprise Gateway to manage kernels across a distributed set of hosts. In this case, a resource manager is not used, but, rather, SSH is used to distribute the kernels. This functionality is accomplished via the DistributedProcessProxy.

Steps required to complete deployment on a distributed cluster are:

Install Enterprise Gateway on the “primary node” of the cluster.
Install the desired kernels
Install and configure the server and desired kernel specifications (see below)
Launch Enterprise Gateway

The DistributedProcessProxy simply uses a fixed set of host names and selects the next host using a simple round-robin algorithm (see the Roadmap for making this pluggable). In this case, you can still experience bottlenecks on a given node that receives requests to start “large” kernels, but otherwise, you will be better off compared to when all kernels are started on a single node or as local processes, which is the default for Jupyter Notebook and JupyterLab when not configured to use Enterprise Gateway.

The following sample kernelspecs are configured to use the DistributedProcessProxy:

python_distributed
spark_python_yarn_client
spark_scala_yarn_client
spark_R_yarn_client

Important!

The DistributedProcessProxy utilizes SSH between the Enterprise Gateway server and the remote host. As a result, you must ensure passwordless SSH is configured between hosts.

The set of remote hosts used by the DistributedProcessProxy are derived from two places.

The configuration option EnterpriseGatewayApp.remote_hosts, whose default value comes from the env variable EG_REMOTE_HOSTS - which, itself, defaults to ‘localhost’.
The config option can be overridden on a per-kernel basis if the process_proxy stanza contains a config stanza where there’s a remote_hosts entry. If present, this value will be used instead.

Tip

Entries in the remote hosts configuration should be fully qualified domain names (FQDN). For example, host1.acme.com, host2.acme.com

Important!

All the kernel specifications configured to use the DistributedProcessProxy must be on all nodes to which there’s a reference in the remote hosts configuration! With YARN cluster node, only the Python and R kernel packages are required on each node, not the entire kernel specification.

The following installs the sample python_distributed kernel specification relative to the 3.2.3 release on the given node. This step must be repeated for each node and each kernel specification.

wget https://github.com/jupyter-server/enterprise_gateway/releases/download/v3.2.3/jupyter_enterprise_gateway_kernelspecs-3.2.3.tar.gz
KERNELS_FOLDER=/usr/local/share/jupyter/kernels
tar -zxvf jupyter_enterprise_gateway_kernelspecs-3.2.3.tar.gz --strip 1 --directory $KERNELS_FOLDER/python_distributed/ python_distributed/

Tip

You may find it easier to install all kernel specifications on each node, then remove directories corresponding to specification you’re not interested in using.

Specifying a load-balancing algorithm#

Jupyter Enterprise Gateway provides two ways to configure how kernels are distributed across the configured set of hosts: round-robin or least-connection.

Round-robin#

The round-robin algorithm simply uses an index into the set of configured hosts, incrementing the index on each kernel startup so that it points to the next host in the configured set. To specify the use of round-robin, use one of the following:

Command-line:

--EnterpriseGatewayApp.load_balancing_algorithm=round-robin

Configuration:

c.EnterpriseGatewayApp.load_balancing_algorithm="round-robin"

Environment:

export EG_LOAD_BALANCING_ALGORITHM=round-robin

Since round-robin is the default load-balancing algorithm, this option is not necessary.

Least-connection#

The least-connection algorithm tracks the hosts that are currently servicing kernels spawned by the Enterprise Gateway instance. Using this information, Enterprise Gateway selects the host with the least number of kernels. It does not consider other information, or whether there is another Enterprise Gateway instance using the same set of hosts. To specify the use of least-connection, use one of the following:

Command-line:

--EnterpriseGatewayApp.load_balancing_algorithm=least-connection

Configuration:

c.EnterpriseGatewayApp.load_balancing_algorithm="least-connection"

Environment:

export EG_LOAD_BALANCING_ALGORITHM=least-connection

Pinning a kernel to a host#

A kernel’s start request can specify a specific remote host on which to run by specifying that host in the KERNEL_REMOTE_HOST environment variable within the request’s body. When specified, the configured load-balancing algorithm will be by-passed and the kernel will be started on the specified host.

YARN Client Mode#

YARN client mode kernel specifications can be considered distributed mode kernels. They just happen to use spark-submit from different nodes in the cluster but use the DistributedProcessProxy to manage their lifecycle.

YARN Client kernel specifications require the following environment variable to be set within their env entries:

SPARK_HOME must point to the Apache Spark installation path

SPARK_HOME:/usr/hdp/current/spark2-client                            #For HDP distribution

In addition, they will leverage the aforementioned remote hosts configuration.

After that, you should have a kernel.json that looks similar to the one below:

{
  "language": "python",
  "display_name": "Spark - Python (YARN Client Mode)",
  "metadata": {
    "process_proxy": {
      "class_name": "enterprise_gateway.services.processproxies.distributed.DistributedProcessProxy"
    }
  },
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client",
    "PYSPARK_PYTHON": "/opt/conda/bin/python",
    "PYTHONPATH": "${HOME}/.local/lib/python3.6/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip",
    "SPARK_YARN_USER_ENV": "PYTHONUSERBASE=/home/yarn/.local,PYTHONPATH=${HOME}/.local/lib/python3.6/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip,PATH=/opt/conda/bin:$PATH",
    "SPARK_OPTS": "--master yarn --deploy-mode client --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false",
    "LAUNCH_OPTS": ""
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh",
    "--RemoteProcessProxy.kernel-id",
    "{kernel_id}",
    "--RemoteProcessProxy.response-address",
    "{response_address}",
    "--RemoteProcessProxy.public-key",
    "{public_key}"
  ]
}

Make any necessary adjustments such as updating SPARK_HOME or other environment and path specific configurations.

Tip

Each node of the cluster will typically be configured in the same manner relative to directory hierarchies and environment variables. As a result, you may find it easier to get kernel specifications working on one node, then, after confirming their operation, copy them to other nodes and update the remote-hosts configuration to include the other nodes. You will still need to install the kernels themselves on each node.

Spark Standalone#

Although Enterprise Gateway does not provide sample kernelspecs for Spark standalone, here are the steps necessary to convert a yarn_client kernelspec to standalone.

Make a copy of the source yarn_client kernelspec into an applicable standalone directory.
Edit the kernel.json file:
- Update the display_name with e.g. Spark - Python (Spark Standalone).
- Update the --master option in the SPARK_OPTS to point to the spark master node rather than indicate --deploy-mode client.
- Update SPARK_OPTS and remove the spark.yarn.submit.waitAppCompletion=false.
- Update the argv stanza to reference run.sh in the appropriate directory.

After that, you should have a kernel.json that looks similar to the one below:

{
  "language": "python",
  "display_name": "Spark - Python (Spark Standalone)",
  "metadata": {
    "process_proxy": {
      "class_name": "enterprise_gateway.services.processproxies.distributed.DistributedProcessProxy"
    }
  },
  "env": {
    "SPARK_HOME": "/usr/hdp/current/spark2-client",
    "PYSPARK_PYTHON": "/opt/conda/bin/python",
    "PYTHONPATH": "${HOME}/.local/lib/python3.6/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip",
    "SPARK_YARN_USER_ENV": "PYTHONUSERBASE=/home/yarn/.local,PYTHONPATH=${HOME}/.local/lib/python3.6/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip,PATH=/opt/conda/bin:$PATH",
    "SPARK_OPTS": "--master spark://127.0.0.1:7077  --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID}",
    "LAUNCH_OPTS": ""
  },
  "argv": [
    "/usr/local/share/jupyter/kernels/spark_python_spark_standalone/bin/run.sh",
    "--RemoteProcessProxy.kernel-id",
    "{kernel_id}",
    "--RemoteProcessProxy.response-address",
    "{response_address}",
    "--RemoteProcessProxy.public-key",
    "{public_key}"
  ]
}