Distributed deployments#
This section describes how to deploy Enterprise Gateway to manage kernels across a distributed set of hosts. In this case, a resource manager is not used, but, rather, SSH is used to distribute the kernels. This functionality is accomplished via the DistributedProcessProxy
.
Steps required to complete deployment on a distributed cluster are:
Install Enterprise Gateway on the “primary node” of the cluster.
Install and configure the server and desired kernel specifications (see below)
The DistributedProcessProxy
simply uses a fixed set of host names and selects the next host using a simple round-robin algorithm (see the Roadmap for making this pluggable). In this case, you can still experience bottlenecks on a given node that receives requests to start “large” kernels, but otherwise, you will be better off compared to when all kernels are started on a single node or as local processes, which is the default for Jupyter Notebook and JupyterLab when not configured to use Enterprise Gateway.
The following sample kernelspecs are configured to use the DistributedProcessProxy
:
python_distributed
spark_python_yarn_client
spark_scala_yarn_client
spark_R_yarn_client
Important!
The DistributedProcessProxy
utilizes SSH between the Enterprise Gateway server and the remote host. As a result, you must ensure passwordless SSH is configured between hosts.
The set of remote hosts used by the DistributedProcessProxy
are derived from two places.
The configuration option
EnterpriseGatewayApp.remote_hosts
, whose default value comes from the env variable EG_REMOTE_HOSTS - which, itself, defaults to ‘localhost’.The config option can be overridden on a per-kernel basis if the process_proxy stanza contains a config stanza where there’s a
remote_hosts
entry. If present, this value will be used instead.
Tip
Entries in the remote hosts configuration should be fully qualified domain names (FQDN). For example, host1.acme.com, host2.acme.com
Important!
All the kernel specifications configured to use the DistributedProcessProxy
must be on all nodes to which there’s a reference in the remote hosts configuration! With YARN cluster node, only the Python and R kernel packages are required on each node, not the entire kernel specification.
The following installs the sample python_distributed
kernel specification relative to the 3.2.3 release on the given node. This step must be repeated for each node and each kernel specification.
wget https://github.com/jupyter-server/enterprise_gateway/releases/download/v3.2.3/jupyter_enterprise_gateway_kernelspecs-3.2.3.tar.gz
KERNELS_FOLDER=/usr/local/share/jupyter/kernels
tar -zxvf jupyter_enterprise_gateway_kernelspecs-3.2.3.tar.gz --strip 1 --directory $KERNELS_FOLDER/python_distributed/ python_distributed/
Tip
You may find it easier to install all kernel specifications on each node, then remove directories corresponding to specification you’re not interested in using.
Specifying a load-balancing algorithm#
Jupyter Enterprise Gateway provides two ways to configure how kernels are distributed across the configured set of hosts: round-robin or least-connection.
Round-robin#
The round-robin algorithm simply uses an index into the set of configured hosts, incrementing the index on each kernel startup so that it points to the next host in the configured set. To specify the use of round-robin, use one of the following:
Command-line:
--EnterpriseGatewayApp.load_balancing_algorithm=round-robin
Configuration:
c.EnterpriseGatewayApp.load_balancing_algorithm="round-robin"
Environment:
export EG_LOAD_BALANCING_ALGORITHM=round-robin
Since round-robin is the default load-balancing algorithm, this option is not necessary.
Least-connection#
The least-connection algorithm tracks the hosts that are currently servicing kernels spawned by the Enterprise Gateway instance. Using this information, Enterprise Gateway selects the host with the least number of kernels. It does not consider other information, or whether there is another Enterprise Gateway instance using the same set of hosts. To specify the use of least-connection, use one of the following:
Command-line:
--EnterpriseGatewayApp.load_balancing_algorithm=least-connection
Configuration:
c.EnterpriseGatewayApp.load_balancing_algorithm="least-connection"
Environment:
export EG_LOAD_BALANCING_ALGORITHM=least-connection
Pinning a kernel to a host#
A kernel’s start request can specify a specific remote host on which to run by specifying that host in the KERNEL_REMOTE_HOST
environment variable within the request’s body. When specified, the configured load-balancing algorithm will be by-passed and the kernel will be started on the specified host.
YARN Client Mode#
YARN client mode kernel specifications can be considered distributed mode kernels. They just happen to use spark-submit
from different nodes in the cluster but use the DistributedProcessProxy
to manage their lifecycle.
YARN Client kernel specifications require the following environment variable to be set within their env
entries:
SPARK_HOME
must point to the Apache Spark installation path
SPARK_HOME:/usr/hdp/current/spark2-client #For HDP distribution
In addition, they will leverage the aforementioned remote hosts configuration.
After that, you should have a kernel.json
that looks similar to the one below:
{
"language": "python",
"display_name": "Spark - Python (YARN Client Mode)",
"metadata": {
"process_proxy": {
"class_name": "enterprise_gateway.services.processproxies.distributed.DistributedProcessProxy"
}
},
"env": {
"SPARK_HOME": "/usr/hdp/current/spark2-client",
"PYSPARK_PYTHON": "/opt/conda/bin/python",
"PYTHONPATH": "${HOME}/.local/lib/python3.6/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip",
"SPARK_YARN_USER_ENV": "PYTHONUSERBASE=/home/yarn/.local,PYTHONPATH=${HOME}/.local/lib/python3.6/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip,PATH=/opt/conda/bin:$PATH",
"SPARK_OPTS": "--master yarn --deploy-mode client --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID} --conf spark.yarn.submit.waitAppCompletion=false",
"LAUNCH_OPTS": ""
},
"argv": [
"/usr/local/share/jupyter/kernels/spark_python_yarn_client/bin/run.sh",
"--RemoteProcessProxy.kernel-id",
"{kernel_id}",
"--RemoteProcessProxy.response-address",
"{response_address}",
"--RemoteProcessProxy.public-key",
"{public_key}"
]
}
Make any necessary adjustments such as updating SPARK_HOME
or other environment and path specific configurations.
Tip
Each node of the cluster will typically be configured in the same manner relative to directory hierarchies and environment variables. As a result, you may find it easier to get kernel specifications working on one node, then, after confirming their operation, copy them to other nodes and update the remote-hosts configuration to include the other nodes. You will still need to install the kernels themselves on each node.
Spark Standalone#
Although Enterprise Gateway does not provide sample kernelspecs for Spark standalone, here are the steps necessary to convert a yarn_client
kernelspec to standalone.
Make a copy of the source
yarn_client
kernelspec into an applicablestandalone
directory.Edit the
kernel.json
file:Update the display_name with e.g.
Spark - Python (Spark Standalone)
.Update the
--master
option in the SPARK_OPTS to point to the spark master node rather than indicate--deploy-mode client
.Update
SPARK_OPTS
and remove thespark.yarn.submit.waitAppCompletion=false
.Update the
argv
stanza to referencerun.sh
in the appropriate directory.
After that, you should have a kernel.json
that looks similar to the one below:
{
"language": "python",
"display_name": "Spark - Python (Spark Standalone)",
"metadata": {
"process_proxy": {
"class_name": "enterprise_gateway.services.processproxies.distributed.DistributedProcessProxy"
}
},
"env": {
"SPARK_HOME": "/usr/hdp/current/spark2-client",
"PYSPARK_PYTHON": "/opt/conda/bin/python",
"PYTHONPATH": "${HOME}/.local/lib/python3.6/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip",
"SPARK_YARN_USER_ENV": "PYTHONUSERBASE=/home/yarn/.local,PYTHONPATH=${HOME}/.local/lib/python3.6/site-packages:/usr/hdp/current/spark2-client/python:/usr/hdp/current/spark2-client/python/lib/py4j-0.10.6-src.zip,PATH=/opt/conda/bin:$PATH",
"SPARK_OPTS": "--master spark://127.0.0.1:7077 --name ${KERNEL_ID:-ERROR__NO__KERNEL_ID}",
"LAUNCH_OPTS": ""
},
"argv": [
"/usr/local/share/jupyter/kernels/spark_python_spark_standalone/bin/run.sh",
"--RemoteProcessProxy.kernel-id",
"{kernel_id}",
"--RemoteProcessProxy.response-address",
"{response_address}",
"--RemoteProcessProxy.public-key",
"{public_key}"
]
}