Backend.AI Documentation

Latest API version: v6.20220615

Backend.AI is an enterprise-grade development and service backend for a wide range of AI-powered applications. Its core technology is tailored for operating high density computing clusters including GPUs and heterogeneous accelerators.

From the user’s perspective, Backend.AI is a cloud-like GPU powered HPC/DL application host (“Google Colab on your machine”). It runs arbitrary user codes safely in resource-constrained containers. It hosts various programming languages and runtimes, such as Python 2/3, R, PHP, C/C++, Java, JavaScript, Julia, Octave, Haskell, Lua and Node.js, as well as AI-oriented libraries such as TensorFlow, Keras, Caffe, and MXNet.

From the admin’s perspecetive, Backend.AI streamlines the process of assigning computing nodes, GPUs, and storage space to individual research team members. With detailed policy-based idle checks and resource limits, you no longer have to worry about exceeding the capacity of the cluster when there are high demands.

Using the plugin architecture, Backend.AI also offers more advanced features such as fractional sharing of GPUs and site-specific SSO integrations, etc. for various-sized enterprise customers.

Backend.AI Concepts

Here we describe the key concepts that are required to understand and follow this documentation.

_images/server-architecture.svg

The diagram of a typical multi-node Backend.AI server architecture

Fig. 1 shows a brief Backend.AI server-side architecture where the components are what you need to install and configure.

Each border-connected group of components is intended to be run on the same server, but you may split them into multiple servers or merge different groups into a single server as you need. For example, you can run separate servers for the nginx reverse-proxy and the Backend.AI manager or run both on a single server. In the development setup, all these components run on a single PC such as your laptop.

Service Components

Public-facing services

Manager and Webserver

Backend.AI manager is the central governor of the cluster. It accepts user requests, creates/destroys the sessions, and routes code execution requests to appropriate agents and sessions. It also collects the output of sessions and responds the users with them.

Backend.AI agent is a small daemon installed onto individual worker servers to control them. It manages and monitors the lifecycle of kernel containers, and also mediates the input/output of sessions. Each agent also reports the resource capacity and status of its server, so that the manager can assign new sessions on idle servers to load balance.

The primary networking requirements are:

  • The manager server (the HTTPS 443 port) should be exposed to the public Internet or the network that your client can access.

  • The manager, agents, and all other database/storage servers should reside at the same local private network where any traffic between them are transparently allowed.

  • For high-volume big-data processing, you may want to separate the network for the storage using a secondary network interface on each server, such as Infiniband and RoCE adaptors.

App Proxy

Backend.AI App Proxy is a proxy to mediate the traffic between user applications and clients like browsers. It provides the central place to set the networking and firewall policy for the user application traffic.

It has two operation modes:

  • Port mapping: Individual app instances are mapped with a TCP port taken from a pre-configured range of TCP port range.

  • Wildecard subdomain: Individual app instances are mapped with a system-generated subdomain under the given top-level domain.

Depending on the session type and application launch configurations, it may require an authenticated HTTP session for HTTP-based applications. For instance, you may enforce authentication for interactive development apps like Jupyter while allow anonymous access for AI model service APIs.

Storage Proxy

Backend.AI Storage Proxy is a proxy to offload the large file transfers from the manager. It also provides an abstraction of underlying storage vendor’s acceleration APIs since many storage vendors offer vendor-specific APIs for filesystem operations like scanning of directories with millions of files. Using the storage proxy, we apply our abstraction models for such filesystem operations and quota management specialized to each vendor API.

FastTrack (Enterprise only)

Backend.AI FastTrack is an add-on service running on top of the manager that features a slick GUI to design and run pipelines of computation tasks. It makes it easier to monitor the progress of various MLOps pipelines running concurrently, and allows sharing of such pielines in portable ways.

Resource Management

Sokovan Orchestrator

Backend.AI Sokovan is the central cluster-level scheduler running inside the manager. It monitors the resource usage of agents and assigns new containers from the job queue to the agents.

Each resource group may have separate scheduling policy and options. The scheduling algorithm may be extended using a common abstract interface. A scheduler implementation accepts the list of currently running sessions, the list of pending sessions in the job queue, and the current resource usage of target agents. It then outputs the choice of a pending session to start and the assignment of an agent to host it.

Agent

Backend.AI Agent is a small daemon running at each compute node like a GPU server. Its main job is to control and monitor the containers via Docker, but also includes an abstraction of various “compute process” backends. It publishes various types of container-related events so that the manager could react to status updates of containers.

When the manager assigns a new container, the agent decides the device-level resource mappings for the container considering optimal hardware layouts such as NUMA and the PCIe bus locations of accelerator and network devices.

Internal services

Event bus

Backend.AI uses Redis to keep track of various real-time information and notify system events to other service components.

Control Panel (Enterprise only)

Backend.AI Control Panel is an add-on service to the manager for advanced management and monitoring. It provides a dedicated superadmin GUI, featuring batch creation and modification of the users, detailed configuration of various resource policies, and etc.

Forklift (Enterprise only)

Backend.AI Forklift is a standalone service that eases building new container images from scratch or importing existing ones that are compatible with Backend.AI.

Reservoir (Enterprise only)

Backend.AI Reservoir is an add-on service to provide open source package mirrors for air-gapped setups.

Container Registry

Backend.AI supports integration with several common container registry solutions, while open source users may also rely on our official registry service with prebuilt images in https://cr.backend.ai:

  • Docker’s vanilla open-source registry

    • It is simplest to set up but does not provide advanced access controls and namespacing over container images.

  • Harbor v2 (recommended)

    • It provides a full-fledged container registry service including ACLs with project/user memberships, cloning from/to remote registries, on-premise and cloud deployments, security analysis, and etc.

Computing

Sessions and kernels

Backend.AI spawns sessions to host various kinds of computation with associated computing resources. Each session may have one or more kernels. We call sessions with multiple kernels as “cluster sessions”.

A kernel represents an isolated unit of computation such as a container, a virtual machine, a native process, or even a Kubernetes pod, depending on the Agent’s backed implementation and configurations. The most common form of a kernel is a Docker container. For container or VM-based kernels, they are also associated with the base images. The most common form of a base image is the OCI container images.

Kernel roles in a cluster session

In a cluster session with multiple kernels, each kernel has a role. By default, the first container takes the “main” role while others takes the “sub” role. All kernels are given unique hostnames like “main1”, “sub1”, “sub2”, …, and “subN” (the cluster size is N+1 in this case). A non-cluster session has one “main1” kernel only.

All interactions with a session are routed to its “main1” kernel, while the “main1” kernel is allowed to access all other kernels via a private network.

Session templates

A session template is a predefined set of parameters to create a session, while they can be overriden by the caller. It may define additional kernel roles for a cluster session, with different base images and resource specifications.

Session types

There are several classes of sessions for different purposes having different features.

Features by the session type

Feature

Compute
(Interactive)

Compute
(Batch)

Inference

System

Code execution

Service port

Dependencies

Session result

Clustering

Compute session is the most generic form of session to host computations. It has two operation modes: interactive and batch.

Interactive compute session

Interactive compute sessions are used to run various interactive applications and development tools, such as Jupyter Notebooks, web-based terminals, and etc. It is expected that the users control their lifecycles (e.g., terminating them) while Backend.AI offers configuration knobs for the administrators to set idle timeouts with various criteria.

There are two major ways to interact with an interactive compute session: service ports and the code execution API.

Service ports

TODO: port mapping diagram

Code execution

TODO: execution API state diagram

Batch compute session

Batch compute sessions are used to host a “run-to-completion” script with a finite execution time. It has two result states: SUCCESS or FAILED, which is defined by whether the main program’s exit code is zero or not.

Dependencies between compute sessions

Pipelining

Inference session

Service endpoint and routing

Auto-scaling

System session

SFTP access

Scheduling

Backend.AI keeps track of sessions using a state-machine to represent the various lifecycle stages of them.

TODO: session/kernel state diagram

TODO: two-level scheduler architecture diagram

See also

Resource groups

Session selection strategy
Heuristic FIFO

The default session selection strategy is the heuristic FIFO. It mostly works like a FIFO queue to select the oldest pending session, but offers an option to enable a head-of-line (HoL) blocking avoidance logic.

The HoL blocking problem happens when the oldest pending session requires too much resources so that it cannot be scheduled while other subsequent pending sessions fit within the available cluster resources. Those subsequent pending sessions that can be started never have chances until the oldest pending session (“blocker”) is either cancelled or more running sessions terminate and release more cluster resources.

When enabled, the HoL blocking avoidance logic keeps track of the retry count of scheduling attempts of each pending session and pushes back the pending sessions whose retry counts exceed a certain threshold. This option should be explicitly enabled by the administrators or during installation.

Dominant resource fairness (DRF)
Agent selection strategy
Concentrated
Dispersed
Custom

Resource Management

Resource slots

Backend.AI abstracts each different type of computing resources as a “resource slot”. Resource slots are distinguished by its name consisting of two parts: the device name and the slot name.

Resource slot name

Device name

Slot name

cpu

cpu

(implicitly defined as root)

mem

mem

(implicitly defined as root)

cuda.device

cuda

device

cuda.shares

cuda

shares

cuda.mig-2c10g

cuda

mig-2c10g

Each resource slot has a slot type as follows:

Slot type

Meaning

Examples

COUNT

The value of the resource slot is an integer or decimal to represent how many of the device(s) are available/allocated. It may also represent fractions of devices.

cpu, cuda.device, cuda.shares

BYTES

The value of the resource slot is an integer to represent how many bytes of the resources are available/allocated.

mem

UNIQUE

Only “each one” of the device can be allocated to each different kernel exclusively.

cuda.mig-10g

Compute plugins

Backend.AI administrators may install one or more compute plugins to each agent. Without any plugin, only the intrinsic cpu and mem resource slots are available.

Each compute plugin may declare one or more resource slots. The plugin is invoked upon startup of the agent to get the list of devices and the resource slots to report. Administrators can inspect the per-agent accelerator details provided by the compute plugins in the control panel.

The most well-known compute plugin is cuda_open, which is included in the open source version. It declares cuda.device resource slot that represents each NVIDIA GPU as one unit.

There is a special compute plugin to simulate non-existent devices: mock. Developers may put a local configuration to declare an arbitrary set of devices and resource slots to test the schedulers and the frontend. It is useful to develop integrations with new hardware devices before you get the actual devices on your hands.

Resource groups

Resource group is a logical group of the Agents with independent schedulers. Each agent belongs to a single resource group only. It self-reports which resource group to join when sending the heartbeat messages, but the specified resource group must exist in prior.

See also

Scheduling

User Management

Users

Backend.AI’s user account has two types of authentication modes: session and keypair. The session mode just uses the normal username and password based on browser sessions (e.g., when using the Web UI), while the keypair mode uses a pair of access and secret keys for programmatic access.

Projects

There may be multiple projects created by administrators and users may belong to one or more projects. Administrators may configure project-level resource policies such as storage quota shared by all project vfolders and project-level artifacts.

When a user creates a new session, he/she must choose which project to use if he/she belongs to multiple projects to be in line with the resource policies.

Cluster Networking

Single-node cluster session

If a session is created with multiple containers with a single-node option, all containers are created in a single agent. The containers share a private bridge network in addition to the default network, so that they could interact with each other privately. There are no firewall restrictions in this private bridge network.

Multi-node cluster session

For even larger-scale computation, you may create a multi-node cluster session that spans across multiple agents. In this case, the manager auto-configures a private overlay network, so that the containers could interact with each other. There are no firewall restrictions in this private overlay network.

Detection of clustered setups

There is a concept called cluster role. The current version of Backend.AI creates homogeneous cluster sessions by replicating the same resource configuration and the same container image, but we have plans to add heterogeneous cluster sessions that have different resource and image configurations for each cluster role. For instance, a Hadoop cluster may have two types of containers: name nodes and data nodes, where they could be mapped to main and sub cluster roles.

All interactive apps are executed only in the main1 container which is always present in both cluster and non-cluster sessions. It is the user application’s responsibility to connect with and utilize other containers in a cluster session. To ease the process, Backend.AI injects the following environment variables into the containers and sets up a random-generated SSH keypairs between the containers so that each container ssh into others without additional prompts.:

Environment Variable

Meaning

Examples

BACKENDAI_CLUSTER_SIZE

The number of containers in this cluster session.

4

BACKENDAI_CLUSTER_HOSTS

A comma-separated list of container hostnames in this cluster session.

main1,sub1,sub2,sub3

BACKENDAI_CLUSTER_REPLICAS

A comma-separated key:value pairs of cluster roles and the replica counts for each role.

main:1,sub:3

BACKENDAI_CLUSTER_HOST

The container hostname of the current container.

main1

BACKENDAI_CLUSTER_IDX

The one-based index of the current container from the containers sharing the same cluster role.

1

BACKENDAI_CLUSTER_ROLE

The name of the current container’s cluster role.

main

BACKENDAI_CLUSTER_LOCAL_RANK

The zero-based global index of the current container within the entire cluster session.

0

Storage Management

Virtual folders

Backend.AI abstracts network storages as a set of “virtual folders” (aka “vfolders”), which provides a persistent file storage to users and projects.

When creating a new session, users may connect vfolders to it with read-only or read-write permissions. If the shared vfolder has limited the permission to read-only, then the user may connect it with the read-only permission only. Virtual folders are mounted into compute session containers at /home/work/{name} so that user programs have access to the virtual folder contents like a local directory. The mounted path inside containers may be customized (e.g., /workspace) for compatibility with existing scripts and codes. Currently it is not possible to unmount or delete a vfolder when there are any running session connected to it. For cluster sessions having multiple kernels (containers), the connected vfolders are mounted to all kernels using the same location and the permission.

For a multi-node setup, the storage volume mounts must be synchronized across all Agent nodes and the Storage Proxy node(s) using the same mount path (e.g., /mnt or /vfroot). For a single-node setup, you may simply use an empty local directory, like our install-dev.sh script (link) does.

From the perspective of the storage, all vfolders from different Backend.AI users and projects share a single same UID and GID. This allows a flexible permission sharing between users and projects, while keeping the Linux ownership of the files and directories consistent when they are accessed by multiple different Backend.AI users.

User-owned vfolders

The users may create their own one or more virtual folders to store data files, libraries, and program codes. The superadmins may limit the maximum number of vfolders owned by a user.

Project-owned vfolders

The project admins and superadmins may create a vfolder that is automatically shared to all members of the project, with a specific read-only or read-write permission.

Note

If allowed, users and projects may create and access vfolders in multiple different storage volumes, but the vfolder names must be unique in all storage volumes, for each user and project.

VFolder invitations and permissions

Users and project administrators may invite other users to collaborate on a vfolder. Once the invitee accepts the request, he/she gets the designated read-only or read-write permission on the shared vfolder.

Volume-level permissions

The superadmin may set additional action privileges to each storage volume, such as whether to allow or block mounting the vfolders in compute sessions, cloning the vfolders, etc.

Auto-mount vfolders

If a user-owned vfolder’s name starts with a dot, it is automatically mounted at /home/work for all sessions created by the user. A good usecase is .config and .local directories to keep your local configurations and user-installed packages (e.g., pip install --user) persistent across all your sessions.

Quota scopes

New in version 23.03.

Quota scopes implement per-user and per-project storage usage limits. Currently it supports the hard limits specified in bytes. There are two main schemes to set up this feature.

Storage with per-directory quota
_images/vfolder-dir-quota.svg

Quota scopes and vfolders with storage solutions supporting per-directry quota

For each storage volume, each user and project has their own dedicated quota scope directories as shown in Fig. 2. The storage solution must support per-directory quota, at least for a single-level (like NetApp’s QTree). We recommend this configuration for filesystems like CephFS, Weka.io, or custom-built storage servers using ZFS or XFS where Backend.AI Storage Proxy can be installed directly onto the storage servers.

Storage with per-volume quota
_images/vfolder-volume-quota.svg

Quota scopes and vfolders with storage solutions supporting per-volume quota

Unfortunately, there are many cases that we cannot rely on per-directory quota support in storage solutions, due to limitation of the underlying filesystem implementation or having no direct access to the storage vendor APIs.

For this case, we may assign dedicated storage volumes to each user and project like Fig. 3, which naturally limits the space usage by the volume size. Another option is not to configure quota limits, but we don’t recommend this option in production setups.

The shortcoming is that we may need to frequently mount/unmount the network volumes when we create or remove users and projects, which may cause unexpected system failures due to stale file descriptors.

Note

For shared vfolders, the quota usage is accounted for the original owner of the vfolder, either a user or a project.

Warning

For both schemes, the administrator should take care of the storage solution’s system limits such as the maximum number of volumes and quota sets because such limits may impose a hidden limit to the maximum number of users and projects in Backend.AI.

Configuration

Shared config

Most cluster-level configurations are stored in an Etcd service. The Etcd server is also used for service discovery; when new agents boot up they register themselves to the cluster manager via etcd. For production deployments, we recommend to use an Etcd cluster composed of odd (3, 5, or higher) number of nodes to keep high availability.

Local config

Each service component has a TOML-based local configuration. It defines node-specific configurations such as the agent name, the resource group where it belongs, specific system limits, the IP address and the TCP port(s) to bind their service traffic, and etc.

The configuration files are named after the service components, like manager.toml, agent.toml, and storage-proxy.toml. The search paths are: the current working directory, ~/.config/backend.ai, and /etc/backend.ai.

See also

The sample configurations in our source repository. Inside each component directory, sample.toml contains the full configuration schema and descriptions.

Monitoring

Dashboard (Enterprise only)

Backend.AI Dashboard is an add-on service that displays various real-time and historical performance metrics. The metrics include the number of sessions, cluster power usage, GPU utilization, and etc.

Alerts (Enterprise only)

Administrators may configure automatic alerts based on several thrsholds on the monitored metrics, via an external messaging service like emails and SMS.

FAQ

vs. Notebooks

Product

Role

Value

Apache Zeppelin, Jupyter Notebook

Notebook-style document + code frontends

Familiarity from data scientists and researchers, but hard to avoid insecure host resource sharing

Backend.AI

Pluggable backend to any frontends

Built for multi-tenancy: scalable and better isolation

vs. Orchestration Frameworks

Product

Target

Value

Amazon ECS, Kubernetes

Long-running interactive services

Load balancing, fault tolerance, incremental deployment

Amazon Lambda, Azure Functions

Stateless light-weight, short-lived functions

Serverless, zero-management

Backend.AI

Stateful batch computations mixed with interactive applications

Low-cost high-density computation, maximization of hardware potentials

vs. Big-data and AI Frameworks

Product

Role

Value

TensorFlow, Apache Spark, Apache Hive

Computation runtime

Difficult to install, configure, and operate at scale

Amazon ML, Azure ML, GCP ML

Managed MLaaS

Highly scalable but dependent on each platform, still requires system engineering backgrounds

Backend.AI

Host of computation runtimes

Pre-configured, versioned, reproducible, customizable (open-source)

(All product names and trade-marks are the properties of their respective owners.)

Installation Guides

Install from Source

Note

For production deployments, we recommend to create separate virtualenvs for individual services and install the pre-built wheel distributions, following Install from Packages.

Setting Up Manager and Agent (single node, all-in-one)

Check out Development Setup.

Setting Up Additional Agents (multi-node)

Updating manager configuration for multi-nodes

Since scripts/install-dev.sh assumes a single-node all-in-one setup, it configures the etcd and Redis addresses to be 127.0.0.1.

You need to update the etcd configuration of the Redis address so that additional agent nodes can connect to the Redis server using the address advertised via etcd:

$ ./backend.ai mgr etcd get config/redis/addr
127.0.0.1:xxxx
$ ./backend.ai mgr etcd put config/redis/addr MANAGER_IP:xxxx  # use the port number read above

where MANAGER_IP is an IP address of the manager node accessible from other agent nodes.

Installing additional agents in different nodes

First, you need to initialize a working copy of the core repository for each additional agent node. As our scripts/install-dev.sh does not yet provide an “agent-only” installation mode, you need to manually perform the same repository cloning along with the pyenv, Python, and Pants setup procedures as the script does.

Note

Since we use the mono-repo for the core packages, there is no way to separately clone the agent sources only. Just clone the entire repository and configure/execute the agent only. Ensure that you also pull the LFS files and submodules when you manually clone it.

Once your pants is up and working, run pants export to populate virtualenvs and install dependencies.

Then start to configure agent.toml by copying it from configs/agent/halfstack.toml as follows:

  • agent.toml

    • [etcd].addr.host: Replace with MANAGER_IP

    • [agent].rpc-listen-addr.host: Replace with AGENT_IP

    • [container].bind-host: Replace with AGENT_IP

    • [watcher].service-addr.host: Replace with AGENT_IP

where AGENT_IP is an IP address of this agent node accessible from the manager and MANAGER_IP is an IP address of the manager node accessible from this agent node.

Now execute ./backend.ai ag start-server to connect this agent node to an existing manager.

We assume that the agent and manager nodes reside in a same local network, where all TCP ports are open to each other. If this is not the case, you should configure firewalls to open all the port numbers appearing in agent.toml.

There are more complicated setup scenarios such as splitting network planes for control and container-to-container communications, but we provide assistance with them for enterprise customers only.

Setting Up Accelerators

Ensure that your accelerator is properly set up using vendor-specific installation methods.

Clone the accelerator plugin package into plugins directory if necessary or just use one of the already existing one in the mono-repo.

You also need to configure agent.toml’s [agent].allow-compute-plugins with the full package path (e.g., ai.backend.accelerator.cuda_open) to activate them.

Setting Up Shared Storage

To make vfolders working properly with multiple nodes, you must enable and configure Linux NFS to share the manager node’s vfroot/local directory under the working copy and mount it in the same path in all agent nodes.

It is recommended to unify the UID and GID of the storage-proxy service, all of the agent services across nodes, container UID and GID (configurable in agent.toml), and the NFS volume.

Configuring Overlay Networks for Multi-node Training (Optional)

Note

All other features of Backend.AI except multi-node training work without this configuration. The Docker Swarm mode is used to configure overlay networks to ensure privacy between cluster sessions, while the container monitoring and configuration is done by Backend.AI itself.

Currently the cross-node inter-container overlay routing is controlled via Docker Swarm’s overlay networks. In the manager, you need to create a Swarm. In the agent nodes, you need to join the Swarm. Then restart all manager and agent daemons to make it working.

Install from Packages

This guide covers how to install Backend.AI from the official release packages. You can build a fully-functional Backend.AI cluster with open-source packages.

Backend.AI consists of a variety of components, including open-source core components, pluggable extensions, and enterprise modules. Some of the major components are:

  • Backend.AI Manager : API gateway and resource management. Manager delegates workload requests to Agent and storage/file requests to Storage Proxy.

  • Backend.AI Agent : Installs on a compute node (usually GPU nodes) to start and manage the workload execution. It sends periodic heartbeat signals to the Manager in order to register itself as a worker node. Even if the connection to the Manager is temporarily lost, the pre-initiated workloads continue to be executed.

  • Backend.AI Storage Proxy : Handles requests relating to storage and files. It offloads the Manager’s burden of handling long-running file I/O operations. It embeds a plugin backend structure that provides dedicated features for each storage type.

  • Backend.AI Webserver : A web server that provides persistent user web sessions. Users can use the Backend.AI features without subsequent authentication upon initial login. It also serves the statically built graphical user interface in an Enterprise environment.

  • Backend.AI Web UI : Web application with a graphical user interface. Users can enjoy the easy-to-use interface to launch their secure execution environment and use apps like Jupyter and Terminal. It can be served as statically built JavaScript via Webserver. Or, it also offers desktop applications for many operating systems and architectures.

Most components can be installed in a single management node except Agent, which is usually installed on dedicated computing nodes (often GPU servers). However, this is not a rule and Agent can also be installed on the management node.

It is also possible to configure a high-availability (HA) setup with three or more management nodes, although this is not the focus of this guide.

Setup OS Environment

Backend.AI and its associated components share common requirements and configurations for proper operation. This section explains how to configure the OS environment.

Note

This section assumes the installation on Ubuntu 20.04 LTS.

Create a user account for operation

We will create a user account bai to install and operate Backend.AI services. Set the UID and GID to 1100 to prevent conflicts with other users or groups. sudo privilege is required so add bai to sudo group.

$ username="bai"
$ password="secure-password"
$ sudo adduser --disabled-password --uid 1100 --gecos "" $username
$ echo "$username:$password" | sudo chpasswd
$ sudo usermod -aG sudo bai

If you do not want to expose your password in the shell history, remove the --disabled-password option and interactively enter your password.

Login as the bai user and continue the installation.

Install Docker engine

Backend.AI requires Docker Engine to create a compute session with the Docker container backend. Also, some service components are deployed as containers. So installing Docker Engine is required. Ensure docker-compose-plugin is installed as well to use docker compose command.

After the installation, add the bai user to the docker group not to issue the sudo prefix command every time interacting with the Docker engine.

$ sudo usermod -aG docker bai

Logout and login again to apply the group membership change.

Optimize sysctl/ulimit parameters

This is not essential but the recommended step to optimize the performance and stability of operating Backend.AI. Refer to the guide of the Manager repiository for the details of the kernel parameters and the ulimit settings. Depending on the Backend.AI services you install, the optimal values may vary. Each service installation section guide with the values, if needed.

Note

Modern systems may have already set the optimal parameters. In that case, you can skip this step.

To cleanly separate the configurations, you may follow the steps below.

  • Save the resource limit parameters in /etc/security/limits.d/99-backendai.conf.

    root hard nofile 512000
    root soft nofile 512000
    root hard nproc 65536
    root soft nproc 65536
    bai hard nofile 512000
    bai soft nofile 512000
    bai hard nproc 65536
    bai soft nproc 65536
    
  • Logout and login again to apply the resource limit changes.

  • Save the kernel parameters in /etc/sysctl.d/99-backendai.conf.

    fs.file-max=2048000
    net.core.somaxconn=1024
    net.ipv4.tcp_max_syn_backlog=1024
    net.ipv4.tcp_slow_start_after_idle=0
    net.ipv4.tcp_fin_timeout=10
    net.ipv4.tcp_window_scaling=1
    net.ipv4.tcp_tw_reuse=1
    net.ipv4.tcp_early_retrans=1
    net.ipv4.ip_local_port_range="10000 65000"
    net.core.rmem_max=16777216
    net.core.wmem_max=16777216
    net.ipv4.tcp_rmem=4096 12582912 16777216
    net.ipv4.tcp_wmem=4096 12582912 16777216
    vm.overcommit_memory=1
    
  • Apply the kernel parameters with sudo sysctl -p /etc/sysctl.d/99-backendai.conf.

Prepare required Python versions and virtual environments

Prepare a Python distribution whose version meets the requirements of the target package. Backend.AI 22.09, for example, requires Python 3.10. The latest information on the Python version compatibility can be found at here.

There can be several ways to prepare a specific Python version. Here, we will be using a standalone static built Python.

(Alternative) Use pyenv to manually build and select a specific Python version

If you prefer, there is no problem using pyenv and pyenv-virtualenv.

Install pyenv and pyenv-virtualenv. Then, install a Python version that are needed:

$ pyenv install "${YOUR_PYTHON_VERSION}"

Note

You may need to install suggested build environment to build Python from pyenv.

Then, you can create multiple virtual environments per service. To create a virtual environment for Backend.AI Manager 22.09.x and automatically activate it, for example, you may run:

$ mkdir "${HOME}/manager"
$ cd "${HOME}/manager"
$ pyenv virtualenv "${YOUR_PYTHON_VERSION}" bai-22.09-manager
$ pyenv local bai-22.09-manager
$ pip install -U pip setuptools wheel

You also need to make pip available to the Python installation with the latest wheel and setuptools packages, so that any non-binary extension packages can be compiled and installed on your system.

Configure network aliases

Although not required, using a network aliases instead of IP addresses can make setup and operation easier. Edit the /etc/hosts file for each node and append the contents like example below to access each server with network aliases.

##### BEGIN for Backend.AI services #####
10.20.30.10 bai-m1   # management node 01
10.20.30.20 bai-a01  # agent node 01 (GPU 01)
10.20.30.22 bai-a02  # agent node 02 (GPU 02)
##### END for Backend.AI services #####

Note that the IP addresses should be accessible from other nodes, if you are installing on multiple servers.

Mount a shared storage

Having a shared storage volume makes it easy to save and manage data inside a Backend.AI compute environment. If you have a dedicated storage, mount it with the name of your choice under /vfroot/ directory on each server. You must mount it in the same path in all management and compute nodes.

Detailed mount procedures may vary depending on the storage type or vendor. For a usual NFS, adding the configurations to /etc/fstab and executing sudo mount -a will do the job.

Note

It is recommended to unify the UID and GID of the Storage Proxy service, all of the Agent services across nodes, container UID and GID (configurable in agent.toml), and the NFS volume.

If you do not have a dedicated storage or installing on one server, you can use a local directory. Just create a directory /vfroot/local.

$ sudo mkdir -p /vfroot/local
$ sudo chown -R ${UID}.${GID} /vfroot
Setup accelerators

If there are accelerators (e.g., GPU) on the server, you have to install the vendor-specific drivers and libraries to make sure the accelerators are properly set up and working. Please refer to the vendor documentation for the details.

  • To integrate NVIDIA GPUs,

    • Install the NVIDIA driver and CUDA toolkit.

    • Install the NVIDIA container toolkit (nvidia-docker2).

Pull container images

For compute nodes, you need to pull some container images that are required for creating a compute session. Lablup provides a set of open container images and you may pull the following starter images:

docker pull cr.backend.ai/stable/filebrowser:21.02-ubuntu20.04
docker pull cr.backend.ai/stable/python:3.9-ubuntu20.04
docker pull cr.backend.ai/stable/python-pytorch:1.11-py38-cuda11.3
docker pull cr.backend.ai/stable/python-tensorflow:2.7-py38-cuda11.3

Prepare Database

Backend.AI makes use of PostgreSQL as its main database. Launch the service using docker compose by generating the file $HOME/halfstack/postgres-cluster-default/docker-compose.yaml and populating it with the following YAML. Feel free to adjust the volume paths and port settings. Please refer the latest configuration (it’s a symbolic link so follow the filename in it) if needed.

version: "3"
x-base: &base
   logging:
      driver: "json-file"
      options:
         max-file: "5"
         max-size: "10m"

services:
   backendai-pg-active:
      <<: *base
      image: postgres:15.1-alpine
      restart: unless-stopped
      command: >
         postgres
         -c "max_connections=256"
         -c "max_worker_processes=4"
         -c "deadlock_timeout=10s"
         -c "lock_timeout=60000"
         -c "idle_in_transaction_session_timeout=60000"
      environment:
         - POSTGRES_USER=postgres
         - POSTGRES_PASSWORD=develove
         - POSTGRES_DB=backend
         - POSTGRES_INITDB_ARGS="--data-checksums"
      healthcheck:
         test: ["CMD", "pg_isready", "-U", "postgres"]
         interval: 10s
         timeout: 3s
         retries: 10
      volumes:
         - "${HOME}/.data/backend.ai/postgres-data/active:/var/lib/postgresql/data:rw"
      ports:
         - "8100:5432"
      networks:
         half_stack:
      cpu_count: 4
      mem_limit: "4g"

networks:
    half_stack:

Execute the following command to start the service container. The project ${USER} is added for operational convenience.

$ cd ${HOME}/halfstack/postgres-cluster-default
$ docker compose up -d
$ # -- To terminate the container:
$ # docker compose down
$ # -- To see the container logs:
$ # docker compose logs -f

Prepare Cache Service

Backend.AI makes use of Redis as its main cache service. Launch the service using docker compose by generating the file $HOME/halfstack/redis-cluster-default/docker-compose.yaml and populating it with the following YAML. Feel free to adjust the volume paths and port settings. Please refer the latest configuration (it’s a symbolic link so follow the filename in it) if needed.

version: "3"
x-base: &base
   logging:
      driver: "json-file"
      options:
         max-file: "5"
         max-size: "10m"

services:
   backendai-halfstack-redis:
      <<: *base
      image: redis:6.2-alpine
      restart: unless-stopped
      command: >
         redis-server
         --requirepass develove
         --appendonly yes
      volumes:
         - "${HOME}/.data/backend.ai/redis-data:/data:rw"
      healthcheck:
         test: ["CMD", "redis-cli", "--raw", "incr", "ping"]
         interval: 10s
         timeout: 3s
         retries: 10
      ports:
         - "8110:6379"
      networks:
         - half_stack
      cpu_count: 1
      mem_limit: "2g"

networks:
   half_stack:

Execute the following command to start the service container. The project ${USER} is added for operational convenience.

$ cd ${HOME}/halfstack/redis-cluster-default
$ docker compose up -d
$ # -- To terminate the container:
$ # docker compose down
$ # -- To see the container logs:
$ # docker compose logs -f

Prepare Config Service

Backend.AI makes use of Etcd as its main config service. Launch the service using docker compose by generating the file $HOME/halfstack/etcd-cluster-default/docker-compose.yaml and populating it with the following YAML. Feel free to adjust the volume paths and port settings. Please refer the latest configuration (it’s a symbolic link so follow the filename in it) if needed.

version: "3"
x-base: &base
   logging:
      driver: "json-file"
      options:
         max-file: "5"
         max-size: "10m"

services:
   backendai-halfstack-etcd:
      <<: *base
      image: quay.io/coreos/etcd:v3.4.15
      restart: unless-stopped
      command: >
         /usr/local/bin/etcd
         --name etcd-node01
         --data-dir /etcd-data
         --listen-client-urls http://0.0.0.0:2379
         --advertise-client-urls http://0.0.0.0:8120
         --listen-peer-urls http://0.0.0.0:2380
         --initial-advertise-peer-urls http://0.0.0.0:8320
         --initial-cluster etcd-node01=http://0.0.0.0:8320
         --initial-cluster-token backendai-etcd-token
         --initial-cluster-state new
         --auto-compaction-retention 1
      volumes:
         - "${HOME}/.data/backend.ai/etcd-data:/etcd-data:rw"
      healthcheck:
         test: ["CMD", "etcdctl", "endpoint", "health"]
         interval: 10s
         timeout: 3s
         retries: 10
      ports:
         - "8120:2379"
         # - "8320:2380"  # listen peer (only if required)
      networks:
         - half_stack
      cpu_count: 1
      mem_limit: "1g"

networks:
   half_stack:

Execute the following command to start the service container. The project ${USER} is added for operational convenience.

$ cd ${HOME}/halfstack/etcd-cluster-default
$ docker compose up -d
$ # -- To terminate the container:
$ # docker compose down
$ # -- To see the container logs:
$ # docker compose logs -f

Install Backend.AI Manager

Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.

Install the latest version of Backend.AI Manager for the current Python version:

$ cd "${HOME}/manager"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-manager

If you want to install a specific version:

$ pip install -U backend.ai-manager==${BACKEND_PKG_VERSION}
Local configuration

Backend.AI Manager uses a TOML file (manager.toml) to configure local service. Refer to the manager.toml sample file for a detailed description of each section and item. A configuration example would be:

[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""

[db]
type = "postgresql"
addr = { host = "bai-m1", port = 8100 }
name = "backend"
user = "postgres"
password = "develove"

[manager]
num-proc = 2
service-addr = { host = "0.0.0.0", port = 8081 }
# user = "bai"
# group = "bai"
ssl-enabled = false

heartbeat-timeout = 30.0
pid-file = "/home/bai/manager/manager.pid"
disabled-plugins = []
hide-agents = true
# event-loop = "asyncio"
# importer-image = "lablup/importer:manylinux2010"
distributed-lock = "filelock"

[docker-registry]
ssl-verify = false

[logging]
level = "INFO"
drivers = ["console", "file"]

[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiopg" = "WARNING"
"aiohttp" = "INFO"
"ai.backend" = "INFO"
"alembic" = "INFO"

[logging.console]
colored = true
format = "verbose"

[logging.file]
path = "./logs"
filename = "manager.log"
backup-count = 10
rotation-size = "10M"

[debug]
enabled = false
enhanced-aiomonitor-task-info = true

Save the contents to ${HOME}/.config/backend.ai/manager.toml. Backend.AI will automatically recognize the location. Adjust each field to conform to your system.

Global configuration

Etcd (cluster) stores globally shared configurations for all nodes. Some of them should be populated prior to starting the service.

Note

It might be a good idea to create a backup of the current Etcd configuration before modifying the values. You can do so by simply executing:

$ backend.ai mgr etcd get --prefix "" > ./etcd-config-backup.json

To restore the backup:

$ backend.ai mgr etcd delete --prefix ""
$ backend.ai mgr etcd put-json "" ./etcd-config-backup.json

The commands below should be executed at ${HOME}/manager directory.

To list a specific key from Etcd, for example, config key:

$ backend.ai mgr etcd get --prefix config

Now, configure Redis access information. This should be accessible from all nodes.

$ backend.ai mgr etcd put config/redis/addr "bai-m1:8110"
$ backend.ai mgr etcd put config/redis/password "develove"

Set the container registry. The following is the Lablup’s open registry (cr.backend.ai). You can set your own registry with username and password if needed. This can be configured via GUI as well.

$ backend.ai mgr etcd put config/docker/image/auto_pull "tag"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai "https://cr.backend.ai"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai/type "harbor2"
$ backend.ai mgr etcd put config/docker/registry/cr.backend.ai/project "stable"
$ # backend.ai mgr etcd put config/docker/registry/cr.backend.ai/username "bai"
$ # backend.ai mgr etcd put config/docker/registry/cr.backend.ai/password "secure-password"

Also, populate the Storage Proxy configuration to the Etcd:

$ # Allow project (group) folders.
$ backend.ai mgr etcd put volumes/_types/group ""
$ # Allow user folders.
$ backend.ai mgr etcd put volumes/_types/user ""
$ # Default volume host. The name of the volume proxy here is "bai-m1" and volume name is "local".
$ backend.ai mgr etcd put volumes/default_host "bai-m1:local"
$ # Set the "bai-m1" proxy information.
$ # User (browser) facing API endpoint of Storage Proxy.
$ # Cannot use host alias here. It should be user-accessible URL.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/client_api "http://10.20.30.10:6021"
$ # Manager facing internal API endpoint of Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/manager_api "http://bai-m1:6022"
$ # Random secret string which is used by Manager to communicate with Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/secret "secure-token-to-authenticate-manager-request"
$ # Option to disable SSL verification for the Storage Proxy.
$ backend.ai mgr etcd put volumes/proxies/bai-m1/ssl_verify "false"

Check if the configuration is properly populated:

$ backend.ai mgr etcd get --prefix volumes

Note that you have to change the secret to a unique random string for secure communication between the manager and Storage Proxy. The most recent set of parameters can be found from sample.etcd.volumes.json.

To enable access to the volumes defined by the Storage Proxy from every user, you need to update the allowed_vfolder_hosts column of the domains table to hold the storage volume reference (e.g., bai-m1:local). You can do this by issuing SQL statement directly inside the PostgreSQL container:

$ vfolder_host_val='{"bai-m1:local": ["create-vfolder", "modify-vfolder", "delete-vfolder", "mount-in-session", "upload-file", "download-file", "invite-others", "set-user-specific-permission"]}'
$ docker exec -it bai-backendai-pg-active-1 psql -U postgres -d backend \
      -c "UPDATE domains SET allowed_vfolder_hosts = '${vfolder_host_val}' WHERE name = 'default';"
Populate the database with initial fixtures

You need to prepare alembic.ini file under ${HOME}/manager to manage the database schema. Copy the sample halfstack.alembic.ini and save it as ${HOME}/manager/alembic.ini. Adjust the sqlalchemy.url field if database connection information is different from the default one. You may need to change localhost to bai-m1.

Populate the database schema and initial fixtures. Copy the example JSON files (example-keypairs.json and example-resource-presets.json) as keypairs.json and resource-presets.json, save them under ${HOME}/manager/. Customize them to have unique keypairs and passwords for your initial superadmin and sample user accounts for security.

$ backend.ai mgr schema oneshot
$ backend.ai mgr fixture populate ./keypairs.json
$ backend.ai mgr fixture populate ./resource-presets.json
Sync the information of container registry

You need to scan the image catalog and metadata from the container registry to the Manager. This is required to display the list of compute environments in the user web GUI (Web UI). You can run the following command to sync the information with Lablup’s public container registry:

$ backend.ai mgr image rescan cr.backend.ai
Run Backend.AI Manager service

You can run the service:

$ cd "${HOME}/manager"
$ python -m ai.backend.manager.server

Check if the service is running. The default Manager API port is 8081, but it can be configured from manager.toml:

$ curl bai-m1:8081
{"version": "v6.20220615", "manager": "22.09.6"}

Press Ctrl-C to stop the service.

Register systemd service

The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.

First, create a runner script at ${HOME}/bin/run-manager.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using static python --
source .venv/bin/activate

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.manager.server
else
   exec "$@"
fi

Make the script executable:

$ chmod +x "${HOME}/bin/run-manager.sh"

Then, create a systemd service file at /etc/systemd/system/backendai-manager.service:

[Unit]
Description= Backend.AI Manager
Requires=network.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/home/bai/bin/run-manager.sh
PIDFile=/home/bai/manager/manager.pid
User=1100
Group=1100
WorkingDirectory=/home/bai/manager
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072

[Install]
WantedBy=multi-user.target

Finally, enable and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-manager

$ # To check the service status
$ sudo systemctl status backendai-manager
$ # To restart the service
$ sudo systemctl restart backendai-manager
$ # To stop the service
$ sudo systemctl stop backendai-manager
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-manager -f

Install Backend.AI Agent

If there are dedicated compute nodes (often, GPU nodes) in your cluster, Backend.AI Agent service should be installed on the compute nodes, not on the management node.

Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.

Install the latest version of Backend.AI Agent for the current Python version:

$ cd "${HOME}/agent"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-agent

If you want to install a specific version:

$ pip install -U backend.ai-agent==${BACKEND_PKG_VERSION}
Setting Up Accelerators

Note

You can skip this section if your system does not have H/W accelerators.

Backend.AI supports various H/W accelerators. To integrate them with Backend.AI, you need to install the corresponding accelerator plugin package. Before installing the package, make sure that the accelerator is properly set up using vendor-specific installation methods.

Most popular accelerator today would be NVIDIA GPU. To install the open-source CUDA accelerator plugin, run:

$ pip install -U backend.ai-accelerator-cuda-open

Note

Backend.AI’s fractional GPU sharing is available only on the enterprise version but not supported on the open-source version.

Local configuration

Backend.AI Agent uses a TOML file (agent.toml) to configure local service. Refer to the agent.toml sample file for a detailed description of each section and item. A configuration example would be:

[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""

[agent]
mode = "docker"
# NOTE: You cannot use network alias here. Write the actual IP address.
rpc-listen-addr = { host = "10.20.30.10", port = 6001 }
# id = "i-something-special"
scaling-group = "default"
pid-file = "/home/bai/agent/agent.pid"
event-loop = "uvloop"
# allow-compute-plugins = ["ai.backend.accelerator.cuda_open"]

[container]
port-range = [30000, 31000]
kernel-uid = 1100
kernel-gid = 1100
bind-host = "bai-m1"
advertised-host = "bai-m1"
stats-type = "docker"
sandbox-type = "docker"
jail-args = []
scratch-type = "hostdir"
scratch-root = "./scratches"
scratch-size = "1G"

[watcher]
service-addr = { host = "bai-a01"", port = 6009 }
ssl-enabled = false
target-service = "backendai-agent.service"
soft-reset-available = false

[logging]
level = "INFO"
drivers = ["console", "file"]

[logging.console]
colored = true
format = "verbose"

[logging.file]
path = "./logs"
filename = "agent.log"
backup-count = 10
rotation-size = "10M"

[logging.pkg-ns]
"" = "WARNING"
"aiodocker" = "INFO"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"

[resource]
reserved-cpu = 1
reserved-mem = "1G"
reserved-disk = "8G"

[debug]
enabled = false
skip-container-deletion = false
asyncio = false
enhanced-aiomonitor-task-info = true
log-events = false
log-kernel-config = false
log-alloc-map = false
log-stats = false
log-heartbeats = false
log-docker-events = false

[debug.coredump]
enabled = false
path = "./coredumps"
backup-count = 10
size-limit = "64M"

You may need to configure [agent].allow-compute-plugins with the full package path (e.g., ai.backend.accelerator.cuda_open) to activate them.

Save the contents to ${HOME}/.config/backend.ai/agent.toml. Backend.AI will automatically recognize the location. Adjust each field to conform to your system.

Run Backend.AI Agent service

You can run the service:

$ cd "${HOME}/agent"
$ python -m ai.backend.agent.server

You should see a log message like started handling RPC requests at ...

There is an add-on service, Agent Watcher, that can be used to monitor and manage the Agent service. It is not required to run the Agent service, but it is recommended to use it for production environments.

$ cd "${HOME}/agent"
$ python -m ai.backend.agent.watcher

Press Ctrl-C to stop both services.

Register systemd service

The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.

It is better to set [container].stats-type = "cgroup" in the agent.toml for better metric collection which is only available with root privileges.

First, create a runner script at ${HOME}/bin/run-agent.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using static python --
source .venv/bin/activate

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.agent.server
else
   exec "$@"
fi

Create a runner script for Watcher at ${HOME}/bin/run-watcher.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.agent.watcher
else
   exec "$@"
fi

Make the scripts executable:

$ chmod +x "${HOME}/bin/run-agent.sh"
$ chmod +x "${HOME}/bin/run-watcher.sh"

Then, create a systemd service file at /etc/systemd/system/backendai-agent.service:

[Unit]
Description= Backend.AI Agent
Requires=backendai-watcher.service
After=network.target remote-fs.target backendai-watcher.service

[Service]
Type=simple
ExecStart=/home/bai/bin/run-agent.sh
PIDFile=/home/bai/agent/agent.pid
WorkingDirectory=/home/bai/agent
TimeoutStopSec=5
KillMode=process
KillSignal=SIGINT
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072

[Install]
WantedBy=multi-user.target

And for Watcher at /etc/systemd/system/backendai-watcher.service:

[Unit]
Description= Backend.AI Agent Watcher
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/home/bai/bin/run-watcher.sh
WorkingDirectory=/home/bai/agent
TimeoutStopSec=3
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

Finally, enable and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-watcher
$ sudo systemctl enable --now backendai-agent

$ # To check the service status
$ sudo systemctl status backendai-agent
$ # To restart the service
$ sudo systemctl restart backendai-agent
$ # To stop the service
$ sudo systemctl stop backendai-agent
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-agent -f

Install Backend.AI Storage Proxy

Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.

Install the latest version of Backend.AI Storage Proxy for the current Python version:

$ cd "${HOME}/storage-proxy"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-storage-proxy

If you want to install a specific version:

$ pip install -U backend.ai-storage-proxy==${BACKEND_PKG_VERSION}
Local configuration

Backend.AI Storage Proxy uses a TOML file (storage-proxy.toml) to configure local service. Refer to the storage-proxy.toml sample file for a detailed description of each section and item. A configuration example would be:

[etcd]
namespace = "local"
addr = { host = "bai-m1", port = 8120 }
user = ""
password = ""

[storage-proxy]
node-id = "i-bai-m1"
num-proc = 2
pid-file = "/home/bai/storage-proxy/storage_proxy.pid"
event-loop = "uvloop"
scandir-limit = 1000
max-upload-size = "100g"

# Used to generate JWT tokens for download/upload sessions
secret = "secure-token-for-users-download-upload-sessions"
# The download/upload session tokens are valid for:
session-expire = "1d"

user = 1100
group = 1100

[api.client]
# Client-facing API
service-addr = { host = "0.0.0.0", port = 6021 }
ssl-enabled = false

[api.manager]
# Manager-facing API
service-addr = { host = "0.0.0.0", port = 6022 }
ssl-enabled = false

# Used to authenticate managers
secret = "secure-token-to-authenticate-manager-request"

[debug]
enabled = false
asyncio = false
enhanced-aiomonitor-task-info = true

[logging]
# One of: "NOTSET", "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
# Set the global logging level.
level = "INFO"

# Multi-choice of: "console", "logstash", "file"
# For each choice, there must be a "logging.<driver>" section
# in this config file as exemplified below.
drivers = ["console", "file"]

[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"

[logging.console]
# If set true, use ANSI colors if the console is a terminal.
# If set false, always disable the colored output in console logs.
colored = true

# One of: "simple", "verbose"
format = "simple"

[logging.file]
path = "./logs"
filename = "storage-proxy.log"
backup-count = 10
rotation-size = "10M"

[volume]

[volume.local]
backend = "vfs"
path = "/vfroot/local"

# If there are NFS volumes
# [volume.nfs]
# backend = "vfs"
# path = "/vfroot/nfs"

Save the contents to ${HOME}/.config/backend.ai/storage-proxy.toml. Backend.AI will automatically recognize the location. Adjust each field to conform to your system.

Run Backend.AI Storage Proxy service

You can run the service:

$ cd "${HOME}/storage-proxy"
$ python -m ai.backend.storage.server

Press Ctrl-C to stop both services.

Register systemd service

The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.

First, create a runner script at ${HOME}/bin/run-storage-proxy.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using static python --
source .venv/bin/activate

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.storage.server
else
   exec "$@"
fi

Make the scripts executable:

$ chmod +x "${HOME}/bin/run-storage-proxy.sh"

Then, create a systemd service file at /etc/systemd/system/backendai-storage-proxy.service:

[Unit]
Description= Backend.AI Storage Proxy
Requires=network.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/home/bai/bin/run-storage-proxy.sh
PIDFile=/home/bai/storage-proxy/storage-proxy.pid
WorkingDirectory=/home/bai/storage-proxy
User=1100
Group=1100
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072

[Install]
WantedBy=multi-user.target

Finally, enable and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-storage-proxy

$ # To check the service status
$ sudo systemctl status backendai-storage-proxy
$ # To restart the service
$ sudo systemctl restart backendai-storage-proxy
$ # To stop the service
$ sudo systemctl stop backendai-storage-proxy
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-storage-proxy -f

Install Backend.AI Webserver

Refer to Prepare required Python versions and virtual environments to setup Python and virtual environment for the service.

Install the latest version of Backend.AI Webserver for the current Python version:

$ cd "${HOME}/webserver"
$ # Activate a virtual environment if needed.
$ pip install -U backend.ai-webserver

If you want to install a specific version:

$ pip install -U backend.ai-webserver==${BACKEND_PKG_VERSION}
Local configuration

Backend.AI Webserver uses a config file (webserver.conf) to configure local service. Refer to the webserver.conf sample file for a detailed description of each section and item. A configuration example would be:

[service]
ip = "0.0.0.0"
port = 8080
# Not active in open-source edition.
wsproxy.url = "http://10.20.30.10:10200"

# Set or enable it when using reverse proxy for SSL-termination
# force_endpoint_protocol = "https"

mode = "webui"
enable_signup = false
allow_signup_without_confirmation = false
always_enqueue_compute_session = false
allow_project_resource_monitor = false
allow_change_signin_mode = false
mask_user_info = false
enable_container_commit = false
hide_agents = true
directory_based_usage = false

[resources]
open_port_to_public = false
allow_non_auth_tcp = false
allow_preferred_port = false
max_cpu_cores_per_container = 255
max_memory_per_container = 1000
max_cuda_devices_per_container = 8
max_cuda_shares_per_container = 8
max_shm_per_container = 256
# Maximum per-file upload size (bytes)
max_file_upload_size = 4294967296

[environments]
# allowlist = ""

[ui]
brand = "Backend.AI"
menu_blocklist = "pipeline"

[api]
domain = "default"
endpoint = "http://bai-m1:8081"
text = "Backend.AI"
ssl-verify = false

[session]
redis.host = "bai-m1"
redis.port = 8110
redis.db = 5
redis.password = "develove"
max_age = 604800  # 1 week
flush_on_startup = false
login_block_time = 1200  # 20 min (in sec)
login_allowed_fail_count = 10
max_count_for_preopen_ports = 10

[license]

[webserver]

[logging]
# One of: "NOTSET", "DEBUG", "INFO", "WARNING", "ERROR", "CRITICAL"
# Set the global logging level.
level = "INFO"

# Multi-choice of: "console", "logstash", "file"
# For each choice, there must be a "logging.<driver>" section
# in this config file as exemplified below.
drivers = ["console", "file"]

[logging.console]
# If set true, use ANSI colors if the console is a terminal.
# If set false, always disable the colored output in console logs.
colored = true

# One of: "simple", "verbose"
format = "verbose"

[logging.file]
# The log file path and filename pattern.
# All messages are wrapped in single-line JSON objects.
# Rotated logs may have additional suffixes.
# For production, "/var/log/backend.ai" is recommended.
path = "./logs"
filename = "webserver.log"

# Set the maximum number of recent container coredumps in the coredump directory.
# Oldest coredumps are deleted if there is more than this number of coredumps.
backup-count = 10

# The log file size to begin rotation.
rotation-size = "10M"

[logging.logstash]
# The endpoint to publish logstash records.
endpoint = { host = "localhost", port = 9300 }

# One of: "zmq.push", "zmq.pub", "tcp", "udp"
protocol = "tcp"

# SSL configs when protocol = "tcp"
ssl-enabled = true
ssl-verify = true

# Specify additional package namespaces to include in the logs
# and their individual log levels.
# Note that the actual logging level applied is the conjunction of the global logging level and the
# logging levels specified here for each namespace.
[logging.pkg-ns]
"" = "WARNING"
"aiotools" = "INFO"
"aiohttp" = "INFO"
"ai.backend" = "INFO"

[debug]
enabled = false

[plugin]

[pipeline]

Save the contents to ${HOME}/.config/backend.ai/webserver.conf.

Run Backend.AI Webserver service

You can run the service by specifying the config file path with -f option:

$ cd "${HOME}/webserver"
$ python -m ai.backend.web.server -f ${HOME}/.config/backend.ai/webserver.conf

Press Ctrl-C to stop both services.

Register systemd service

The service can be registered as a systemd daemon. It is recommended to automatically run the service after rebooting the host machine, although this is entirely optional.

First, create a runner script at ${HOME}/bin/run-webserver.sh:

#! /bin/bash
set -e

if [ -z "$HOME" ]; then
   export HOME="/home/bai"
fi

# -- If you have installed using static python --
source .venv/bin/activate

# -- If you have installed using pyenv --
if [ -z "$PYENV_ROOT" ]; then
   export PYENV_ROOT="$HOME/.pyenv"
   export PATH="$PYENV_ROOT/bin:$PATH"
fi
eval "$(pyenv init --path)"
eval "$(pyenv virtualenv-init -)"

if [ "$#" -eq 0 ]; then
   exec python -m ai.backend.web.server -f ${HOME}/.config/backend.ai/webserver.conf
else
   exec "$@"
fi

Make the scripts executable:

$ chmod +x "${HOME}/bin/run-webserver.sh"

Then, create a systemd service file at /etc/systemd/system/backendai-webserver.service:

[Unit]
Description= Backend.AI Webserver
Requires=network.target
After=network.target remote-fs.target

[Service]
Type=simple
ExecStart=/home/bai/bin/run-webserver.sh
PIDFile=/home/bai/webserver/webserver.pid
WorkingDirectory=/home/bai/webserver
User=1100
Group=1100
TimeoutStopSec=5
KillMode=process
KillSignal=SIGTERM
PrivateTmp=false
Restart=on-failure
RestartSec=10
LimitNOFILE=5242880
LimitNPROC=131072

[Install]
WantedBy=multi-user.target

Finally, enable and start the service:

$ sudo systemctl daemon-reload
$ sudo systemctl enable --now backendai-webserver

$ # To check the service status
$ sudo systemctl status backendai-webserver
$ # To restart the service
$ sudo systemctl restart backendai-webserver
$ # To stop the service
$ sudo systemctl stop backendai-webserver
$ # To check the service log and follow
$ sudo journalctl --output cat -u backendai-webserver -f
Check user GUI access via web

You can check the access to the web GUI by opening the URL http://<host-ip-or-domain>:8080 in your web browser. If all goes well, you will see the login page.

_images/webserver-login.png

Enter the email and password you set in the previous step to check login.

_images/webserver-summary-page-after-login.png

You can use almost every feature from the web GUI, but launching compute sesison apps like Terminal and/or Jupyer notebook is not possible from the web in the open-source edition. You can instead use the GUI desktop client to fully use the GUI features.

You can download the GUI desktop client from the web GUI in the Summary page. Please use the “Download Backend.AI Web UI App” at the bottom of the page.

_images/webserver-dashboard-download-desktop-app.png

Or, you can download from the following release page: https://github.com/lablup/backend.ai-webui/releases

Web UI (user GUI) guide can be found at https://webui.docs.backend.ai/.

Install on Clouds

The minimal instance configuration:

  • 1x SSL certificate with a private key for your own domain (for production)

  • 1x manager instance (e.g., t3.xlarge on AWS)

    • For HA setup, you many replicate multiple manager instances running in different availability zones and put a load balancer in front of them.

  • Nx agent instances (e.g., t3.medium / p2.xlarge on AWS – for minimal testing)

    • If you spawn multiple agents, it is recommended to use a placement group to improve locality for each availability zone.

  • 1x PostgreSQL instance (e.g., AWS RDS)

  • 1x Redis instance (e.g., AWS ElasticCache)

  • 1x etcd cluster

    • For HA setup, it should consist of 5 separate instances distributed across availability zones.

  • 1x cloud file system (e.g., AWS EFS, Azure FileShare)

  • All should be in the same virtual private network (e.g., AWS VPC).

Install on Premise

The minimal server node configuration:

  • 1x SSL certificate with a private key for your own domain (for production)

  • 1x manager server

  • Nx agent servers

  • 1x PostgreSQL server

  • 1x Redis server

  • 1x etcd cluster

    • For HA setup, it should consist of 5 separate server nodes.

  • 1x network-accessible storage server (NAS) with NFS/SMB mounts

    • All should be in the same private network (LAN).

  • Depending on the cluster size, several service/database daemons may run on the same physical server.

Install monitoring and logging tools

The Backend.AI can use several 3rd-party monitoring and logging services. Using them is completely optional.

Guide variables

⚠️ Prepare the values of the following variables before working with this page and replace their occurrences with the values when you follow the guide.

Name

Description

{DDAPIKEY}

>The Datadog API key

{DDAPPKEY}

The Datadog application key

{SENTRYURL}

The private Sentry report URL

Install Datadog agent

Datadog is a 3rd-party service to monitor the server resource usage.

$ DD_API_KEY={DDAPIKEY} bash -c "$(curl -L https://raw.githubusercontent.com/DataDog/dd-agent/master/packaging/datadog-agent/source/install_agent.sh)"

Install Raven (Sentry client)

Raven is the official client package name of Sentry, which reports detailed contextual information such as stack and package versions when an unhandled exception occurs.

$ pip install "raven>=6.1"

Environment specifics: WSL v2

Backend.AI supports running on WSL (Windows Subsystem for Linux) version 2. However, you need to configure some special options so that the WSL distribution can interact with the Docker Desktop service.

Configuration of Docker Desktop for Windows

Turn on WSL Integration on Settings → Resources → WSL INTEGRATION. For the most cases, this should be already configured when you install Docker Desktop for Windows.

Configuration of WSL

  1. Create or modify /etc/wsl.conf using sudo in the WSL shell.

  2. Write down this and save.

    [automount]
    root = /
    options = "metadata"
    
  3. Run wsl --shutdown in a PowerShell prompt to restart the WSL distribution to ensure your wsl.conf updates applied.

  4. Enter the WSL shell again. If it is applied, your path must appears like /c/some/path instead of /mnt/c/some/path.

  5. Run sudo mount --make-rshared / in the WSL shell. Otherwise, your container creation from Backend.AI will fail with an error message like aiodocker.exceptions.DockerError: DockerError(500, 'path is mounted on /d but it is not a shared mount.').

Installation of Backend.AI

Now you may run the installer in the WSL shell.

User Guides

Install User Programs in Session Containers

Sometimes you need new programs or libraries that are not installed in your environment. If so, you can install the new program into your environment.

NOTE: Newly installed programs are not environment dependent. It is installed in the user directory.

Install packages with linuxbrew

If you are a macOS user and a researcher or developer who occasionally installs unix programs, you may be familiar with homebrew <https://brew.sh>. You can install new programs using linuxbrew in Backend.AI.

Creating a user linuxbrew directory

Directories that begin with a dot are automatically mounted when the session starts. Create a linuxbrew directory that will be automatically mounted so that programs you install with linuxbrew can be used in all sessions.

Create .linuxbrew in the Storage section.

With CLI:

$ backend.ai vfolder create .linuxbrew

Let’s check if they are created correctly.

$ backend.ai vfolder list

Also, you can create a directory using GUI console with same name.

Installing linuxbrew

Start a new session for installation. Choose your environment and allocate the necessary resources. Generally, you don’t need to allocate a lot of resources, but if you need to compile or install a GPU-dependent library, you need to adjust the resource allocation to your needs.

In general, 1 CPU / 4GB RAM is enough.

$ sh -c "$(curl -fsSL https://raw.githubusercontent.com/Linuxbrew/install/master/install.sh)"
Testing linuxbrew

Enter the brew command to verify that linuxbrew is installed. In general, to use linuxbrew you need to add the path where linuxbrew is installed to the PATH variable.

Enter the following command to temporarily add the path and verify that it is installed correctly.

$ brew
Setting linuxbrew environment variables automatically

To correctly reference the binaries and libraries installed by linuxbrew, add the configuration to .bashrc. You can add settings from the settings tab.

Example: Installing and testing htop

To test the program installation, let’s install a program called htop. htop is a program that extends the top command, allowing you to monitor the running computing environment in a variety of ways.

Let’s install it with the following command:

$ brew install htop

If there are any libraries needed for the htop program, they will be installed automatically.

Now let’s run:

$ htop

From the run screen, you can press q to return to the terminal.

1.6 Deleting the linuxbrew Environment

To reset all programs installed with linuxbrew, just delete everything in the .linuxbrew directory.

Note: If you want to remove a program by selecting it, use the brew uninstall [PROGRAM_NAME] command.

$ rm -rf ~/.linuxbrew/*

Install packages with miniconda

Some environments support miniconda. In this case, you can use miniconda <https://docs.conda.io/projects/conda/en/latest/user-guide/install/> to install the packages you want.

Creating a user miniconda-required directory

Directories that begin with a dot are automatically mounted when the session starts. Create a .conda, .continuum directory that will be automatically mounted so that programs you install with miniconda can be used in all sessions.

Create .conda, .continuum in the Storage section.

With CLI:

$ backend.ai vfolder create .conda
$ backend.ai vfolder create .continuum

Let’s check if they are created correctly.

$ backend.ai vfolder list

Also, you can create a directory using GUI console with same name.

miniconda test

Make sure you have miniconda installed in your environment. Package installation using miniconda is only available if miniconda is preinstalled in your environment.

$ conda
Example: Installing and testing htop

To test the program installation, let’s install a program called htop. htop is a program that extends the top command, allowing you to monitor the running computing environment in a variety of ways.

Let’s install it with the following command:

$ conda install -c conda-forge htop

If there are any libraries needed for the htop program, they will be installed automatically.

Now let’s run:

$ htop

From the run screen, you can press q to return to the terminal.

Developer Guides

Development Setup

Currently Backend.AI is developed and tested under only *NIX-compatible platforms (Linux or macOS).

The development setup uses a mono-repository for the backend stack and a side-by-side repository checkout of the frontend stack. In contrast, the production setup uses per-service independent virtual environments and relies on a separately provisioned app proxy pool.

There are three ways to run both the backend and frontend stacks for development, as demonstrated in Fig. 4, Fig. 5, and Fig. 6. The installation guide in this page using scripts/install-dev.sh covers all three cases because the only difference is that how you launch the Web UI from the mono-repo.

_images/dev-setup.svg

A standard development setup of Backend.AI open source components

_images/dev-setup-app.svg

A development setup of Backend.AI open source components for Electron-based desktop app

_images/dev-setup-staticwebui.svg

A development setup of Backend.AI open source components with pre-built web UI from the backend.ai-app repository

Installation from Source

For the ease of on-boarding developer experience, we provide an automated script that installs all server-side components in editable states with just one command.

Prerequisites

Install the followings accordingly to your host operating system.

Warning

To avoid conflicts with your system Python such as macOS/XCode versions, our default pants.toml is configured to search only pyenv-provided Python versions.

Note

In some cases, locale conflicts between the terminal client and the remote host may cause encoding errors when installing Backend.AI components due to Unicode characters in README files. Please keep correct locale configurations to prevent such errors.

Running the install-dev script
$ git clone https://github.com/lablup/backend.ai bai-dev
$ cd bai-dev
$ ./scripts/install-dev.sh

Note

The script requires sudo to check and install several system packages such as build-essential.

This script will bootstrap Pants and creates the halfstack containers using docker compose with fixture population. At the end of execution, the script will show several command examples about launching the service daemons such as manager and agent. You may execute this script multiple times when you encounter prerequisite errors and resolve them. Also check out additional options using -h / --help option, such as installing the CUDA mockup plugin together, etc.

Changed in version 22.09: We have migrated to per-package repositories to a semi-mono repository that contains all Python-based components except plugins. This has changed the installation instruction completely with introduction of Pants.

Note

To install multiple instances/versions of development environments using this script, just clone the repository in another location and run scripts/install-dev.sh inside that directory.

It is important to name these working-copy directories differently not to confuse docker compose so that it can distinguish the containers for each setup.

Unless you customize all port numbers by the options of scripts/install-dev.sh, you should docker compose -f docker-compose.halfstack.current.yml down and docker compose -f docker-compose.halfstack.current.yml up -d when switching between multiple working copies.

Note

By default, the script pulls the docker images for our standard Python kernel and TensorFlow CPU-only kernel. To try out other images, you have to pull them manually afterwards.

Note

Currently there are many limitations on running deep learning images on ARM64 platforms, because users need to rebuild the whole computation library stack, although more supported images will come in the future.

Note

To install the webui in an editable state, try --editable-webui flag option when running scripts/install-dev.sh.

Tip

Using the agent’s cgroup-based statistics without the root privilege (Linux-only)

To allow Backend.AI to collect sysfs/cgroup resource usage statistics, the Python executable must have the following Linux capabilities: CAP_SYS_ADMIN, CAP_SYS_PTRACE, and CAP_DAC_OVERRIDE.

$ sudo setcap \
>   cap_sys_ptrace,cap_sys_admin,cap_dac_override+eip \
>   $(readlink -f $(pyenv which python))
Verifying Installation

Refer the instructions displayed after running scripts/install-dev.sh. We recommend to use tmux to open multiple terminals in a single SSH session. Your terminal app may provide a tab interface, but when using remote servers, tmux is more convenient because you don’t have to setup a new SSH connection whenever adding a new terminal.

Ensure the halfstack containers are running:

$ docker compose -f docker-compose.halfstack.current.yml up -d

Open a terminal for manager and run:

$ ./backend.ai mgr start-server --debug

Open another terminal for agent and run:

$ ./backend.ai ag start-server --debug

Open yet another terminal for client and run:

$ source ./env-local-admin-api.sh  # Use the generated local endpoint and credential config.
$ # source ./env-local-user-api.sh  # You may choose an alternative credential config.
$ ./backend.ai config
$ ./backend.ai run python --rm -c 'print("hello world")'
∙ Session token prefix: fb05c73953
✔ [0] Session fb05c73953 is ready.
hello world
✔ [0] Execution finished. (exit code = 0)
✔ [0] Cleaned up the session.
$ ./backend.ai ps
Resetting the environment

Shutdown all docker containers using docker compose -f docker-compose.halfstack.current.yml down and delete the entire working copy directory. That’s all.

You may need sudo to remove the directories mounted as halfstack container volumes because Docker auto-creates them with the root privilege.

Daily Workflows

Check out Daily Development Workflows for your reference.

Daily Development Workflows

About Pants

Since 22.09, we have migrated to Pants as our primary build system and dependency manager for the mono-repository of Python components.

Pants is a graph-based async-parallel task executor written in Rust and Python. It is tailored to building programs with explicit and auto-inferred dependency checks and aggressive caching.

Key concepts
  • The command pattern:

    $ pants [GLOBAL_OPTS] GOAL [GOAL_OPTS] [TARGET ...]
    
  • Goal: an action to execute

    • You may think this as the root node of the task graph executed by Pants.

  • Target: objectives for the action, usually expressed as path/to/dir:name

    • The targets are declared/defined by path/to/dir/BUILD files.

  • The global configuration is at pants.toml.

  • Recommended reading: https://www.pantsbuild.org/docs/concepts

Inspecting build configurations
  • Display all targets

    $ pants list ::
    
    • This list includes the full enumeration of individual targets auto-generated by collective targets (e.g., python_sources() generates multiple python_source() targets by globbing the sources pattern)

  • Display all dependencies of a specific target (i.e., all targets required to build this target)

    $ pants dependencies --transitive src/ai/backend/common:src
    
  • Display all dependees of a specific target (i.e., all targets affected when this target is changed)

    $ pants dependees --transitive src/ai/backend/common:src
    

Note

Pants statically analyzes the source files to enumerate all its imports and determine the dependencies automatically. In most cases this works well, but sometimes you may need to manually declare explicit dependencies in BUILD files.

Running lint and check

Run lint/check for all targets:

$ pants lint ::
$ pants check ::

To run lint/check for a specific target or a set of targets:

$ pants lint src/ai/backend/common:: tests/common::
$ pants check src/ai/backend/manager::

Currently running mypy with pants is slow because mypy cannot utilize its own cache as pants invokes mypy per file due to its own dependency management scheme. (e.g., Checking all sources takes more than 1 minutes!) This performance issue is being tracked by pantsbuild/pants#10864. For now, try using a smaller target of files that you work on and use an option to select the targets only changed (--changed-since).

Running formatters

If you encounter failure from ruff, you may run the following to automatically fix the import ordering issues.

$ pants fix ::

If you encounter failure from black, you may run the following to automatically fix the code style issues.

$ pants fmt ::

Running unit tests

Here are various methods to run tests:

$ pants test ::
$ pants test tests/manager/test_scheduler.py::
$ pants test tests/manager/test_scheduler.py:: -- -k test_scheduler_configs
$ pants test tests/common::            # Run common/**/test_*.py
$ pants test tests/common:tests        # Run common/test_*.py
$ pants test tests/common/redis::      # Run common/redis/**/test_*.py
$ pants test tests/common/redis:tests  # Run common/redis/test_*.py

You may also try --changed-since option like lint and check.

To specify extra environment variables for tests, use the --test-extra-env-vars option:

$ pants test \
>   --test-extra-env-vars=MYVARIABLE=MYVALUE \
>   tests/common:tests

Running integration tests

$ ./backend.ai test run-cli user,admin

Building wheel packages

To build a specific package:

$ pants \
>   --tag="wheel" \
>   package \
>   src/ai/backend/common:dist
$ ls -l dist/*.whl

If the package content varies by the target platform, use:

$ pants \
>   --tag="wheel" \
>   --tag="+platform-specific" \
>   --platform-specific-resources-target=linux_arm64 \
>   package \
>   src/ai/backend/runner:dist
$ ls -l dist/*.whl

Using IDEs and editors

Pants has an export goal to auto-generate a virtualenv that contains all external dependencies installed in a single place. This is very useful when you use IDEs and editors.

To (re-)generate the virtualenv(s), run:

$ pants export --resolve=RESOLVE_NAME  # you may add multiple --resolve options

You may display the available resolve names by (the command works with Python 3.11 or later):

$ python -c 'import tomllib,pathlib;print("\n".join(tomllib.loads(pathlib.Path("pants.toml").read_text())["python"]["resolves"].keys()))'

Similarly, you can export all virtualenvs at once:

$ python -c 'import tomllib,pathlib;print("\n".join(tomllib.loads(pathlib.Path("pants.toml").read_text())["python"]["resolves"].keys()))' | sed 's/^/--resolve=/' | xargs ./pants export

Then configure your IDEs/editors to use dist/export/python/virtualenvs/python-default/PYTHON_VERSION/bin/python as the interpreter for your code, where PYTHON_VERSION is the interpreter version specified in pants.toml.

As of Pants 2.16, you must export the virtualenvs by the individual lockfiles using the --resolve option, as all tools are unified to use the same custom resolve subsystem of Pants and the :: target no longer works properly, like:

$ pants export --resolve=python-default --resolve=mypy

To make LSP (language server protocol) services like PyLance to detect our source packages correctly, you should also configure PYTHONPATH to include the repository root’s src directory and plugins/*/ directories if you have added Backend.AI plugin checkouts.

For linters and formatters, configure the tool executable paths to indicate dist/export/python/virtualenvs/RESOLVE_NAME/PYTHON_VERSION/bin/EXECUTABLE. For example, ruff’s executable path is dist/export/python/virtualenvs/ruff/3.11.6/bin/ruff.

Currently we have the following Python tools to configure in this way:

  • ruff: Provides a fast linting (combining pylint, flake8, and isort) fixing (auto-fix for some linting rules and isort) and formatting (black)

  • mypy: Validates the type annotations and performs a static analysis

    Tip

    For a long list of arguments or list/tuple items, you could explicitly add a trailing comma to force Ruff/Black to insert line-breaks after every item even when the line length does not exceed the limit (100 characters).

    Tip

    You may disable auto-formatting on a specific region of code using # fmt: off and # fmt: on comments, though this is strongly discouraged except when manual formatting gives better readability, such as numpy matrix declarations.

  • pytest: The unit test runner framework.

  • coverage-py: Generates reports about which source lines were visited during execution of a pytest session.

  • towncrier: Generates the changelog from news fragments in the changes directory when making a new release.

VSCode

Install the following extensions:

  • Python (ms-python.python)

  • Pylance (ms-python.vscode-pylance) (optional but recommended)

  • Mypy (ms-python.mypy-type-checker)

  • Ruff (charliermarsh.ruff)

  • For other standard Python extensions like Flake8, isort, and Black, disable them for the Backend.AI workspace only to prevent interference with Ruff’s own linting, fixing and formatting.

Set the workspace settings for the Python extension for code navigation and auto-completion:

Setting ID

Recommended value

python.analysis.autoSearchPaths

true

python.analysis.extraPaths

["dist/export/python/virtualenvs/python-default/3.11.6/lib/python3.11/site-packages"]

python.analysis.importFormat

"relative"

editor.formatOnSave

true

editor.codeActionsOnSave

{"source.fixAll": true}

Set the following keys in the workspace settings to configure Python tools:

Setting ID

Example value

mypy-type-checker.interpreter

["dist/export/python/virtualenvs/mypy/3.11.6/bin/python"]

mypy-type-checker.importStrategy

"fromEnvironment"

ruff.interpreter

["dist/export/python/virtualenvs/ruff/3.11.6/bin/python"]

ruff.importStrategy

"fromEnvironment"

Note

Changed in July 2023

After applying the VSCode Python Tool migration, we no longer recommend to configure python.linting.*Path and python.formatting.*Path keys.

Vim/NeoVim

There are a large variety of plugins and usually heavy Vimmers should know what to do.

We recommend using ALE or CoC plugins to have automatic lint highlights, auto-formatting on save, and auto-completion support with code navigation via LSP backends.

Warning

Note that it is recommended to enable only one linter/formatter at a time (either ALE or CoC) with proper configurations, to avoid duplicate suggestions and error reports.

When using ALE, it is recommended to have a directory-local vimrc as follows. First, add set exrc in your user-level vimrc. Then put the followings in .vimrc (or .nvimrc for NeoVim) in the build root directory:

let s:cwd = getcwd()
let g:ale_python_mypy_executable = s:cwd . '/dist/export/python/virtualenvs/mypy/3.11.6/bin/mypy'
let g:ale_python_ruff_executable = s:cwd . '/dist/export/python/virtualenvs/ruff/3.11.6/bin/ruff'
let g:ale_linters = { "python": ['ruff', 'mypy'] }
let g:ale_fixers = {'python': ['ruff']}
let g:ale_fix_on_save = 1

When using CoC, run :CocInstall coc-pyright @yaegassy/coc-ruff and :CocLocalConfig after opening a file in the local working copy to initialize Pyright functionalities. In the local configuration file (.vim/coc-settings.json), you may put the linter/formatter configurations just like VSCode (see the official reference).

{
  "coc.preferences.formatOnType": false,
  "coc.preferences.willSaveHandlerTimeout": 5000,
  "ruff.enabled": true,
  "ruff.autoFixOnSave": true,
  "ruff.useDetectRuffCommand": false,
  "ruff.builtin.pythonPath": "dist/export/python/virtualenvs/ruff/3.11.6/bin/python",
  "ruff.serverPath": "dist/export/python/virtualenvs/ruff/3.11.6/bin/ruff-lsp",
  "python.pythonPath": "dist/export/python/virtualenvs/python-default/3.11.6/bin/python",
  "python.linting.mypyEnabled": true,
  "python.linting.mypyPath": "dist/export/python/virtualenvs/mypy/3.11.6/bin/mypy",
}

To activate Ruff (a Python linter and fixer), run :CocCommand ruff.builtin.installServer after opening any Python source file to install the ruff-lsp server.

Switching between branches

When each branch has different external package requirements, you should run pants export before running codes after git switch-ing between such branches.

Sometimes, you may experience bogus “glob” warning from pants because it sees a stale cache. In that case, run pgrep pantsd | xargs kill and it will be fine.

Running entrypoints

To run a Python program within the unified virtualenv, use the ./py helper script. It automatically passes additional arguments transparently to the Python executable of the unified virtualenv.

./backend.ai is an alias of ./py -m ai.backend.cli.

Examples:

$ ./py -m ai.backend.storage.server
$ ./backend.ai mgr start-server
$ ./backend.ai ps

Working with plugins

To develop Backend.AI plugins together, the repository offers a special location ./plugins where you can clone plugin repositories and a shortcut script scripts/install-plugin.sh that does this for you.

$ scripts/install-plugin.sh lablup/backend.ai-accelerator-cuda-mock

This is equivalent to:

$ git clone \
>   https://github.com/lablup/backend.ai-accelerator-cuda-mock \
>   plugins/backend.ai-accelerator-cuda-mock

These plugins are auto-detected by scanning setup.cfg of plugin subdirectories by the ai.backend.plugin.entrypoint module, even without explicit editable installations.

Writing test cases

Mostly it is just same as before: use the standard pytest practices. Though, there are a few key differences:

  • Tests are executed in parallel in the unit of test modules.

  • Therefore, session-level fixtures may be executed multiple times during a single run of pants test.

Warning

If you interrupt (Ctrl+C, SIGINT) a run of pants test, it will immediately kill all pytest processes without fixture cleanup. This may accumulate unused Docker containers in your system, so it is a good practice to run docker ps -a periodically and clean up dangling containers.

To interactively run tests, see Debugging test cases (or interactively running test cases).

Here are considerations for writing Pants-friendly tests:

  • Ensure that it runs in an isolated/mocked environment and minimize external dependency.

  • If required, use the environment variable BACKEND_TEST_EXEC_SLOT (an integer value) to uniquely define TCP port numbers and other resource identifiers to allow parallel execution. Refer the Pants docs.

  • Use ai.backend.testutils.bootstrap to populate a single-node Redis/etcd/Postgres container as fixtures of your test cases. Import the fixture and use it like a plain pytest fixture.

    • These fixtures create those containers with OS-assigned public port numbers and give you a tuple of container ID and a ai.backend.common.types.HostPortPair for use in test codes. In manager and agent tests, you could just refer local_config to get a pre-populated local configurations with those port numbers.

    • In this case, you may encounter flake8 complaining about unused imports and redefinition. Use # noqa: F401 and # noqa: F811 respectively for now.

Warning

About using /tmp in tests

If your Docker service is installed using Snap (e.g., Ubuntu 20.04 or later), it cannot access the system /tmp directory because Snap applies a private “virtualized” tmp directory to the Docker service.

You should use other locations under the user’s home directory (or preferably .tmp in the working copy directory) to avoid mount failures for the developers/users in such platforms.

It is okay to use the system /tmp directory if they are not mounted inside any containers.

Writing documentation

  • Create a new pyenv virtualenv based on Python 3.10.

    $ pyenv virtualenv 3.10.9 venv-bai-docs
    
  • Activate the virtualenv and run:

    $ pyenv activate venv-bai-docs
    $ pip install -U pip setuptools wheel
    $ pip install -U -r docs/requirements.txt
    
  • You can build the docs as follows:

    $ cd docs
    $ pyenv activate venv-bai-docs
    $ make html
    
  • To locally serve the docs:

    $ cd docs
    $ python -m http.server --directory=_build/html
    

(TODO: Use Pants’ own Sphinx support when pantsbuild/pants#15512 is released.)

Advanced Topics

Adding new external dependencies
  • Add the package version requirements to the unified requirements file (./requirements.txt).

  • Update the module_mapping field in the root build configuration (./BUILD) if the package name and its import name differs.

  • Update the type_stubs_module_mapping field in the root build configuration if the package provides a type stubs package separately.

  • Run:

    $ pants generate-lockfiles
    $ pants export
    
Merging lockfile conflicts

When you work on a branch that adds a new external dependency and the main branch has also another external dependency addition, merging the main branch into your branch is likely to make a merge conflict on python.lock file.

In this case, you can just do the followings since we can just regenerate the lockfile after merging requirements.txt and BUILD files.

$ git merge main
... it says a conflict on python.lock ...
$ git checkout --theirs python.lock
$ pants generate-lockfiles --resolve=python-default
$ git add python.lock
$ git commit
Resetting Pants

If Pants behaves strangely, you could simply reset all its runtime-generated files by:

$ pgrep pantsd | xargs -r kill
$ rm -r /tmp/*-pants/ .pants.d .pids ~/.cache/pants

After this, re-running any Pants command will automatically reinitialize itself and all cached data as necessary.

Note that you may find out the concrete path inside /tmp from .pants.rc’s local_execution_root_dir option set by install-dev.sh.

Warning

If you have run pants or the installation script with sudo, some of the above directories may be owned by root and running pants as the user privilege would not work. In such cases, remove the directories with sudo and retry.

Resolve the error message ‘Pants is not abailable for your platform’, When installing Backend.AI with pants

When installing Backend.AI, you may find the following error message saying ‘Pants is not available for your platform’ if you have installed Pants 2.17 or older with prior versions of Backend.AI.

[INFO] Bootstrapping the Pants build system...
Pants system command is already installed.
Failed to fetch https://binaries.pantsbuild.org/tags/pantsbuild.pants/release_2.19.0: [22] HTTP response code said error (The requested URL returned error: 404)
Bootstrapping Pants 2.19.0 using cpython 3.9.15
Installing pantsbuild.pants==2.19.0 into a virtual environment at /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0
    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 65.4/65.4 KB 3.3 MB/s eta 0:00:00
ERROR: Could not find a version that satisfies the requirement pantsbuild.pants==2.19.0 (from versions: 0.0.17, 0.0.18, 0.0.20, 0.0.21, 0.0.22, ... (a long list of versions) ..., 2.17.0,
2.17.1rc0, 2.17.1rc1, 2.17.1rc2, 2.17.1rc3, 2.17.1, 2.18.0.dev0, 2.18.0.dev1, 2.18.0.dev3, 2.18.0.dev4, 2.18.0.dev5, 2.18.0.dev6, 2.18.0.dev7, 2.18.0a0)
ERROR: No matching distribution found for pantsbuild.pants==2.19.0
Install failed: Command '['/home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0/bin/python', '-sE', '-m', 'pip', '--disable-pip-versi
on-check', '--no-python-version-warning', '--log', PosixPath('/home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/venvs/2.19.0/pants-install.log'
), 'install', '--quiet', '--find-links', 'file:///home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/find_links/2.19.0/e430175b/index.html', '--p
rogress-bar', 'off', 'pantsbuild.pants==2.19.0']' returned non-zero exit status 1.
More information can be found in the log at: /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/bindings/logs/install.log

Error: Isolates your Pants from the elements.

Please select from the following boot commands:

<default>: Detects the current Pants installation and launches it.
bootstrap-tools: Introspection tools for the Pants bootstrap process.
pants: Runs a hermetic Pants installation.
pants-debug: Runs a hermetic Pants installation with a debug server for debugging Pants code.
update: Update scie-pants.

You can select a boot command by passing it as the 1st argument or else by setting the SCIE_BOOT environment variable.

ERROR: Failed to establish atomic directory /home/aaa/.cache/nce/bad1ad5b44f41a6ca9c99a135f9af8849a3b93ec5a018c7b2d13acaf0a969e3a/locks/install-a4f15e2d2c97473883ec33b4ee0f9d11f99dcf5bee63
8b1cc7a0270d55d0ec8d. Population of work directory failed: Boot binding command failed: exit status: 1

[ERROR] Cannot proceed the installation because Pants is not available for your platform!

To resolve this error, reinstall or upgrade Pants. As of the Pants 2.18.0 release, they no longer use the Python Package Index but GitHub releases to distribute the binary builds.

Resolving missing directories error when running Pants
ValueError: Failed to create temporary directory for immutable inputs: No such file or directory (os error 2) at path "/tmp/bai-dev-PN4fpRLB2u2xL.j6-pants/immutable_inputsvIpaoN"

If you encounter errors like above when running daily Pants commands like lint, you may manually create the directory one step higher. For the above example, run:

mkdir -p /tmp/bai-dev-PN4fpRLB2u2xL.j6-pants/

If this workaround does not work, backup your current working files and reinstall by running scripts/delete-dev.sh and scripts/install-dev.sh serially.

Changing or updating the Python runtime for Pants

When you run scripts/install-dev.sh, it automatically creates .pants.bootstrap to explicitly set a specific pyenv Python version to run Pants.

If you have removed/upgraded this specific Python version from pyenv, you also need to update .pants.bootstrap accordingly.

Debugging test cases (or interactively running test cases)

When your tests hang, you can try adding the --debug flag to the pants test command:

$ pants test --debug ...

so that Pants runs the designated test targets serially and interactively. This means that you can directly observe the console output and Ctrl+C to gracefully shutdown the tests with fixture cleanup. You can also apply additional pytest options such as --fulltrace, -s, etc. by passing them after target arguments and -- when executing pants test command.

Installing a subset of mono-repo packages in the editable mode for other projects

Sometimes, you need to editable-install a subset of packages into other project’s directories. For instance you could mount the client SDK and its internal dependencies for a Docker container for development.

In this case, we recommend to do it as follows:

  1. Run the following command to build a wheel from the current mono-repo source:

    $ pants --tag=wheel package src/ai/backend/client:dist
    

    This will generate dist/backend.ai_client-{VERSION}-py3-none-any.whl.

  2. Run pip install -U {MONOREPO_PATH}/dist/{WHEEL_FILE} in the target environment.

    This will populate the package metadata and install its external dependencies. The target environment may be one of a separate virtualenv or a container being built. For container builds, you need to first COPY the wheel file and install it.

  3. Check the internal dependency directories to link by running the following command:

    $ pants dependencies --transitive src/ai/backend/client:src \
    >   | grep src/ai/backend | grep -v ':version' | cut -d/ -f4 | uniq
    cli
    client
    plugin
    
  4. Link these directories in the target environment.

    For example, if it is a Docker container, you could add -v {MONOREPO_PATH}/src/ai/backend/{COMPONENT}:/usr/local/lib/python3.10/site-packages/ai/backend/{COMPONENT} to the docker create or docker run commands for all the component directories found in the previous step.

    If it is a local checkout with a pyenv-based virtualenv, you could replace $(pyenv prefix)/lib/python3.10/site-packages/ai/backend/{COMPONENT} directories with symbolic links to the mono-repo’s component source directories.

Boosting the performance of Pants commands

Since Pants uses temporary directories for aggressive caching, you could make the .tmp directory under the working copy root a tmpfs partition:

$ sudo mount -t tmpfs -o size=4G tmpfs .tmp
  • To make this persistent across reboots, add the following line to /etc/fstab:

    tmpfs /path/to/dir/.tmp tmpfs defaults,size=4G 0 0
    
  • The size should be more than 3GB. (Running pants test :: consumes about 2GB.)

  • To change the size at runtime, you could simply remount it with a new size option:

    $ sudo mount -t tmpfs -o remount,size=8G tmpfs .tmp
    
Making a new release
  • Update ./VERSION file to set a new version number. (Remove the ending new line, e.g., using set noeol in Vim. This is also configured in ./editorconfig)

  • Run LOCKSET=tools/towncrier ./py -m towncrier to auto-generate the changelog.

    • You may append --draft to see a preview of the changelog update without actually modifying the filesystem.

    • (WIP: lablup/backend.ai#427).

  • Make a new git commit with the commit message: “release: <version>”.

  • Make an annotated tag to the commit with the message: “Release v<version>” or “Pre-release v<version>” depending on the release version.

  • Push the commit and tag. The GitHub Actions workflow will build the packages and publish them to PyPI.

Backporting to legacy per-pkg repositories
  • Use git diff and git apply instead of git cherry-pick.

    • To perform a three-way merge for conflicts, add -3 option to the git apply command.

    • You may need to rewrite some codes as the package structure differs. (The new mono repository has more fine-grained first party packages divided from the backend.ai-common package.)

  • When referring the PR/issue numbers in the commit for per-pkg repositories, update them like lablup/backend.ai#NNN instead of #NNN.

Adding New Kernel Images

Overview

Backend.AI supports running Docker containers to execute user-requested computations in a resource-constrained and isolated environment. Most Docker container images can be imported as Backend.AI kernels with appropriate metadata annotations.

  1. Prepare a Docker image based on Ubuntu 16.04/18.04, CentOS 7.6, or Alpine 3.8.

  2. Create a Dockerfile that does:

  • Install the OpenSSL library in the image for the kernel runner (if not installed).

  • Add metadata labels.

  • Add service definition files.

  • Add a jail policy file.

  1. Build a derivative image using the Dockerfile.

  2. Upload the image to a Docker registry to use with Backend.AI.

Kernel Runner

Every Backend.AI kernel should run a small daemon called “kernel runner”. It communicates with the Backend.AI Agent running in the host via ZeroMQ, and manages user code execution and in-container service processes.

The kernel runner provides runtime-specific implementations for various code execution modes such as the query mode and the batch mode, compatible with a number of well-known programming languages. It also manages the process lifecycles of service-port processes.

To decouple the development and update cycles for Docker images and the Backend.AI Agent, we don’t install the kernel runner inside images. Instead, Backend.AI Agent mounts a special “krunner” volume as /opt/backend.ai inside containers. This volume includes a customized static build of Python. The kernel runner daemon package is mounted as one of the site packages of this Python distribution as well. The agent also uses /opt/kernel as the directory for mounting other self-contained single-binary utilities. This way, image authors do not have to bother with installing Python and Backend.AI specific software. All dirty jobs like volume deployment, its content updates, and mounting for new containers are automatically managed by Backend.AI Agent.

Since the customized Python build and binary utilities need to be built for specific Linux distributions, we only support Docker images built on top of Alpine 3.8+, CentOS 7+, and Ubuntu 16.04+ base images. Note that these three base distributions practically cover all commonly available Docker images.

Image Prerequisites

For glibc-based (most) Linux kernel images, you don’t have to add anything to the existing container image as we use a statically built Python distribution with precompiled wheels to run the kernel runner. The only requirement is that it should be compatible with manylinux2014 or later.

For musl-based Linux kernel images (e.g., Alpine), you have to install libffi and sqlite-libs as the minimum. Please also refer the Dockerfile to build a minimal compatible image.

Metadata Labels

Any Docker image based on Alpine 3.17+, CentOS 7+, and Ubuntu 16.04+ which satisfies the above prerequisites may become a Backend.AI kernel image if you add the following image labels:

  • Required Labels

    • ai.backend.kernelspec: 1 (this will be used for future versioning of the metadata specification)

    • ai.backend.features: A list of constant strings indicating which Backend.AI kernel features are available for the kernel.

      • batch: Can execute user programs passed as files.

      • query: Can execute user programs passed as code snippets while keeping the context across multiple executions.

      • uid-match: As of 19.03, this must be specified always.

      • user-input: The query/batch mode supports interactive user inputs.

    • ai.backend.resource.min.*: The minimum amount of resource to launch this kernel. At least, you must define the CPU core (cpu) and the main memory (mem). In the memory size values, you may use binary scale-suffixes such as m for MiB, g for GiB, etc.

    • ai.backend.base-distro: Either “ubuntu16.04” or “alpine3.8”. Note that Ubuntu 18.04-based kernels also need to use “ubuntu16.04” here.

    • ai.backend.runtime-type: The type of kernel runner to use. (One of the directories in the ai.backend.kernels namespace.)

      • python: This runtime is for Python-based kernels, allowing the given Python executable accessible via the query and batch mode, also as a Jupyter kernel service.

      • app: This runtime does not support code execution in the query/batch modes but just manages the service port processes. For custom kernel images with their own service ports for their main applications, this is the most frequently used runtime type for derivative images.

      • For the full list of available runtime types, check out the lang_map variable at the ai.backend.kernels module code

    • ai.backend.runtime-path: The path to the language runtime executable.

  • Optional Labels

    • ai.backend.role: COMPUTE (default if unspecified) or INFERENCE

    • ai.backend.service-ports: A list of port mapping declaration strings for services supported by the image. (See the next section for details) Backend.AI manages the host-side port mapping and network tunneling via the API gateway automagically.

    • ai.backend.endpoint-ports: A comma-separated name(s) of service port(s) to be bound with the service endpoint. (At least one is required in inference sessions)

    • ai.backend.model-path: The path to mount the target model’s target version storage folder. (Required in inference sessions)

    • ai.backend.envs.corecount: A comma-separated string list of environment variable names. They are set to the number of available CPU cores to the kernel container. It allows the CPU core restriction to be enforced to legacy parallel computation libraries. (e.g., JULIA_CPU_CORES, OPENBLAS_NUM_THREADS)

Service Ports

As of Backend.AI v19.03, service ports are our preferred way to run computation workloads inside Backend.AI kernels. It provides tunneled access to Jupyter Notebooks and other daemons running in containers.

As of Backend.AI v19.09, Backend.AI provides SSH (including SFTP and SCP) and ttyd (web-based xterm shell) as intrinsic services for all kernels. “Intrinsic” means that image authors do not have to do anything to support/enable the services.

As of Backend.AI v20.03, image authors may define their own service ports using service definition JSON files installed at /etc/backend.ai/service-defs in their images.

Port Mapping Declaration

A custom service port should define two things. First, the image label ai.backend.service-ports contains the port mapping declarations. Second, the service definition file which specifies how to start the service process.

A port mapping declaration is composed of three values: the service name, the protocol, and the container-side port number. The label may contain multiple port mapping declarations separated by commas, like the following example:

jupyter:http:8080,tensorboard:http:6006

The name may be an non-empty arbitrary ASCII alphanumeric string. We use the kebab-case for it. The protocol may be one of tcp, http, and pty, but currently most services use http.

Note that there are a few port numbers reserved for Backend.AI itself and intrinsic service ports. The TCP port 2000 and 2001 is reserved for the query mode, whereas 2002 and 2003 are reserved for the native pseudo-terminal mode (stdin and stdout combined with stderr), 2200 for the intrinsic SSH service, and 7681 for the intrinsic ttyd service.

Up to Backend.AI 19.09, this was the only method to define a service port for images, and the service-specific launch sequences were all hard-coded in the ai.backend.kernel module.

Service Definition DSL

Now the image author should define the service launch sequences using a DSL (domain-specific language). The service definitions are written as JSON files in the container’s /etc/backend.ai/service-defs directory. The file names must be same with the name parts of the port mapping declarations.

For example, a sample service definition file for “jupyter” service (hence its filename must be /etc/backend.ai/service-defs/jupyter.json) looks like:

{
    "prestart": [
      {
        "action": "write_tempfile",
        "args": {
          "body": [
            "c.NotebookApp.allow_root = True\n",
            "c.NotebookApp.ip = \"0.0.0.0\"\n",
            "c.NotebookApp.port = {ports[0]}\n",
            "c.NotebookApp.token = \"\"\n",
            "c.FileContentsManager.delete_to_trash = False\n"
          ]
        },
        "ref": "jupyter_cfg"
      }
    ],
    "command": [
        "{runtime_path}",
        "-m", "jupyterlab",
        "--no-browser",
        "--config", "{jupyter_cfg}"
    ],
    "url_template": "http://{host}:{port}/"
}

A service definition is composed of three major fields: prestart that contains a list of prestart actions, command as a list of template-enabled strings, and an optional url_template as a template-enabled string that defines the URL presented to the end-user on CLI or used as the redirection target on GUI with wsproxy.

The “template-enabled” strings may have references to a contextual set of variables in curly braces. All the variable substitution follows the Python’s brace-style formatting syntax and rules.

Available predefined variables

There are a few predefined variables as follows:

  • ports: A list of TCP ports used by the service. Most services have only one port. An item in the list may be referenced using bracket notation like {ports[0]}.

  • runtime_path: A string representing the full path to the runtime, as specified in the ai.backend.runtime-path image label.

Available prestart actions

A prestart action is composed of two mandatory fields action and args (see the table below), and an optional field ref. The ref field defines a variable that stores the result of the action and can be referenced in later parts of the service definition file where the arguments are marked as “template-enabled”.

Action Name

Arguments

Return

write_file

  • body: a list of string lines (template-enabled)

  • filename: a string representing the file name (template-enabled)

  • mode: an optional octal number as string, representing UNIX file permission (default: “755”)

  • append: an optional boolean. If set true, open the file in the appending mode.

None

write_tempfile

  • body: a list of string line (template-enabled)

  • mode: an optional octal number as string, representing UNIX file permission (default: “755”)

The generated file path

mkdir

  • path: the directory path (template-enabled) where parent directories are auto-created

None

run_command

  • command: the command-line argument list as passed to exec syscall (template-enabled)

A dictionary with two fields: out and err which contain the console output decoded as the UTF-8 encoding

log

  • body: a string to send as kernel log (template-enabled)

  • debug: a boolean to lower the logging level to DEBUG (default is INFO)

None

Warning

run_command action should return quickly, otherwise the session creation latency will be increased. If you need to run a background process, you must use its own options to let it daemonize or wrap as a background shell command (["/bin/sh", "-c", "... &"]).

Interpretation of URL template

url_template field is used by the client SDK and wsproxy to fill up the actual URL presented to the end-user (or the end-user’s web browser as the redirection target). So its template variables are not parsed when starting the service, but they are parsed and interpolated by the clients. There are only three fixed variables: {protocol}, {host}, and {port}.

Here is a sample service-definition that utilizes the URL template:

{
  "command": [
    "/opt/noVNC/utils/launch.sh",
    "--vnc", "localhost:5901",
    "--listen", "{ports[0]}"
  ],
  "url_template": "{protocol}://{host}:{port}/vnc.html?host={host}&port={port}&password=backendai&autoconnect=true"
}

Jail Policy

(TODO: jail policy syntax and interpretation)

Adding Custom Jail Policy

To write a new policy implementation, extend the jail policy interface in Go. Ebmed it inside your jail build. Please give a look to existing jail policies as good references.

Example: An Ubuntu-based Kernel

FROM ubuntu:16.04

# Add commands for image customization
RUN apt-get install ...

# Backend.AI specifics
RUN apt-get install libssl
LABEL ai.backend.kernelspec=1 \
      ai.backend.resource.min.cpu=1 \
      ai.backend.resource.min.mem=256m \
      ai.backend.envs.corecount="OPENBLAS_NUM_THREADS,OMP_NUM_THREADS,NPROC" \
      ai.backend.features="batch query uid-match user-input" \
      ai.backend.base-distro="ubuntu16.04" \
      ai.backend.runtime-type="python" \
      ai.backend.runtime-path="/usr/local/bin/python" \
      ai.backend.service-ports="jupyter:http:8080"
COPY service-defs/*.json /etc/backend.ai/service-defs/
COPY policy.yml /etc/backend.ai/jail/policy.yml

Custom startup scripts (aka custom entrypoint)

When the image has preopen service ports and/or an endpoint port, Backend.AI automatically sets up application proxy tunnels as if the listening applications are already started.

To initialize and start such applications, put a shell script as /opt/container/bootstrap.sh when building the image. This per-image bootstrap script is executed as root by the agent-injected entrypoint.sh.

Warning

Since Backend.AI overrides the command and the entrypoint of container images to run the kernel runner regardless of the image content, setting CMD or ENTRYPOINT in Dockerfile has no effects. You should use /opt/container/bootstrap.sh to migrate existing entrypoint/command wrappers.

Warning

/opt/container/bootstrap.sh must return immediately to prevent the session from staying in the PREPARING status. This means that it should run service applications in background by daemonization.

To run a process as the user privilege, you should use su-exec which is also injected by the agent like:

/opt/kernel/su-exec "${LOCAL_GROUP_ID}:${LOCAL_USER_ID}" /path/to/your/service

Implementation details

The query mode I/O protocol

The input is a ZeroMQ’s multipart message with two payloads. The first payload should contain a unique identifier for the code snippet (usually a hash of it), but currently it is ignored (reserved for future caching implementations). The second payload should contain a UTF-8 encoded source code string.

The reply is a ZeroMQ’s multipart message with a single payload, containing a UTF-8 encoded string of the following JSON object:

{
    "stdout": "hello world!",
    "stderr": "oops!",
    "exceptions": [
        ["exception-name", ["arg1", "arg2"], false, null]
    ],
    "media": [
        ["image/png", "data:image/base64,...."]
    ],
    "options": {
        "upload_output_files": true
    }
}

Each item in exceptions is an array composed of four items: exception name, exception arguments (optional), a boolean indicating if the exception is raised outside the user code (mostly false), and a traceback string (optional).

Each item in media is an array of two items: MIME-type and the data string. Specific formats are defined and handled by the Backend.AI Media module.

The options field may present optionally. If upload_output_files is true (default), then the agent uploads the files generated by user code in the working directory (/home/work) to AWS S3 bucket and make their URLs available in the front-end.

The pseudo-terminal mode protocol

If you want to allow users to have real-time interactions with your kernel using web-based terminals, you should implement the PTY mode as well. A good example is our “git” kernel runner.

The key concept is separation of the “outer” daemon and the “inner” target program (e.g., a shell). The outer daemon should wrap the inner program inside a pseudo-tty. As the outer daemon is completely hidden in terminal interaction by the end-users, the programming language may differ from the inner program. The challenge is that you need to implement piping of ZeroMQ sockets from/to pseudo-tty file descriptors. It is up to you how you implement the outer daemon, but if you choose Python for it, we recommend to use asyncio or similar event loop libraries such as tornado and Twisted to mulitplex sockets and file descriptors for both input/output directions. When piping the messages, the outer daemon should not apply any specific transformation; it should send and receive all raw data/control byte sequences transparently because the front-end (e.g., terminal.js) is responsible for interpreting them. Currently we use PUB/SUB ZeroMQ socket types but this may change later.

Optionally, you may run the query-mode loop side-by-side. For example, our git kernel supports terminal resizing and pinging commands as the query-mode inputs. There is no fixed specification for such commands yet, but the current CodeOnWeb uses the followings:

  • %resize <rows> <cols>: resize the pseudo-tty’s terminal to fit with the web terminal element in user browsers.

  • %ping: just a no-op command to prevent kernel idle timeouts while the web terminal is open in user browsers.

A best practice (not mandatory but recommended) for PTY mode kernels is to automatically respawn the inner program if it terminates (e.g., the user has exited the shell) so that the users are not locked in a “blank screen” terminal.

Using Mocked Accelerators

For developers who do not have access to physical accelerator devices such as CUDA GPUs, we provide a mock-up plugin to simulate the system configuration with such devices, allowing development and testing of accelerator-related features in various components including the web UI.

Configuring the mock-accelerator plugin

Check out the examples in the configs/accelerator directory.

Here are the description of each field:

  • slot_name: The resource slot’s main key name. The plugin’s resource slot name has the form of "<slot_name>.<subtype>", where the subtype may be something such as device (default), shares (for the fractional allocation mode). For CUDA MIG devices, it becomes a string including the slice size from the device memory size such as 10g-mig.

    To configure the fractional allocation mode, you should also specify the etcd accelerator plugin configuration like the following JSON, where unit_mem and unit_proc is used as the divisor to calculate 1.0 fraction:

    {
       "config": {
         "plugins": {
           "accelerator": {
             "<slot_name>": {
               "allocation_mode": "fractional",
               "unit_mem": 1073741824,
               "unit_proc": 10
             }
           }
         }
       }
    }
    

    In the above example, the 10 subprocessors and 1 GiB of device memory is regarded as 1.0 fractional device. You may store it as a JSON file and put in the etcd configuration tree like:

    $ ./backend.ai mgr etcd put-json '' mydevice-fractional-mode.json
    
  • device_plugin_name: The class name to use as the actual implementation. Currently there are two: CUDADevice and MockDevice.

  • formats.<subtype>: The tables for per-subtype formatting details

    • display_icon: The device icon type displayed in the UI.

    • display_unit: The resource slot unit displayed in the UI, alongside the amount numbers.

    • human_readable_name: The device name displayed in the UI.

    • description: The device description displayed in the UI.

    • number_format: The number formatting string used for the UI.

      • binary: A boolean flag to indicate whether to use the binary suffixes (divided by 2^(10n) instead of 10^(3n))

      • round_length: The length of fixed points to wrap the numeric value of this resource slot. If zero, the number is treated as an integer.

  • devices: The list of mocked device declarations

    • mother_uuid: The unique ID of the device, which may be random-generated

    • model_name: The model name to report to the manager as metadata

    • numa_node: The NUMA node index to place this device.

    • subproc_count: The number of sub-processing cores (e.g., the number of streaming multi-processors of CUDA GPUs)

    • memory_size: The size of on-device memory represented as human-readable binary sizes

    • is_mig_devices: (CUDA-specific) whether this device is a MIG slice or a full device

Activating the mock-accelerator plugin

Add "ai.backend.accelerator.mock" to the agent.toml’s [agent].allowed-compute-plugins field. Then restart the agent.

Version Numbering

  • Version numbering uses x.y.z format (where x, y, z are integers).

  • Mostly, we follow the calendar versioning scheme.

  • x.y is a release branch name (major releases per 6 months).

    • When y is smaller than 10, we prepend a zero sign like 05 in the version numbers (e.g., 20.09.0).

    • When referring the version in other Python packages as requirements, you need to strip the leading zeros (e.g., 20.9.0 instead of 20.09.0) because Python setuptools normalizes the version integers.

  • x.y.z is a release tag name (patch releases).

  • When releasing x.y.0:

    • Create a new x.y branch, do all bugfix/hotfix there, and make x.y.z releases there.

    • All fixes must be first implemented on the main branch and then cherry-picked back to x.y branches.

      • When cherry-picking, use the -e option to edit the commit message.
        Append Backported-From: main and Backported-To: X.Y lines after one blank line at the end of the existing commit message.

    • Change the version number of main to x.(y+1).0.dev0

    • There is no strict rules about alpha/beta/rc builds yet. We will elaborate as we scale up.
      Once used, alpha versions will have aN suffixes, beta versions bN suffixes, and RC versions rcN suffixes where N is an integer.

  • New development should go on the main branch.

    • main: commit here directly if your changes are a self-complete one as a single commit.

    • Use both short-lived and long-running feature branches freely, but ensure there names differ from release branches and tags.

  • The major/minor (x.y) version of Backend.AI subprojects will go together to indicate compatibility. Currently manager/agent/common versions progress this way, while client SDKs have their own version numbers and the API specification has a different vN.yyyymmdd version format.

    • Generally backend.ai-manager 1.2.p is compatible with backend.ai-agent 1.2.q (where p and q are same or different integers)

      • As of 22.09, this won’t be guaranteed any more. All server-side core component versions should exactly match with others, as we release them at once from the mono-repo, even for those who do not have any code changes.

    • The client is guaranteed to be backward-compatible with the server they share the same API specification version.

Upgrading

You can upgrade the installed Python packages using pip install -U ... command along with dependencies.

If you have cloned the stable version of source code from git, then pull and check out the next x.y release branch. It is recommended to re-run pip install -U -r requirements.txt as dependencies might be updated.

For the manager, ensure that your database schema is up-to-date by running alembic upgrade head. If you setup your development environment with Pants and install-dev.sh script, keep your database schema up-to-date via ./py -m alembic upgrade head instead of plain alembic command above.

Also check if any manual etcd configuration scheme change is required, though we will try to keep it compatible and automatically upgrade when first executed.

Migration Guides

Upgrading from 20.03 to 20.09

(TODO)

Migrating from the Docker Hub to cr.backend.ai

As of November 2020, the Docker Hub has begun to limit the retention time and the rate of pulls of public images. Since Backend.AI uses a number of Docker images with variety of access frequencies, we decided to migrate to our own container registry, https://cr.backend.ai.

It is strongly recommended to set a maintenance period if there are active users of the Backend.AI cluster to prevent new session starts during migration. This registry migration does not affect existing running sessions, though the Docker image removal in the agent nodes can only be done after terminating all existing containers started with the old images and there will be brief disconnection of service ports as the manager requires to be restarted.

  1. Update your Backend.AI installation to the latest version (manager 20.03.11 or 20.09.0b2) to get support for Harbor v2 container registries.

  2. Save the following JSON snippet as registry-config.json.

    {
      "config": {
        "docker": {
          "registry": {
            "cr.backend.ai": {
              "": "https://cr.backend.ai",
              "type": "harbor2",
              "project": "stable,community"
            }
          }
        }
      }
    }
    
  3. Run the following using the manager CLI on one of the manager nodes:

    $ sudo systemctl stop backendai-manager  # stop the manager daemon (may differ by setup)
    $ backend.ai mgr etcd put-json '' registry-config.json
    $ backend.ai mgr image rescan cr.backend.ai
    $ sudo systemctl start backendai-manager  # start the manager daemon (may differ by setup)
    
    • The agents will automatically pull the images since the image references are changed even when the new images are actually same to the existing ones. It is recommended to pull the essential images by yourself in the agents to avoid long waiting times when starting sessions using the docker pull command in the agent nodes.

    • Now the images are categorized with additional path prefix, such as stable and community. More prefixes may be introduced in the future and some prefixes may be set only available to specific set of users/user groups, with dedicated credentials.

      For example, lablup/python:3.6-ubuntu18.04 is now referred as cr.backend.ai/stable/python:3.6-ubuntu18.04.

    • If you have configured image aliases, you need to update them manually as well, using the backend.ai mgr image alias command. This does not affect existing sessions running with old aliases.

  4. Update the allowed docker registries policy for each domain using the backend.ai mgr dbshell command. Remove “index.docker.io” from the existing values and replace “…” below with your own domain names and additional registries.

    SELECT name, allowed_docker_registries FROM domains;  -- check the current config
    UPDATE domains SET allowed_docker_registries = '{cr.backend.ai,...}' WHERE name = '...';
    
  5. Now you may start new sessions using the images from the new registry.

  6. After terminating all existing sessions using the old images from the Docker Hub (i.e., images whose names start with lablup/ prefix), remove the image metadata and registry configuration using the manager CLI:

    $ backend.ai mgr etcd delete --prefix images/index.docker.io
    $ backend.ai mgr etcd delete --prefix config/docker/registry/index.docker.io
    
  7. Run docker rmi commands to clean up the pulled images in the agent nodes. (Automatic/managed removal of images will be implemented in the future versions of Backend.AI)

Backend.AI Manager Reference

Manager API Common Concepts

API and Document Conventions

HTTP Methods

We use the standard HTTP/1.1 methods (RFC-2616), such as GET, POST, PUT, PATCH and DELETE, with some additions from WebDAV (RFC-3253) such as REPORT method to send JSON objects in request bodies with GET semantics.

If your client runs under a restrictive environment that only allows a subset of above methods, you may use the universal POST method with an extra HTTP header like X-Method-Override: REPORT, so that the Backend.AI gateway can recognize the intended HTTP method.

Parameters in URI and JSON Request Body

The parameters with colon prefixes (e.g., :id) are part of the URI path and must be encoded using a proper URI-compatible encoding schemes such as encodeURIComponent(value) in Javascript and urllib.parse.quote(value, safe='~()*!.\'') in Python 3+.

Other parameters should be set as a key-value pair of the JSON object in the HTTP request body. The API server accepts both UTF-8 encoded bytes and standard-compliant Unicode-escaped strings in the body.

HTTP Status Codes and JSON Response Body

The API responses always contain a root JSON object, regardless of success or failures.

For successful responses (HTTP status 2xx), the root object has a varying set of key-value pairs depending on the API.

For failures (HTTP status 4xx/5xx), the root object contains at least two keys: type which uniquely identifies the failure reason as an URI and title for human-readable error messages. Some failures may return extra structured information as additional key-value pairs. We use RFC 7807-style problem detail description returned in JSON of the response body.

JSON Field Notation

Dot-separated field names means a nested object. If the field name is a pure integer, it means a list item.

Example

Meaning

a

The attribute a of the root object. (e.g., 123 at {"a": 123})

a.b

The attribute b of the object a on the root. (e.g., 456 at {"a": {"b": 456}})

a.0

An item in the list a on the root. 0 means an arbitrary array index, not the specific item at index zero. (e.g., any of 13, 57, 24, and 68 at {"a": [13, 57, 24, 68]})

a.0.b

The attribute b of an item in the list a on the root. (e.g., any of 1, 2, and 3 at {"a": [{"b": 1}, {"b": 2}, {"b": 3}]})

JSON Value Types

This documentation uses a type annotation style similar to Python’s typing module, but with minor intuitive differences such as lower-cased generic type names and wildcard as asterisk * instead of Any.

The common types are array (JSON array), object (JSON object), int (integer-only subset of JSON number), str (JSON string), and bool (JSON true or false). tuple and list are aliases to array. Optional values may be omitted or set to null.

We also define several custom types:

Type

Description

decimal

Fractional numbers represented as str not to loose precision. (e.g., to express money amounts)

slug

Similar to str, but the values should contain only alpha-numeric characters, hyphens, and underscores. Also, hyphens and underscores should have at least one alphanumeric neighbor as well as cannot become the prefix or suffix.

datetime

ISO-8601 timestamps in str, e.g., "YYY-mm-ddTHH:MM:SS.ffffff+HH:MM". It may include an optional timezone information. If timezone is not included, the value is assumed to be UTC. The sub-seconds parts has at most 6 digits (micro-seconds).

enum[*]

Only allows a fixed/predefined set of possible values in the given parametrized type.

API Versioning

A version string of the Backend.AI API uses two parts: a major revision (prefixed with v) and minor release dates after a dot following the major revision. For example, v23.20250101 indicates a 23rd major revision with a minor release at January 1st in 2025.

We keep backward compatibility between minor releases within the same major version. Therefore, all API query URLs are prefixed with the major revision, such as /v2/kernel/create. Minor releases may introduce new parameters and response fields but no URL changes. Accessing unsupported major revision returns HTTP 404 Not Found.

Changed in version v3.20170615: Version prefix in API queries are deprecated. (Yet still supported currently) For example, now users should call /kernel/create rather than /v2/kernel/create.

A client must specify the API version in the HTTP request header named X-BackendAI-Version. To check the latest minor release date of a specific major revision, try a GET query to the URL with only the major revision part (e.g., /v2). The API server will return a JSON string in the response body containing the full version. When querying the API version, you do not have to specify the authorization header and the rate-limiting is enforced per the client IP address. Check out more details about Authentication and Rate Limiting.

Example version check response body:

{
   "version": "v2.20170315"
}

JSON Object References

Paging Query Object

It describes how many items to fetch for object listing APIs. If index exceeds the number of pages calculated by the server, an empty list is returned.

Key

Type

Description

size

int

The number of items per page. If set zero or this object is entirely omitted, all items are returned and index is ignored.

index

int

The page number to show, zero-based.

Paging Info Object

It contains the paging information based on the paging query object in the request.

Key

Type

Description

pages

int

The number of total pages.

count

int

The number of all items.

KeyPair Item Object

Key

Type

Description

accessKey

slug

The access key part.

isActive

bool

Indicates if the keypair is active or not.

totalQueries

int

The number of queries done via this keypair. It may have a stale value.

created

datetime

The timestamp when the keypair was created.

KeyPair Properties Object

Key

Type

Description

isActive

bool

Indicates if the keypair is activated or not. If not activated, all authentication using the keypair returns 401 Unauthorized. When changed from true to false, existing running sessions continue to run but any requests to create new sessions are refused. (default: true)

concurrecy

int

The maximum number of concurrent sessions allowed for this keypair. (default: 5)

ML.clusterSize

int

Sets the number of instances clustered together when launching new machine learning sessions. (default: 1)

ML.instanceMemory

int (MiB)

Sets the memory limit of each instance in the cluster launched for new machine learning sessions. (default: 8)

The enterprise edition offers the following additional properties:

Key

Type

Description

cost.automatic

bool

If set true, enables automatic cost optimization (BETA). With supported kernel types, it automatically suspends or resize the sessions not to exceed the configured cost limit per day. (default: false)

cost.dailyLimit

str

The string representation of money amount as decimals. The currency is fixed to USD. (default: "50.00")

Service Port Object

Key

Type

Description

name

slug

The name of service provided by the container. See also: Terminal Emulation

protocol

str

The type of network protocol used by the container service.

Batch Execution Query Object

Key

Type

Description

build

str

The bash command to build the main program from the given uploaded files.

If this field is not present, an empty string or null, it skips the build step.

If this field is a constant string "*", it will use a default build script provided by the kernel. For example, the C kernel’s default Makefile adds all C source files under the working directory and copmiles them into ./main executable, with commonly used C/link flags: "-pthread -lm -lrt -ldl".

exec

str

The bash command to execute the main program.

If this is not present, an empty string, or null, the server only performs the build step and options.buildLog is assumed to be true (the given value is ignored).

clean

str

The bash command to clean the intermediate files produced during the build phase. The clean step comes before the build step if specified so that the build step can (re)start fresh.

If the field is not present, an empty string, or null, it skips the clean step.

Unlike the build and exec command, the default for "*" is do-nothing to prevent deletion of other files unrelated to the build by bugs or mistakes.

Note

A client can distinguish whether the current output is from the build phase or the execution phase by whether it has received build-finished status or not.

Note

All shell commands are by default executed under /home/work. The common environment is:

TERM=xterm
LANG=C.UTF-8
SHELL=/bin/bash
USER=work
HOME=/home/work

but individual kernels may have additional environment settings.

Warning

The shell does NOT have access to sudo or the root privilege. Though, some kernels may allow installation of language-specific packages in the user directory.

Also, your build script and the main program is executed inside Backend.AI Jail, meaning that some system calls are blocked by our policy. Since ptrace syscall is blocked, you cannot use native debuggers such as gdb.

This limitation, however, is subject to change in the future.

Example:

{
  "build": "gcc -Wall main.c -o main -lrt -lz",
  "exec": "./main"
}
Execution Result Object

Key

Type

Description

runId

str

The user-provided run identifier. If the user has NOT provided it, this will be set by the API server upon the first execute API call. In that case, the client should use it for the subsequent execute API calls during the same run.

status

enum[str]

One of "continued", "waiting-input", "finished", "clean-finished", "build-finished", or "exec-timeout". See more details at Code Execution Model.

exitCode

int | null

The exit code of the last process. This field has a valid value only when the status is "finished", "clean-finished" or "build-finished". Otherwise it is set to null.

For batch-mode kernels and query-mode kernels without global context support, exitCode is the return code of the last executed child process in the kernel. In the execution step of a batch mode run, this is always 127 (a UNIX shell common practice for “command not found”) when the build step has failed.

For query-mode kernels with global context support, this value is always zero, regardless of whether the user code has caused an exception or not.

A negative value (which cannot happen with normal process termination) indicates a Backend.AI-side error.

console

list[object]

A list of Console Item Object.

options

object

An object containing extra display options. If there is no options indicated by the kernel, this field is null. When result.status is "waiting-input", it has a boolean field is_password so that you could use different types of text boxes for user inputs.

files

list[object]

A list of Execution Result File Object that represents files generated in /home/work/.output directory of the container during the code execution .

Console Item Object

Key

Type

Description

(root)

[enum, *]

A tuple of the item type and the item content. The type may be "stdout", "stderr", and others.

See more details at Handling Console Output.

Execution Result File Object

Key

Type

Description

name

str

The name of a created file after execution.

url

str

The URL of a create file uploaded to AWS S3.

Container Stats Object

Key

Type

Description

cpu_used

int (msec)

The total time the kernel was running.

mem_max_bytes

int (Byte)

The maximum memory usage.

mem_cur_bytes

int (Byte)

The current memory usage.

net_rx_bytes

int (Byte)

The total amount of received data through network.

net_tx_bytes

int (Byte)

The total amount of transmitted data through network.

io_read_bytes

int (Byte)

The total amount of received data from IO.

io_write_bytes

int (Byte)

The total amount of transmitted data to IO.

io_max_scratch_size

int (Byte)

Currently not used field.

io_write_bytes

int (Byte)

Currently not used field.

Creation Config Object

Key

Type

Description

environ

object

A dictionary object specifying additional environment variables. The values must be strings.

mounts

list[str]

An optional list of the name of virtual folders that belongs to the current API key. These virtual folders are mounted under /home/work. For example, if the virtual folder name is abc, you can access it on /home/work/abc.

If the name contains a colon in the middle, the second part of the string indicates the alias location in the kernel’s file system which is relative to /home/work.

You may mount up to 5 folders for each session.

clusterSize

int

The number of instances bundled for this session.

resources

Resource Slot Object

The resource slot specification for each container in this session.

New in version v4.20190315.

instanceMemory

int (MiB)

The maximum memory allowed per instance. The value is capped by the per-kernel image limit. Additional charges may apply on the public API service.

Deprecated since version v4.20190315.

instanceCores

int

The number of CPU cores. The value is capped by the per-kernel image limit. Additional charges may apply on the public API service.

Deprecated since version v4.20190315.

instanceGPUs

float

The fraction of GPU devices (1.0 means a whole device). The value is capped by the per-kernel image limit. Additional charges may apply on the public API service.

Deprecated since version v4.20190315.

Resource Slot Object

Key

Type

Description

cpu

str | int

The number of CPU cores.

mem

str | int

The amount of main memory in bytes. When the slot object is used as an input to an API, it may be represented as binary numbers using the binary scale suffixes such as k, m, g, t, p, e, z, and y, e.g., “512m”, “512M”, “512MiB”, “64g”, “64G”, “64GiB”, etc. When the slot object is used as an output of an API, this field is always represented in the unscaled number of bytes as strings.

Warning

When parsing this field as JSON, you must check whether your JSON library or the programming language supports large integers. For instance, most modern Javascript engines support up to \(2^{53}-1\) (8 PiB – 1) which is often defined as the Number.MAX_SAFE_INTEGER constant. Otherwise you need to use a third-party big number calculation library. To prevent unexpected side-effects, Backend.AI always returns this field as a string.

cuda.device

str | int

The number of CUDA devices. Only available when the server is configured to use the CUDA agent plugin.

cuda.shares

str

The virtual share of CUDA devices represented as fractional decimals. Only available when the server is configured to use the CUDA agent plugin with the fractional allocation mode (enterprise edition only).

tpu.device

str | int

The number of TPU devices. Only available when the server is configured to use the TPU agent plugin (cloud edition only).

(others)

str

More resource slot types may be available depending on the server configuration and agent plugins. There are two types for an arbitrary slot: “count” (the default) and “bytes”.

For “count” slots, you may put arbitrary positive real number there, but fractions may be truncated depending on the plugin implementation.

For “bytes” slots, its interpretation and representation follows that of the mem field.

Resource Preset Object

Key

Type

Description

name

str

The name of this preset.

resource_slots

Resource Slot Object

The pre-configured combination of resource slots. If it contains slot types that are not currently used/activated in the cluster, they will be removed when returned via /resource/* REST APIs.

shared_memory

int (Byte)

The pre-configured shared memory size. Client can send humanized strings like ‘2g’, ‘128m’, ‘534773760’, etc, and they will be automatically converted into bytes.

Virtual Folder Creation Result Object

Key

Type

Description

id

UUID

An internally used unique identifier of the created vfolder. Currently it has no use in the client-side.

name

str

The name of created vfolder, as the client has given.

host

str

The host name where the vfolder is created.

user

UUID

The user who has the ownership of this vfolder.

group

UUID

The group who is the owner of this vfolder.

New in version v4.20190615: user and group fields.

Virtual Folder List Item Object

Key

Type

Description

name

str

The human readable name set when created.

id

slug

The unique ID of the folder.

host

str

The host name where this folder is located.

is_owner

bool

True if the client user is the owner of this folder. False if the folder is shared from a group or another user.

permission

enum

The requested user’s permission for this folder. (One of “ro”, “rw”, and “wd” which represents read-only, read-write, and write-delete respectively. Currently “rw” and “wd” has no difference.)

user

UUID

The user ID if the owner of this item is a user vfolder. Otherwise, null.

group

UUID

The group ID if the owner of this item is a group vfolder. Otherwise, null.

type

enum

The owner type of vfolder. One of “user” or “group”.

New in version v4.20190615: user, group, and type fields.

Virtual Folder Item Object

Key

Type

Description

name

str

The human readable name set when created.

id

UUID

The unique ID of the folder.

host

str

The host name where this folder is located.

is_owner

bool

True if the client user is the owner of this folder. False if the folder is shared from a group or another user.

num_files

int

The number of files in this folder.

permission

enum

The requested user’s permission for this folder.

created_at

datetime

The date and time when the folder is created.

last_used

datetime

The date and time when the folder is last used.

user

UUID

The user ID if the owner of this item is a user. Otherwise, null.

group

UUID

The group ID if the owner of this item is a group. Otherwise, null.

type

enum

The owner type of vfolder. One of “user” or “group”.

New in version v4.20190615: user, group, and type fields.

Virtual Folder File Object

Key

Type

Description

filename

str

The filename.

mode

int

The file’s mode (permission) bits as an integer.

size

int

The file’s size.

ctime

int

The timestamp when the file is created.

mtime

int

The timestamp when the file is last modified.

atime

int

The timestamp when the file is last accessed.

Virtual Folder Invitation Object

Key

Type

Description

id

UUID

The unique ID of the invitation. Use this when making API requests referring this invitation.

inviter

str

The inviter’s user ID (email) of the invitation.

permission

str

The permission that the invited user will have.

state

str

The current state of the invitation.

vfolder_id

UUID

The unique ID of the vfolder where the user is invited.

vfolder_name

str

The name of the vfolder where the user is invited.

Key

Type

Description

content

str

The retrieved content (multi-line string) of fstab.

node

str

The node type, either “agent” or “manager.

node_id

str

The node’s unique ID.

New in version v4.20190615.

Authentication

Access Tokens and Secret Key

To make requests to the API server, a client needs to have a pair of an API access key and a secret key. You may get one from our cloud service or from the administrator of your Backend.AI cluster.

The server uses the API keys to identify each client and secret keys to verify integrity of API requests as well as to authenticate clients.

Warning

For security reasons (to avoid exposition of your API access key and secret keys to arbitrary Internet users), we highly recommend to setup a server-side proxy to our API service if you are building a public-facing front-end service using Backend.AI.

For local deployments, you may create a master dummy pair in the configuration (TODO).

Common Structure of API Requests

HTTP Headers

Values

Method

GET / REPORT / POST / PUT / PATCH / DELETE

Query String

If your access key has the administrator privilege, your client may optionally specify other user’s access key as the owner_access_key parameter of the URL query string (in addition to other API-specific ones if any) to change the scope of access key applied to access and manipulation of keypair-specific resources such as kernels and vfolders.

New in version v4.20190315.

Content-Type

Always should be application/json

Authorization

Signature information generated as the section Signing API Requests describes.

Date

The date/time of the request formatted in RFC 8022 or ISO 8601. If no timezone is specified, UTC is assumed. The deviation with the server-side clock must be within 15-minutes.

X-BackendAI-Date

Same as Date. May be omitted if Date is present.

X-BackendAI-Version

vX.yyymmdd where X is the major version and yyyymmdd is the minor release date of the specified API version. (e.g., 20160915)

X-BackendAI-Client-Token

An optional, client-generated random string to allow the server to distinguish repeated duplicate requests. It is important to keep idempotent semantics with multiple retries for intermittent failures. (Not implemented yet)

Body

JSON-encoded request parameters

Common Structure of API Responses

HTTP Headers

Values

Status code

API-specific HTTP-standard status codes. Responses commonly used throughout all APIs include 200, 201, 204, 400, 401, 403, 404, 429, and 500, but not limited to.

Content-Type

application/json and its variants (e.g., application/problem+json for errors)

Link

Web link headers specified as in RFC 5988. Only optionally used when returning a collection of objects.

X-RateLimit-*

The rate-limiting information (see Rate Limiting).

Body

JSON-encoded results

Signing API Requests

Each API request must be signed with a signature. First, the client should generate a signing key derived from its API secret key and a string to sign by canonicalizing the HTTP request.

Generating a signing key

Here is a Python code that derives the signing key from the secret key. The key is nestedly signed against the current date (without time) and the API endpoint address.

import hashlib, hmac
from datetime import datetime

SECRET_KEY = b'abc...'

def sign(key, msg):
  return hmac.new(key, msg, hashlib.sha256).digest()

def get_sign_key():
  t = datetime.utcnow()
  k1 = sign(SECRET_KEY, t.strftime('%Y%m%d').encode('utf8'))
  k2 = sign(k1, b'your.sorna.api.endpoint')
  return k2
Generating a string to sign

The string to sign is generated from the following request-related values:

  • HTTP Method (uppercase)

  • URI including query strings

  • The value of Date (or X-BackendAI-Date if Date is not present) formatted in ISO 8601 (YYYYmmddTHHMMSSZ) using the UTC timezone.

  • The canonicalized header/value pair of Host

  • The canonicalized header/value pair of Content-Type

  • The canonicalized header/value pair of X-BackendAI-Version

  • The hex-encoded hash value of body as-is. The hash function must be same to the one given in the Authorization header (e.g., SHA256).

To generate a string to sign, the client should join the above values using the newline ("\n", ASCII 10) character. All non-ASCII strings must be encoded with UTF-8. To canonicalize a pair of HTTP header/value, first trim all leading/trailing whitespace characters ("\n", "\r", " ", "\t"; or ASCII 10, 13, 32, 9) of its value, and join the lowercased header name and the value with a single colon (":", ASCII 58) character.

The success example in Example Requests and Responses makes a string to sign as follows (where the newlines are "\n"):

GET
/v2
20160930T01:23:45Z
host:your.sorna.api.endpoint
content-type:application/json
x-sorna-version:v2.20170215
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855

In this example, the hash value e3b0c4... is generated from an empty string using the SHA256 hash function since there is no body for GET requests.

Then, the client should calculate the signature using the derived signing key and the generated string with the hash function, as follows:

import hashlib, hmac

str_to_sign = 'GET\n/v2...'
sign_key = get_sign_key()  # see "Generating a signing key"
m = hmac.new(sign_key, str_to_sign.encode('utf8'), hashlib.sha256)
signature = m.hexdigest()
Attaching the signature

Finally, the client now should construct the following HTTP Authorization header:

Authorization: BackendAI signMethod=HMAC-SHA256, credential=<access-key>:<signature>
Example Requests and Responses

For the examples here, we use a dummy access key and secret key:

  • Example access key: AKIAIOSFODNN7EXAMPLE

  • Example secret key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

Success example for checking the latest API version
GET /v2 HTTP/1.1
Host: your.sorna.api.endpoint
Date: 20160930T01:23:45Z
Authorization: BackendAI signMethod=HMAC-SHA256, credential=AKIAIOSFODNN7EXAMPLE:022ae894b4ecce097bea6eca9a97c41cd17e8aff545800cd696112cc387059cf
Content-Type: application/json
X-BackendAI-Version: v2.20170215
HTTP/1.1 200 OK
Content-Type: application/json
Content-Language: en
Content-Length: 31
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1999
X-RateLimit-Reset: 897065

{
   "version": "v2.20170215"
}
Failure example with a missing authorization header
GET /v2/kernel/create HTTP/1.1
Host: your.sorna.api.endpoint
Content-Type: application/json
X-BackendAI-Date: 20160930T01:23:45Z
X-BackendAI-Version: v2.20170215
HTTP/1.1 401 Unauthorized
Content-Type: application/problem+json
Content-Language: en
Content-Length: 139
X-RateLimit-Limit: 2000
X-RateLimit-Remaining: 1998
X-RateLimit-Reset: 834821

{
   "type": "https://sorna.io/problems/unauthorized",
   "title": "Unauthorized access",
   "detail": "Authorization header is missing."
}

Rate Limiting

The API server imposes a rate limit to prevent clients from overloading the server. The limit is applied to the last N minutes at ANY moment (N is 15 minutes by default).

For public non-authorized APIs such as version checks, the server uses the client’s IP address seen by the server to impose rate limits. Due to this, please keep in mind that large-scale NAT-based deployments may encounter the rate limits sooner than expected. For authorized APIs, it uses the access key in the authorization header to impose rate limits. The rate limit includes both all successful and failed requests.

Upon a valid request, the HTTP response contains the following header fields to help the clients flow-control their requests.

HTTP Headers

Values

X-RateLimit-Limit

The maximum allowed number of requests during the rate-limit window.

X-RateLimit-Remaining

The number of further allowed requests left for the moment.

X-RateLimit-Window

The constant value representing the window size in seconds. (e.g., 900 means 15 minutes)

Changed in version v3.20170615: Deprecated X-RateLimit-Reset and transitional X-Retry-After as we have implemented a rolling counter that measures last 15 minutes API call counts at any moment.

When the limit is exceeded, further API calls will get HTTP 429 “Too Many Requests”. If the client seems to be DDoS-ing, the server may block the client forever without prior notice.

Manager REST API

Backend.AI REST API is for running instant compute sessions at scale in clouds or on-premise clusters.

Session Management

Here are the API calls to create and manage compute sessions.

Creating Session
  • URI: /session (/session/create also works for legacy)

  • Method: POST

Creates a new session or returns an existing session, depending on the parameters.

Parameters

Parameter

Type

Description

image

str

The kernel runtime type in the form of the Docker image name and tag. For legacy, the API also recognizes the lang field when image is not present.

Changed in version v4.20190315.

clientSessionToken

slug

A client-provided session token, which must be unique among the currently non-terminated sessions owned by the requesting access key. Clients may reuse the token if the previous session with the same token has been terminated.

It may contain ASCII alphabets, numbers, and hyphens in the middle. The length must be between 4 to 64 characters inclusively. It is useful for aliasing the session with a human-friendly name.

enqueueOnly

bool

(optional) If set true, the API returns immediately after queueing the session creation request to the scheduler. Otherwise, the manager will wait until the session gets started actually. (default: false)

New in version v4.20190615.

maxWaitSeconds

int

(optional) Set the maximum duration to wait until the session starts after queued, in seconds. If zero, the manager will wait indefinitely. (default: 0)

New in version v4.20190615.

reuseIfExists

bool

(optional) If set true, the API returns without creating a new session if a session with the same ID and the same image already exists and not terminated. In this case config options are ignored. If set false but a session with the same ID and image exists, the manager returns an error: “session already exists”. (default: true)

New in version v4.20190615.

group

str

(optional) The name of a user group (aka “project”) to launch the session within. (default: "default")

New in version v4.20190615.

domain

str

(optional) The name of a domain to launch the session within (default: "default")

New in version v4.20190615.

config

object

(optional) A Creation Config Object to specify kernel configuration including resource requirements. If not given, the kernel is created with the minimum required resource slots defined by the target image.

tag

str

(optional) A per-session, user-provided tag for administrators to keep track of additional information of each session, such as which sessions are from which users.

Example:

{
  "image": "python:3.6-ubuntu18.04",
  "clientSessionToken": "mysession-01",
  "enqueueOnly": false,
  "maxWaitSeconds": 0,
  "reuseIfExists": true,
  "domain": "default",
  "group": "default",
  "config": {
    "clusterSize": 1,
    "environ": {
      "MYCONFIG": "XXX",
    },
    "mounts": ["mydata", "mypkgs"],
    "resources": {
      "cpu": "2",
      "mem": "4g",
      "cuda.devices": "1",
    }
  },
  "tag": "example-tag"
}
Response

HTTP Status Code

Description

200 OK

The session is already running and you are okay to reuse it.

201 Created

The session is successfully created.

401 Invalid API parameters

There are invalid or malformed values in the API parameters.

406 Not acceptable

The requested resource limits exceed the server’s own limits.

Fields

Type

Values

sessId

slug

The session ID used for later API calls, which is same to the value of clientSessionToken. This will be random-generated by the server if clientSessionToken is not provided.

status

str

The status of the created kernel. This is always "PENDING" if enqueueOnly is set true. In other cases, it may be either "RUNNING" (normal case), "ERROR", or even "TERMINATED" depending on what happens during session startup.

New in version v4.20190615.

servicePorts

list[object]

The list of Service Port Object. This field becomes an empty list if enqueueOnly is set true, because the final service ports are determined when the session becomes ready after scheduling.

Note

In most cases the service ports are same to what specified in the image metadata, but the agent may add shared services for all sessions.

Changed in version v4.20190615.

created

bool

True if the session is freshly created.

Example:

{
  "sessId": "mysession-01",
  "status": "RUNNING",
  "servicePorts": [
    {"name": "jupyter", "protocol": "http"},
    {"name": "tensorboard", "protocol": "http"}
  ],
  "created": true
}
Getting Session Information
  • URI: /session/:id

  • Method: GET

Retrieves information about a session. For performance reasons, the returned information may not be real-time; usually they are updated every a few seconds in the server-side.

Parameters

Parameter

Type

Description

:id

slug

The session ID.

Response

HTTP Status Code

Description

200 OK

The information is successfully returned.

404 Not Found

There is no such session.

Key

Type

Description

lang

str

The kernel’s programming language

age

int (msec)

The time elapsed since the kernel has started.

memoryLimit

int (KiB)

The memory limit of the kernel in KiB.

numQueriesExecuted

int

The number of times the kernel has been accessed.

cpuCreditUsed

int (msec)

The total time the kernel was running.

Destroying Session
  • URI: /session/:id

  • Method: DELETE

Terminates a session.

Parameters

Parameter

Type

Description

:id

slug

The session ID.

Response

HTTP Status Code

Description

204 No Content

The session is successfully destroyed.

404 Not Found

There is no such session.

Key

Type

Description

stats

object

The Container Stats Object of the kernel when deleted.

Restarting Session
  • URI: /session/:id

  • Method: PATCH

Restarts a session. The idle time of the session will be reset, but other properties such as the age and CPU credit will continue to accumulate. All global states such as global variables and modules imports are also reset.

Parameters

Parameter

Type

Description

:id

slug

The session ID.

Response

HTTP Status Code

Description

204 No Content

The session is successfully restarted.

404 Not Found

There is no such session.

Code Execution (Query Mode)

Executing Snippet
  • URI: /session/:id

  • Method: POST

Executes a snippet of user code using the specified session. Each execution request to a same session may have side-effects to subsequent executions. For instance, setting a global variable in a request and reading the variable in another request is completely legal. It is the job of the user (or the front-end) to guarantee the correct execution order of multiple interdependent requests. When the session is terminated or restarted, all such volatile states vanish.

Parameters

Parameter

Type

Description

:id

slug

The session ID.

mode

str

A constant string "query".

code

str

A string of user-written code. All non-ASCII data must be encoded in UTF-8 or any format acceptable by the session.

runId

str

A string of client-side unique identifier for this particular run. For more details about the concept of a run, see Code Execution Model. If not given, the API server will assign a random one in the first response and the client must use it for the same run afterwards.

Example:

{
  "mode": "query",
  "code": "print('Hello, world!')",
  "runId": "5facbf2f2697c1b7"
}
Response

HTTP Status Code

Description

200 OK

The session has responded with the execution result. The response body contains a JSON object as described below.

Fields

Type

Values

result

object

Execution Result Object.

Note

Even when the user code raises exceptions, such queries are treated as successful execution. i.e., The failure of this API means that our API subsystem had errors, not the user codes.

Warning

If the user code tries to breach the system, causes crashes (e.g., segmentation fault), or runs too long (timeout), the session is automatically terminated. In such cases, you will get incomplete console logs with "finished" status earlier than expected. Depending on situation, the result.stderr may also contain specific error information.

Here we demonstrate a few example returns when various Python codes are executed.

Example: Simple return.

print("Hello, world!")
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "Hello, world!\n"]
    ],
    "options": null
  }
}

Example: Runtime error.

a = 123
print('what happens now?')
a = a / 0
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "what happens now?\n"],
      ["stderr", "Traceback (most recent call last):\n  File \"<input>\", line 3, in <module>\nZeroDivisionError: division by zero"],
    ],
    "options": null
  }
}

Example: Multimedia output.

Media outputs are also mixed with other console outputs according to their execution order.

import matplotlib.pyplot as plt
a = [1,2]
b = [3,4]
print('plotting simple line graph')
plt.plot(a, b)
plt.show()
print('done')
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "plotting simple line graph\n"],
      ["media", ["image/svg+xml", "<?xml version=\"1.0\" ..."]],
      ["stdout", "done\n"]
    ],
    "options": null
  }
}

Example: Continuation results.

import time
for i in range(5):
    print(f"Tick {i+1}")
    time.sleep(1)
print("done")
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "continued",
    "console": [
      ["stdout", "Tick 1\nTick 2\n"]
    ],
    "options": null
  }
}

Here you should make another API query with the empty code field.

{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "continued",
    "console": [
      ["stdout", "Tick 3\nTick 4\n"]
    ],
    "options": null
  }
}

Again.

{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "Tick 5\ndone\n"],
    ],
    "options": null
  }
}

Example: User input.

print("What is your name?")
name = input(">> ")
print(f"Hello, {name}!")
{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "waiting-input",
    "console": [
      ["stdout", "What is your name?\n>> "]
    ],
    "options": {
      "is_password": false
    }
  }
}

You should make another API query with the code field filled with the user input.

{
  "result": {
    "runId": "5facbf2f2697c1b7",
    "status": "finished",
    "console": [
      ["stdout", "Hello, Lablup!\n"]
    ],
    "options": null
  }
}
Auto-completion
  • URI: /session/:id/complete

  • Method: POST

Parameters

Parameter

Type

Description

:id

slug

The session ID.

code

str

A string containing the code until the current cursor position.

options.post

str

A string containing the code after the current cursor position.

options.line

str

A string containing the content of the current line.

options.row

int

An integer indicating the line number (0-based) of the cursor.

options.col

int

An integer indicating the column number (0-based) in the current line of the cursor.

Example:

{
  "code": "pri",
  "options": {
    "post": "\nprint(\"world\")\n",
    "line": "pri",
    "row": 0,
    "col": 3
  }
}
Response

HTTP Status Code

Description

200 OK

The session has responded with the execution result. The response body contains a JSON object as described below.

Fields

Type

Values

result

list[str]

An ordered list containing the possible auto-completion matches as strings. This may be empty if the current session does not implement auto-completion or no matches have been found.

Selecting a match and merging it into the code text are up to the front-end implementation.

Example:

{
  "result": [
    "print",
    "printf"
  ]
}
Interrupt
  • URI: /session/:id/interrupt

  • Method: POST

Parameters

Parameter

Type

Description

:id

slug

The session ID.

Response

HTTP Status Code

Description

204 No Content

Sent the interrupt signal to the session. Note that this does not guarantee the effectiveness of the interruption.

Code Execution (Batch Mode)

Some sessions provide the batch mode, which offers an explicit build step required for multi-module programs or compiled programming languages. In this mode, you first upload files in prior to execution.

Uploading files
  • URI: /session/:id/upload

  • Method: POST

Parameters

Upload files to the session. You may upload multiple files at once using multi-part form-data encoding in the request body (RFC 1867/2388). The uploaded files are placed under /home/work directory (which is the home directory for all sessions by default), and existing files are always overwritten. If the filename has a directory part, non-existing directories will be auto-created. The path may be either absolute or relative, but only sub-directories under /home/work is allowed to be created.

Hint

This API is for uploading frequently-changing source files in prior to batch-mode execution. All files uploaded via this API is deleted when the session terminates. Use virtual folders to store and access larger, persistent, static data and library files for your codes.

Warning

You cannot upload files to mounted virtual folders using this API directly. However, you may copy/move the generated files to virtual folders in your build script or the main program for later uses.

There are several limits on this API:

The maximum size of each file

1 MiB

The number of files per upload request

20

Response

HTTP Status Code

Description

204 OK

Success.

400 Bad Request

Returned when one of the uploaded file exceeds the size limit or there are too many files.

Executing with Build Step
  • URI: /session/:id

  • Method: POST

Parameters

Parameter

Type

Description

:id

slug

The session ID.

mode

enum[str]

A constant string "batch".

code

str

Must be an empty string "".

runId

str

A string of client-side unique identifier for this particular run. For more details about the concept of a run, see Code Execution Model. If not given, the API server will assign a random one in the first response and the client must use it for the same run afterwards.

options

object

Batch Execution Query Object.

Example:

{
  "mode": "batch",
  "options": "{batch-execution-query-object}",
  "runId": "af9185c5fb0eacb2"
}
Response

HTTP Status Code

Description

200 OK

The session has responded with the execution result. The response body contains a JSON object as described below.

Fields

Type

Values

result

object

Execution Result Object.

Listing Files

Once files are uploaded to the session or generated during the execution of the code, there is a need to identify what files actually are in the current session. In this case, use this API to get the list of files of your compute sesison.

  • URI: /session/:id/files

  • Method: GET

Parameters

Parameter

Type

Description

:id

slug

The session ID.

path

str

Path inside the session (default: /home/work).

Response

HTTP Status Code

Description

200 OK

Success.

404 Not Found

There is no such path.

Fields

Type

Values

files

str

Stringified json containing list of files.

folder_path

str

Absolute path inside session.

errors

str

Any errors occurred during scanning the specified path.

Downloading Files

Download files from your compute session.

The response contents are multiparts with tarfile binaries. Post-processing, such as unpacking and save them, should be handled by the client.

  • URI: /session/:id/download

  • Method: GET

Parameters

Parameter

Type

Description

:id

slug

The session ID.

files

list[str]

File paths inside the session container to download. (maximum 5 files at once)

Response

HTTP Status Code

Description

200 OK

Success.

Code Execution (Streaming)

The streaming mode provides a lightweight and interactive method to connect with the session containers.

Code Execution
  • URI: /stream/session/:id/execute

  • Method: GET upgraded to WebSockets

This is a real-time streaming version of Code Execution (Batch Mode) and Code Execution (Query Mode) which uses long polling via HTTP.

(under construction)

New in version v4.20181215.

Terminal Emulation
  • URI: /stream/session/:id/pty?app=:service

  • Method: GET upgraded to WebSockets

This endpoint provides a duplex continuous stream of JSON objects via the native WebSocket. Although WebSocket supports binary streams, we currently rely on TEXT messages only conveying JSON payloads to avoid quirks in typed array support in Javascript across different browsers.

The service name should be taken from the list of service port objects returned by the session creation API.

Note

We do not provide any legacy WebSocket emulation interfaces such as socket.io or SockJS. You need to set up your own proxy if you want to support legacy browser users.

Changed in version v4.20181215: Added the service query parameter.

Parameters

Parameter

Type

Description

:id

slug

The session ID.

:service

slug

The service name to connect.

Client-to-Server Protocol

The endpoint accepts the following four types of input messages.

Standard input stream

All ASCII (and UTF-8) inputs must be encoded as base64 strings. The characters may include control characters as well.

{
  "type": "stdin",
  "chars": "<base64-encoded-raw-characters>"
}
Terminal resize

Set the terminal size to the given number of rows and columns. You should calculate them by yourself.

For instance, for web-browsers, you may do a simple math by measuring the width and height of a temporarily created, invisible HTML element with the (monospace) font styles same to the terminal container element that contains only a single ASCII character.

{
  "type": "resize",
  "rows": 25,
  "cols": 80
}
Ping

Use this to keep the session alive (preventing it from auto-terminated by idle timeouts) by sending pings periodically while the user-side browser is open.

{
  "type": "ping",
}
Restart

Use this to restart the session without affecting the working directory and usage counts. Useful when your foreground terminal program does not respond for whatever reasons.

{
  "type": "restart",
}
Server-to-Client Protocol
Standard output/error stream

Since the terminal is an output device, all stdout/stderr outputs are merged into a single stream as we see in real terminals. This means there is no way to distinguish stdout and stderr in the client-side, unless your session applies some special formatting to distinguish them (e.g., make all stderr otuputs red).

The terminal output is compatible with xterm (including 256-color support).

{
  "type": "out",
  "data": "<base64-encoded-raw-characters>"
}
Server-side errors
{
  "type": "error",
  "data": "<human-readable-message>"
}

Event Monitoring

Session Lifecycle Events
  • URI: /events/session

  • Method: GET

Provides a continuous message-by-message JSON object stream of session lifecycles. It uses HTML5 Server-Sent Events (SSE). Browser-based clients may use the EventSource API for convenience.

New in version v4.20190615: First properly implemented in this version, deprecating prior unimplemented interfaces.

Changed in version v5.20191215: The URI is changed from /stream/session/_/events to /events/session.

Parameters

Parameter

Type

Description

sessionId

slug

The session ID to monitor the lifecycle events. If set "*", the API will stream events from all sessions visible to the client depending on the client’s role and permissions.

ownerAccessKey

str

(optional) The access key of the owner of the specified session, since different access keys (users) may share a same session ID for different session instances. You can specify this only when the client is either a domain admin or a superadmin.

group

str

The group name to filter the lifecycle events. If set "*", the API will stream events from all sessions visible to the client depending on the client’s role and permissions.

Responses

The response is a continuous stream of UTF-8 text lines following the text/event-stream format. Each event is composed of the event type and data, where the data part is encoded as JSON.

Possible event names (more events may be added in the future):

Event Name

Description

session_preparing

The session is just scheduled from the job queue and got an agent resource allocation.

session_pulling

The session begins pulling the session image (usually from a Docker registry) to the scheduled agent.

session_creating

The session is being created as containers (or other entities in different agent backends).

session_started

The session becomes ready to execute codes.

session_terminated

The session has terminated.

When using the EventSource API, you should add event listeners as follows:

const sse = new EventSource('/events/session', {
  withCredentials: true,
});
sse.addEventListener('session_started', (e) => {
  console.log('session_started', JSON.parse(e.data));
});

Note

The EventSource API must be used with the session-based authentication mode (when the endpoint is a console-server) which uses the browser cookies. Otherwise, you need to manually implement the event stream parser using the standard fetch API running against the manager server.

The event data contains a JSON string like this (more fields may be added in the future):

Field Name

Description

sessionId

The source session ID.

ownerAccessKey

The access key who owns the session.

reason

A short string that describes why the event happened. This may be null or an empty string.

result

Only present for session-terminated events. Only meaningful for batch-type sessions. Either one of: "UNDEFINED", "SUCCESS", "FAILURE"

{
  "sessionId": "mysession-01",
  "ownerAccessKey": "MYACCESSKEY",
  "reason": "self-terminated",
  "result": "SUCCESS"
}
Background Task Progress Events
  • URI: /events/background-task

  • Method: GET for server-side events

New in version v5.20191215.

Parameters

Parameter

Type

Description

taskId

UUID

The background task ID to monitor the progress and completion.

Responses

The response is a continuous stream of UTF-8 text lines following text/event-stream format. Each event is composed of the event type and data, where the data part is encoded as JSON. Possible event names (more events may be added in the future):

Event Name

Description

task_updated

Updates for the progress. This can be generated many times during the background task execution.

task_done

The background task is successfully completed.

tak_failed

The background task has failed. Check the message field and/or query the error logs API for error details.

task_cancelled

The background task is cancelled in the middle. Usually this means that the server is being shutdown for maintenance.

server_close

This event indicates explicit server-initiated close of the event monitoring connection, which is raised just after the background task is either done/failed/cancelled. The client should not reconnect because there is nothing more to monitor about the given task.

The event data (per-line JSON objects) include the following fields:

Field Name

Type

Description

task_id

str

The background task ID.

current_progress

int

The current progress value. Only meaningful for task_update events. If total_progress is zero, this value should be ignored.

total_progress

int

The total progress count. Only meaningful for task_update events. The scale may be an arbitrary positive integer. If the total count is not defined, this may be zero.

message

str

An optional human-readable message indicating what the task is doing. It may be null. For example, it may contain the name of agent or scaling group being worked on for image preload/unload APIs.

Check out the session lifecycle events API for example client-side Javascript implementations to handle text/event-stream responses.

If you make the request for the tasks already finished, it may return either “404 Not Found” (the result is expired or the task ID is invalid) or a single event which is one of task_done, task_fail, or task_cancel followed by immediate response disconnection. Currently, the results for finished tasks may be archived up to one day (24 hours).

Service Ports (aka Service Proxies)

The service ports API provides WebSocket-based authenticated and encrypted tunnels to network-facing services (“container services”) provided by the kernel container. The main advantage of this feature is that all application-specific network traffic are wrapped as a standard WebSocket API (no need to open extra ports of the manager). It also hides the container from the client and the client from the container, offerring an extra level of security.

_images/service-ports.svg

The diagram showing how tunneling of TCP connections via WebSockets works.

As Fig. 7 shows, all TCP traffic to a container service could be sent to a WebSocket connection to the following API endpoints. A single WebSocket connection corresponds to a single TCP connection to the service, and there may be multiple concurrent WebSocket connections to represent multiple TCP connections to the service. It is the client’s responsibility to accept arbitrary TCP connections from users (e.g., web browsers) with proper authorization for multi-user setups and wrap those as WebSocket connections to the following APIs.

When the first connection is initiated, the Backend.AI Agent running the designated kernel container signals the kernel runner daemon in the container to start the designated service. It shortly waits for the in-container port opening and then delivers the first packet to the service. After initialization, all WebSocket payloads are delivered back and forth just like normal TCP packets. Note that the WebSocket message type must be BINARY.

The container service will see the packets from the manager and it never knows the real origin of packets unless the service-level protocol enforces to state such client-side information. Likewise, the client never knows the container’s IP address (though the port numbers are included in service port objects returned by the session creation API).

Note

Currently non-TCP (e.g., UDP) services are not supported.

Service Proxy (HTTP)
  • URI: /stream/kernel/:id/httpproxy?app=:service

  • Method: GET upgraded to WebSockets

The service proxy API allows clients to directly connect to service daemons running inside compute sessions, such as Jupyter and TensorBoard.

The service name should be taken from the list of service port objects returned by the session creation API.

New in version v4.20181215.

Parameters

Parameter

Type

Description

:id

slug

The kernel ID.

:service

slug

The service name to connect.

Service Proxy (TCP)
  • URI: /stream/kernel/:id/tcpproxy?app=:service

  • Method: GET upgraded to WebSockets

This is the TCP version of service proxy, so that client users can connect to native services running inside compute sessions, such as SSH.

The service name should be taken from the list of service port objects returned by the session creation API.

New in version v4.20181215.

Parameters

Parameter

Type

Description

:id

slug

The kernel ID.

:service

slug

The service name to connect.

Resource Presets

Resource presets provide a simple storage for pre-configured resource slots and a dynamic checker for allocatability of given presets before actually calling the kernel creation API.

To add/modify/delete resource presets, you need to use the admin GraphQL API.

New in version v4.20190315.

Listing Resource Presets

Returns the list of admin-configured resource presets.

  • URI: /resource/presets

  • Method: GET

Parameters

None.

Response

HTTP Status Code

Description

200 OK

The preset list is returned.

Fields

Type

Values

presets

list[object]

The list of Resource Preset Object

Checking Allocatability of Resource Presets

Returns current keypair and scaling-group’s resource limits in addition to the list of admin-configured resource presets. It also checks the allocatability of the resource presets and adds allocatable boolean field to each preset item.

  • URI: /resource/check-presets

  • Method: POST

Parameters

None.

Response

HTTP Status Code

Description

200 OK

The preset list is returned.

401 Unauthorized

The client is not authorized.

Fields

Type

Values

keypair_limits

Resource Slot Object

The maximum amount of total resource slots allowed for the current access key. It may contain infinity values as the string “Infinity”.

keypair_using

Resource Slot Object

The amount of total resource slots used by the current access key.

keypair_remaining

Resource Slot Object

The amount of total resource slots remaining for the current access key. It may contain infinity values as the string “Infinity”.

scaling_group_remaining

Resource Slot Object

The amount of total resource slots remaining for the current scaling group. It may contain infinity values as the string “Infinity” if the server is configured for auto-scaling.

presets

list[object]

The list of Resource Preset Object, but with an extra boolean field allocatable which indicates if the given resource slot is actually allocatable considering the keypair’s resource limits and the scaling group’s current usage.

Virtual Folders

Virtual folders provide access to shared, persistent, and reused files across different sessions.

You can mount virtual folders when creating new sessions, and use them like a plain directory on the local filesystem. Of course, reads/writes to virtual folder contents may have degraded performance compared to the main scratch directory (usually /home/work in most kernels) as internally it uses a networked file system.

Also, you might share your virtual folders with other users by inviting them and granting them proper permission. Currently, there are three levels of permissions: read-only, read-write, read-write-delete. They are represented by short strings, 'ro', 'rw', 'wd', respectively. The owner of a virtual folder have read-write-delete permission for the folder.

Listing Virtual Folders

Returns the list of virtual folders created by the current keypair.

  • URI: /folders

  • Method: GET

Parameters

Parameter

Type

Description

all

bool

(optional) If this parameter is True, it returns all virtual folders, including those that do not belong to the current user. Only available for superadmin (default: False).

group_id

UUID | str

(optional) If this parameter is set, it returns the virtual folders that belong to the specified group. Have no effect in user-type virtual folders.

Response

HTTP Status Code

Description

200 OK

Success

Fields

Type

Values

(root)

list[object]

A list of Virtual Folder List Item Object

Example:

[
   {
      "name": "myfolder",
      "id": "b4b1b16c-d07f-4f1f-b60e-da9449aa60a6",
      "host": "local:volume1",
      "usage_mode": "general",
      "created_at": "2020-11-28 13:30:30.912056+00",
      "is_owner": "true",
      "permission": "rw",
      "user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
      "group": null,
      "creator": "admin@lablup.com",
      "user_email": "admin@lablup.com",
      "group_name": null,
      "ownership_type": "user",
      "unmanaged_path": null,
      "cloneable": "false",
   }
]
Listing Virtual Folder Hosts

Returns the list of available host names where the current keypair can create new virtual folders.

New in version v4.20190315.

  • URI: /folders/_/hosts

  • Method: GET

Parameters

None

Response

HTTP Status Code

Description

200 OK

Success

Fields

Type

Values

default

str

The default virtual folder host

allowed

list[str]

The list of available virtual folder hosts

Example:

{
  "default": "seoul:nfs1",
  "allowed": ["seoul:nfs1", "seoul:nfs2", "seoul:cephfs1"]
}
Creating a Virtual Folder
  • URI: /folders

  • Method: POST

Creates a virtual folder associated with the current API key.

Parameters

Parameter

Type

Description

name

str

The human-readable name of the virtual folder

host

str

(optional) The name of the virtual folder host

usage_mode

str

(optional) The purpose of the virtual folder. Allowed values are general, model, and data (default: general).

permission

str

(optional) The default share permission of the virtual folder. The owner of the virtual folder always have wd permission regardless of this parameter. Allowed values are ro, rw, and wd (default: rw).

group_id

UUID | str

(optional) If this parameter is set, it creates a group-type virtual folder. If empty, it creates a user-type virtual folder.

quota

int

(optional) Set the quota of the virtual folder in bytes. Note, however, that the quota is only supported under the xfs filesystems. Other filesystems that do not support per-directory quota will ignore this parameter.

Example:

{
  "name": "My Data",
  "host": "seoul:nfs1"
}
Response

HTTP Status Code

Description

201 Created

The kernel is successfully created.

400 Bad Request

The name is malformed or duplicate with your existing virtual folders.

406 Not acceptable

You have exceeded internal limits of virtual folders. (e.g., the maximum number of folders you can have.)

Fields

Type

Values

id

slug

The unique folder ID used for later API calls

name

str

The human-readable name of the created virtual folder

host

str

The name of the virtual folder host where the new folder is created

Example:

{
  "id": "aef1691db3354020986d6498340df13c",
  "name": "My Data",
  "host": "nfs1",
  "usage_mode": "general",
  "permission": "rw",
  "creator": "admin@lablup.com",
  "ownership_type": "user",
  "user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
  "group": "",
}
Getting Virtual Folder Information
  • URI: /folders/:name

  • Method: GET

Retrieves information about a virtual folder. For performance reasons, the returned information may not be real-time; usually they are updated every a few seconds in the server-side.

Parameters

Parameter

Type

Description

name

str

The human-readable name of the virtual folder

Response

HTTP Status Code

Description

200 OK

The information is successfully returned.

404 Not Found

There is no such folder or you may not have proper permission to access the folder.

Fields

Type

Values

(root)

object

Virtual Folder Item Object

Deleting Virtual Folder
  • URI: /folders/:name

  • Method: DELETE

This immediately deletes all contents of the given virtual folder and makes the folder unavailable for future mounts.

Danger

If there are running kernels that have mounted the deleted virtual folder, those kernels are likely to break!

Warning

There is NO way to get back the contents once this API is invoked.

Parameters

Parameter

Description

name

The human-readable name of the virtual folder

Response

HTTP Status Code

Description

204 No Content

The folder is successfully destroyed.

404 Not Found

There is no such folder or you may not have proper permission to delete the folder.

Rename a Virtual Folder
  • URI: /folders/:name/rename

  • Method: POST

Rename a virtual folder associated with the current API key.

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

new_name

str

New virtual folder name

Response

HTTP Status Code

Description

201 Created

The folder is successfully renamed.

404 Not Found

There is no such folder or you may not have proper permission to rename the folder.

Listing Files in Virtual Folder

Returns the list of files in a virtual folder associated with current keypair.

  • URI: /folders/:name/files

  • Method: GET

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

path

str

Path inside the virtual folder (default: root)

Response

HTTP Status Code

Description

200 OK

Success.

404 Not Found

There is no such path or you may not have proper permission to access the folder.

Fields

Type

Values

files

list[object]

List of Virtual Folder File Object

Uploading a File to Virtual Folder

Upload a local file to a virtual folder associated with the current keypair. Internally, the Manager will deligate the upload to a Backend.AI Storage-Proxy service. JSON web token is used for the authentication of the request.

  • URI: /folders/:name/request-upload

  • Method: POST

Warning

If a file with the same name already exists in the virtual folder, it will be overwritten without warning.

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

path

str

Path of the local file to upload

size

int

The total size of the local file to upload

Response

HTTP Status Code

Description

200 OK

Success.

Fields

Type

Values

token

str

JSON web token for the authentication of the upload session to Storage-Proxy service.

url

str

Request url for a Storage-Proxy. Client should use this URL to upload the file.

Creating New Directory in Virtual Folder

Create a new directory in the virtual folder associated with current keypair. this API recursively creates parent directories if they does not exist.

  • URI: /folders/:name/mkdir

  • Method: POST

Warning

If a directory with the same name already exists in the virtual folder, it may be overwritten without warning.

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder.

path

str

The relative path of a new folder to create inside the virtual folder

parents

bool

If True, the parent directories will be created if they do not exist.

exist_ok

bool

If a directory with the same name already exists, overwrite it without an error.

Response

HTTP Status Code

Description

201 Created

Success.

400 Bad Request

There already exists a file, not a directory, with duplicated name.

404 Not Found

There is no such folder or you may not have proper permission to write into folder.

Downloading a File or a Directory from a Virtual Folder

Download a file or a directory from a virtual folder associated with the current keypair. Internally, the Manager will deligate the download to a Backend.AI Storage-Proxy service. JSON web token is used for the authentication of the request.

New in version v4.20190315.

  • URI: /folders/:name/request-download

  • Method: POST

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

path

str

The path to a file or a directory inside the virtual folder to download.

archive

bool

If this parameter is True and path is a directory, the directory will be archived into a zip file on the fly (default: False).

Response

HTTP Status Code

Description

200 OK

Success.

404 Not Found

File not found or you may not have proper permission to access the folder.

Fields

Type

Values

token

str

JSON web token for the authentication of the download session to Storage-Proxy service.

url

str

Request url for a Storage-Proxy. Client should use this URL to download the file.

Deleting Files in Virtual Folder

This deletes files inside a virtual folder.

Warning

There is NO way to get back the files once this API is invoked.

  • URI: /folders/:name/delete-files

  • Method: DELETE

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

files

list[str]

File paths inside the virtual folder to delete

recursive

bool

Recursive option to delete folders if set to True. The default is False.

Response

HTTP Status Code

Description

200 OK

Success.

400 Bad Request

You tried to delete a folder without setting recursive option as True.

404 Not Found

There is no such folder or you may not have proper permission to delete the file in the folder.

Rename a File in Virtual Folder

Rename a file inside a virtual folder.

  • URI: /folders/:name/rename-file

  • Method: POST

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

target_path

str

The relative path of target file or directory

new_name

str

The new name of the file or directory

is_dir

bool

Flag that indicates the target_path is a directory or not

Response

HTTP Status Code

Description

200 OK

Success.

400 Bad Request

You tried to rename a directory without setting is_dir option as True.

404 Not Found

There is no such folder or you may not have proper permission to rename the file in the folder.

Listing Invitations for Virtual Folder

Returns the list of pending invitations that the requested user received. This will display the invitations sent to me by other users.

  • URI: /folders/invitations/list

  • Method: GET

Parameters

This API does not need any parameter.

Response

HTTP Status Code

Description

200 OK

Success.

Fields

Type

Values

invitations

list[object]

A list of Virtual Folder Invitation Object

Creating an Invitation

Invite other users to share a virtual folder with proper permissions. If a user is already invited, then this API does not create a new invitation or update the permission of the existing invitation.

  • URI: /folders/:name/invite

  • Method: POST

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

perm

str

The permission to grant to invitee

emails

list[slug]

A list of user emails to invite

Response

HTTP Status Code

Description

200 OK

Success.

400 Bad Request

No invitee is given.

404 Not Found

There is no invitation.

Fields

Type

Values

invited_ids

list[slug]

A list of invited user emails

Accepting an Invitation

Accept an invitation and receive permission to a virtual folder as in the invitation.

  • URI: /folders/invitations/accept

  • Method: POST

Parameters

Parameter

Type

Description

inv_id

slug

The unique invitation ID

Response

HTTP Status Code

Description

200 OK

Success.

400 Bad Request

The name of the target virtual folder is duplicate with your existing virtual folders.

404 Not Found

There is no such invitation.

Rejecting an Invitation

Reject an invitation.

  • URI: /folders/invitations/delete

  • Method: DELETE

Parameters

Parameter

Type

Description

inv_id

slug

The unique invitation ID

Response

HTTP Status Code

Description

200 OK

Success.

404 Not Found

There is no such invitation.

Fields

Type

Values

msg

str

Detail message for the invitation deletion

Listing Sent Invitations

Returns the list of virtual folder invitations the requested user sent. This does not include the invitations those are already accepted or rejected.

  • URI: /folders/invitations/list-sent

  • Method: GET

Parameters

This API does not need any parameter.

Response

HTTP Status Code

Description

200 OK

Success.

Fields

Type

Values

invitations

list[object]

A list of Virtual Folder Invitation Object

Updating an Invitation

Update the permission of an already-sent, but not accepted or rejected, invitation.

  • URI: /folders/invitations/update/:inv_id

  • Method: POST

Parameters

Parameter

Type

Description

:inv_id

str

The unique invitation ID

perm

str

The permission to grant to invitee

Response

HTTP Status Code

Description

200 OK

Success.

400 Bad Request

No permission is given.

404 Not Found

There is no invitation.

Fields

Type

Values

msg

str

An update message string

Leave an Shared Virtual Folder

Leave a shared virtual folder.

Cannot leave a group vfolder or a vfolder that the requesting user owns.

  • URI: /folders/:name/leave

  • Method: POST

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

Response

HTTP Status Code

Description

200 OK

Success.

404 Not Found

There is no virtual folder.

Fields

Type

Values

msg

str

A result message string

Listing Users Share Virtual Folders

Returns the list of users who shares requester’s virtual folders.

  • URI: /folders/_/shared

  • Method: GET

Parameters

Parameter

Type

Description

vfolder_id

str

(Optional) The unique virtual folder ID to list shared users. If not specified, all users who shares any virtual folders the requester created.

Response

HTTP Status Code

Description

200 OK

Success.

Fields

Type

Values

shared

list[object]

A list of information about shared users.

Example:

[
   {
      "vfolder_id": "aef1691db3354020986d6498340df13c",
      "vfolder_name": "My Data",
      "shared_by": "admin@lablup.com",
      "shared-to": {
         "uuid": "dfa9da54-4b28-432f-be29-c0d680c7a412",
         "email": "user@lablup.com"
      },
      "perm": "ro"
   }
]
Updating the permission of a shared virtual folder

Update the permission of a user for a shared virtual folder.

  • URI: /folders/_/shared

  • Method: POST

Parameters

Parameter

Type

Description

vfolder

UUID

The unique virtual folder ID

user

UUID

The unique user ID

perm

str

The permission to update for the user on vfolder

Response

HTTP Status Code

Description

200 OK

Success.

400 Bad Request

No permission or user is given.

404 Not Found

There is no virtual folder.

Fields

Type

Values

msg

str

An update message string

Share a Group Virtual Folder to an Individual Users

Share a group virtual folder to users with overriding permission.

This will create vfolder_permission(s) relation directly without creating invitation(s). Only group virtual folders are allowed to be shared directly.

This API can be useful when you want to share a group virtual folder to every group members with read-only permission, but allows some users read-write permission.

NOTE: This API is only available for group virtual folders.

  • URI: /folders/:name/share

  • Method: POST

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

permission

str

Overriding permission to share the group virtual folder

emails

list[str]

A list of user emails to share

Response

HTTP Status Code

Description

201 Created

Success.

400 Bad Request

No permission or email is given.

404 Not Found

There is no virtual folder.

Fields

Type

Values

shared_emails

list[str]

A list of user emails those are successfully shared the virtual folder

Unshare a Group Virtual Folder from Users

Unshare a group virtual folder from users

NOTE: This API is only available for group virtual folders.

  • URI: /folders/:name/unshare

  • Method: DELETE

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

emails

list[str]

A list of user emails to unshare

Response

HTTP Status Code

Description

200 OK

Success.

400 Bad Request

No email is given.

404 Not Found

There is no virtual folder.

Fields

Type

Values

unshared_emails

list[str]

A list of user emails those are successfully unshared the virtual folder

Clone a Virtual Folder

Clone a virtual folder

  • URI: /folders/:name/clone

  • Method: POST

Parameters

Parameter

Type

Description

:name

str

The human-readable name of the virtual folder

cloneable

bool

If True, cloned virtual folder will be cloneable again.

target_name

str

The name of the new virtual folder

target_host

str

The targe host volume of the new virtual folder

usage_mode

str

(optional) The purpose of the new virtual folder. Allowed values are general, model, and data (default: general).

permission

str

(optional) The default share permission of the new virtual folder. The owner of the virtual folder always have wd permission regardless of this parameter. Allowed values are ro, rw, and wd (default: rw).

Response

HTTP Status Code

Description

200 OK

Success.

400 Bad Request

No target name, target host, or no permission.

403 Forbidden

The source virtual folder is not permitted to be cloned.

404 Not Found

There is no virtual folder.

Fields

Type

Values

unshared_emails

list[str]

A list of user emails those are successfully unshared the virtual folder.

Fields

Type

Values

(root)

list[object]

Virtual Folder List Item Object

Example:

{
   "name": "my cloned folder",
   "id": "b4b1b16c-d07f-4f1f-b60e-da9449aa60a6",
   "host": "local:volume1",
   "usage_mode": "general",
   "created_at": "2020-11-28 13:30:30.912056+00",
   "is_owner": "true",
   "permission": "rw",
   "user": "dfa9da54-4b28-432f-be29-c0d680c7a412",
   "group": null,
   "creator": "admin@lablup.com",
   "user_email": "admin@lablup.com",
   "group_name": null,
   "ownership_type": "user",
   "unmanaged_path": null,
   "cloneable": "false"
}

Code Execution Model

The core of the user API is the execute call which allows clients to execute user-provided codes in isolated compute sessions (aka kernels). Each session is managed by a kernel runtime, whose implementation is language-specific. A runtime is often a containerized daemon that interacts with the Backend.AI agent via our internal ZeroMQ protocol. In some cases, kernel runtimes may be just proxies to other code execution services instead of actual executor daemons.

Inside each compute session, a client may perform multiple runs. Each run is for executing different code snippets (the query mode) or different sets of source files (the batch mode). The client often has to call the execute API multiple times to finish a single run. It is completely legal to mix query-mode runs and batch-mode runs inside the same session, given that the kernel runtime supports both modes.

To distinguish different runs which may be overlapped, the client must provide the same run ID to all execute calls during a single run. The run ID should be unique for each run and can be an arbitrary random string. If the run ID is not provided by the client at the first execute call of a run, the API server will assign a random one and inform it to the client via the first response. Normally, if two or more runs are overlapped, they are processed in a FIFO order using an internal queue. But they may be processed in parallel if the kernel runtime supports parallel processing. Note that the API server may raise a timeout error and cancel the run if the waiting time exceeds a certain limit.

In the query mode, usually the runtime context (e.g., global variables) is preserved for next subsequent runs, but this is not guaranteed by the API itself—it’s up to the kernel runtime implementation.

_images/run-state-diagram.svg

The state diagram of a “run” with the execute API.

The execute API accepts 4 arguments: mode, runId, code, and options (opts). It returns an Execution Result Object encoded as JSON.

Depending on the value of status field in the returned Execution Result Object, the client must perform another subsequent execute call with appropriate arguments or stop. Fig. 8 shows all possible states and transitions between them via the status field value.

If status is "finished", the client should stop.

If status is "continued", the client should make another execute API call with the code field set to an empty string and the mode field set to "continue". Continuation happens when the user code runs longer than a few seconds to allow the client to show its progress, or when it requires extra step to finish the run cycle.

If status is "clean-finished" or "build-finished" (this happens at the batch-mode only), the client should make the same continuation call. Since cleanup is performed before every build, the client will always receive "build-finished" after "clean-finished" status. All outputs prior to "build-finished" status return are from the build program and all future outputs are from the executed program built. Note that even when the exitCode value is non-zero (failed), the client must continue to complete the run cycle.

If status is "waiting-input", you should make another execute API call with the code field set to the user-input text and the mode field set to "input". This happens when the user code calls interactive input() functions. Until you send the user input, the current run is blocked. You may use modal dialogs or other input forms (e.g., HTML input) to retrieve user inputs. When the server receives the user input, the kernel’s input() returns the given value. Note that each kernel runtime may provide different ways to trigger this interactive input cycle or may not provide at all.

When each call returns, the console field in the Execution Result Object have the console logs captured since the last previous call. Check out the following section for details.

Handling Console Output

The console output consists of a list of tuple pairs of item type and item data. The item type is one of "stdout", "stderr", "media", "html", or "log".

When the item type is "stdout" or "stderr", the item data is the standard I/O stream outputs as (non-escaped) UTF-8 string. The total length of either streams is limited to 524,288 Unicode characters per each execute API call; all excessive outputs are truncated. The stderr often includes language-specific tracebacks of (unhandled) exceptions or errors occurred in the user code. If the user code generates a mixture of stdout and stderr, the print ordering is preserved and each contiguous block of stdout/stderr becomes a separate item in the console output list so that the client user can reconstruct the same console output by sequentially rendering the items.

Note

The text in the stdout/stderr item may contain arbitrary terminal control sequences such as ANSI color codes and cursor/line manipulations. It is the user’s job to strip out them or implement some sort of terminal emulation.

Tip

Since the console texts are not escaped, the client user should take care of rendering and escaping depending on the UI implementation. For example, use <pre> element, replace newlines with <br>, or apply white-space: pre CSS style when rendering as HTML. An easy way to do escape the text safely is to use insertAdjacentText() DOM API.

When the item type is "media", the item data is a pair of the MIME type and the content data. If the MIME type is text-based (e.g., "text/plain") or XML-based (e.g., "image/svg+xml"), the content is just a string that represent the content. Otherwise, the data is encoded as a data URI format (RFC 2397). You may use backend.ai-media library to handle this field in Javascript on web-browsers.

When the item type is "html", the item data is a partial HTML document string, such as a table to show tabular data. If you are implementing a web-based front-end, you may use it directly to the standard DOM API, for instance, consoleElem.insertAdjacentHTML(value, "beforeend").

When the item type is "log", the item data is a 4-tuple of the log level, the timestamp in the ISO 8601 format, the logger name and the log message string. The log level may be one of "debug", "info", "warning", "error", or "fatal". You may use different colors/formatting by the log level when printing the log message. Not every kernel runtime supports this rich logging facility.

Manager GraphQL API

Backend.AI GraphQL API is for developing in-house management consoles.

There are two modes of operation:

  1. Full admin access: you can query all information of all users. It requires a privileged keypair.

  2. Restricted owner access: you can query only your own information. The server processes your request in this mode if you use your own plain keypair.

Warning

The Admin API only accepts authenticated requests.

Tip

To test and debug with the Admin API easily, try the proxy mode of the official Python client. It provides an insecure (non-SSL, non-authenticated) local HTTP proxy where all the required authorization headers are attached from the client configuration. Using this you do not have to add any custom header configurations to your favorite API development tools such as GraphiQL.

Domain Management

Query Schema
type Domain {
  name: String
  description: String
  is_active: Boolean
  created_at: DateTime
  modified_at: DateTime
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  allowed_docker_registries: [String]
  integration_id: String
  scaling_groups: [String]
}

type Query {
  domain(name: String): Domain
  domains(is_active: Boolean): [Domain]
}
Mutation Schema
input DomainInput {
  description: String
  is_active: Boolean
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  allowed_docker_registries: [String]
  integration_id: String
}

input ModifyDomainInput {
  name: String
  description: String
  is_active: Boolean
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  allowed_docker_registries: [String]
  integration_id: String
}

type CreateDomain {
  ok: Boolean
  msg: String
  keypair: KeyPair
}

type ModifyDomain {
  ok: Boolean
  msg: String
}

type DeleteDomain {
  ok: Boolean
  msg: String
}

type Mutation {
  create_domain(name: String!, props: DomainInput!): CreateDomain
  modify_domain(name: String!, props: ModifyDomainInput!): ModifyDomain
  delete_domain(name: String!): DeleteDomain
}

Scaling Group Management

Query Schema
type ScalingGroup {
  name: String
  description: String
  is_active: Boolean
  created_at: DateTime
  driver: String
  driver_opts: JSONString
  scheduler: String
  scheduler_opts: JSONString
}

type Query {
  scaling_group(name: String): ScalingGroup
  scaling_groups(name: String, is_active: Boolean): [ScalingGroup]
  scaling_groups_for_domain(domain: String!, is_active: Boolean): [ScalingGroup]
  scaling_groups_for_user_group(user_group: String!, is_active: Boolean): [ScalingGroup]
  scaling_groups_for_keypair(access_key: String!, is_active: Boolean): [ScalingGroup]
}
Mutation Schema
input ScalingGroupInput {
  description: String
  is_active: Boolean
  driver: String!
  driver_opts: JSONString
  scheduler: String!
  scheduler_opts: JSONString
}

input ModifyScalingGroupInput {
  description: String
  is_active: Boolean
  driver: String
  driver_opts: JSONString
  scheduler: String
  scheduler_opts: JSONString
}

type CreateScalingGroup {
  ok: Boolean
  msg: String
  scaling_group: ScalingGroup
}

type ModifyScalingGroup {
  ok: Boolean
  msg: String
}

type DeleteScalingGroup {
  ok: Boolean
  msg: String
}

type AssociateScalingGroupWithDomain {
  ok: Boolean
  msg: String
}

type AssociateScalingGroupWithKeyPair {
  ok: Boolean
  msg: String
}

type AssociateScalingGroupWithUserGroup {
  ok: Boolean
  msg: String
}

type DisassociateAllScalingGroupsWithDomain {
  ok: Boolean
  msg: String
}

type DisassociateAllScalingGroupsWithGroup {
  ok: Boolean
  msg: String
}

type DisassociateScalingGroupWithDomain {
  ok: Boolean
  msg: String
}

type DisassociateScalingGroupWithKeyPair {
  ok: Boolean
  msg: String
}

type DisassociateScalingGroupWithUserGroup {
  ok: Boolean
  msg: String
}

type Mutation {
  create_scaling_group(name: String!, props: ScalingGroupInput!): CreateScalingGroup
  modify_scaling_group(name: String!, props: ModifyScalingGroupInput!): ModifyScalingGroup
  delete_scaling_group(name: String!): DeleteScalingGroup
  associate_scaling_group_with_domain(domain: String!, scaling_group: String!): AssociateScalingGroupWithDomain
  associate_scaling_group_with_user_group(scaling_group: String!, user_group: String!): AssociateScalingGroupWithUserGroup
  associate_scaling_group_with_keypair(access_key: String!, scaling_group: String!): AssociateScalingGroupWithKeyPair
  disassociate_scaling_group_with_domain(domain: String!, scaling_group: String!): DisassociateScalingGroupWithDomain
  disassociate_scaling_group_with_user_group(scaling_group: String!, user_group: String!): DisassociateScalingGroupWithUserGroup
  disassociate_scaling_group_with_keypair(access_key: String!, scaling_group: String!): DisassociateScalingGroupWithKeyPair
  disassociate_all_scaling_groups_with_domain(domain: String!): DisassociateAllScalingGroupsWithDomain
  disassociate_all_scaling_groups_with_group(user_group: String!): DisassociateAllScalingGroupsWithGroup
}

Resource Preset Management

Query Schema
type ResourcePreset {
  name: String
  resource_slots: JSONString
  shared_memory: BigInt
}

type Query {
  resource_preset(name: String!): ResourcePreset
  resource_presets(): [ResourcePreset]
}
Mutation Schema
input CreateResourcePresetInput {
  resource_slots: JSONString
  shared_memory: String
}

type CreateResourcePreset {
  ok: Boolean
  msg: String
  resource_preset: ResourcePreset
}

input ModifyResourcePresetInput {
  resource_slots: JSONString
  shared_memory: String
}

type ModifyResourcePreset {
  ok: Boolean
  msg: String
}

type DeleteResourcePreset {
  ok: Boolean
  msg: String
}

type Mutation {
  create_resource_preset(name: String!, props: CreateResourcePresetInput!): CreateResourcePreset
  modify_resource_preset(name: String!, props: ModifyResourcePresetInput!): ModifyResourcePreset
  delete_resource_preset(name: String!): DeleteResourcePreset
}

Agent Monitoring

Query Schema
type Agent {
  id: ID
  status: String
  status_changed: DateTime
  region: String
  scaling_group: String
  available_slots: JSONString  # ResourceSlot
  occupied_slots: JSONString   # ResourceSlot
  addr: String
  first_contact: DateTime
  lost_at: DateTime
  live_stat: JSONString
  version: String
  compute_plugins: JSONString
  compute_containers(status: String): [ComputeContainer]

  # legacy fields
  mem_slots: Int
  cpu_slots: Float
  gpu_slots: Float
  tpu_slots: Float
  used_mem_slots: Int
  used_cpu_slots: Float
  used_gpu_slots: Float
  used_tpu_slots: Float
  cpu_cur_pct: Float
  mem_cur_bytes: Float
}

type Query {
  agent_list(
   limit: Int!,
   offset: Int!
   order_key: String,
   order_asc: Boolean,
   scaling_group: String,
   status: String,
 ): PaginatedList[Agent]
}

User Management

Query Schema
type User {
  uuid: UUID
  username: String
  email: String
  password: String
  need_password_change: Boolean
  full_name: String
  description: String
  is_active: Boolean
  created_at: DateTime
  domain_name: String
  role: String
  groups: [UserGroup]
}

type UserGroup {  # shorthand reference to Group
  id: UUID
  name: String
}

type Query {
  user(domain_name: String, email: String): User
  user_from_uuid(domain_name: String, user_id: String): User
  users(domain_name: String, group_id: String, is_active: Boolean): [User]
}
Mutation Schema
input UserInput {
  username: String!
  password: String!
  need_password_change: Boolean!
  full_name: String
  description: String
  is_active: Boolean
  domain_name: String!
  role: String
  group_ids: [String]
}

input ModifyUserInput {
  username: String
  password: String
  need_password_change: Boolean
  full_name: String
  description: String
  is_active: Boolean
  domain_name: String
  role: String
  group_ids: [String]
}

type CreateKeyPair {
  ok: Boolean
  msg: String
  keypair: KeyPair
}

type ModifyUser {
  ok: Boolean
  msg: String
  user: User
}

type DeleteUser {
  ok: Boolean
  msg: String
}

type Mutation {
  create_user(email: String!, props: UserInput!): CreateUser
  modify_user(email: String!, props: ModifyUserInput!): ModifyUser
  delete_user(email: String!): DeleteUser
}

Group Management

Query Schema
type Group {
  id: UUID
  name: String
  description: String
  is_active: Boolean
  created_at: DateTime
  modified_at: DateTime
  domain_name: String
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  integration_id: String
  scaling_groups: [String]
}

type Query {
  group(id: String!): Group
  groups(domain_name: String, is_active: Boolean): [Group]
}
Mutation Schema
input GroupInput {
  description: String
  is_active: Boolean
  domain_name: String!
  total_resource_slots: JSONString  # ResourceSlot
  allowed_vfolder_hosts: [String]
  integration_id: String
}

input ModifyGroupInput {
  name: String
  description: String
  is_active: Boolean
  domain_name: String
  total_resource_slots: JSONString  # ResourceSlot
  user_update_mode: String
  user_uuids: [String]
  allowed_vfolder_hosts: [String]
  integration_id: String
}

type CreateGroup {
  ok: Boolean
  msg: String
  keypair: KeyPair
}

type ModifyGroup {
  ok: Boolean
  msg: String
}

type DeleteGroup {
  ok: Boolean
  msg: String
}

type Mutation {
  create_group(name: String!, props: GroupInput!): CreateGroup
  modify_group(name: String!, props: ModifyGroupInput!): ModifyGroup
  delete_group(name: String!): DeleteGroup
}

KeyPair Management

Query Schema
type KeyPair {
  user_id: String
  access_key: String
  secret_key: String
  is_active: Boolean
  is_admin: Boolean
  resource_policy: String
  created_at: DateTime
  last_used: DateTime
  concurrency_used: Int
  rate_limit: Int
  num_queries: Int
  user: UUID
  ssh_public_key: String
  vfolders: [VirtualFolder]
  compute_sessions(status: String): [ComputeSession]
}

type Query {
  keypair(domain_name: String, access_key: String): KeyPair
  keypairs(domain_name: String, email: String, is_active: Boolean): [KeyPair]
}
Mutation Schema
input KeyPairInput {
  is_active: Boolean
  resource_policy: String
  concurrency_limit: Int
  rate_limit: Int
}

input ModifyKeyPairInput {
  is_active: Boolean
  is_admin: Boolean
  resource_policy: String
  concurrency_limit: Int
  rate_limit: Int
}

type CreateKeyPair {
  ok: Boolean
  msg: String
  keypair: KeyPair
}

type ModifyKeyPair {
  ok: Boolean
  msg: String
}

type DeleteKeyPair {
  ok: Boolean
  msg: String
}

type Mutation {
  create_keypair(props: KeyPairInput!, user_id: String!): CreateKeyPair
  modify_keypair(access_key: String!, props: ModifyKeyPairInput!): ModifyKeyPair
  delete_keypair(access_key: String!): DeleteKeyPair
}

KeyPair Resource Policy Management

Query Schema
type KeyPairResourcePolicy {
  name: String
  created_at: DateTime
  default_for_unspecified: String
  total_resource_slots: JSONString  # ResourceSlot
  max_concurrent_sessions: Int
  max_containers_per_session: Int
  idle_timeout: BigInt
  max_vfolder_count: Int
  max_vfolder_size: BigInt
  allowed_vfolder_hosts: [String]
}

type Query {
  keypair_resource_policy(name: String): KeyPairResourcePolicy
  keypair_resource_policies(): [KeyPairResourcePolicy]
}
Mutation Schema
input CreateKeyPairResourcePolicyInput {
  default_for_unspecified: String!
  total_resource_slots: JSONString!
  max_concurrent_sessions: Int!
  max_containers_per_session: Int!
  idle_timeout: BigInt!
  max_vfolder_count: Int!
  max_vfolder_size: BigInt!
  allowed_vfolder_hosts: [String]
}

input ModifyKeyPairResourcePolicyInput {
  default_for_unspecified: String
  total_resource_slots: JSONString
  max_concurrent_sessions: Int
  max_containers_per_session: Int
  idle_timeout: BigInt
  max_vfolder_count: Int
  max_vfolder_size: BigInt
  allowed_vfolder_hosts: [String]
}

type CreateKeyPairResourcePolicy {
  ok: Boolean
  msg: String
  resource_policy: KeyPairResourcePolicy
}

type ModifyKeyPairResourcePolicy {
  ok: Boolean
  msg: String
}

type DeleteKeyPairResourcePolicy {
  ok: Boolean
  msg: String
}

type Mutation {
  create_keypair_resource_policy(name: String!, props: CreateKeyPairResourcePolicyInput!): CreateKeyPairResourcePolicy
  modify_keypair_resource_policy(name: String!, props: ModifyKeyPairResourcePolicyInput!): ModifyKeyPairResourcePolicy
  delete_keypair_resource_policy(name: String!): DeleteKeyPairResourcePolicy
}

Compute Session Monitoring

As of Backend.AI v20.03, compute sessions are composed of one or more containers, while interactions with sessions only occur with the master container when using REST APIs. The GraphQL API allows users and admins to check details of sessions and their belonging containers.

Changed in version v5.20191215.

Query Schema

ComputeSession provides information about the whole session, including user-requested parameters when creating sessions.

type ComputeSession {
  # identity and type
  id: UUID
  name: String
  type: String
  id: UUID
  tag: String

  # image
  image: String
  registry: String
  cluster_template: String  # reserved for future release

  # ownership
  domain_name: String
  group_name: String
  group_id: UUID
  user_email: String
  user_id: UUID
  access_key: String
  created_user_email: String  # reserved for future release
  created_user_uuid: UUID     # reserved for future release

  # status
  status: String
  status_changed: DateTime
  status_info: String
  created_at: DateTime
  terminated_at: DateTime
  startup_command: String
  result: String

  # resources
  resource_opts: JSONString
  scaling_group: String
  service_ports: JSONString   # only available in master
  mounts: List[String]            # shared by all kernels
  occupied_slots: JSONString  # ResourceSlot; sum of belonging containers

  # statistics
  num_queries: BigInt

  # owned containers (aka kernels)
  containers: List[ComputeContainer]  # full list of owned containers

  # pipeline relations
  dependencies: List[ComputeSession]  # full list of dependency sessions
}

The sessions may be queried one by one using compute_session field on the root query schema, or as a paginated list using compute_session_list.

type Query {
  compute_session(
    id: UUID!,
  ): ComputeSession

  compute_session_list(
    limit: Int!,
    offset: Int!,
    order_key: String,
    order_asc: Boolean,
    domain_name: String,  # super-admin can query sessions in any domain
    group_id: String,     # domain-admins can query sessions in any group
    access_key: String,   # admins can query sessions of other users
    status: String,
  ): PaginatedList[ComputeSession]
}

ComputeContainer provides information about individual containers that belongs to the given session. Note that the client must assume that id is different from container_id, because agents may be configured to use non-Docker backends.

Note

The container ID in the GraphQL queries and REST APIs are different from the actual Docker container ID. The Docker container IDs can be queried using container_id field of ComputeContainer objects. If the agents are configured to using non-Docker-based backends, then container_id may also be completely arbitrary identifiers.

type ComputeContainer {
  # identity
  id: UUID
  role: String      # "master" is reserved, other values are defined by cluster templates
  hostname: String  # used by sibling containers in the same session
  session_id: UUID

  # image
  image: String
  registry: String

  # status
  status: String
  status_changed: DateTime
  status_info: String
  created_at: DateTime
  terminated_at: DateTime

  # resources
  agent: String               # super-admin only
  container_id: String
  resource_opts: JSONString
  # NOTE: mounts are same in all containers of the same session.
  occupied_slots: JSONString  # ResourceSlot

  # statistics
  live_stat: JSONString
  last_stat: JSONString
}

In the same way, the containers may be queried one by one using compute_container field on the root query schema, or as a paginated list using compute_container_list for a single session.

Note

The container ID of the master container of each session is same to the session ID.

type Query {
  compute_container(
    id: UUID!,
  ): ComputeContainer

  compute_container_list(
    limit: Int!,
    offset: Int!,
    session_id: UUID!,
    role: String,
  ): PaginatedList[ComputeContainer]
}
Query Example
query(
  $limit: Int!,
  $offset: Int!,
  $ak: String,
  $status: String,
) {
  compute_session_list(
    limit: $limit,
    offset: $offset,
    access_key: $ak,
    status: $status,
  ) {
    total_count
    items {
      id
      name
      type
      user_email
      status
      status_info
      status_updated
      containers {
        id
        role
        agent
      }
    }
  }
}
API Parameters

Using the above GraphQL query, clients may send the following JSON object as the request:

{
  "query": "...",
  "variables": {
    "limit": 10,
    "offset": 0,
    "ak": "AKIA....",
    "status": "RUNNING"
  }
}
API Response
{
  "compute_session_list": {
    "total_count": 1,
    "items": [
      {
        "id": "12c45b55-ce3c-418d-9c58-223bbba307f1",
        "name": "mysession",
        "type": "interactive",
        "user_email": "user@lablup.com",
        "status": "RUNNING",
        "status_info": null,
        "status_updated": "2020-02-16T15:47:28.997335+00:00",
        "containers": [
          {
            "id": "12c45b55-ce3c-418d-9c58-223bbba307f1",
            "role": "master",
            "agent": "i-agent01"
          },
          {
            "id": "12c45b55-ce3c-418d-9c58-223bbba307f2",
            "role": "slave",
            "agent": "i-agent02"
          },
          {
            "id": "12c45b55-ce3c-418d-9c58-223bbba307f3",
            "role": "slave",
            "agent": "i-agent03"
          }
        ]
      }
    ]
  }
}

Virtual Folder Management

Query Schema
type VirtualFolder {
  id: UUID
  host: String
  name: String
  user: UUID
  group: UUID
  unmanaged_path: UUID
  max_files: Int
  max_size: Int
  created_at: DateTime
  last_used: DateTime
  num_files: Int
  cur_size: BigInt
}

type Query {
  vfolder_list(
    limit: Int!,
    offset: Int!,
    order_key: String,
    order_asc: Boolean,
    domain_name: String,
    group_id: String,
    access_key: String,
  ): PaginatedList[VirtualFolder]
}

Image Management

Query Schema
type Image {
  name: String
  humanized_name: String
  tag: String
  registry: String
  digest: String
  labels: [KVPair]
  aliases: [String]
  size_bytes: BigInt
  resource_limits: [ResourceLimit]
  supported_accelerators: [String]
  installed: Boolean
  installed_agents: [String]  # super-admin only
}
type Query {
  image(reference: String!): Image

  images(
    is_installed: Boolean,
    is_operation: Boolean,
    domain: String,         # only settable by super-admins
    group: String,
    scaling_group: String,  # null to take union of all agents from allowed scaling groups
  ): [Image]
}

The image list is automatically filtered by: 1) the allowed docker registries of the current user’s domain, 2) whether at least one agent in the union of all agents from the allowed scaling groups for the current user’s group has the image or not. The second condition applies only when the value of group is given explicitly. If scaling_group is not null, then only the agents in the given scaling group are checked for image availability instead of taking the union of all agents from the allowed scaling groups.

If the requesting user is a super-admin, clients may set the filter conditions as they want. If the filter conditions are not specified by the super-admin, clients work like v19.09 and prior versions

New in version v5.20191215: domain, group, and scaling_group filters are added to the images root query field.

Changed in version v5.20191215: images query returns the images currently usable by the requesting user as described above. Previously, it returned all etcd-registered images.

Mutation Schema
type RescanImages {
  ok: Boolean
  msg: String
  task_id: String
}

type PreloadImage {
  ok: Boolean
  msg: String
  task_id: String
}

type UnloadImage {
  ok: Boolean
  msg: String
  task_id: String
}

type ForgetImage {
  ok: Boolean
  msg: String
}

type AliasImage {
  ok: Boolean
  msg: String
}

type DealiasImage {
  ok: Boolean
  msg: String
}

type Mutation {
  rescan_images(registry: String!): RescanImages
  preload_image(reference: String!, target_agents: String!): PreloadImage
  unload_image(reference: String!, target_agents: String!): UnloadImage
  forget_image(reference: String!): ForgetImage
  alias_image(alias: String!, target: String!): AliasImage
  dealias_image(alias: String!): DealiasImage
}

All these mutations are only allowed for super-admins.

The query parameter target_agents takes a special expression to indicate a set of agents.

The mutations that returns task_id may take an arbitrarily long time to complete. This means that getting the response does not necessarily mean that the requested task is complete. To monitor the progress and actual completion, clients should use the background task API using the task_id value.

New in version v5.20191215: forget_image, preload_image and unload_image are added to the root mutation.

Changed in version v5.20191215: rescan_images now returns immediately and its completion must be monitored using the new background task API.

Basics of GraphQL

The Admin API uses a single GraphQL endpoint for both queries and mutations.

https://api.backend.ai/admin/graphql

For more information about GraphQL concepts and syntax, please visit the following site(s):

HTTP Request Convention

A client must use the POST HTTP method. The server accepts a JSON-encoded body with an object containing two fields: query and variables, pretty much like other GraphQL server implementations.

Warning

Currently the API gateway does not support schema discovery which is often used by API development tools such as Insomnia and GraphiQL.

Field Naming Convention

We do NOT automatically camel-case our field names. All field names follow the underscore style, which is common in the Python world as our server-side framework uses Python.

Common Object Types

ResourceLimit represents a range (min, max) of specific resource slot (key). The max value may be the string constant “Infinity” if not specified.

type ResourceLimit {
  key: String
  min: String
  max: String
}

KVPair is used to represent a mapping data structure with arbitrary (runtime-determined) key-value pairs, in contrast to other data types in GraphQL which have a set of predefined static fields.

type KVPair {
  key: String
  value: String
}
Pagination Convention

GraphQL itself does not enforce how to pass pagination information when querying multiple objects of the same type.

We use a pagination convention as described below:

interface Item {
  id: UUID
  # other fields are defined by concrete types
}

interface PaginatedList(
  offset: Integer!,
  limit: Integer!,
  # some concrete types define ordering customization fields:
  #   order_key: String,
  #   order_asc: Boolean,
  # other optional filter condition may be added by concrete types
) {
  total_count: Integer
  items: [Item]
}

offset and limit are interpreted as SQL’s offset and limit clauses. For the first page, set the offset to zero and the limit to the page size. The items field may contain from zero up to limit items. Use total_count field to determine how many pages are there. Fields that support pagination is suffixed with _list in our schema.

Custom Scalar Types
  • UUID: A hexademically formatted (8-4-4-4-12 alphanumeric characters connected via single hyphens) UUID values represented as String

  • DateTime: An ISO-8601 formatted date-time value represented as String

  • BigInt: GraphQL’s integer is officially 32-bits only, so we define a “big integer” type which can represent from -9007199254740991 (-253+1) to 9007199254740991 (253-1) (or, ±(8 PiB - 1 byte). This range is regarded as a “safe” (i.e., can be compared without loosing precision) integer range in most Javascript implementations which represent numbers in the IEEE-754 double (64-bit) format.

  • JSONString: It contains a stringified JSON value, whereas the whole query result is already a JSON object. A client must parse the value again to get an object representation.

Authentication

The admin API shares the same authentication method of the user API.

Versioning

As we use GraphQL, there is no explicit versioning. Check out the descriptions for each API for its own version history.

Backend.AI REST API Reference

Backend.AI Agent Reference

RPC Interface for Kernel Management

Docker Backend

Kubernetes Backend

Accelerators (aka Compute Plugins)

Backend.AI Storage Proxy Reference

Storage Proxy Manager-facing API

Storage Proxy Client-facing API

Backend.AI Client SDK for Python

Python 3.8 or higher is required.

You can download its official installer from python.org, or use a 3rd-party package/version manager such as homebrew, miniconda, or pyenv. It works on Linux, macOS, and Windows.

We recommend to create a virtual environment for isolated, unobtrusive installation of the client SDK library and tools.

$ python3 -m venv venv-backend-ai
$ source venv-backend-ai/bin/activate
(venv-backend-ai) $

Then install the client library from PyPI.

(venv-backend-ai) $ pip install -U pip setuptools
(venv-backend-ai) $ pip install backend.ai-client

Set your API keypair as environment variables:

(venv-backend-ai) $ export BACKEND_ACCESS_KEY=AKIA...
(venv-backend-ai) $ export BACKEND_SECRET_KEY=...

And then try the first commands:

(venv-backend-ai) $ backend.ai --help
...
(venv-backend-ai) $ backend.ai ps
...

Check out more details with the below table of contents.

Installation

Linux/macOS

We recommend using pyenv to manage your Python versions and virtual environments to avoid conflicts with other Python applications.

Create a new virtual environment (Python 3.6 or higher) and activate it on your shell. Then run the following commands:

pip install -U pip setuptools
pip install -U backend.ai-client-py

Create a shell script my-backendai-env.sh like:

export BACKEND_ACCESS_KEY=...
export BACKEND_SECRET_KEY=...
export BACKEND_ENDPOINT=https://my-precious-cluster
export BACKEND_ENDPOINT_TYPE=api

Run this shell script before using backend.ai command.

Note

The console-server users should set BACKEND_ENDPOINT_TYPE to session. For details, check out the client configuration document.

Windows

We recommend using the Anaconda Navigator to manage your Python environments with a slick GUI app.

Create a new environment (Python 3.6 or higher) and launch a terminal (command prompt). Then run the following commands:

python -m pip install -U pip setuptools
python -m pip install -U backend.ai-client-py

Create a batch file my-backendai-env.bat like:

chcp 65001
set PYTHONIOENCODING=UTF-8
set BACKEND_ACCESS_KEY=...
set BACKEND_SECRET_KEY=...
set BACKEND_ENDPOINT=https://my-precious-cluster
set BACKEND_ENDPOINT_TYPE=api

Run this batch file before using backend.ai command.

Note that this batch file switches your command prompt to use the UTF-8 codepage for correct display of special characters in the console logs.

Verification

Run backend.ai ps command and check if it says “there is no compute sessions running” or something similar.

If you encounter error messages about “ACCESS_KEY”, then check if your batch/shell scripts have the correct environment variable names.

If you encounter network connection error messages, check if the endpoint server is configured correctly and accessible.

Client Configuration

The configuration for Backend.AI API includes the endpoint URL prefix, API keypairs (access and secret keys), and a few others.

There are two ways to set the configuration:

  1. Setting environment variables before running your program that uses this SDK. This applies to the command-line interface as well.

  2. Manually creating APIConfig instance and creating sessions with it.

The list of configurable environment variables are:

  • BACKEND_ENDPOINT

  • BACKEND_ENDPOINT_TYPE

  • BACKEND_ACCESS_KEY

  • BACKEND_SECRET_KEY

  • BACKEND_VFOLDER_MOUNTS

Please refer the parameter descriptions of APIConfig’s constructor for what each environment variable means and what value format should be used.

Command Line Interface

Configuration

Note

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Check out the client configuration for configurations via environment variables.

Session Mode

When the endpoint type is "session", you must explicitly login and logout into/from the console server.

$ backend.ai login
Username: myaccount@example.com
Password:
✔ Login succeeded.

$ backend.ai ...  # any commands

$ backend.ai logout
✔ Logout done.
API Mode

After setting up the environment variables, just run any command:

$ backend.ai ...
Checking out the current configuration

Run the following command to list your current active configurations.

$ backend.ai config

Compute Sessions

Note

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Listing sessions

List the session owned by you with various status filters. The most recently status-changed sessions are listed first. To prevent overloading the server, the result is limited to the first 10 sessions and it provides a separate --all option to paginate further sessions.

backend.ai ps

The ps command is an alias of the following admin session list command. If you have the administrator privilege, you can list sessions owned by other users by adding --access-key option here.

backend.ai admin session list

Both commands offer options to set the status filter as follows. For other options, please consult the output of --help.

Option

Included Session Status

(no option)

PENDING, PREPARING, RUNNING, RESTARTING, TERMINATING, RESIZING, SUSPENDED, and ERROR.

--running

PREPARING, PULLING, and RUNNING.

--dead

CANCELLED and TERMINATED.

Both commands offer options to specify which fields of sessions should be printed as follows.

Option

Included Session Fields

(no option)

Session ID, Owner, Image, Type,

Status, Status Info, Last updated, and Result.

--id-only

Session ID.

--detail

Session ID, Owner, Image, Type,

Status, Status Info, Last updated, Result,

Tag, Created At, Occupied Resource, Used Memory (MiB),

Max Used Memory (MiB), and CPU Using (%).

-f, --format

Specified fields by user.

Note

Fields for -f/--format option can be displayed by specifying comma-separated parameters.

Available parameters for this option are: id, status, status_info, created_at, last_updated, result, image, type, task_id, tag, occupied_slots, used_memory, max_used_memory, cpu_using.

For example:

backend.ai admin session --format id,status,cpu_using
Running simple sessions

The following command spawns a Python session and executes the code passed as -c argument immediately. --rm option states that the client automatically terminates the session after execution finishes.

backend.ai run --rm -c 'print("hello world")' python:3.6-ubuntu18.04

Note

By default, you need to specify language with full version tag like python:3.6-ubuntu18.04. Depending on the Backend.AI admin’s language alias settings, this can be shortened just as python. If you want to know defined language aliases, contact the admin of Backend.AI server.

The following command spawns a Python session and executes the code passed as ./myscript.py file, using the shell command specified in the --exec option.

backend.ai run --rm --exec 'python myscript.py arg1 arg2' \
           python:3.6-ubuntu18.04 ./myscript.py

Please note that your run command may hang up for a very long time due to queueing when the cluster resource is not sufficiently available.

To avoid indefinite waiting, you may add --enqueue-only to return immediately after posting the session creation request.

Note

When using --enqueue-only, the codes are NOT executed and relevant options are ignored. This makes the run command to the same of the start command.

Or, you may use --max-wait option to limit the maximum waiting time. If the session starts within the given --max-wait seconds, it works normally, but if not, it returns without code execution like when used --enqueue-only.

To watch what is happening behind the scene until the session starts, try backend.ai events <sessionID> to receive the lifecycle events such as its scheduling and preparation steps.

Running sessions with accelerators

Use one or more -r options to specify resource requirements when using backend.ai run and backend.ai start commands.

For instance, the following command spawns a Python TensorFlow session using a half of virtual GPU device, 4 CPU cores, and 8 GiB of the main memory to execute ./mygpucode.py file inside it.

backend.ai run --rm \
           -r cpu=4 -r mem=8g -r cuda.shares=2 \
           python-tensorflow:1.12-py36 ./mygpucode.py
Terminating or cancelling sessions

Without --rm option, your session remains alive for a configured amount of idle timeout (default is 30 minutes). You can see such sessions using the backend.ai ps command. Use the following command to manually terminate them via their session IDs. You may specifcy multiple session IDs to terminate them at once.

backend.ai rm <sessionID> [<sessionID>...]

If you terminate PENDING sessions which are not scheduled yet, they are cancelled.

Container Applications

Note

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Starting a session and connecting to its Jupyter Notebook

The following command first spawns a Python session named “mysession” without running any code immediately, and then executes a local proxy which connects to the “jupyter” service running inside the session via the local TCP port 9900. The start command shows application services provided by the created compute session so that you can choose one in the subsequent app command. In the start command, you can specify detailed resource options using -r and storage mounts using -m parameter.

backend.ai start -t mysession python
backend.ai app -b 9900 mysession jupyter

Once executed, the app command waits for the user to open the displayed address using appropriate application. For the jupyter service, use your favorite web browser just like the way you use Jupyter Notebooks. To stop the app command, press Ctrl+C or send the SIGINT signal.

Accessing sessions via a web terminal

All Backend.AI sessions expose an intrinsic application named "ttyd". It is an web application that embeds xterm.js-based full-screen terminal that runs on web browsers.

backend.ai start -t mysession ...
backend.ai app -b 9900 mysession ttyd

Then open http://localhost:9900 to access the shell in a fully functional web terminal using browsers. The default shell is /bin/bash for Ubuntu/CentOS-based images and /bin/ash for Alpine-based images with a fallback to /bin/sh.

Note

This shell access does NOT grant your root access. All compute session processes are executed as the user privilege.

Accessing sessions via native SSH/SFTP

Backend.AI offers direct access to compute sessions (containers) via SSH and SFTP, by auto-generating host identity and user keypairs for all sessions. All Baceknd.AI sessions expose an intrinsic application named "sshd" like "ttyd".

To connect your sessions with SSH, first prepare your session and download an auto-generated SSH keypair named id_container. Then start the service port proxy (“app” command) to open a local TCP port that proxies the SSH/SFTP traffic to the compute sessions:

$ backend.ai start -t mysess ...
$ backend.ai session download mysess id_container
$ mv id_container ~/.ssh
$ backend.ai app mysess sshd -b 9922

In another terminal on the same PC, run your ssh client like:

$ ssh -o StrictHostKeyChecking=no \
>     -o UserKnownHostsFile=/dev/null \
>     -i ~/.ssh/id_container \
>     work@localhost -p 9922
Warning: Permanently added '[127.0.0.1]:9922' (RSA) to the list of known hosts.
f310e8dbce83:~$

This SSH port is also compatible with SFTP to browse the container’s filesystem and to upload/download large-sized files.

You could add the following to your ~/.ssh/config to avoid type extra options every time.

Host localhost
  User work
  IdentityFile ~/.ssh/id_container
  StrictHostKeyChecking no
  UserKnownHostsFile /dev/null
$ ssh localhost -p 9922

Warning

Since the SSH keypair is auto-generated every time when your launch a new compute session, you need to download and keep it separately for each session.

To use your own SSH private key across all your sessions without downloading the auto-generated one every time, create a vfolder named .ssh and put the authorized_keys file that includes the public key. The keypair and .ssh directory permissions will be automatically updated by Backend.AI when the session launches.

$ ssh-keygen -t rsa -b 2048 -f id_container
$ cat id_container.pub > authorized_keys
$ backend.ai vfolder create .ssh
$ backend.ai vfolder upload .ssh authorized_keys

Storage Management

Note

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Backend.AI abstracts shared network storages into per-user slices called “virtual folders” (aka “vfolders”), which can be shared between users and user group members.

Creating vfolders and managing them

The command-line interface provides a set of subcommands under backend.ai vfolder to manage vfolders and files inside them.

To list accessible vfolders including your own ones and those shared by other users:

$ backend.ai vfolder list

To create a virtual folder named “mydata1”:

$ backend.ai vfolder create mydata1 mynas

The second argument mynas corresponds to the name of a storage host. To list up storage hosts that you are allowed to use:

$ backend.ai vfolder list-hosts

To delete the vfolder completely:

$ backend.ai vfolder delete mydata1
File transfers and management

To upload a file from the current working directory into the vfolder:

$ backend.ai vfolder upload mydata1 ./bigdata.csv

To download a file from the vfolder into the current working directory:

$ backend.ai vfolder download mydata1 ./bigresult.txt

To list files in the vfolder’s specific path:

$ backend.ai vfolder ls mydata1 .

To delete files in the vfolder:

$ backend.ai vfolder rm mydata1 ./bigdata.csv

Warning

All file uploads and downloads overwrite existing files and all file operations are irreversible.

Running sessions with storages

The following command spawns a Python session where the virtual folder “mydata1” is mounted. The execution options are omitted in this example. Then, it downloads ./bigresult.txt file (generated by your code) from the “mydata1” virtual folder.

$ backend.ai vfolder upload mydata1 ./bigdata.csv
$ backend.ai run --rm -m mydata1 python:3.6-ubuntu18.04 ...
$ backend.ai vfolder download mydata1 ./bigresult.txt

In your code, you may access the virtual folder via /home/work/mydata1 (where the default current working directory is /home/work) just like a normal directory. If you want to mount vfolders in other path, add ‘/’ as prefix at the forefont of the vfolder path.

By reusing the same vfolder in subsequent sessions, you do not have to download the result and upload it as the input for next sessions, just keeping them in the storage.

Creating default files for kernels

Backend.AI has a feature called ‘dotfile’, created to all the kernels user spawns. As you can guess, dotfile’s path should start with .. The following command creates dotfile named .aws/config with permission 755. This file will be created under /home/work every time user spawns Backend.AI kernel.

$ backend.ai dotfile create .aws/config < ~/.aws/config

Advanced Code Execution

Note

Please consult the detailed usage in the help of each command (use -h or --help argument to display the manual).

Running concurrent experiment sessions

In addition to single-shot code execution as described in Running simple sessions, the run command offers concurrent execution of multiple sessions with different parameters interpolated in the execution command specified in --exec option and environment variables specified as -e / --env options.

To define variables interpolated in the --exec option, use --exec-range. To define variables interpolated in the --env options, use --env-range.

Here is an example with environment variable ranges that expands into 4 concurrent sessions.

backend.ai run -c 'import os; print("Hello world, {}".format(os.environ["CASENO"]))' \
    -r cpu=1 -r mem=256m \
    -e 'CASENO=$X' \
    --env-range=X=case:1,2,3,4 \
    lablup/python:3.6-ubuntu18.04

Both range options accept a special form of argument: “range expressions”. The front part of range option value consists of the variable name used for interpolation and an equivalence sign (=). The rest of range expressions have the following three types:

Expression

Interpretation

case:CASE1,CASE2,...,CASEN

A list of discrete values. The values may be either string or numbers.

linspace:START,STOP,POINTS

An inclusive numerical range with discrete points, in the same way of numpy.linspace(). For example, linspace:1,2,3 generates a list of three values: 1, 1.5, and 2.

range:START,STOP,STEP

A numerical range with the same semantics of Python’s range(). For example, range:1,6,2 generates a list of values: 1, 3, and 5.

If you specify multiple occurrences of range options in the run command, the client spawns sessions for all possible combinations of all values specified by each range.

Note

When your resource limit and cluster’s resource capacity cannot run all spawned sessions at the same time, some of sessions may be queued and the command may take a long time to finish.

Warning

Until all cases finish, the client must keep its network connections to the server alive because this feature is implemented in the client-side. Server-side batch job scheduling is under development!

Session Templates

Creating and starting session template

Users may define commonly used set of session creation parameters as reusable templates.

A session template includes common session parameters such as resource slots, vfolder mounts, the kernel image to use, and etc. It also support an extra feature that automatically clones a Git repository upon startup as a bootstrap command.

The following sample shows how a session template looks like:

---
api_version: v1
kind: taskTemplate
metadata:
  name: template1234
  tag: example-tag
spec:
  kernel:
    environ:
      MYCONFIG: XXX
    git:
      branch: '19.09'
      commit: 10daee9e328876d75e6d0fa4998d4456711730db
      repository: https://github.com/lablup/backend.ai-agent
      destinationDir: /home/work/baiagent
    image: python:3.6-ubuntu18.04
  resources:
    cpu: '2'
    mem: 4g
  mounts:
    hostpath-test: /home/work/hostDesktop
    test-vfolder:
  sessionType: interactive

The backend.ai sesstpl command set provides the basic CRUD operations of user-specific session templates.

The create command accepts the YAML content either piped from the standard input or read from a file using -f flag:

$ backend.ai sesstpl create < session-template.yaml
# -- or --
$ backend.ai sesstpl create -f session-template.yaml

Once the session template is uploaded, you may use it to start a new session:

$ backend.ai start-template <templateId>

with substituting <templateId> to your template ID.

Other CRUD command examples are as follows:

$ backend.ai sesstpl update <templateId> < session-template.yaml
$ backend.ai sesstpl list
$ backend.ai sesstpl get <templateId>
$ backend.ai sesstpl delete <templateId>
Full syntax for task template
---
api_version or apiVersion: str, required
kind: Enum['taskTemplate', 'task_template'], required
metadata: required
  name: str, required
  tag: str (optional)
spec:
  type or sessionType: Enum['interactive', 'batch'] (optional), default=interactive
  kernel:
    image: str, required
    environ: map[str, str] (optional)
    run: (optional)
      bootstrap: str (optional)
      stratup or startup_command or startupCommand: str (optional)
    git: (optional)
      repository: str, required
      commit: str (optional)
      branch: str (optional)
      credential: (optional)
        username: str
        password: str
      destination_dir or destinationDir: str (optional)
  mounts: map[str, str] (optional)
  resources: map[str, str] (optional)

Developer Guides

Client Session

Client Session Objects

This module is the first place to begin with your Python programs that use Backend.AI API functions.

The high-level API functions cannot be used alone – you must initiate a client session first because each session provides proxy attributes that represent API functions and run on the session itself.

To achieve this, during initialization session objects internally construct new types by combining the BaseFunction class with the attributes in each API function classes, and makes the new types bound to itself. Creating new types every time when creating a new session instance may look weird, but it is the most convenient way to provide class-methods in the API function classes to work with specific session instances.

When designing your application, please note that session objects are intended to live long following the process’ lifecycle, instead of to be created and disposed whenever making API requests.

class ai.backend.client.session.BaseSession(*, config=None, proxy_mode=False)

The base abstract class for sessions.

property proxy_mode: bool

If set True, it skips API version negotiation when opening the session.

abstractmethod open()

Initializes the session and perform version negotiation.

Return type:

Optional[Awaitable[None]]

abstractmethod close()

Terminates the session and releases underlying resources.

Return type:

Optional[Awaitable[None]]

property closed: bool

Checks if the session is closed.

property config: APIConfig

The configuration used by this session object.

class ai.backend.client.session.Session(*, config=None, proxy_mode=False)

A context manager for API client sessions that makes API requests synchronously. You may call simple request-response APIs like a plain Python function, but cannot use streaming APIs based on WebSocket and Server-Sent Events.

property closed: bool

Checks if the session is closed.

property config: APIConfig

The configuration used by this session object.

property proxy_mode: bool

If set True, it skips API version negotiation when opening the session.

open()

Initializes the session and perform version negotiation.

Return type:

None

close()

Terminates the session. It schedules the close() coroutine of the underlying aiohttp session and then enqueues a sentinel object to indicate termination. Then it waits until the worker thread to self-terminate by joining.

Return type:

None

property worker_thread

The thread that internally executes the asynchronous implementations of the given API functions.

class ai.backend.client.session.AsyncSession(*, config=None, proxy_mode=False)

A context manager for API client sessions that makes API requests asynchronously. You may call all APIs as coroutines. WebSocket-based APIs and SSE-based APIs returns special response types.

property closed: bool

Checks if the session is closed.

property config: APIConfig

The configuration used by this session object.

property proxy_mode: bool

If set True, it skips API version negotiation when opening the session.

open()

Initializes the session and perform version negotiation.

Return type:

Awaitable[None]

close()

Terminates the session and releases underlying resources.

Return type:

Awaitable[None]

Examples

Here are several examples to demonstrate the functional API usage.

Initialization of the API Client
Implicit configuration from environment variables
from ai.backend.client.session import Session

def main():
    with Session() as api_session:
        print(api_session.System.get_versions())

if __name__ == "__main__":
    main()
Explicit configuration
from ai.backend.client.config import APIConfig
from ai.backend.client.session import Session

def main():
    config = APIConfig(
        endpoint="https://api.backend.ai.local",
        endpoint_type="api",
        domain="default",
        group="default",  # the default project name to use
    )
    with Session(config=config) as api_session:
        print(api_session.System.get_versions())

if __name__ == "__main__":
    main()
Asyncio-native API session
import asyncio
from ai.backend.client.session import AsyncSession

async def main():
    async with AsyncSession() as api_session:
        print(api_session.System.get_versions())

if __name__ == "__main__":
    asyncio.run(main())

See also

The interface of API client session objects: ai.backend.client.session

Working with Compute Sessions

Note

From here, we omit the main() function structure in the sample codes.

Listing currently running compute sessions
import functools
from ai.backend.client.session import Session

with Session() as api_session:
    fetch_func = functools.partial(
        api_session.ComputeSession.paginated_list,
        status="RUNNING",
    )
    current_offset = 0
    while True:
        result = fetch_func(page_offset=current_offset, page_size=20)
        if result.total_count == 0:
            # no items found
            break
        current_offset += len(result.items)
        for item in result.items:
           print(item)
        if current_offset >= result.total_count:
            # end of list
            break
Creating and destroying a compute session
from ai.backend.client.session import Session

with Session() as api_session:
    my_session = api_session.ComputeSession.get_or_create(
        "python:3.9-ubuntu20.04",      # registered container image name
        mounts=["mydata", "mymodel"],  # vfolder names
        resources={"cpu": 8, "mem": "32g", "cuda.device": 2},
    )
    print(my_session.id)
    my_session.destroy()
Accessing Container Applications

Launchable apps may vary for sessions. From here we illustrate an example to create a ttyd (web-based terminal) app, which is available for all Backend.AI sessions.

Note

This example is only applicable for the Backend.AI cluster with AppProxy v2 enabled and configured. AppProxy v2 only ships with enterprise version of Backend.AI.

The ComputeSession.start_service() API
import requests

from ai.backend.client.request import Request
from ai.backend.client.session import Session

app_name = "ttyd"

with Session() as api_session:
    sess = api_session.ComputeSession.get_or_create(...)
    service_info = sess.start_service(app_name, login_session_token="dummy")
    app_proxy_url = f"{service_info['wsproxy_addr']}/v2/proxy/{service_info['token']}/{sess.id}/add?app={app_name}"
    resp = requests.get(app_proxy_url)
    body = resp.json()
    auth_url = body["url"]
    print(auth_url)  # opening this link from browser will navigate user to the terminal session

Set the value login_session_token to a dummy string like "dummy" as it is a trace of the legacy interface, which is no longer used.

Alternatively, in versions before 23.09.8, you may use the raw ai.backend.client.Request to call the server-side start_service API.

import asyncio

import aiohttp

from ai.backend.client.request import Request
from ai.backend.client.session import AsyncSession

app_name = "ttyd"

async def main():
    async with AsyncSession() as api_session:
        sess = api_session.ComputeSession.get_or_create(...)
        rqst = Request(
            "POST",
            f"/session/{sess.id}/start-service",
        )
        rqst.set_json({"app": app_name, "login_session_token": "dummy"})
        async with rqst.fetch() as resp:
            body = await resp.json()
            app_proxy_url = f"{body['wsproxy_addr']}/v2/proxy/{body['token']}/{sess.id}/add?app={app_name}"

        async with aiohttp.ClientSession() as client:
            async with client.get(app_proxy_url) as resp:
                body = await resp.json()
                auth_url = body["url"]
                print(auth_url)  # opening this link from browser will navigate user to the terminal session

if __name__ == "__main__":
    asyncio.run(main())
Code Execution via API
Synchronous mode
Snippet execution (query mode)

This is the minimal code to execute a code snippet with this client SDK.

import sys
from ai.backend.client.session import Session

with Session() as api_session:
    my_session = api_session.ComputeSession.get_or_create("python:3.9-ubuntu20.04")
    code = 'print("hello world")'
    mode = "query"
    run_id = None
    try:
        while True:
            result = my_session.execute(run_id, code, mode=mode)
            run_id = result["runId"]  # keeps track of this particular run loop
            for rec in result.get("console", []):
                if rec[0] == "stdout":
                    print(rec[1], end="", file=sys.stdout)
                elif rec[0] == "stderr":
                    print(rec[1], end="", file=sys.stderr)
                else:
                    handle_media(rec)
            sys.stdout.flush()
            if result["status"] == "finished":
                break
            else:
                mode = "continued"
                code = ""
    finally:
        my_session.destroy()

You need to take care of client_token because it determines whether to reuse kernel sessions or not. Backend.AI cloud has a timeout so that it terminates long-idle kernel sessions, but within the timeout, any kernel creation requests with the same client_token let Backend.AI cloud to reuse the kernel.

Script execution (batch mode)

You first need to upload the files after creating the session and construct a opts struct.

import sys
from ai.backend.client.session import Session

with Session() as session:
    compute_sess = session.ComputeSession.get_or_create("python:3.6-ubuntu18.04")
    compute_sess.upload(["mycode.py", "setup.py"])
    code = ""
    mode = "batch"
    run_id = None
    opts = {
        "build": "*",  # calls "python setup.py install"
        "exec": "python mycode.py arg1 arg2",
    }
    try:
        while True:
            result = kern.execute(run_id, code, mode=mode, opts=opts)
            opts.clear()
            run_id = result["runId"]
            for rec in result.get("console", []):
                if rec[0] == "stdout":
                    print(rec[1], end="", file=sys.stdout)
                elif rec[0] == "stderr":
                    print(rec[1], end="", file=sys.stderr)
                else:
                    handle_media(rec)
            sys.stdout.flush()
            if result["status"] == "finished":
                break
            else:
                mode = "continued"
                code = ""
    finally:
        compute_sess.destroy()
Handling user inputs

Inside the while-loop for kern.execute() above, change the if-block for result['status'] as follows:

...
if result["status"] == "finished":
    break
elif result["status"] == "waiting-input":
    mode = "input"
    if result["options"].get("is_password", False):
        code = getpass.getpass()
    else:
        code = input()
else:
    mode = "continued"
    code = ""
...

A common gotcha is to miss setting mode = "input". Be careful!

Handling multi-media outputs

The handle_media() function used above examples would look like:

def handle_media(record):
    media_type = record[0]  # MIME-Type string
    media_data = record[1]  # content
    ...

The exact method to process media_data depends on the media_type. Currently the following behaviors are well-defined:

  • For (binary-format) images, the content is a dataURI-encoded string.

  • For SVG (scalable vector graphics) images, the content is an XML string.

  • For application/x-sorna-drawing, the content is a JSON string that represents a set of vector drawing commands to be replayed the client-side (e.g., Javascript on browsers)

Asynchronous mode

The async version has all sync-version interfaces as coroutines but comes with additional features such as stream_execute() which streams the execution results via websockets and stream_pty() for interactive terminal streaming.

import asyncio
import json
import sys
import aiohttp
from ai.backend.client.session import AsyncSession

async def main():
    async with AsyncSession() as api_session:
        compute_sess = await api_session.ComputeSession.get_or_create(
            "python:3.6-ubuntu18.04",
            client_token="mysession",
        )
        code = 'print("hello world")'
        mode = "query"
        try:
            async with compute_sess.stream_execute(code, mode=mode) as stream:
                # no need for explicit run_id since WebSocket connection represents it!
                async for result in stream:
                    if result.type != aiohttp.WSMsgType.TEXT:
                        continue
                    result = json.loads(result.data)
                    for rec in result.get("console", []):
                        if rec[0] == "stdout":
                            print(rec[1], end="", file=sys.stdout)
                        elif rec[0] == "stderr":
                            print(rec[1], end="", file=sys.stderr)
                        else:
                            handle_media(rec)
                    sys.stdout.flush()
                    if result["status"] == "finished":
                        break
                    elif result["status"] == "waiting-input":
                        mode = "input"
                        if result["options"].get("is_password", False):
                            code = getpass.getpass()
                        else:
                            code = input()
                        await stream.send_text(code)
                    else:
                        mode = "continued"
                        code = ""
        finally:
            await compute_sess.destroy()

if __name__ == "__main__":
    asyncio.run(main())

New in version 19.03.

Working with model service

Along with working AppProxy v2 deployments, model service requires a resource group configured to accept the inference workload.

Starting model service
from ai.backend.client.session import Session

with Session() as api_session:
    compute_sess = api_session.Service.create(
        "python:3.6-ubuntu18.04",
        "Llama2-70B",
        1,
        service_name="Llama2-service",
        resources={"cuda.shares": 2, "cpu": 8, "mem": "64g"},
        open_to_public=False,
    )

If you set open_to_public=True, the endpoint accepts anonymous traffic without the authentication token (see below).

Making request to model service endpoint
from ai.backend.client.session import Session

with Session() as api_session:
    compute_sess = api_session.Service.create(...)
    service_info = compute_sess.info()
    endpoint = service_info["url"]  # this value can be None if no successful inference service deployment has been made

    token_info = compute_sess.generate_api_token("3600s")
    token = token_info["token"]
    headers = {"Authorization": f"BackendAI {token}"}  # providing token is not required for public model services
    resp = requests.get(f"{endpoint}/v1/models", headers=headers)

The token returned by the generate_api_token() method is a JSON web token (JWT), which conveys all required information to authenticate the inference request. Once generated, it cannot be revoked. A token may have its own expiration date/time. The lifetime of a token is configured by the user who deploys the inference model, and currently there is no intrinsic minimum/maximum limits of the lifetime.

New in version 23.09.

Testing

Unit Tests

Unit tests perform function-by-function tests to ensure their individual functionality. This test suite runs without depending on the server-side and thus it is executed in Travis CI for every push.

How to run
$ python -m pytest -m 'not integration' tests
Integration Tests

Integration tests combine multiple invocations of high-level interfaces to make underlying API requests to a running gateway server to test the full functionality of the client as well as the manager.

They are marked as “integration” using the @pytest.mark.integration decorator to each test case.

Warning

The integration tests actually make changes to the target gateway server and agents. If some tests fail, those changes may remain in an inconsistent state and requires a manual recovery such as resetting the database and populating fixtures again, though the test suite tries to clean up them properly.

So, DO NOT RUN it against your production server.

Prerequisite

Please refer the README of the manager and agent repositories to set up them. To avoid an indefinite waiting time for pulling Docker images:

  • (manager) python -m ai.backend.manager.cli image rescan

  • (agent) docker pull

    • lablup/python:3.6-ubuntu18.04

    • lablup/lua:5.3-alpine3.8

The manager must also have at least the following active suerp-admin account in the default domain and the default group.

  • Example super-admin account:

    • User ID: admin@lablup.com

    • Password wJalrXUt

    • Access key: AKIAIOSFODNN7EXAMPLE

    • Secret key: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

One or more testing-XXXX domain, one or more testing-XXXX groups, and one ore more dummy users are created and used during the tests and destroyed after running tests. XXXX will be filled with random identifiers.

Tip

The halfstack configuration and the example-users.json, example-keypairs.json, example-set-user-main-access-keys.json fixture is compatible with this integration test suite.

How to run

Execute the gateway and at least one agent in their respective virtualenvs and hosts:

$ python -m ai.backend.client.gateway.server
$ python -m ai.backend.client.agent.server
$ python -m ai.backend.client.agent.watcher

Then run the tests:

$ export BACKEND_ENDPOINT=...
$ python -m pytest -m 'integration' tests

High-level Function Reference

Admin Functions

class ai.backend.client.func.admin.Admin

Provides the function interface for making admin GraphQL queries.

Note

Depending on the privilege of your API access key, you may or may not have access to querying/mutating server-side resources of other users.

classmethod await query(query, variables=None)

Sends the GraphQL query and returns the response.

Parameters:
  • query (str) – The GraphQL query string.

  • variables (Optional[Mapping[str, Any]]) – An optional key-value dictionary to fill the interpolated template variables in the query.

Return type:

Any

Returns:

The object parsed from the response JSON string.

Agent Functions

class ai.backend.client.func.agent.Agent

Provides a shortcut of Admin.query() that fetches various agent information.

Note

All methods in this function class require your API access key to have the admin privilege.

classmethod await paginated_list(status='ALIVE', scaling_group=None, *, fields=(FieldSpec(field_ref='id', humanized_name='ID', field_name='id', alt_name='id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scaling_group', humanized_name='Scaling Group', field_name='scaling_group', alt_name='scaling_group', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='available_slots', humanized_name='Available Slots', field_name='available_slots', alt_name='available_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='occupied_slots', humanized_name='Occupied Slots', field_name='occupied_slots', alt_name='occupied_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Lists the keypairs. You need an admin privilege for this operation.

Return type:

PaginatedResult

classmethod await detail(agent_id, fields=(FieldSpec(field_ref='id', humanized_name='ID', field_name='id', alt_name='id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scaling_group', humanized_name='Scaling Group', field_name='scaling_group', alt_name='scaling_group', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='addr', humanized_name='Addr', field_name='addr', alt_name='addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='region', humanized_name='Region', field_name='region', alt_name='region', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='first_contact', humanized_name='First Contact', field_name='first_contact', alt_name='first_contact', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='cpu_cur_pct', humanized_name='CPU Usage (%)', field_name='cpu_cur_pct', alt_name='cpu_cur_pct', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='mem_cur_bytes', humanized_name='Used Memory (MiB)', field_name='mem_cur_bytes', alt_name='mem_cur_bytes', formatter=<ai.backend.client.output.formatters.MiBytesOutputFormatter object>, subfields={}), FieldSpec(field_ref='available_slots', humanized_name='Available Slots', field_name='available_slots', alt_name='available_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='occupied_slots', humanized_name='Occupied Slots', field_name='occupied_slots', alt_name='occupied_slots', formatter=<ai.backend.client.output.formatters.ResourceSlotFormatter object>, subfields={}), FieldSpec(field_ref='local_config', humanized_name='Local Config', field_name='local_config', alt_name='local_config', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={})))
Return type:

Sequence[dict]

Auth Functions

class ai.backend.client.func.auth.Auth

Provides the function interface for login session management and authorization.

classmethod await login(user_id, password, otp=None)

Log-in into the endpoint with the given user ID and password. It creates a server-side web session and return a dictionary with "authenticated" boolean field and JSON-encoded raw cookie data.

Return type:

dict

classmethod await logout()

Log-out from the endpoint. It clears the server-side web session.

Return type:

None

classmethod await update_password(old_password, new_password, new_password2)

Update user’s password. This API works only for account owner.

Return type:

dict

classmethod await update_password_no_auth(domain, user_id, current_password, new_password)

Update user’s password. This is used to update EXPIRED password only. This function fetch a request to manager.

Return type:

dict

classmethod await update_password_no_auth_in_session(user_id, current_password, new_password)

Update user’s password. This is used to update EXPIRED password only. This function fetch a request to webserver.

Return type:

dict

Configuration

ai.backend.client.config.get_env(key, default=Undefined.token, *, clean=<function default_clean>)

Retrieves a configuration value from the environment variables. The given key is uppercased and prefixed by "BACKEND_" and then "SORNA_" if the former does not exist.

Parameters:
  • key (str) – The key name.

  • default (Union[str, Mapping, Undefined]) – The default value returned when there is no corresponding environment variable.

  • clean (Callable[[Any], TypeVar(T)]) – A single-argument function that is applied to the result of lookup (in both successes and the default value for failures). The default is returning the value as-is.

Return type:

TypeVar(T)

Returns:

The value processed by the clean function.

ai.backend.client.config.get_config()

Returns the configuration for the current process. If there is no explicitly set APIConfig instance, it will generate a new one from the current environment variables and defaults.

Return type:

APIConfig

ai.backend.client.config.set_config(conf)

Sets the configuration used throughout the current process.

Return type:

None

class ai.backend.client.config.APIConfig(*, endpoint=None, endpoint_type=None, domain=None, group=None, storage_proxy_address_map=None, version=None, user_agent=None, access_key=None, secret_key=None, hash_type=None, vfolder_mounts=None, skip_sslcert_validation=None, connection_timeout=None, read_timeout=None, announcement_handler=None)

Represents a set of API client configurations. The access key and secret key are mandatory – they must be set in either environment variables or as the explicit arguments.

Parameters:
  • endpoint (Union[URL, str]) – The URL prefix to make API requests via HTTP/HTTPS. If this is given as str and contains multiple URLs separated by comma, the underlying HTTP request-response facility will perform client-side load balancing and automatic fail-over using them, assuming that all those URLs indicates a single, same cluster. The users of the API and CLI will get network connection errors only when all of the given endpoints fail – intermittent failures of a subset of endpoints will be hidden with a little increased latency.

  • endpoint_type (str) – Either "api" or "session". If the endpoint type is "api" (the default if unspecified), it uses the access key and secret key in the configuration to access the manager API server directly. If the endpoint type is "session", it assumes the endpoint is a Backend.AI console server which provides cookie-based authentication with username and password. In the latter, users need to use backend.ai login and backend.ai logout to manage their sign-in status, or the API equivalent in login() and logout() methods.

  • version (str) – The API protocol version.

  • user_agent (str) – A custom user-agent string which is sent to the API server as a User-Agent HTTP header.

  • access_key (str) – The API access key. If deliberately set to an empty string, the API requests will be made without signatures (anonymously).

  • secret_key (str) – The API secret key.

  • hash_type (str) – The hash type to generate per-request authentication signatures.

  • vfolder_mounts (Iterable[str]) – A list of vfolder names (that must belong to the given access key) to be automatically mounted upon any Kernel.get_or_create() calls.

DEFAULTS: Mapping[str, Union[str, Mapping]] = {'connection_timeout': '10.0', 'domain': 'default', 'endpoint': 'https://api.cloud.backend.ai', 'endpoint_type': 'api', 'group': 'default', 'hash_type': 'sha256', 'read_timeout': '0', 'storage_proxy_address_map': {}, 'version': 'v8.20240315'}

The default values for config parameters settable via environment variables except the access and secret keys.

property endpoint: URL

The currently active endpoint URL. This may change if there are multiple configured endpoints and the current one is not accessible.

property endpoints: Sequence[URL]

All configured endpoint URLs.

property endpoint_type: str

The configured endpoint type.

property domain: str

The configured domain.

property group: str

The configured group.

property storage_proxy_address_map: Mapping[str, str]

The storage proxy address map for overriding.

property user_agent: str

The configured user agent string.

property access_key: str

The configured API access key.

property secret_key: str

The configured API secret key.

property version: str

The configured API protocol version.

property hash_type: str

The configured hash algorithm for API authentication signatures.

property vfolder_mounts: Sequence[str]

The configured auto-mounted vfolder list.

property skip_sslcert_validation: bool

Whether to skip SSL certificate validation for the API gateway.

property connection_timeout: float

The maximum allowed duration for making TCP connections to the server.

property read_timeout: float

The maximum allowed waiting time for the first byte of the response from the server.

property announcement_handler: Callable[[str], None] | None

The announcement handler to display server-set announcements.

KeyPair Functions

class ai.backend.client.func.keypair.KeyPair(access_key)

Provides interactions with keypairs.

classmethod await create(user_id, is_active=True, is_admin=False, resource_policy=Undefined.TOKEN, rate_limit=Undefined.TOKEN, fields=(FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

Creates a new keypair with the given options. You need an admin privilege for this operation.

Return type:

dict

classmethod await update(access_key, is_active=Undefined.TOKEN, is_admin=Undefined.TOKEN, resource_policy=Undefined.TOKEN, rate_limit=Undefined.TOKEN)

Creates a new keypair with the given options. You need an admin privilege for this operation.

Return type:

dict

classmethod await delete(access_key)

Deletes an existing keypair with given ACCESSKEY.

classmethod await list(user_id=None, is_active=None, fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

Lists the keypairs. You need an admin privilege for this operation.

Return type:

Sequence[dict]

classmethod await paginated_list(is_active=None, domain_name=None, *, user_id=None, fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Lists the keypairs. You need an admin privilege for this operation.

Return type:

PaginatedResult[dict]

await info(fields=(FieldSpec(field_ref='user_id', humanized_name='Email', field_name='user_id', alt_name='user_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='access_key', humanized_name='Access Key', field_name='access_key', alt_name='access_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='secret_key', humanized_name='Secret Key', field_name='secret_key', alt_name='secret_key', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_admin', humanized_name='Admin?', field_name='is_admin', alt_name='is_admin', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

Returns the keypair’s information such as resource limits.

Parameters:

fields (Sequence[FieldSpec]) – Additional per-agent query fields to fetch.

Return type:

dict

New in version 18.12.

classmethod await activate(access_key)

Activates this keypair. You need an admin privilege for this operation.

Return type:

dict

classmethod await deactivate(access_key)

Deactivates this keypair. Deactivated keypairs cannot make any API requests unless activated again by an administrator. You need an admin privilege for this operation.

Return type:

dict

Manager Functions

class ai.backend.client.func.manager.Manager

Provides controlling of the gateway/manager servers.

New in version 18.12.

classmethod await status()

Returns the current status of the configured API server.

classmethod await freeze(force_kill=False)

Freezes the configured API server. Any API clients will no longer be able to create new compute sessions nor create and modify vfolders/keypairs/etc. This is used to enter the maintenance mode of the server for unobtrusive manager and/or agent upgrades.

Parameters:

force_kill (bool) – If set True, immediately shuts down all running compute sessions forcibly. If not set, clients who have running compute session are still able to interact with them though they cannot create new compute sessions.

classmethod await unfreeze()

Unfreezes the configured API server so that it resumes to normal operation.

classmethod await get_announcement()

Get current announcement.

classmethod await update_announcement(enabled=True, message=None)

Update (create / delete) announcement.

Parameters:
  • enabled (bool) – If set False, delete announcement.

  • message (str) – Announcement message. Required if enabled is True.

classmethod await scheduler_op(op, args)

Perform a scheduler operation.

Parameters:
  • op (str) – The name of scheduler operation.

  • args (Any) – Arguments specific to the given operation.

Scaling Group Functions

class ai.backend.client.func.scaling_group.ScalingGroup(name)

Provides getting scaling-group information required for the current user.

The scaling-group is an opaque server-side configuration which splits the whole cluster into several partitions, so that server administrators can apply different auto-scaling policies and operation standards to each partition of agent sets.

classmethod await list_available(group)

List available scaling groups for the current user, considering the user, the user’s domain, and the designated user group.

classmethod await list(fields=(FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='description', humanized_name='Description', field_name='description', alt_name='description', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_public', humanized_name='Public?', field_name='is_public', alt_name='is_public', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver', humanized_name='Driver', field_name='driver', alt_name='driver', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler', humanized_name='Scheduler', field_name='scheduler', alt_name='scheduler', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='use_host_network', humanized_name='Use Host Network', field_name='use_host_network', alt_name='use_host_network', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_addr', humanized_name='Wsproxy Addr', field_name='wsproxy_addr', alt_name='wsproxy_addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_api_token', humanized_name='Wsproxy Api Token', field_name='wsproxy_api_token', alt_name='wsproxy_api_token', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

List available scaling groups for the current user, considering the user, the user’s domain, and the designated user group.

Return type:

Sequence[dict]

classmethod await detail(name, fields=(FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='description', humanized_name='Description', field_name='description', alt_name='description', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_active', humanized_name='Active?', field_name='is_active', alt_name='is_active', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='is_public', humanized_name='Public?', field_name='is_public', alt_name='is_public', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver', humanized_name='Driver', field_name='driver', alt_name='driver', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='driver_opts', humanized_name='Driver Opts', field_name='driver_opts', alt_name='driver_opts', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler', humanized_name='Scheduler', field_name='scheduler', alt_name='scheduler', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='scheduler_opts', humanized_name='Scheduler Opts', field_name='scheduler_opts', alt_name='scheduler_opts', formatter=<ai.backend.client.output.formatters.NestedDictOutputFormatter object>, subfields={}), FieldSpec(field_ref='use_host_network', humanized_name='Use Host Network', field_name='use_host_network', alt_name='use_host_network', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_addr', humanized_name='Wsproxy Addr', field_name='wsproxy_addr', alt_name='wsproxy_addr', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='wsproxy_api_token', humanized_name='Wsproxy Api Token', field_name='wsproxy_api_token', alt_name='wsproxy_api_token', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})))

Fetch information of a scaling group by name.

Parameters:
  • name (str) – Name of the scaling group.

  • fields (Sequence[FieldSpec]) – Additional per-scaling-group query fields.

Return type:

dict

classmethod await create(name, *, description='', is_active=True, is_public=True, driver, driver_opts=Undefined.TOKEN, scheduler, scheduler_opts=Undefined.TOKEN, use_host_network=False, wsproxy_addr=None, wsproxy_api_token=None, fields=None)

Creates a new scaling group with the given options.

Return type:

dict

classmethod await update(name, *, description=Undefined.TOKEN, is_active=Undefined.TOKEN, is_public=Undefined.TOKEN, driver=Undefined.TOKEN, driver_opts=Undefined.TOKEN, scheduler=Undefined.TOKEN, scheduler_opts=Undefined.TOKEN, use_host_network=Undefined.TOKEN, wsproxy_addr=Undefined.TOKEN, wsproxy_api_token=Undefined.TOKEN, fields=None)

Update existing scaling group.

Return type:

dict

classmethod await delete(name)

Deletes an existing scaling group.

classmethod await associate_domain(scaling_group, domain)

Associate scaling_group with domain.

Parameters:
  • scaling_group (str) – The name of a scaling group.

  • domain (str) – The name of a domain.

classmethod await dissociate_domain(scaling_group, domain)

Dissociate scaling_group from domain.

Parameters:
  • scaling_group (str) – The name of a scaling group.

  • domain (str) – The name of a domain.

classmethod await dissociate_all_domain(domain)

Dissociate all scaling_groups from domain.

Parameters:

domain (str) – The name of a domain.

classmethod await associate_group(scaling_group, group_id)

Associate scaling_group with group.

Parameters:
  • scaling_group (str) – The name of a scaling group.

  • group_id (str) – The ID of a group.

classmethod await dissociate_group(scaling_group, group_id)

Dissociate scaling_group from group.

Parameters:
  • scaling_group (str) – The name of a scaling group.

  • group_id (str) – The ID of a group.

classmethod await dissociate_all_group(group_id)

Dissociate all scaling_groups from group.

Parameters:

group_id (str) – The ID of a group.

ComputeSession Functions

class ai.backend.client.func.session.ComputeSession(name, owner_access_key=None)

Provides various interactions with compute sessions in Backend.AI.

The term ‘kernel’ is now deprecated and we prefer ‘compute sessions’. However, for historical reasons and to avoid confusion with client sessions, we keep the backward compatibility with the naming of this API function class.

For multi-container sessions, all methods take effects to the master container only, except destroy() and restart() methods. So it is the user’s responsibility to distribute uploaded files to multiple containers using explicit copies or virtual folders which are commonly mounted to all containers belonging to the same compute session.

classmethod await paginated_list(status=None, access_key=None, *, fields=(FieldSpec(field_ref='id', humanized_name='Session ID', field_name='id', alt_name='session_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='image', humanized_name='Image', field_name='image', alt_name='image', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='type', humanized_name='Type', field_name='type', alt_name='type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status_info', humanized_name='Status Info', field_name='status_info', alt_name='status_info', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status_changed', humanized_name='Last Updated', field_name='status_changed', alt_name='status_changed', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='result', humanized_name='Result', field_name='result', alt_name='result', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='abusing_reports', humanized_name='Abusing Reports', field_name='abusing_reports', alt_name='abusing_reports', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of sessions.

Parameters:
  • status (str) –

    Fetches sessions in a specific status (PENDING, SCHEDULED, PULLING, PREPARING,

    RUNNING, RESTARTING, RUNNING_DEGRADED, TERMINATING, TERMINATED, ERROR, CANCELLED)

  • fields (Sequence[FieldSpec]) – Additional per-session query fields to fetch.

Return type:

PaginatedResult[dict]

classmethod await get_or_create(image, *, name=None, type_='interactive', starts_at=None, enqueue_only=False, max_wait=0, no_reuse=False, dependencies=None, callback_url=None, mounts=None, mount_map=None, envs=None, startup_command=None, resources=None, resource_opts=None, cluster_size=1, cluster_mode=ClusterMode.SINGLE_NODE, domain_name=None, group_name=None, bootstrap_script=None, tag=None, architecture='x86_64', scaling_group=None, owner_access_key=None, preopen_ports=None, assign_agent=None)

Get-or-creates a compute session. If name is None, it creates a new compute session as long as the server has enough resources and your API key has remaining quota. If name is a valid string and there is an existing compute session with the same token and the same image, then it returns the ComputeSession instance representing the existing session.

Parameters:
  • image (str) – The image name and tag for the compute session. Example: python:3.6-ubuntu. Check out the full list of available images in your server using (TODO: new API).

  • name (str) –

    A client-side (user-defined) identifier to distinguish the session among currently running sessions. It may be used to seamlessly reuse the session already created.

    Changed in version 19.12.0: Renamed from clientSessionToken.

  • type

    Either "interactive" (default) or "batch".

    New in version 19.09.0.

  • enqueue_only (bool) –

    Just enqueue the session creation request and return immediately, without waiting for its startup. (default: false to preserve the legacy behavior)

    New in version 19.09.0.

  • max_wait (int) –

    The time to wait for session startup. If the cluster resource is being fully utilized, this waiting time can be arbitrarily long due to job queueing. If the timeout reaches, the returned status field becomes "TIMEOUT". Still in this case, the session may start in the future.

    New in version 19.09.0.

  • no_reuse (bool) –

    Raises an explicit error if a session with the same image and the same name already exists instead of returning the information of it.

    New in version 19.09.0.

  • mounts (List[str]) – The list of vfolder names that belongs to the current API access key.

  • mount_map (Mapping[str, str]) – Mapping which contains custom path to mount vfolder. Key and value of this map should be vfolder name and custom path. Default mounts or relative paths are under /home/work. If you want different paths, names should be absolute paths. The target mount path of vFolders should not overlap with the linux system folders. vFolders which has a dot(.) prefix in its name are not affected.

  • envs (Mapping[str, str]) – The environment variables which always bypasses the jail policy.

  • resources (Mapping[str, str | int]) – The resource specification. (TODO: details)

  • cluster_size (int) –

    The number of containers in this compute session. Must be at least 1.

    New in version 19.09.0.

    Changed in version 20.09.0.

  • cluster_mode (ClusterMode) –

    Set the clustering mode whether to use distributed nodes or a single node to spawn multiple containers for the new session.

    New in version 20.09.0.

  • tag (str) – An optional string to annotate extra information.

  • owner – An optional access key that owns the created session. (Only available to administrators)

Return type:

ComputeSession

Returns:

The ComputeSession instance.

classmethod await create_from_template(template_id, *, name=Undefined.TOKEN, type_=Undefined.TOKEN, starts_at=None, enqueue_only=Undefined.TOKEN, max_wait=Undefined.TOKEN, dependencies=None, callback_url=Undefined.TOKEN, no_reuse=Undefined.TOKEN, image=Undefined.TOKEN, mounts=Undefined.TOKEN, mount_map=Undefined.TOKEN, envs=Undefined.TOKEN, startup_command=Undefined.TOKEN, resources=Undefined.TOKEN, resource_opts=Undefined.TOKEN, cluster_size=Undefined.TOKEN, cluster_mode=Undefined.TOKEN, domain_name=Undefined.TOKEN, group_name=Undefined.TOKEN, bootstrap_script=Undefined.TOKEN, tag=Undefined.TOKEN, scaling_group=Undefined.TOKEN, owner_access_key=Undefined.TOKEN)

Get-or-creates a compute session from template. All other parameters provided will be overwritten to template, including vfolder mounts (not appended!). If name is None, it creates a new compute session as long as the server has enough resources and your API key has remaining quota. If name is a valid string and there is an existing compute session with the same token and the same image, then it returns the ComputeSession instance representing the existing session.

Parameters:
  • template_id (str) – Task template to apply to compute session.

  • image (str | Undefined) – The image name and tag for the compute session. Example: python:3.6-ubuntu. Check out the full list of available images in your server using (TODO: new API).

  • name (str | Undefined) –

    A client-side (user-defined) identifier to distinguish the session among currently running sessions. It may be used to seamlessly reuse the session already created.

    Changed in version 19.12.0: Renamed from clientSessionToken.

  • type

    Either "interactive" (default) or "batch".

    New in version 19.09.0.

  • enqueue_only (bool | Undefined) –

    Just enqueue the session creation request and return immediately, without waiting for its startup. (default: false to preserve the legacy behavior)

    New in version 19.09.0.

  • max_wait (int | Undefined) –

    The time to wait for session startup. If the cluster resource is being fully utilized, this waiting time can be arbitrarily long due to job queueing. If the timeout reaches, the returned status field becomes "TIMEOUT". Still in this case, the session may start in the future.

    New in version 19.09.0.

  • no_reuse (bool | Undefined) –

    Raises an explicit error if a session with the same image and the same name already exists instead of returning the information of it.

    New in version 19.09.0.

  • mounts (Union[List[str], Undefined]) – The list of vfolder names that belongs to the current API access key.

  • mount_map (Union[Mapping[str, str], Undefined]) – Mapping which contains custom path to mount vfolder. Key and value of this map should be vfolder name and custom path. Default mounts or relative paths are under /home/work. If you want different paths, names should be absolute paths. The target mount path of vFolders should not overlap with the linux system folders. vFolders which has a dot(.) prefix in its name are not affected.

  • envs (Union[Mapping[str, str], Undefined]) – The environment variables which always bypasses the jail policy.

  • resources (Union[Mapping[str, str | int], Undefined]) – The resource specification. (TODO: details)

  • cluster_size (int | Undefined) –

    The number of containers in this compute session. Must be at least 1.

    New in version 19.09.0.

    Changed in version 20.09.0.

  • cluster_mode (ClusterMode | Undefined) –

    Set the clustering mode whether to use distributed nodes or a single node to spawn multiple containers for the new session.

    New in version 20.09.0.

  • tag (str | Undefined) – An optional string to annotate extra information.

  • owner – An optional access key that owns the created session. (Only available to administrators)

Return type:

ComputeSession

Returns:

The ComputeSession instance.

await destroy(*, forced=False, recursive=False)

Destroys the compute session. Since the server literally kills the container(s), all ongoing executions are forcibly interrupted.

await restart()

Restarts the compute session. The server force-destroys the current running container(s), but keeps their temporary scratch directories intact.

await rename(new_id)

Renames Session ID of running compute session.

await commit()

Commit a running session to a tar file in the agent host.

await interrupt()

Tries to interrupt the current ongoing code execution. This may fail without any explicit errors depending on the code being executed.

await complete(code, opts=None)

Gets the auto-completion candidates from the given code string, as if a user has pressed the tab key just after the code in IDEs.

Depending on the language of the compute session, this feature may not be supported. Unsupported sessions returns an empty list.

Parameters:
  • code (str) – An (incomplete) code text.

  • opts (dict) – Additional information about the current cursor position, such as row, col, line and the remainder text.

Return type:

Iterable[str]

Returns:

An ordered list of strings.

await get_info()

Retrieves a brief information about the compute session.

await get_logs()

Retrieves the console log of the compute session container.

await get_dependency_graph()

Retrieves the root node of dependency graph of the compute session.

await get_status_history()

Retrieves the status transition history of the compute session.

await execute(run_id=None, code=None, mode='query', opts=None)

Executes a code snippet directly in the compute session or sends a set of build/clean/execute commands to the compute session.

For more details about using this API, please refer the official API documentation.

Parameters:
  • run_id (str) – A unique identifier for a particular run loop. In the first call, it may be None so that the server auto-assigns one. Subsequent calls must use the returned runId value to request continuation or to send user inputs.

  • code (str) – A code snippet as string. In the continuation requests, it must be an empty string. When sending user inputs, this is where the user input string is stored.

  • mode (str) – A constant string which is one of "query", "batch", "continue", and "user-input".

  • opts (dict) – A dict for specifying additional options. Mainly used in the batch mode to specify build/clean/execution commands. See the API object reference for details.

Returns:

An execution result object

await upload(files, basedir=None, show_progress=False)

Uploads the given list of files to the compute session. You may refer them in the batch-mode execution or from the code executed in the server afterwards.

Parameters:
  • files (Sequence[str | Path]) –

    The list of file paths in the client-side. If the paths include directories, the location of them in the compute session is calculated from the relative path to basedir and all intermediate parent directories are automatically created if not exists.

    For example, if a file path is /home/user/test/data.txt (or test/data.txt) where basedir is /home/user (or the current working directory is /home/user), the uploaded file is located at /home/work/test/data.txt in the compute session container.

  • basedir (Union[str, Path, None]) – The directory prefix where the files reside. The default value is the current working directory.

  • show_progress (bool) – Displays a progress bar during uploads.

await download(files, dest='.', show_progress=False)

Downloads the given list of files from the compute session.

Parameters:
  • files (Sequence[str | Path]) – The list of file paths in the compute session. If they are relative paths, the path is calculated from /home/work in the compute session container.

  • dest (str | Path) – The destination directory in the client-side.

  • show_progress (bool) – Displays a progress bar during downloads.

await list_files(path='.')

Gets the list of files in the given path inside the compute session container.

Parameters:

path (str | Path) – The directory path in the compute session.

await get_abusing_report()

Retrieves abusing reports of session’s sibling kernels.

await start_service(app, *, port=Undefined.TOKEN, envs=Undefined.TOKEN, arguments=Undefined.TOKEN, login_session_token=Undefined.TOKEN)

Starts application from Backend.AI session and returns access credentials to access AppProxy endpoint.

Return type:

Mapping[str, Any]

listen_events(scope='*')

Opens the stream of the kernel lifecycle events. Only the master kernel of each session is monitored.

Return type:

SSEContextManager

Returns:

a StreamEvents object.

stream_events(scope='*')

Opens the stream of the kernel lifecycle events. Only the master kernel of each session is monitored.

Return type:

SSEContextManager

Returns:

a StreamEvents object.

stream_pty()

Opens a pseudo-terminal of the kernel (if supported) streamed via websockets.

Return type:

WebSocketContextManager

Returns:

a StreamPty object.

stream_execute(code='', *, mode='query', opts=None)

Executes a code snippet in the streaming mode. Since the returned websocket represents a run loop, there is no need to specify run_id explicitly.

Return type:

WebSocketContextManager

class ai.backend.client.func.session.StreamPty(session, underlying_response, **kwargs)

A derivative class of WebSocketResponse which provides additional functions to control the terminal.

Session Template Functions

class ai.backend.client.func.session_template.SessionTemplate(template_id, owner_access_key=None)
classmethod await create(template, domain_name=None, group_name=None, owner_access_key=None)
Return type:

SessionTemplate

classmethod await list_templates(list_all=False)
Return type:

List[Mapping[str, str]]

await get(body_format='yaml')
Return type:

str

await put(template)
Return type:

Any

await delete()
Return type:

Any

Virtual Folder Functions

class ai.backend.client.func.vfolder.VFolder(name, id=None)
classmethod await create(name, host=None, unmanaged_path=None, group=None, usage_mode='general', permission='rw', cloneable=False)
classmethod await delete_by_id(oid)
classmethod await list(list_all=False)
classmethod await paginated_list(group=None, *, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of vfolders. Domain admins can only get domain vfolders.

Parameters:
  • group (str) – Fetch vfolders in a specific group.

  • fields (Sequence[FieldSpec]) – Additional per-vfolder query fields to fetch.

Return type:

PaginatedResult[dict]

classmethod await paginated_own_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of own vfolders.

Parameters:

fields (Sequence[FieldSpec]) – Additional per-vfolder query fields to fetch.

Return type:

PaginatedResult[dict]

classmethod await paginated_invited_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of invited vfolders.

Parameters:

fields (Sequence[FieldSpec]) – Additional per-vfolder query fields to fetch.

Return type:

PaginatedResult[dict]

classmethod await paginated_project_list(*, fields=(FieldSpec(field_ref='host', humanized_name='Host', field_name='host', alt_name='host', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='name', humanized_name='Name', field_name='name', alt_name='name', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='created_at', humanized_name='Created At', field_name='created_at', alt_name='created_at', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='creator', humanized_name='Creator', field_name='creator', alt_name='creator', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='group', humanized_name='Group', field_name='group', alt_name='group_id', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='permission', humanized_name='Permission', field_name='permission', alt_name='permission', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='ownership_type', humanized_name='Ownership Type', field_name='ownership_type', alt_name='ownership_type', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={}), FieldSpec(field_ref='status', humanized_name='Status', field_name='status', alt_name='status', formatter=<ai.backend.client.output.formatters.OutputFormatter object>, subfields={})), page_offset=0, page_size=20, filter=None, order=None)

Fetches the list of invited vfolders.

Parameters:

fields (Sequence[FieldSpec]) – Additional per-vfolder query fields to fetch.

Return type:

PaginatedResult[dict]

classmethod await list_hosts()
classmethod await list_all_hosts()
classmethod await list_allowed_types()
await info()
await delete()
await purge()
Return type:

Mapping[str, Any]

await recover()
await restore()
await delete_trash()
Return type:

Mapping[str, Any]

await rename(new_name)
await download(relative_paths, *, basedir=None, dst_dir=None, chunk_size=16777216, show_progress=False, address_map=None, max_retries=20)
Return type:

None

await upload(sources, *, basedir=None, recursive=False, dst_dir=None, chunk_size=16777216, address_map=None, show_progress=False)
Return type:

None

await mkdir(path, parents=False, exist_ok=False)
Return type:

ResultSet

await rename_file(target_path, new_name)
await move_file(src_path, dst_path)
await delete_files(files, recursive=False)
await list_files(path='.')
await invite(perm, emails)
classmethod await invitations()
classmethod await accept_invitation(inv_id)
classmethod await delete_invitation(inv_id)
classmethod await get_fstab_contents(agent_id=None)
classmethod await get_performance_metric(folder_host)
classmethod await list_mounts()
classmethod await mount_host(name, fs_location, options=None, edit_fstab=False)
classmethod await umount_host(name, edit_fstab=False)
await share(perm, emails)
await unshare(emails)
await leave(shared_user_uuid=None)
await clone(target_name, target_host=None, usage_mode='general', permission='rw')
await update_options(name, permission=None, cloneable=None)
classmethod await list_shared_vfolders()
classmethod await shared_vfolder_info(vfolder_id)
classmethod await update_shared_vfolder(vfolder, user, perm=None)
classmethod await change_vfolder_ownership(vfolder, user_email)

Low-level SDK Reference

Base Function

This module defines a few utilities that ease complexities to support both synchronous and asynchronous API functions, using some tricks with Python metaclasses.

Unless your are contributing to the client SDK, probably you won’t have to use this module directly.

class ai.backend.client.func.base.APIFunctionMeta(name, bases, attrs, **kwargs)

Converts all methods marked with api_function() into session-aware methods that are either plain Python functions or coroutines.

mro()

Return a type’s method resolution order.

class ai.backend.client.func.base.BaseFunction
@ai.backend.client.func.base.api_function(meth)

Mark the wrapped method as the API function method.

Request API

This module provides low-level API request/response interfaces based on aiohttp.

Depending on the session object where the request is made from, Request and Response differentiate their behavior: works as plain Python functions or returns awaitables.

class ai.backend.client.request.Request(method='GET', path=None, content=None, *, content_type=None, params=None, reporthook=None, override_api_version=None)

The API request object.

with async with fetch(**kwargs) as Response

Sends the request to the server and reads the response.

You may use this method either with plain synchronous Session or AsyncSession. Both the followings patterns are valid:

from ai.backend.client.request import Request
from ai.backend.client.session import Session

with Session() as sess:
  rqst = Request('GET', ...)
  with rqst.fetch() as resp:
    print(resp.text())
from ai.backend.client.request import Request
from ai.backend.client.session import AsyncSession

async with AsyncSession() as sess:
  rqst = Request('GET', ...)
  async with rqst.fetch() as resp:
    print(await resp.text())
Return type:

FetchContextManager

async with connect_websocket(**kwargs) as WebSocketResponse or its derivatives

Creates a WebSocket connection. :rtype: WebSocketContextManager

Warning

This method only works with AsyncSession.

property content: bytes | bytearray | str | StreamReader | IOBase | None

Retrieves the content in the original form. Private codes should NOT use this as it incurs duplicate encoding/decoding.

set_content(value, *, content_type=None)

Sets the content of the request.

Return type:

None

set_json(value)

A shortcut for set_content() with JSON objects.

Return type:

None

attach_files(files)

Attach a list of files represented as AttachedFile.

Return type:

None

connect_events(**kwargs)

Creates a Server-Sent Events connection. :rtype: SSEContextManager

Warning

This method only works with AsyncSession.

class ai.backend.client.request.Response(session, underlying_response, *, async_mode=False, **kwargs)
class ai.backend.client.request.WebSocketResponse(session, underlying_response, **kwargs)

A high-level wrapper of aiohttp.ClientWebSocketResponse.

class ai.backend.client.request.FetchContextManager(session, rqst_ctx_builder, *, response_cls=<class 'ai.backend.client.request.Response'>, check_status=True)

The context manager returned by Request.fetch().

It provides both synchronous and asynchronous context manager interfaces.

class ai.backend.client.request.WebSocketContextManager(session, ws_ctx_builder, *, on_enter=None, response_cls=<class 'ai.backend.client.request.WebSocketResponse'>)

The context manager returned by Request.connect_websocket().

class ai.backend.client.request.AttachedFile(filename, stream, content_type)

A struct that represents an attached file to the API request.

Parameters:
  • filename (str) – The name of file to store. It may include paths and the server will create parent directories if required.

  • stream (Any) – A file-like object that allows stream-reading bytes.

  • content_type (str) – The content type for the stream. For arbitrary binary data, use “application/octet-stream”.

content_type

Alias for field number 2

count(value, /)

Return number of occurrences of value.

filename

Alias for field number 0

index(value, start=0, stop=9223372036854775807, /)

Return first index of value.

Raises ValueError if the value is not present.

stream

Alias for field number 1

Exceptions

class ai.backend.client.exceptions.BackendError

Exception type to catch all ai.backend-related errors.

add_note()

Exception.add_note(note) – add a note to the exception

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class ai.backend.client.exceptions.BackendAPIError(status, reason, data)

Exceptions returned by the API gateway.

add_note()

Exception.add_note(note) – add a note to the exception

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

class ai.backend.client.exceptions.BackendClientError

Exceptions from the client library, such as argument validation errors and connection failures.

add_note()

Exception.add_note(note) – add a note to the exception

with_traceback()

Exception.with_traceback(tb) – set self.__traceback__ to tb and return self.

Type Definitions

class ai.backend.client.types.Undefined(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)

A special type to represent an undefined value.

ai.backend.client.types.undefined

A placeholder to signify an undefined value as a singleton object of Undefined and distinguish it from a null (None) value.

Indices and tables