论文标题

云和外部资源之间的Kubernetes“桥”操作员

A Kubernetes 'Bridge' operator between cloud and external resources

论文作者

Lublinsky, Boris, Jennings, Elise, Spišaková, Viktória

论文摘要

例如,许多科学工作流都需要专门的计算资源,包括具有优化软件的HPC群集,量子资源以及诸如Ray之类的专用硬件集群系统。同时,今天的许多科学工作流都建立在Kubernetes上,利用了对工作流和支持工具的支持日益增长的支持。为了满足支持在云和专用计算资源上支持工作流程的日益增长的需求,我们介绍了桥梁运营商,这是Kubernetes中容器编排的软件扩展,可有助于在具有自己的集群资源管理器(SLURM,LSF,Quantum Services和Ray)上对外部系统上的长期运行过程提交和监视。桥梁操作员由一个自定义的Kubernetes控制器组成,该控制器采用Kubernetes自定义资源定义来管理应用程序。我们提供控制器逻辑来管理云容器编排和外部资源工作负载管理器接口,资源定义将HTTP/HTTPS请求提交给外部资源,以及与外部资源管理器通信以提交和管理作业执行的控制器POD。该实现使我们能够反映Kubernetes Pods中的外部资源,该资源允许操作员使用这些POD作为代理来控制外部系统。该实现对资源管理器的选择不可知,但假设该系统用于其控件/管理中的HTTP/HTTPS API。桥梁操作员自动化了人类操作员在黑匣子外部资源上运行作业的角色,这是云上复杂的混合工作流程的一部分。

Many scientific workflows require dedicated compute resources, including HPC clusters with optimized software, quantum resources, and dedicated hardware cluster systems like Ray, for example. At the same time, many scientific workflows today are built on Kubernetes leveraging growing support for workflow and support tools. To address the growing demand to support workflows on both cloud and dedicated compute resources we present the Bridge Operator, a software extension for container orchestration in Kubernetes which facilitates the submission and monitoring of long running processes on external systems which have their own cluster resources manager (SLURM, LSF, quantum services and Ray). The Bridge Operator consists of a custom Kubernetes controller that employs a Kubernetes Custom Resource Definition to manage applications. We present controller logic to manage the cloud container orchestration and external resource workload manager interface, a resource definition to submit HTTP/HTTPS requests to the external resource, and a controller pod communicating with the external resource manager to submit and manage job execution. The implementation allows us to mirror the external resource in Kubernetes pods, which allows the operator to use these pods as proxies to control the external system. The implementation is agnostic to the choice of resource manager but assumes the system exposes a HTTP/HTTPS API for its control/management. The Bridge Operator automates the role of a human operator running jobs on a black box external resource as part of a complex hybrid workflow on the Cloud.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源