NetApp Apache Spark on Red Hat OpenShift – Image and Operator

For the most efficient management of Apache Spark™ on OpenShift, our Support team recommends the Red Hat – NetApp certified Operator and the certified Image. You can find a description of these by searching for “NetApp Spark OpenShift” in the Red Hat Ecosystem Catalog, and the Operator can be deployed from the OpenShift OperatorHub. The Image contains software for running Spark applicationsapproved versions of Spark and libraries for connecting to data sources such as AWS S3. The Operator automates the management of Spark on Kubernetes applications. 

Image

Deployment of Image:

  1. On the Image page, choose the latest image from the Tag dropdown menu. 
  2. Follow the instructions described under “Get this image.

  3. You can find older images under the Tags dropdown menu, as shown in the image above. 

Use of Image: 

  1. NetApp’s default image has been packaged to ensure that all necessary dependencies and configurations are correctly set up. 
  2. We regularly update the image to include the latest security patches and improvements to reduce the risk of vulnerabilities. 
  3. Our image includes an entrypoint.sh script to manage the startup process that runs Spark as a non-root user. This approach enhances security and ensures that your Spark containers are running with the least privileges necessary. 

Custom Image:

If you require a specific set of software, configurations, or dependencies for your image, you can build a custom image suitable for your application’s environment. To create your custom image, use one of the NetApp Spark images as a base. Once your custom image has been built, upload it to your local private repository. Then, tag your image and push it to the repository.

OpenShift requires credentials to access the private repository image. Below are the steps to create a secret with your username and password.   

  1. Navigate to Secrets within the Workloads section of the sidebar. Then click create to choose “Image Pull secret” 

  2. Add your required details and click Create. To get the password, use the command: “AWS ecr get-login-password –region <region-name>”.

  3. Add the following to the Spark application YAML file to authenticate and use the credentials while submitting applications.   

Operator

The NetApp Spark-on-OpenShift Operator is available to OpenShift customers in their OperatorHub console.

Deployment of Operator:

  1. Log in with Administrator privileges to your Red Hat OpenShift web console.
  2. Search for “NetApp Spark” in the OperatorHub Filter by keyword box to locate the NetApp Supported Apache Spark on OpenShift”. 
  3. Create a new namespace/project called “spark” for your Operator deployment and follow the Red Hat instructions to install the Operator into your new “spark” namespace. 
  4. Create a SparkOperator custom resource using the sample YAML provided in the OpenShift console under the SparkOperator tab. 
  5. Create a SparkApplication custom resource using the sample YAML provided. The sample YAML is provided in the OpenShift Console under the SparkApplication tab. 

Spark Operator Permissions 

To ensure the NetApp Apache Spark Operator runs effectively on Red Hat OpenShift, specific permissions are needed to manage the lifecycle of Spark applications, custom resources, and necessary Kubernetes resources while supporting optional batch scheduling capabilities.  

The following RBAC (Role-Based Access Control) permissions are required by the Spark Operator. We have divided the permissions into cluster-wide permissions (ClusterRole) and namespaced permissions (Role). 

Cluster-Wide Permissions 

Cluster-Wide permissions to provide necessary permissions across the cluster for managing Spark applications are split into the ClusterRole and ClusterRoleBinding sections. The permissions listed here can be viewed and edited in the Spark operator’s YAML file used in the installation process. 

 ClusterRole permissions 

The ClusterRole grants the following permissions: 

Object  apiGroups  Resources  Verbs 
Pod Management  ““ (core)  pods  *
(all actions: create, get, list, watch, update, delete, etc.) 
Service and ConfigMap Management  ““ (core)  services, configmaps  create, get, delete, update, patch 
Node Access  ““ (core)  nodes  get 
Event Management  ““ (core)  events  create, update, patch 
Resource Quota Management  ““ (core)  resourcequotas  get, list, watch 
Custom Resource Definition (CRD) Management  apiextensions.k8s.io  customresourcedefinitions  create, get, update, delete 
Webhook Configuration  admissionregistration.k8s.io  mutatingwebhookconfigurations, validatingwebhookconfigurations  create, get, update, delete 
Spark Application Management  sparkoperator.k8s.io  sparkapplications, sparkapplications/status 

scheduledsparkapplications, scheduledsparkapplications/status 

*
(all actions: create, get, list, watch, update, delete, etc.) 
Batch Scheduler (Optional) (If the batch scheduler is enabled (volcano))  scheduling.incubator.k8s.io 

scheduling.sigs.dev 

scheduling.volcano.sh 

podgroups  *
(all actions: create, get, list, watch, update, delete, etc.) 

A sample of the permissions section in the Cluster Role YAML is shown below: 

ClusterRoleBinding permissions :

The ClusterRoleBinding associates the ClusterRole with a ServiceAccount, granting the following: 

  • ServiceAccount:
             Name: {{ include “spark-operator.serviceAccountName” . }}
             Namespace: {{ .Release.Namespace }} 
  • Role Reference:
             ClusterRole Name: {{ include “spark-operator.fullname” . }}
             API Group: rbac.authorization.k8s.io 

A sample of the permissions section in the Cluster Role Binding YAML is shown below: 

Namespace-Scoped Permissions:

Why Use Namespaced RBAC? 

Namespaced RBAC provides the following advantages: 

  • Reduced Attack Surface: By limiting secrets and job deletion permissions to a specific namespace, exposure of sensitive resources is minimized. 
  • Granular Control: Namespace-scoped RBAC ensures that different components operate with isolated privileges, reducing the risk of privilege escalation. 
  • Compliance and Security: Aligns with security best practices, improving governance and reducing audit risks.  
  • Secrets and Jobs Permissions: By moving from cluster-wide to namespace-scoped, security is improved due to limiting sensitive resource access. 
  • Namespace: Permissions are applied within the namespace specified during the Helm release, typically spot-system or any targeted namespace. 

This extended RBAC setup provides a balance between operational flexibility and security, ensuring that the Spark Operator can perform necessary tasks while maintaining strict control over sensitive operations at the namespace level. 

This topic describes the permissions required by the Spark operator at the namespace level. 

Role 

Certain cluster-wide privileges have been reduced to namespace-specific permissions to follow the principle of least privilege and improve security 

Object  apiGroups  Resources  Verbs 
Management  ““ (core)  secrets  create, get, delete, update 
Job Deletion  batch  jobs  delete 

A sample of the permissions section in the Role YAML is shown below: 

Role Binding: Namespace-Scoped Association 

The RoleBinding associates the above Role with the specified ServiceAccount, limiting permissions to the namespace scope. 

  • ServiceAccount:
           Name: {{ include “spark-operator.serviceAccountName” . }}
           Namespace: {{ .Release.Namespace }} 
  • Role Reference:
           Role Name: {{ include “spark-operator.fullname” . }}
           API Group: rbac.authorization.k8s.io 

A sample of the permissions section in the Role Binding YAML is shown below:

Use of Operator 

  1. The Spark Operator handles the entire lifecycle of Spark applications from submission to completion, including retries and failure handling.
  2. The Operator is used for running and managing Apache Spark applications on Kubernetes. You can schedule Spark applications using the ScheduledSparkApplication CustomResourceDefinition (CRD). The sample YAML is provided in the OpenShift Console under the ScheduledSparkApplication tab.  To launch the application, navigate to the ScheduleSparkApplication tab and click on Create a ScheduleSparkApplication using the sample YAML provided. Here, by default every 5 minutes, the application will trigger the jobs, and you can change the scheduled time as per requirement. Please refer to the snapshots below for guidance.

By Instaclustr Support
Need Support?
Experiencing difficulties on the website or console?
Already have an account?
Need help with your cluster?
Contact Support
Why sign up?
To experience the ease of creating and managing clusters via the Instaclustr Console
Spin up a cluster in minutes