Resilience

Monitoring and Detection

NetApp monitors environments continuously to detect and respond to issues quickly.

24/7 Automated Monitoring

All nodes are monitored around the clock for application health and availability. When a monitored metric exceeds its defined threshold, a ticket is automatically created for investigation and action.

Daily Health Checks

Human-reviewed daily checks cover disk capacity, backup status, SLA performance, and repair activity. Results are reviewed by the Technical Operations team, with issues tracked through the Customer Support Portal.

Intrusion Detection

An intrusion detection capability is deployed across all managed nodes. The system monitors the following on every customer node against a maintained allowlist of expected activity:

  • Running processes
  • Network connections
  • SSH sessions

Any unexpected activity — including unrecognised processes or connections — triggers an immediate alert to the on-call team. Unresolved alerts are escalated to senior engineers and, where necessary, to Security Operations. Changes to the allowlist require security team approval.

Endpoint detection and response is deployed on all management hosts. A centralised security monitoring platform ingests audit logs from all cloud providers and alerts on suspicious activity. Enhanced runtime scanning is deployed on PCI-enabled clusters.

File Integrity Monitoring

File integrity monitoring is in place on management infrastructure and PCI-enabled customer cluster nodes.

Logging

Administrative and operational activity on managed clusters is logged, including:

  • Actions performed via the console and API
  • Access by operations staff to cluster nodes

Logs are automatically exported to a separate system to prevent undetected modification or loss. Administrative activity is reviewed on a regular basis by the Security and Compliance team, with actions reconciled against approved tickets. Any exceptions are documented and tracked to resolution.

For PCI-enabled clusters and clusters with audit logging enabled, logs are reviewed daily for suspicious activity.

Incident Response

Triage and Escalation

Alerts from monitoring, intrusion detection, or customer reports are evaluated to determine severity and next steps. Technical issues escalate to senior engineers; potential security incidents escalate to Security Operations. All actions are tracked through a centralised incident response process.

Security Advisories

When vulnerabilities are identified in the open-source technologies on the platform, NetApp publishes security advisories with impact analysis, mitigation guidance, and remediation timelines. These are available on the NetApp Instaclustr documentation site. For PCI customers, NetApp coordinates upgrades directly to ensure compliance obligations are met within required timeframes.

Customer Communication

Broad issues affecting multiple customers are reported on the status page. Cluster-specific issues are communicated through your nominated support contacts. Changes that may affect security, availability, or confidentiality are communicated via release notes.

Post-Incident Review

NetApp requires post-incident reviews for all unplanned customer production downtime. Lessons learned feed back into processes and controls.

Customers can report security concerns via the support portal or directly to [email protected] and receive the same priority as internally detected issues.

Business Continuity and Backup

NetApp Instaclustr is designed for resilience at both the cluster and management-plane level.

Backups

Daily backups are performed for each technology on the platform and retained for seven days. Backup completion is monitored via the daily health check process. Backups are encrypted at rest and stored in the same region as the service nodes. Retention is extendable based on deployment model.

Restore Testing

NetApp performs data restoration testing quarterly, on all Generally Available major versions, and also performs operational restores on a subset of customer clusters. Testing involves creating a new cluster, loading the restore data, and validating that the cluster is operating successfully with the proper data as intended.

Fault-Tolerant Architecture

Hosted client clusters are by required by default, to be configured with a minimum of three nodes (two for PostgreSQL and Cadence). This ensures systems are fault-tolerant against the failure of a single node. Higher levels of fault tolerance are available based on customer requirements.

Each technology applies replication and distribution strategies:

  • Nodes are spread across multiple availability zones
  • Data is replicated across different physical locations

Resilient Management Plane

The management database is backed up daily with continuous write-ahead log backup, stored in at least two regions with the main backup being immutable. The management database is actively replicated to a secondary management plane, which can be used in the short term while the primary is restored.

No Single Point of Failure

Customer environments run independently from each other and from the management environment. Clusters continue operating even in the event of catastrophic failure of the management environment in the short to medium term. The cloud providers are the only runtime dependencies for customer systems.

NetApp tests and refines the Business Continuity Plan annually, with approval from senior leadership.