Zoning for cluster storage in pictures

NetApp AFF & FAS storage systems can combine into a cluster up to 12 nodes (6 HA pairs) for SAN protocols. Let’s take a look on zoning and connections on an example with 4 nodes (2 HA pairs) in the image below.

For simplicity we will discuss connection of a single host to the storage cluster. In this example we connect each node to each server. Each storage node connected with double links for reliability reasons.

It is easy to notice only two paths going to be Preferred in this example (solid yellow arrows).

Since NetApp FAS/AFF systems implement “share nothing” architecture we have disk drives assigned to a node, then disks on a node  grouped into a RAID group, then one or a few RAID groups combined into a plex, usually one plex form an aggregate (in some cases two plexes form an aggregate, in this case both plexes must have identical RAID configuration, think of it as an analogy to RAID 1). On aggregates you have FlexVol volumes. Each FlexVol volume is a separate WAFL file system and can serve for NAS files (SMB/NFS) or SAN LUNs (iSCSI, FC, FCoE) or Namespaces (NVMeoF). A FlexVol can have multiple Qtrees and each Qtree can store an LUN or files. Read more in series of articles How ONTAP Memory works.

Each drive belongs to & served by a node. A RAID group belongs to and served by a node. All objects on top of those are belong to and are served by a single node, including Aggregates, FlexVols, Qtrees, LUNs & Namespaces.

At a given time a disk can belong to a single node and in case of a node failure, HA partner takes over disks, aggregates, and all the objects on top of that. Note that a “disk” in ONTAP can be entire physical disk as well as a partition on a disk. Read more about disks and ADP (disk partitioning) here.

Though an LUN or Namespace belong to a single node, it is possible to access them through the HA partner or even from other nodes. The most optimal path is always through a node which owns the LUN or Namespace. If a node has more than one port, all ports to that node are considered as optimal paths (also known as Non-Primary paths) through that node. Normally it is a good idea to have more optimal paths to a LUN.


ALUA (Asymmetric Logic Unit Access) is a protocol which help hosts to access LUNs through optimal paths, it also allows to automatically change paths to a LUN if it moved to another controller. ALUA is used in both FCP and iSCSI protocols. Similarly to ALUA, ANA ( Asymmetric Namespace Access) is a protocol for NVMe over Fabrics protocols like FC-NVMe, iNVMe, etc.

Host can use one or a few paths to an LUN and that is depended on the host multipathing configuration and Portset configuration on the ONTAP cluster.

Since an LUN belong to a single storage node and ONTAP provide online migration capabilities between nodes, your network configuration must provide access to the LUN from all the nodes, just in case. Read more in series of articles How ONTAP cluster works.

According to NetApp best practices, zoning is quite simple:

  • Create one zone for each initiator (host) port on each fabric
  • Each zone must have one initiator port and all the target (storage node) ports.

Keeping one initiator per zone reduces “cross talks” between initiators to 0.

Example for Fabric A, Zone for “Host_1-A-Port_A”:

Host 1Port APort A
ONTAP Node1Port ALIF1-A (NAA-2)
ONTAP Node2Port A LIF2-A (NAA-2)
ONTAP Node3Port A LIF3-A (NAA-2)
ONTAP Node4Port A LIF4-A (NAA-2)

Example for Fabric B, Zone for “Host_1-B-Port_B”:

Host 1Port BPort B
ONTAP Node1Port BLIF1-B (NAA-2)
ONTAP Node2Port B LIF2-B (NAA-2)
ONTAP Node3Port BLIF3-B (NAA-2)
ONTAP Node4Port BLIF4-B (NAA-2)

Here is how zoning from tables above it looks like:

Vserver or SVM

An SVM in ONATP cluster lives on all the nodes in the cluster. Each SVM separated one from another and used for creating a multi-tenant environment. Each SVM can be managed by a separate group of people or companies and one will not interfere with another. In fact they will not know about other existence at all, each SVM is like a separate physical storage system box. Read more about SVM, Multi-Tenancy and Non-Disruptive Operations here.

Logical Interface (LIF)

Each SVM has its own WWNN in case of FCP, own IQN in case of iSCSI or Namespace in case of NVMeoF. Each SVM can share a physical storage node port. Each SVM assigns its own range of network addresses (WWPN, IP, or Namespace ID) to a physical port and normally each SVM assigns one network address to one physical port. Therefore one physical port might have a few WWPN network addresses on a single physical storage node port each assigned to a different SVM, if a few SVM exists. NPIV is a crucial functionality which must be enabled on a FC switch for ONTAP cluster with FC protocol to function properly.

Unlike ordinary virtual machines (i.e. ESXi or KVM), each SVM “exists” on all the nodes in the cluster, not just on a single node, the picture below shows two SVMs on a single node just for simplification.

Make sure that each node has at least one LIF, in this case host multipathing will be able to find an optimal path and always access an LUN through optimal route even if a LUN will migrate to another node. Each port has its own assigned “physical address” which you cannot change and network addresses. Here is an example of network & physical addresses looks like in case of iSCSI protocol. Read more about SAN LIFs here and about SAN protocols like FC, iSCSI, NVMeoF here.

Zoning recommendations

For ONTAP 9, 8 & 7 NetApp recommends having a single initiator and multiple targets.

For example in case of FCP, each physical port has its own physical WWPN (WWPN 3 in the image above) which should not be used at all, but rather WWPN addresses assigned to an LIF (WWPN 1 & 2 in the image above) must be used for zoning and host connections. Physical addresses looks like 50:0A:09:8X:XX:XX:XX:XX, this type of addresses numbered according to NAA-3 (IEEE Network Address Authority 3), assigned to a physical port, and should not be used at all. Example: 50:0A:09:82:86:57:D5:58. You can see addresses numbered according to NAA-3 listed on network switches, but they should not be used.

When you create zones on a Fabric, you should use 2X:XX:00:A0:98:XX:XX:XX, this type of addresses numbered according to NAA-2 (IEEE Network Address Authority 2) and assigned to your LIFs. Thanks to NPIV technology, the physical N_Port can register additional WWPNs which means your switch must be enabled in NPV mode in order ONTAP to serve data over FCP protocol to your servers. Example 20:00:00:A0:98:03:A4:6E

  • Block 00:A0:98 is the original OUI block for ONTAP
  • Block D0:39:EA is the newly added OUI block for ONTAP
  • Block 00:A0:B8 is used on NetApp E-Series hardware
  • Block 00:80:E5 is reserved for future use.

Read more


Please note in this article I described my own understanding of the internal organization of ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only.

ONTAP & Antivirus NAS protection

NetApp with ONTAP OS supports antivirus integration known as Off-box Antivirus Scanning or VSCAN. With VSCAN ability, the storage system will scan each new file with an antivirus system. VSCAN allows increasing corporate data security.

ONTAP supports the next list of antivirus software:

  • Symantec
  • Trend Micro
  • Computer Associates
  • Kaspersky
  • McAfee
  • Sophos

Also, ONTAP supports FPolicy technology which can prevent a file been written or read based on file extension or file content header.

This time I’d like to discuss an example of CIFS (SMB) integration with antivirus system McAfee.


In this example im going to show how to set up integration with McAfee. Here are the minimum requirements for McAfee but approximately the same with other AVs:

  • MS Windows Server 2008 or higher
  • NetApp storage with ONTAP 8 or higher
  • SMB v2 or higher (CIFS v1.0 not supported)
  • NetApp ONTAP AV Connector (Download page)
  • McAfee VirusScan Enterprise for Storage (VSEfS)
  • For more details see NetApp Support Matrix Tool.

Diagram of antivirus integration with ONTAP system.


To set up such an integration, we will need to configure the next software components:



We need to set up McAfee VSEfS, which can work in two modes: as an independent product or as managed by McAfee ePolicy Orchestrator (McAfee ePO). In this article, I will discuss how to configure it as an independent product. To set up & configure VSEfS we will need already installed and configured:

  • McAfee VirusScan Enterprise (VSE). Download VSE
  • McAfee ePolicy Orchestrator (ePO), not needed if VirusScan used as an independent product.

SCAN Server

At first, we need to configure few SCAN servers to balance the workload between them. I will install each SCAN server on a separate Windows Server with McAfee VSE, McAfee VSEfS, and ONTAP AV Connector. In this article, we will create three SCAN servers: SCAN1, SCAN2, SCAN3.

Active Directory

At the next step, we need to create user scanuser in our domain, in this example domain will be NetApp.


After ONTAP been started, we need to create Cluster management LIF and SVM management LIF; set up AD integration and configure file shares and data LIFs for SMB protocol. Here, we will have NCluster-mgmt LIF for cluster management and SVM01-mgmt for SVM management.

NCluster::> network interface create -vserver NCluster -home-node NCluster-01 -home-port e0M -role data -protocols none -lif NCluster-mgmt -address -netmask
NCluster::> network interface create -vserver SVM01 -home-node NCluster-01 -home-port e0M -role data -protocols none -lif SVM01-mgmt -address -netmask
NCluster::> domain-tunnel create -vserver SVM01
NCluster::> security login create -username NetApp\scanuser -application ontapi -authmethod domain -role readonly -vserver NCluster
NCluster::> security login create -username NetApp\scanuser -application ontapi -authmethod domain -role readonly -vserver SVM01

ONTAP AV Connector

On each SCAN server, we will install the ONTAP AV Connector. At the end of the installation, I will add AD logging & password for the user scanuser.


Then configure management LIFs

Start → All Programs → NetApp → ONTAP AV Connector → Configure ONTAP Management LIFs

In the field “Management LIF” we will add DNS name or IP address for the NCluster-mgmt or SVM01-mgmt. In the Account field, we will fill with NetApp\scanuser. Also, then pressing “Test,” “Update” or “Save” if test finished.


McAfee Network Appliance Filer AV Scanner Administrator Account

Assuming you already installed McAfee on three SCAN servers, on each SCAN server, we are logging as an administrator and in Windows taskbar opening VirusScan Console and then open Network Appliance Filer AV Scanner and choosing tab called “Network Appliance Filers.” So, in the field “This Server is processing scan request for these filers” press the “Add button” and put to the address field “”, and then also add scanuser credentials.


Returning to ONTAP console

Configuring off-box scanning, then enabling it, creating and applying scan policies. SCAN1, SCAN2, and SCAN3 are the Windows servers with installed McAfee VSE, VSEfS, and ONTAP AV Connector.
First, we create a pool of AV servers:

NCluster::> vserver vscan scanner-pool create -vserver SVM01 -scanner-pool POOL1 -servers SCAN1,SCAN2,SCAN3 -privileged-users NetApp\scanuser 
NCluster::> vserver vscan scanner-pool show
Scanner Pool Privileged Scanner Vserver Pool Owner Servers Users Policy 
-------- ---------- ------- ------------ ------------ ------- 
SVM01 POOL1 vserver SCAN1, NetApp\scanuser idle SCAN2, SCAN3

NCluster::> vserver vscan scanner-pool show -instance
Vserver: SVM01 Scanner Pool: 
POOL1 Applied Policy: idle 
Current Status: off 
Scanner Pool Config Owner: vserver 
List of IPs of Allowed Vscan Servers: SCAN1, SCAN2, SCAN3 
List of Privileged Users: NetApp\scanuser

Second, we apply a scanner policy:

NCluster::> vserver vscan scanner-pool apply-policy -vserver SVM01 -scanner-pool POOL1 -scanner-policy primary
NCluster::> vserver vscan enable -vserver SVM01
NCluster::> vserver vscan connection-status show
Connected Connected Vserver Node Server-Count Servers 
--------- -------- ------------ ------------------------ 
SVM01 NClusterN1 3 SCAN1, SCAN2, SCAN3

NCluster::> vserver vscan on-access-policy show
Policy Policy File-Ext Policy Vserver Name Owner Protocol Paths Excluded Excluded Status 
--------- --------- ------- -------- ---------------- ---------- ------ 
NCluster default_ cluster CIFS - - off CIFS SVM01 default_ cluster CIFS - - on CIFS 


There is no other licensing needed on ONTAP side to enable and use FPolicy & off-box anti-virus scanning; this is a basic functionality available in any ONTAP system. However, you might need to license additional functionality from the antivirus side, so please check it with your antivirus vendor.


Here are some advantages in integration storage system with your corporate AV: NAS integration with antivirus allows you to have one of the antivirus systems on your desktops and another for your NAS share. There is no need to do NAS scanning on workstations and waste their limited resources. All NAS data protected, there is no way for a user with advanced privileges to connect to the file share without antivirus protection and put there some unscanned files.

ONTAP and ESXi 6.х tuning

This article will be useful to those who own an ONTAP system and ESXi environment.

ESXi tuning can be divided into the next parts:

  • SAN network configuration optimization
  • NFS network configuration optimization
  • Hypervisor optimization
  • Guest OS optimization
  • Compatibility for software, firmware, and hardware

There are a few documents which you should use when tuning ESXi for NetApp ONTAP:

TR-4597 VMware vSphere with ONTAP

SAN network

In this section, we will describe configurations for iSCSI, FC, and FCoE SAN protocols


ONTAP 9 has ALUA always enabled for FC, FCoE and iSCSI protocols. If ESXi host correctly detected ALUA, then Storage Array Type plug-in will show VMW_SATP_ALUA. With ONTAP it is allowed to use Most Recently Used or Round Robin load balancing algorithm.

Round Robin will show better results if you have more than one path to a controller. In the case of Microsoft Cluster with RDM drives it is recommended to use the Most Recently Used algorithm. Read more about Zoning for ONTAP clusters.

Storage ALUA Protocols ESXi policy Algorithm
Recently Used

Let’s check policy and algorithm applied to a Datastore:

# esxcli storage nmp device list
Device Display Name: NETAPP Fibre Channel Disk (naa.60a980004434766d452445797451376b)
Storage Array Type: VMW_SATP_ALUA
Storage Array Type Device Config: {implicit_support=on;explicit_support=off; explicit_allow=on;alua_followover=on;{TPG_id=1,TPG_state=ANO}{TPG_id=0,TPG_state=AO}}
Path Selection Policy: VMW_PSP_RR
Path Selection Policy Device Config: {policy=rr,iops=1000,bytes=10485760,useANO=0; lastPathIndex=0: NumIOsPending=0,numBytesPending=0}
Path Selection Policy Device Custom Config:
Working Paths: vmhba2:C0:T6:L119, vmhba1:C0:T7:L119
Is Local SAS Device: false
Is USB: false
Is Boot USB Device: false

Ethernet Network

Ethernet network can be used for NFS and iSCSI protocols.

Jumbo frames

Either you are using iSCSI or NFS, it is recommended to use jumbo frames with speed of 1Gbps or greater. When you are setting up Jumbo frames.

ESXi & MTU9000

When you are setting up a virtual machine to achieve the best network performance, so you’d better use VMXNET3 virtual adapter since it supports both speeds greater than 1Gbps and MTU 9000. While E1000e virtual adapter supports MTU 9000 and speeds up to 1Gbps. Also, E1000e by default sets up 9000 MTU to all the VMs except Linux. Flexible virtual adapters support only MTU 1500.

To achieve maximum network throughput, connect your VM to virtual switch which also has MTU 9000.


ONTAP storage systems support VAAI (vSphere Storage APIs Array Integration). VAAI hardware acceleration or hardware offload APIs, is a set of APIs to enable communication between VMware vSphere ESXi hosts and storage devices. So instead of ESXi host copying some data from storage, modifying it in host memory and putting it back to storage over the network, with VAAI some of the operations can be done by storage itself with API calls from ESXi host. VAAI enabled by default for SAN protocols but not for NAS. For NAS VAAI to work you need to install a vib kernel module called NetAppNFSVAAI on each ESXi host. Do not expect VAAI to solve all your problems but some performance definitely will improve. NetApp VSC also can help with NetAppNFSVAAI installation. For NFS VAAI function you have to set your NFS share on storage properly and to meet a few criteria:

  1. On the ONTAP storage set NFS export policy so ESXi servers can access it
  2. RO, RW and Superuser fields must have SYS or ANY values in export policy for your volume
  3. You have to enable NFSv3 AND NFSv4 protocols, even if NFSv4 will not be used
  4. Parent volumes in your junction path have to be readable. In most of the cases, it means that root volume (vsroot) on your SVM needs to have at least superuser field to be set up with SYS. Moreover, it is recommended to prohibit write access to SVM root volume.
  5. Enable vStorage feature has to be enabled for your SVM (vserver)


#cma320c-rtp::> export-policy rule show -vserver svm01 -policyname vmware_access -ruleindex 2
(vserver export-policy rule show)
Vserver: svm01
Policy Name: vmware_access <--- Applied to Exported Volume
Rule Index: 2
Access Protocol: nfs3 <---- needs to be 'nfs' or 'nfs3,nfs4'
Client Match Spec:
RO Access Rule: sys
RW Access Rule: sys
User ID To Which Anonymous Users Are Mapped: 65534
Superuser Security Flavors: sys
Honor SetUID Bits In SETATTR: true

#cma320c-rtp::> export-policy rule show -vserver svm01 -policyname root_policy -ruleindex 1
(vserver export-policy rule show)
Vserver: svm01
Policy Name: root_policy <--- Applied to SVM root volume
Rule Index: 1
Access Protocol: nfs <--- like requirement 1, set to nfs or nfs3,nfs4
Client Match Spec:
RO Access Rule: sys
RW Access Rule: never <--- this can be never for security reasons
User ID To Which Anonymous Users Are Mapped: 65534
Superuser Security Flavors: sys <--- this is required for VAAI to be set, even in the parent volumes like vsroot
Honor SetUID Bits In SETATTR: true
Allow Creation of Devices: true

#cma320c-rtp::> nfs modify -vserver svm01 -vstorage enabled

ESXi host

First of all, let’s not forget it is a good idea to leave 4GB of memory to the hypervisor itself.  Also, we need to tune some network values for ESXi

ProtocolsValues for ESXi 6.x with ONTAP 9.x
NFS41.MaxVolumesNFS 4.1256
NFS.MaxQueueDepthNFS64 (If you have only AFF, then 128 or even 256)

We can do it in a few ways:

  • The easiest way, again, to use VSC which will configure these values for you
  • Command Line Interface (CLI) on ESXi hosts
  • With the GUI interface of vSphere Client/vCenter Server
  • Remote CLI tool from VMware.
  • VMware Management Appliance (VMA)
  • Applying Host Profile

Let’s set up these values manually in command line:

# For Ethernet-based protocols like iSCSI/NFS
esxcfg-advcfg -s 32 /Net/TcpipHeapSize
esxcfg-advcfg -s 512 /Net/TcpipHeapMax

# For NFS protocol
esxcfg-advcfg -s 256 /NFS/MaxVolumes
esxcfg-advcfg -s 10 /NFS/HeartbeatMaxFailures
esxcfg-advcfg -s 12 /NFS/HeartbeatFrequency
esxcfg-advcfg -s 5 /NFS/HeartbeatTimeout
esxcfg-advcfg -s 64 /NFS/MaxQueueDepth

# For NFS v4.1 protocol
esxcfg-advcfg -s 256 /NFS41/MaxVolumes

# For iSCSI/FC/FCoE SAN protocols
esxcfg-advcfg -s 32 /Disk/QFullSampleSize
esxcfg-advcfg -s 8 /Disk/QFullThreshold

And now let’s check those settings:

# For Ethernet-based protocols like iSCSI/NFS
esxcfg-advcfg -g /Net/TcpipHeapSize
esxcfg-advcfg -g /Net/TcpipHeapMax

# For NFS protocol
esxcfg-advcfg -g /NFS/MaxVolumes
esxcfg-advcfg -g /NFS/HeartbeatMaxFailures
esxcfg-advcfg -g /NFS/HeartbeatFrequency
esxcfg-advcfg -g /NFS/HeartbeatTimeout
esxcfg-advcfg -g /NFS/MaxQueueDepth

# For NFS v4.1 protocol
esxcfg-advcfg -g /NFS41/MaxVolumes

# For iSCSI/FC/FCoE SAN protocols
esxcfg-advcfg -g /Disk/QFullSampleSize
esxcfg-advcfg -g /Disk/QFullThreshold


NetApp usually recommends using settings by default. However, in some cases, VMware, NetApp or Application vendor can ask you to modify those settings. Read more in VMware KB. Example:

# Set value for Qlogic on 6.0
esxcli system module parameters set -p qlfxmaxqdepth=64 -m qlnativefc
# View value for Qlogic on ESXi 6.0
esxcli system module list | grep qln


NetApp Virtual Storage Console (VSC) is a free software which helps you to set recommended values for ESXi hosts and Guest OS. Also, VSC helps with basic storage management like datastore creation from vCenter. VSC is a mandatory tool for VVOLs for ONTAP. VSC available only for the vCenter web client supported vCenter 6.0 and newer.

VASA Provider

VASA Provider is a free software which helps your vCenter to know about some specifics and storage capabilities like disk types: SAS/SATA/SSD, Storage Thing Provisioning, Enabled or disabled storage caching, deduplication and compression. VASA Provider integrates with VSC and allows to create storage profiles. VASA Provides also a mandatory tool for VVOLs. NetApp VASA, VSC and Storage Replication Adapter for SRM are bundled in a single virtual appliance and available for all NetApp customers.

Space Reservation — UNMAP

UNMAP functionality allows to free space on datastore and storage system after data been deleted from VMFS or inside Guest OS, this process known as space reclamation.  There are two independent processes:

  1. First space reclamation form: ESXi UNMAP to storage system when some data been deleted from VMFS datastore. For this type of reclamation to work, storage LUN has to be thin provisioned, and space allocation functionality needs to be enabled on the NetApp LUN. Reclamation of this type can happen in two cases:
    • A VM or VMDK has been deleted
    • Data deleted from Guest OS file system and space reclaimed on from Guest OS VMFS. Basically, after the UNMAP form, Guest OS already happened.
  2. Second space reclamation form: UNMPA from Guest OS when some data deleted on Guest OS file system to free space on VMware datastore (either NFS or SAN). This type of reclamation has nothing in to do with the underlying storage system and do not require any storage tuning or setup, but does need Guest OS tuning and some additional requirements for this feature to function.

Both space reclamation forms are not tied one to another, and you can have only one of them set up to work, but for the best space efficiency, you are interested in having both.

First space reclamation form: From ESXi host to storage system

Historically VMware introduced only the first space reclamation form: from VMFS to storage LUN in ESXi 5.0 with space reclamation happened automatically and nearly online. Moreover, it wasn’t the best idea because it immediately hit storage performance. So, with 5.X/6.0 VMware disabled automatic space reclamation and you have to run it manually. ESXi 6.X with VVOLs space reclamation works automatically and with ESXi 6.5 and VMFS6 it also works automatically, but in both cases,  it is asynchronously (not online process).  

On ONTAP space reclamation (space allocation) is always disabled by default:

lun modify -vserver vsm01 -volume vol2 -lun lun1_myDatastore -state offline

lun modify -vserver vsm01 -volume vol2 -lun lun1_myDatastore -space-allocation enabled lun modify -vserver vsm01 -volume vol2 -lun lun1_myDatastore -state online

If you are using an NFS datastore, space reclamation not needed, because with NAS this functionality available by design. UNMAP needed only for SAN environment because it definitely was one of the disadvantages to NAS.

This type of reclamation automatically occurs in ESXi 6.5 during up to 12 hours and can also be initiated manually.

esxcli storage vmfs reclaim config get -l DataStoreOnNetAppLUN
Reclaim Granularity: 248670 Bytes
Reclaim Priority: low esxcli storage vmfs reclaim config set -l DataStoreOnNetAppLUN -p high

Second space reclamation form: UNMPA from Guest OS

Since for VMs VMDK file is basically a block device, you can apply UNMAP mechanism there too. Starting from 6.0 VMware introduced such capability. It started with Windows in the VVOL environment with ESXi 6.0 with automatic space reclamation from Guest OS and manual space reclamation with Windows machines on ordinary datastores. Later introduced automatic space reclamation from Guest OS (Windows and Linux) on ordinary Datastores in ESXi 6.5.

Now to set it up to function properly it might be trickier then you think. The hardest thing to make this UNMAP work is just to comply with requirements. Once you comply with the requirements, it is easy to make it happen. So, you need to have:

  • Virtual Hardware Version 11
  • vSphere 6.0*/6.5
  • VMDK disks must be thin provisioned
  • The file system of the Guest OS must support UNMAP
    • Linux with SPC-4 support or Windows Server 2012 and later

* If you have ESXi 6.0, then CBT must be disabled, which means in a real production environment you are not going to have Guest OS UNMUP since no production can live without proper backups (Backup software leverage CBT for backups to function)

Moreover, if we are adding ESXi UNMAP to the storage system, a few more requirements needed to be honored:

  • LUN on the storage system must be thinly provisioned (in ONTAP it can be enabled/disabled on the fly)
  • Enable UNMAP in ONTAP
  • Enable UNMAP on Hypervisor

Never use Thin virtual disks on Thin LUN

For many years storage all vendors stated not to use thin virtual disks on thin LUNs, and now it is a requirement to make space reclamation from Guest OS.


UNMAP supported in Windows starting with Windows Server 2012. To make Windows reclama space from VMDK, NTFS must use allocation unit equal to 64KB. To check UNMAP settings issue next command:

fsutil behavior query disabledeletenotify

DisableDeleteNotify = 0 (Disabled) means UNMUP is going to report to the hypervisor to re-clame space.

Linux Guest OS SPC-4 support

Let’s check first is our virtual disk thin or thick:

sg_vpd -p lbpv
Logical block provisioning VPD page (SBC):
Unmap command supported (LBPU): 1

1 means we have a thin virtual disk. If you got 0, then your virtual disk either thick (sparse or eager), both are not supported with UNMAP. Let’s go farther and check that we have SPC-4

sg_inq -d /dev/sda
standard INQUIRY:
PQual=0 Device_type=0 RMB=0 version=0x06 [SPC-4]
Vendor identification: VMware
Product identification: Virtual disk
Product revision level: 2.0

We need to have SPC-4 to make UNMAP work automatically. Let’s check Guest OS notifying SCSI about reclaimed blocks

grep . /sys/block/sdb/queue/discard_max_bytes

1 means we are good. Now let’s try to create a file, remove it and see if we get our space freed:

sg_unmap --lba=0 --num=2048 /dev/sda
# or
blkdiscard --offset 0 --length=2048 /dev/sda

If you are getting “blkdiscard: /dev/sda: BLKDISCARD ioctl failed: Operation not supported”, then UNMAP doesn’t work properly. If we do not have an error, we can remount our filesystem with “-o discard” key to make UNMAP automatic.

mount /dev/sda /mnt/netapp_unmap -o discard

Guest OS

You need at least to check your Guest OS configurations for two reasons:

  1. To gain max performance
  2. To make sure in case of one controller down, your Guest OS survive takeover timeout

Disk alignment: to make sure you get max performance

Disk Misalignment is an infrequent situation, but you still you may get into it. There are two levels where you can get this type of problem:

  1. When you created a LUN in ONTAP with geometry, for example, Windows 2003 and then used it with Linux. This type of problem can occur only in a SAN environment. Its very simple to avoid when you are creating a LUN in ONTAP, make sure you chose proper LUN geometry. This problem happens between storage and hypervisor
  2. Inside of a virtual machine. It can happen in SAN and NAS environment.

To understand how it works let’s take a look on a properly aligned configuration

Fully aligned configuration

On this image upper block belong to Guest OS, block in the middle belongs to ESXi, and lower block represents ONTAP storage system.

First case: Misalignment with VMFS

When you have your VMFS file system misaligned with your storage system. Also, that will happen if you create on ONTAP a LUN with geometry not equal to VMware. It is very easy to fix: just create a new LUN in ONTAP with VMware geometry, create new VMFS datastore and move your VMs to the new one, destroy old one.

Second case: Misalignment inside your guest OS

This is also a very rare problem which can occur because you can get this problem with very old Linux distributives, Windows 2003 and older. However, we are here to discuss all the possible problems to understand better how it works, right? This type of problem can occur on NFS datastore and VMFS datastore leveraging SAN protocols, also in RDM and VVOLs. This type of problem usually happens with virtual machines using non-optimally aligned MBR on Guest OS or Guest OS which previously were converted from physical machines to virtual. How to identify and fix misaligned in Guest OS you can find in NetApp KB.

Misalignment on two levels simultaneously

Of course, if you are very lucky, you can get both simultaneously: on VMFS level and Guest OS level. Later in this article, we will discuss how to identify such a problem from the storage system side.


NetApp ONTAP storage systems consist of one or a few building blocks called HA pairs. Each HA pair consists of two controllers and in the event of one controller failure of one controller, second will take over and continue to serve clients. The takeover is a relatively fast process in ONTAP, and in new All-Flash FAS (AFF) configurations it takes from 2 to 15 seconds. However, with hybrid FAS systems, this time can be longer and take up to 60 seconds. 60 seconds is the absolute maximum after which NetApp guarantees failover to be completed in FAS systems, and it usually occurs for 2-15 seconds. This numbers should not scare you in any way, because during this time your VMs will survive, as long as your timeouts are set equals to or greater than 60 sec and default VMware Tools value for your VMs is 180 seconds in any way. Moreover, since your ONTAP cluster can have different models, generations and disk types of systems, it is a good idea to use the worst-case scenario which is 60 sec.

Guest OSUpdated Guest OS Tuning for SAN:
ESXi 5 and newer, or ONTAP 8.1 and newer (SAN)
Windowsdisk timeout = 60
Linuxdisk timeout = 60

Default values for Guest OS on NFS datastores are tolerable, and there is no need to change them. However, I would recommend testing a takeover in any way to be sure how it works at such events.

You can configure these values manually or with the use of NetApp Virtual Storage Console (VSC) utility. NetApp Virtual Storage Console (VSC) provides the scripts to help reduce the efforts involved in manually updating the guest OS tunings.


You can change Windows registry and reboot Guest OS. Timeout in Windows set in seconds in hexadecimal format.

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\Disk] “TimeOutValue”=dword:0000003c


Timeout in Linux set in seconds in decimal format. To do that you need to modify udev rule in your Linux OS. Location for udev rules may vary from one Linux distributive to another.

# RedHat systems
ACTION=="add", BUS=="scsi", SYSFS{vendor}=="VMware, " , SYSFS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 60 >/sys$DEVPATH/device/timeout'"

# Debian systems
ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="VMware " , ATTRS{model}=="Virtual disk ", RUN+="/bin/sh -c 'echo 60 >/sys$DEVPATH/device/timeout'"

# Ubuntu and SUSE systems
ACTION=="add", SUBSYSTEMS=="scsi", ATTRS{vendor}=="VMware, " , ATTRS{model}=="VMware Virtual S", RUN+="/bin/sh -c 'echo 60 >/sys$DEVPATH/device/timeout'"

VMware tools automatically sets udev rule with timeout 180 seconds. You should go and double check it with:

find /sys/class/scsi_device/*/device/timeout -exec grep -H . '{}' \;


NetApp has Interoperability Matrix Tool (IMT), and VMware has the Hardware Compatibility List (HCL). You need to check on them and stick to the versions both vendors support to reduce potential problems in your environment. Before any upgrades make sure, you are always staying in the compatibility list.


Most of this article are theoretical knowledge which most probably you will not need because nowadays nearly all parameters either automatically assigned or set as default configuration and very possible you will not see misalignment in new installations in real life, but if something will go wrong, information in this article will shed light on some of the under-hood aspects of ONTAP architecture and will help you to deep dive & figure out the reasons of your problem.

Correct settings for your VMware environment with NetApp ONTAP gives not just better performance but also ensure you will not get into trouble in the event of storage failover or network link outage. Make sure you are following NetApp, VMware and application recommendations. When you are setting up your infrastructure from scratch always test performance and availability, simulate storage failover and network link outage. Testing will help you to understand your infrastructure baseline performance, behavior in critical situations and help to find week points. Stay in compatibility lists, it will not guarantee yuo never get in to troubles, but reduce risks and keep you supportable by the both vendors.

Continue to read

How memory works in ONTAP: Write Allocation, Tetris, Summary (Part 3)

Write Allocation

NetApp has changed this part of the WAFL architecture a lot since its first release. Increased demand for performance, multi-core CPUs and a new type for media were the forces to continue to improve Write Allocation.

Each thread can be processed separately and depends on data type, write/read a pattern, data block size, place media where it will be written and other characteristics. Based on that WAFL can decide how and where data will be placed and what type of media, it should use.

For example, for flash media, it is important to write data taking into account the smallest block size that media can operate to ensure cells are not wear-out and evenly utilized. Another example is when data separated from metadata and stored on different tiers of storage. In reality Write Allocation is a very, very huge, separate topic to discuss, how ONTAP is optimizing data in this module.


Increase in the number of CPU cores forced NetApp to develop new algorithms for process parallelization over time. Volume affinities (and other affinities) are algorithms which allow executing multiple processes in parallel, so they do not block each other and can run in parallel, though sometimes it is necessary to execute a serial process which blocks others. For example, when two hosts working on the same storage with two different volumes, they can write & modify data in parallel. If those two hosts start to write to a single volume, ONTAP becomes a broker and serialize processes to that volume and gives access to one host at a time and when it’s done, only then give access to another. Write operations are always executed on a single core; otherwise, because each core can be loaded unequally, you can end up in messing with your data. Each of these affinities allowed to decrease locking other processes and increase parallelization as well as improve multi-core CPU utilization. Volume affinities called “Waffinity” because each volume is a separate WAFL file system, so they combined wafl with affinity words together.

  • Classical Waffinity (2006)
  • Hierarchical Waffinity (2011)
  • Hybrid Waffinity (2016)


If for instance, ONTAP needs to work on an aggregate level, the whole bunch of volumes living on that aggregate will stop getting write operations for some time because aggregate operations are locking volume operations, this is just one example of locking mechanisms and NetApp improving it throughout ONTAP lifespan. What is the most natural solution would be in such a case? To have multiple volumes, multiple aggregates, and even multiple nodes, but in this case, instead of a single bucket of space, you will have multiple volumes on multiple nodes & aggregates. That’s when FlexGroup gets into the picture: it joins all the (constituent) volumes into a single logical space visible to clients as a single volume or file share. Before FlexGroup ONTAP was very good optimized for workloads with random & small blocks and even sequential reads, but now thanks to FlexGroup, ONTAP optimized for sequential operations and mainly benefiting from sequential writes.



From WAFL module data delivered to the RAID module which processing it and writing in one transaction (known as stripes) to the disks, including parity data to the parity drives.

Taking in to account that data written to the disks in stripes, it means there is no need to calculate parity data for parity drives because the system prepared everything in RAID module for us. Moreover, that is the reason why on practice parity disk drives always less utilized than data drives, unlike it happens with traditional RAID-6 and RAID-4. This allows avoiding re-writes of data, placing new data to a new empty space and simply moving metadata links to a new place. This allows the system not to read data to its memory and recalculate new parity to a stripe after a single block change and therefore system memory user more rationally. More about RAID-DP in TR-3298.


Tetris and IO redirection

Tetris is a WAFL mechanism for write & read optimizations. Tetris collects data for each CP and compiles data into chains, so blocks from a single host assembled in bigger chunks (this process is also known as IO redirection). On another hand, this simple logic allows enabling predictive read operations because there is no difference for example to read 5KB or 8KB of data; 13KB or 16KB. A predictive read is a form of read caching. So, when times come to decide what data should be read in advance the answer comes naturally: data most probably will be read right away after the first part most probably the same data what was written right away with the first part. When it comes to a decision what extra data should we read,- it already decided.


Read Cache

MBUF used for both read and write caching and all the reads without exception inevitable copied to the cache. From the same cache, hosts can read just written data. When CPU cannot find data in the system cache for the host, it looks for them in another cache if available and only then on disks. Once data found on disks, the system will place it into MBUF. If that piece of data wasn’t accessed for a while CPU can de-stage it to the second memory ties like FlashPool or FlashCache.

Important to note that system very granularity evicts cache from unaccessed 4kb blocks data. Another important thing is that cache in ONTAP systems is deduplication-aware, so is that block already exists in MBUF or on the second ties cache, it will not be copied there again.

Why is NVRAM/NVMEM size not always important?

In NetApp FAS/AFF and other ONTAP-based systems NVRAM/NVMEM used exclusively for NVLOGs, not for write-buffer, so it doesn’t have to be as big as system memory size in other systems.

ONTAP NVRAM vs competitors

Battery and boot flash drive

As I mentioned before, hardware systems like FAS & AFF have a battery installed in each controller to power NVRAM/NVMEM to maintain data for 72H in the memory. Just after a controller lost it’s power, data from memory will destage to Boot Flash drive. Once ONTAP booted data restored back.

Flash media and WAFL

As I mentioned in previous articles from the series, ONTAP always writes to a new space because of a number of architectural reasons. However, it turns out flash media needs precisely that. Though some people predicted for WAFL death because of the Flash media, it turns out WAFL works on that media quite well and with “always write to the new space” technique not just optimizes garbage collection and wear out of flash memory cells equally but also shows quite competitive performance.


System memory architecturally builds not just to optimize performance to offer high data protection and availability and write optimization. Reach ONTAP functionality, unique way NVRAM/NVMEM usage and reach software integration ecosystem qualitatively differentiate NetApp storage systems from others.

Continue to read

NVLOG & Write-Through (Part 2)

How ONTAP cluster works?

Zoning for ONTAP Cluster


Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

How memory works in ONTAP: NVLOG and Write-Through (Part 2)

NVRAM/NVMEM & Write-Through

It is important to note that NVRAM/NVMEM technology is widely used in many storage systems, but NetApp ONTAP systems are using it for NVLOGs (HW journaling), while others using it as a block device for write-cache (disk driver level or disk cache) and that simple fact makes some difference in storage architectures. With ONTAP having its architecture with NVLOGs allows system not to switch into Write-Through mode in case one controller in an HA pair dead.

That simple statement is tough to explain in simple words. Let me try. Write-Through is a storage system mode which is not using write-buffer, and all the writes directed straight to the disks, which mean it disables write-buffer, which is a bad idea for many reasons. All the optimizations, all the magic and all the intellectual stuff happening with your data in the write buffers, so disabling write buffers is always a bad idea. For example, some problems you might experience with storage in write-through mode if you are using HDD. HDD drives are significantly slower in performance and memory always way faster than drives, so you can optimize in write-buffer random operations and glue them together in memory and later destage them sequentially on your HDD drives as one big chunk of data on your drives in one operation which is easier to process for HDD drives. Memory cache basically used to trick your hosts and to send them acknowledgment before the data actually placed on disks and in this way to improve performance. In the case of Flash media, you can optimize your data to be written in a way not to wear out your memory cells. Memory also very often used as a place to prepare checksums for RAID (or other types of data protection). So, the bottom line Write-Through is terrible for your storage system performance and all storage vendors trying to avoid that scenario in their systems. When might you need in a storage system architecture to switch to Write-Through? When you are uncertain that your write cache will protect you. The simplest example is when your battery to your write cache is dead.

Let’s examine another more complex scenario, what if you have an HA pair and battery only on one controller die? Well, since all storage systems from all A-brands doing HA, your writes should be protected. What happens if you’re in your HA pair will lose one controller and survived one will have a battery? Many of you might think, according to described logic above, that your storage system will not switch to Write-Through, right? Well, the answer to that question, “it depends.” In ONTAP world since we have NVLOGs used only for data protection purposes in dedicated NVRAM/NVMEM device, data always presented as they were placed there in the unchanged state with no architectural ability to change it, the only thing which is architecturally allowable is to write data to an empty half and then when needed clear all the NVLOGs from first one, so in this architecture there is no need to switch your ONTAP system to Write-Through even though you have only a single controller working. While all other architectures even though they are also using NVRAM/NVMEM, all the data stored in one place.

Both systems either ONTAP and other storage vendors using memory for data optimization, in other words, they are changing data from its original state. And changing your data is a big threat to your data consistency, once you have only one controller survived, even though your battery on that controller functioning properly. So, that’s why all the rest storage systems have to switch to Write-Through, because there is no way to guarantee your data will not be corrupted while in the middle of data optimization, especially after your data been (half-) processed with your RAID algorithm and you will have an unexpected reboot for a survived node. Therefore, all other platforms and systems, all other NetApp AFF/FAS competitors I know, they all will switch to Write-Trough mode once there will be only one node left. There obviously some tricks, like some vendors allow you to disable Write-Through once you get in such a situation, but of course it is not recommended, they just give you the ability to make a bad choice on your own and will lead you to data corruption on entire storage, ones the survived node will unexpectedly reboot. Another example is HPE 3Par systems, wherein 4-node configuration if you will lose only one controller, your system will continue to function normally, but once you lose 50% of your nodes, it again switches to Write-Trough, the same happens if you have a 2-node configuration.

Thanks to the fact that ONTAP stores data in NVLOGs as they were, in unchanged form, it is possible to roll back earlier unfinished transaction of data been already half-processed by RAID, restore your data back to MBUF from NVLOGs and finish that transaction. Each transaction to write new data to disk drives from MBUF executed as part of system snapshot called CP. Each transaction can be easily rolled back and after that single controller will boot it will restore data from NVLOGs to MBUF, again process it with RAID in memory, rollback last unfinished CP and write data to disks, which allows ONTAP systems to be always consistent (from the storage system perspective) and never stitch to Write-Through mode.

Continue to read

How ONTAP cluster works?

Zoning for ONTAP Cluster


Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

How memory works in ONTAP: NVRAM/NVMEM, MBUF and CP (Part 1)

In this article, I’d like to describe how NVRAM, Cache and system memory works in ONTAP systems.

System Memory

System memory for any AFF/FAS (and other OEM vendors running ONTAP) consists out of two types of memory: ordinary (volatile) memory and non-volatile memory (that’s why we use “NV” prefix) connected to a battery on the controller. Some systems are using NVRAM and others instead using NVMEM, they both used for the same purpose, the only difference is that NVMEM is installed on the memory bus, while NVRAM is a small board with memory installed on it and that board connected to PCIe bus. Both, either NVRAM or NVMEM used by ONATP to store NVLOGs. Ordinary memory used for system needs and mostly as Memory Buffer (MBUF) in another words cache.

Data de-staging in normally functioning storage system is happening from the MBUF but not from NVRAM/NVMEM. However, the trigger for data de-staging might be the fact of NVRAM/NVMEM been half-full or others. See CP triggers section below.



Data from hosts always placed to the MBUF first. Then from MBUF with Direct Memory Access (DMA) request data copied to NVMEM/NVRAM, as the host sent them, in unchanged form of logs just like a DB log or similar to a journaling file system. DMA does not consume CPU cycles. That’s why NetApp says this system has Hardware Journaling file system. As soon data been acknowledged to be in NVRAM/NVMEM, the system sends data receiving an acknowledgment to the host. After Consistency Point (CP) occurred, data from MBUF are de-staged to the disks, and the system clears NVRAM/NVMEM without using NVLOGs at all in a normally functioning system. NVLOGs used only in special events or configurations. So, ONTAP does not use NVLOGs in normally functioning High Availability (HA) system it erases logs each time CP occur and then writes new NVLOGs. For instance, ONTAP will use NVLOGs to restore data “how they were” back to MBUF, in case of an unexpected reboot.

Memory Buffer: Write operations

The first place where all writes placed is always MBUF. Then from MBUF data copied to NVRAM/NVMEM with DMA call, after that WAFL module allocates a range of blocks where data from MBUF will be written that simply called Write Allocation. It might sound simple, but it’s kind of big deal, been constantly optimized by NetApp. However, just before it will allocate space for your data, the system will compile Tetris! Yes, I’m talking about the same kind of Tetris game for puzzle-matching you might play in your childhood. So, the Write Allocation’s one of the jobs is to make sure to write all the Tetris data in one unbreakable sequence to the disks whenever possible.

WAFL is also doing some additional data optimizations depending on the type of the data, where it goes, what type of media is going to be written, etc. After WAFL module gets acknowledgment from NVRAM/NVMEM that data secured, RAID module processing data from MBUF and adds checksums to each block (known as Block/Zone checksum) and it will calculate and write parity data on the parity disks. It is important to note that some data in MBUF contain commands which can be “extracted” before they will be delivered to RAID module, for example, some commands can ask the storage system to format some space in preassigned repeated patterns or commands which asks the storage to move chunks of data. Those commands might consume a small amount of space in NVRAM/NVMEM but might generate a big amount of data when executed.



Each HA pair with ONTAP consists out of two nodes (controllers), and each node has the copy of data from its neighbored HA partner. This architecture allows to switch hosts to a survived node and continue to serve them without a noticeable disruption for the hosts.

To be more precise, each NVRAM/NVMEM divided into two pieces: one to store NVLOGs from local controller and another piece to store the copy of NVLOGs from HA partner controller. While each piece also divided into halves. So, each time the system filled first local half of NVRAM/NVMEM, CP event generated and while it happens local controller using second local half for new operations and after second half filled by the system with logs, the system switches back to the first one, already emptied half, and repeats the cycle.


Consistency Points

As many modern file systems WAFL is a journaling FS which used to keep consistency and data protection. However, unlike general-purpose journaling file systems, WAFL do not need time or special FS checks to rollback & make sure the FS is consistent. Once a controller unexpectedly rebooted, last unfinished CP transaction not confirmed and similarly to snapshot the last snapshot just deleted and data from NVLOGs used to create new consistency point once the controller booted after an unexpected reboot. CP transaction confirmed only once all the transaction entirely been written to the disks and root inode changed with new pointers to the new data blocks added.

It turns out that NetApp snapshot technology was so successful, so it used literally almost everywhere in ONTAP. Let me remind you that each CP contain data already processed by WAFL and then by RAID module. So, CP is also a snapshot, so before data already been processed with WAFL & RAID been destaged from MBUF to disks, ONTAP create system snapshot from the aggregate where it is going to write data. Then ONTAP writes new data to the aggregate. Once data was successfully written as part of CP transaction, ONTAP changes root inode pointers and clear NVLOGs. Before data from CP transaction been written to disks, ONTAP creates a snapshot which represents the last active file system state. To be more precise, it just copies root inode pointers. If in case of failure even if both controllers will reboot simultaneously, last system snapshot will be rolled back, data will be restored from NVLOGs, aging processed with WAFL and RAID modules and destage back on disks on next CP as soon as the controllers get back online.

In case only one controller will suddenly switch off or reboot, second survived controller will restore data from its own NVRAM/NVLOG and finish earlier unsuccessful CP, applications will be transparently switched to the survived controller, and they will continue to run after a small pause as there was no disruption at all. Once CP is successful, as part of CP transaction, ONTAP changes root inode with pointers to the new data and create a new system snapshot which will capture newly added data and pointers to old data. In this way, ONTAP always maintains data consistency on the storage system without the need to switch to Write-Through mode in any case.

ONTAP 9 performs CP separately for each aggregate, while previously it was controller-wide. With CP separation for each aggregate slow aggregates no longer influencing other aggregates in the system.


An inode contains information about files which is known as metadata, but inodes can store a small fraction of data too. Inodes have a hierarchical structure. Each inode can store up to 4 KB of information. If a file is small enough to fit into an inode and store metadata in the same inode, then only 4 KB block is used for such an inode, a directory actually is also a file on WAFL file-system. So one of the real world examples were an inode stores metadata, and the data itself is an empty directory or an empty file. However, what if the file is not fitting into an inode? Then the inode will store pointers to other inodes, and that indoes store can store pointers to other inodes or address for data blocks. Currently, WAFL has 5-level hierarchy limit. Sometimes inodes and data blocks referred to as files in deep-dive technical documentation about WAFL. Therefore, each file on FlexVolume file system can store no more than 16 TiB. Each FlexVol volume is a separate WAFL file system and has its own volume root inode.

The reason why Write Anywhere File Layout got Anywhere word in its name is metadata can be anywhere in the FS

Interesting nuance. The reason why Write Anywhere File Layout is probably got Anywhere word in its name is metadata can be anywhere in the FS and mixed up with data blocks, while other traditional file systems usually store their metadata on a dedicated area on the disk which usually has a fixed size. Here is the list of metadata information which can be stored alongside with data.

  • Volume where the inode resides
  • Locking information
  • Mode and type of file
  • Number of links to the file
  • Owner’s user and group ids
  • Number of bytes in the file
  • Access and modification times
  • Time the inode itself was last modified
  • Addresses of the file’s blocks on disk
  • Permission: UNIX bits or Windows Access Control List (ACLs)
  • Qtree ID.

Events generating CP

CP is the event which generated by one of the following conditions:

  • 10 seconds passed by since the last CP
  • The system filled the first piece if NVRAM
  • Local MBUF filled (known as High Watermark). It happens really because of MBUF is usually way bigger that NVMEM/NVRAM. When commands in MBUF after execution generates a lot of new data before of in WAFL/RAID modules.
  • When you executed halt command on the controller to stop it
  • Others

CP condition might indirectly show on some system problems, for example, when there are not enough HDD drives to maintain performance, you will see back-to-back (“B” or “b”) operations. See also Knowledge Base FAQ: Consistency Point.

NVRAM/NVMEM and MetroCluster

To protect data from Split-Brain scenario in MetroCluster (MCC), hosts which writes data to the system will get acknowledgment only after the data will be acknowledged by the local HA partner and by the remote MCC partner (in case if the MCC comprises 4 nodes).


HA interconnect

Data synchronization between local HA pair partners happens over HA interconnect. If two controllers in an HA pair located in two separate chassis, then HA interconnect is an external connection (in some models can be over InfiniBand or Ethernet connections and usually named cNx, i.e., c0a, c0b, etc., for example in FAS3240 systems). If two controllers in an HA pair placed in a single HA chassis, HA interconnect is internal, and there are no visible connections. Some controllers might be used in both configurations: HA in a single chassis or HA with each controller in its own chassis, in this case such controllers have dedicated HA interconnect ports often named cNx (i.e., c0a, c0b, etc., for example in FAS3240 systems) but in case this controller used in a single chassis configuration those ports are not used (and cannot be used for other purposes) and internal communication established internally through controller’s back-plain.

Controllers vs Nodes

A storage system formed out of one or few HA pairs. Each HA pair consists out of two controllers; sometimes they called nodes. Controllers and nodes are very similar and often interchangeable terms. The difference between them that controllers are physical devices and nodes are ephemeral OS instance which running on the controllers. Controllers in an HA pair connected with HA cluster interconnect. With hardware appliances like AFF and FAS systems, each hardware disk connected simultaneously to both controllers in an HA pair. Often controllers in tech documents as “controller A” and “controller B.” Even though hard drives in AFF & FAS systems physically got one port, the port comprises two ports. Each port from each drive connected to each controller. So if ever will dig dip into node shell console, and enter disk show command, you’ll see disks named like 0c.00.XX, where 0c means current port through which that disk is connected to the controller which “owns it”, XX is a position of the drive in the disk shelf, and 00 is an ID for the disk shelf.  At each time only one controller owns a disk or a partition on a disk. When a controller owns a disk or a partition on a disk, it means that the controller serves data to hosts from that disk or partition. The HA partner is used only when the owner of the disk or the partition will die, therefore each controller in ONTAP has its own drives/partitions and each serves its own drives/partitions, this architecture known as “share nothing“. There two types of HA policies: SFO (storage failover) and CFO (controller failover). CFO used for root aggregates and SFO for data aggregates. CFO do not change disk ownership in the aggregate, while SFO change disk ownership in the aggregates.

ToasterA*> disk show -v
  DISK       OWNER                  POOL   SERIAL NUMBER
------------ -------------          -----  ----------------
0c.00.1      unowned                  -     WD-WCAW30485556
0c.00.2      ToasterA  (142222203)  Pool0   WD-WCAW30535000
0c.00.11     ToasterB  (142222204)  Pool0   WD-WCAW30485887
0c.00.6      ToasterB  (142222204)  Pool0   WD-WCAW30481983

But since each drive in hardware appliance like FAS & AFF system connected to both controllers, it means each controller can address each disk. And if you will manually change 0c to 0d from this example to the port through which the drive should be available, the system will be able to address the drive.

ToasterB*> disk assign 0d.00.1 -s 142222203
Thu Mar 1 09:18:09 EST [ToasterB:diskown.changingOwner:info]: 
changing ownership for disk 0d.00.1 (S/N WD-WCAW30485556) from (ID 1234) to ToasterA (ID 142222203)

While software-defined ONTAP storage (ONTAP Select & Cloud Volumes ONTAP) works very like MetroCluster, because by definition it has none ”special” equipment, in this case, it doesn‘t have special 2-port drives connected to both servers (nodes). So instead of connecting to a single drive with both nodes in an HA pair, ONTAP Select (and Cloud Volumes ONTAP) it copies data from one controller to the second controller and keeps two copies of data on each node. And that is the price, another side, of the commodity equipment.

Technically, it is possible to connect single external storage, for instance, by iSCSI to each storage node, avoiding unnecessary data duplicating, but that option is not available in SDS ONTAP at the moment.

Mailbox “disk”

While it sounds like disk, it is in really not a disk but rather tiny special area on a disk which consume a few KB. That mailbox area is used to send messages from one HA-partner to another. Mailbox disk is a mechanism which gives ONTAP some additional security level for HA capabilities. Mailbox disks are used to determine the state for the HA partner, in a similar way to emails, where each controller time to time is posting it’s messaging to its own (local) mailbox disks that it is alive, healthy and well while reads from partner’s mailbox. On another hand, if the last timestamp of the last message from a partner is too old, the surviving node will take over. In this way, if HA interconnects not available for some reason or a controller freezes, the partner will determine the state of the second controller using mailbox disks and will perform the takeover. If a disk with mailbox dies, ONTAP going to choose a new disk.
By default, mailbox disks reside on two disks: one data and one parity disk for RAID 4, or one party and one double parity disk for RAID DP, and usually reside at a first aggregate which usually is the system root aggregate.

Cluster1::*> storage failover mailbox-disk show -node node1
Node    Location  Index Disk Name     Physical Location   Disk UUID
------- --------- ----- ------------- ------------------ -------------------
node1    local    0      1.0.4         local        20000000:8777E9D6:[...]
         local    1      1.0.6         partner      20000000:8777E9DE:[...]
         partner  0      1.0.1         local        20000000:877BA634:[...]
         partner  1      1.0.2         partner      20000000:8777C1F2:[...]


When a node in HA pair whether software-defined of hardware dies, the survived one will ”takeover” and continue to serve the data from offline node to the hosts. With hardware appliance, the survived node will also change disk ownership from HA-partner to its own.

Active/Active and Active/Passive configurations

In the Active/Active configuration with ONTAP architecture, each controller has its own drives and serves data to hosts, in this case, each controller has at least one data aggregate. In an Active/Passive configuration passive node does not serve data to hosts and have disk drives only for root aggregate (for internal system needs). Each Active/Active and Active/Passive configuration needs to have for each node one root aggregate for the node to function properly. Aggregates formed out of one or few RAID groups. Each RAID group consists out of few disks drives or partitions. All the drives or partitions from an aggregate has to be owned by a single node.

Continue to read

How ONTAP cluster works?

Zoning for ONTAP Cluster


Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

Which kind of Data Protection SnapMirror is? (Part 2)

I’m facing this question over and over again in different forms. To answer that question, we need to understand what kinds of data protection exists. The first part of this article How to make a Metro-HA from DR (Part 1)?

High Availability

This type of data protection trying to do its best to make your data available all the time. If you have an HA service, it will continue to work even if one or even a few components fail which means your Recovery Point Objective (RPO) is always 0 with HA, and Recovery Time Objective (RTO) is near 0. With RTO whatever that number is we assume that our service and applications using that service (maybe with a small pause) will survive failure and continue to function and will not return an error to its clients. An essential part of any HA solution is automatic switchover between two or more components, so your applications will transparently switch to the survived elements and your applications continue to interact with survived components instead of the failed one. With HA your timeouts should be set for your applications (typically up to 180 seconds) so that RTO will be equal to or lower. HA solutions made in a way not to reach those application timeouts to make sure they not going to return an error to upstream services but rather a short pause. Whenever you got RPO not 0, it instantly means data protection is not an HA solution. The biggest problem with HA solutions they limited by the distance between which components can communicate, the more significant gap between them, the more time they need all your data to be fully Synchronous across all of them and ready to take over the failed part.

In the context of NetApp FAS/AFF/ONTAP systems, HA can be local HA-pair or MetroCluster stretched between two sites up to 700 km.


Disaster Recovery

The second data protection is DR. What is the difference between DR and HA, they both for data protection, right? By definition, DR is the kind of data protection which starts with the assumption you already get into a situation where your data not available and your HA solution has failed for any reason. Why DR assumes your data not available, and you have a disruption in your infrastructure service? The answer is “by definition.” With DR you might have RPO 0 or not and your RTO is always not 0 which means you will get an error accessing your data, there will be a disruption in your service. DR assumes by definition there is no fully automatic and transparent switchover.

Because HA and DR are both Data Protection techniques, people often confuse them, mix them up and do not see the difference or vice versa, they are trying to contrapose them and choose between them. But, now after explanation what they are and how they are different, you might already guess that you cannot replace one with another they do not compete but rather complement each other.

In the context of NetApp systems, SnapMirror technology strongly associated with DR capabilities.


Backup & Archive data protection

Backup is another type of data protection. Backup is an even lower level of data protection than DR and allows you to access your data all the time from the Backup site for the data restoration to a production site. An essential role for Backup data is to ensure it does not alter your data. Therefore, with Backup, we assume to restore data back to original or another place but not alter backed up data which means not to run DR on your Backup data. In the context of NetApp AFF/FAS/ONTAP systems backup solution are local Snapshots (a kind of) and SnapVault D2D replication technology. In ONTAP Cluster-Mode (version 8.3 and newer) SnapVault becomes XDP, just another engine for SnapMirror. With XDP SnapMirror capable of Unified Replication for both DR and Backup. With Archives you do not have access to your backups, so you need some time to bring them online before you can restore it back to the source or another location. The type library or NetApp Cloud Backup are the examples for the archive solution.


Is SnapMirror HA or DR data protection technology?

There is no straightforward answer to that, and to answer the question we have to consider the details.

SnapMirror comes in two flavors: Asynchronous SnapMirror which transfers data time to time to a secondary site, it is obviously a DR technology because you cannot switch to the DR site automatically since you do not have the latest version of your data. That means that before you start your applications, you might need to prepare them first. For instance, you might need to apply DB logs to your database, so your “not the latest version of data” will become the latest one. Alternatively, you might need to choose one snapshot out of the last few which you need to restore because the latest one might have corrupted data with a virus for instance. Again, by definition DR scenario assumes that you will not switch to a DR instantly, it assumes you already have downtime, and it assumes you might have manual interaction or a script or some modifications made before you’ll be able to start & run your services which require some downtime.

Synchronous SnapMirror (SM-S) also has two modes: Strict Full Synchronous mode and Relaxed Synchronous mode. The problem with Synchronous replication, similarly to HA solutions, is that the longer distance between the two sites, the more time needed to replicate the data. And the longer data will be transferred and confirmed to the first system, the longer time your application will not get the confirmation from your system.

The relaxed mode allows to have lags and network break-out and after network communication restoration auto-sync again, which means it is also a DR solution because it enables RPO to be not 0.

Strict mode does not tolerate network break-out by definition, which means it ensures your RPO to be always 0, which kind of makes it closer to HA.

Does it mean Synchronous SnapMirror in Strict mode is an HA solution?

Well, not precisely. Synchronous SnapMirror in Strict mode can also be part of a DR solution. For instance, if you have a DB with all the data been Asynchronously replicated to a DR site and only DB logs synchronously replicated to DR site, in this way we can reduce network traffic between two locations, provide small overall RPO and with DB synchronous logs restore data to the DB to ensure entire DB with RPO 0. In such a scenario RTO will not be so big but allows your DR site to be located very far away one from another. See scenarios how SnapMirror Sync can be combined with SnapMirror Async to build more robust beneficial DR solution.

To comply with HA definition, you need to have not only RPO to be 0 but also to be able to automatically switch over with RTO not higher than timeouts for your applications & services.

Can SM-S Strict mode switchover between sites automatically?

The answer is “not YET.” To do automatic switchover between sites, NetApp has an entirely different technology called MetroCluster which is Metro-HA technology. Any MetroCluster or local HA systems should be accommodated with DR, Backup & Archive technologies to provide the best data protection possible.

Will SM-S become HA?

I personally believe that NetApp will make it possible in the future to automatically switch over between two sites with SM-S. Most probably it will spin around SVM-DR feature to replicate not only data but also network interfaces and configurations, and for doing that SM-S will need some kind of Tiebreaker like in MCC, but those are not there yet. In my personal opinion, this kind of technology most probably going to (and should) be positioned as online data migration technology across NetApp Data Fabric rather than as a (Merto-) HA solution.

Why should SM-S not be positioned as an HA?

Few reasons:

1) NetApp already has MetroCluster (MCC) technology, and for many-many years it was and still is a superior Metro-HA technology proven to be stable, reliable and performant.

2) Now MCC become easier, simpler and smaller, and the only reasons you would like to have HA on top of SnapMirror are basically that tree. Since we already have MCC over IP (MC-IP), it is theoretically possible to run it even on the smallest AFF systems someday.

According to my own sense of how it will be, in some cases, SM-S might be used as an HA solution someday.

How HA, DR & Backup solutions applied to practice?

As you remember HA, DR & Backup solutions do not compete with but rather complement each other to provide full data protection. In a perfect world without money where you need to provide the highest possible and fully covered data protection, you would need HA, DR, Backups, and Archive. Where HA is located in one place or Geo-distributed as far as possible (up to 700 km), and on top of that, you need DR and Backups. For Backups, you might probably need to place your site as far as possible, for instance, on another side of the country or even to another continent. In these circumstances, you can do Synchronous SnapMirror only for some of your data like DB logs and Async for the rest to an intermediate site (up to a 10 ms network RTT latency) to a DR site and from that intermediate site to another continent all the data replicated Asynchronously or as Backup protection. And from DR and/or Backup sites we can do Archiving to Tape Library or NetApp Cloud Backup or another archive solution.



HA, DR, Backup and Archive are different types of data protection which complement each other. Any company should have not only HA solution for their data but also DR, Backup, and Archive in the best-case scenario or at least HA solution & Backup, but it always depends on business needs, business willingness to get some level of protection, and understanding risks involved with not protecting the data properly.

When it comes to really big data, a commodity for storage is not the best option for Big Data?

Commodity hardware cheap right? Well yes, but when it comes to petabytes of data, it becomes more expensive.

Let’s think how much servers do you need to run 1 PetaByte of data? It simple you need 3 PB of storage because if you have HDFS, native filesystem for a Big Data framework, it will create three copies of your data and will spread those pieces across the cluster randomly.

How much server nodes do you need? The biggest NL-SAS HDD 3.5” you can find nowadays is 12TB (actually 10.91 TiB), the biggest SAS HDD 2.5” is 2.4 TB (2.18 TiB) and the biggest SSD drive 2.5” out there is 32TiB (more like 30TiB), but not all servers supports that, and nearest supported drive is 1.6TB (1.46 TiB). So, the SSD with 2.5” has the most compact data footprint and most performant, but the most expensive one.

To get 1PiB of storage with HDFS, we will need 3PiB of raw capacity, which is (sorted from highest to lowest number of drives):

  • 3PiB/1.46TiB = 2104 drives with 1.6TB SSD
  • 3PiB/2.18TiB = 1409 drives with 2.4TB SAS
  • 3PiB/5.46TiB = 563 drives with 6TB NL-SAS
  • 3PiB/9.09TiB = 338 drives with 10TB SSD
  • 3PiB/10.91TiB = 281 drives with 12TB NL-SAS
  • 3PiB/30TiB = 102 drives with 32TB SSD (Not supposed with most servers yet)

How much 2.5” drives you can put to a rack server? About ten drives into 1U rack server or up to 24-26 drives into 2U. Moreover, when it comes to NL-SAS, you can put maximum 12 drives in a 2U rack server. Having 10-26 SSD drives per server is a good way to fully utilize performance potential of SSD drives.

  1. In this case, you’ll need either (1.6TB SSD) 2104/26=81 (2U) servers or 1409/26=54  (2U) servers with 2.4TB SAS for SFF drives and might have too many servers & more computing power than you actually need in your Big Data server farm. Moreover, when it comes to more than 20 nodes, usually, you need more than a couple of
  2. Alternatively, you might need (12TB NL-SAS) 281/12=23 (2U) or 563/12=47 servers with 6TB NL-SAS for LFF drives and that number might have not enough computing power then you need, or in contrary, be too much for you.

And let me remind you, that at the time this article written those disk drives are the best case scenario since usually, the biggest drives have not the best $/TB price. And therefore in real Big Data clusters, you normally will find drives with smaller space, thus the number of drives higher than we are using for this article and thus needs more servers.

There are high-density servers like:

  • Cisco S3260 which can contain up to 56 NL-SAS 3.5” LFF drives (Maximum NL-SAS disk capacity is 6TB, and 10TB SSD)
  • Alternatively, HPE disk enclosures which can be connected to a server with 96 LFF or 200 SFF drives

However, If you’ll put SSD in a server with 56 slots for SFF disk drives, theoretically you’ll need 3 servers (two needed, but the minimum is three) in case of 32TB SSD (only 1 needed but the minimum is 3), and that number might have not enough CPU & RAM to run your tasks but majority of servers still not support 32TB drives. While with 38 (2104/56) servers in case of 1.6TB SSD might be a good ratio to utilize the full potential of SSD drives but might be too much of computing power for your Big Data farm. Again with only 6 servers (338/56) with 10TB SSD drives you’ll not be able to utilize the full potential of the drives themselves. And with all NL-SAS drives, there might be not enough CPU & RAM for your Big Data cluster if you’ll have only 3 (281/96) servers and you have extremely slow storage subsystem.

If you’ll put SSD in a server with 200 slots for SFF disk drives, theoretically you’ll need 3 servers (1 needed but minimum is 3) in case of 32TB SSD (only 1 needed but minimum is 3), and that number might have not enough CPU & RAM to run your tasks & fully utilize SSD performance bat again majority of servers still not support 32TB drives. While with 10 (2104/200) servers in case of 1.6TB SSD also not enough to utilize the full potential of SSD drives themselves. Needles to say situation goes even worse with SSD drive performance utilization in the case of only 3 servers (338/200= ~2, but 3 minimum) with 10TB SSD drives and 3 servers might be not enough for computing power. While with all NL-SAS drives there might be not enough CPU & RAM for your Big Data cluster if you’ll have only 3 (281/96) servers and you have extremely slow storage subsystem.

Do you see how storage medium and space determine your Big Data server farm?

The idea of computing separation from storage comes naturally to make Big Data more flexible.

Additional HDFS overheads

When you choose a strategy to reduce costs as much as you can, you might choose slow NL-SAS & high-density servers, and obviously, you’ll try to choose a server which can support a lot of CPU & Memory. In this case, when it comes to cluster expansion for storage or CPU or memory, you’ll have to buy another big server with a lot of CPU, Memory, and storage to keep your nodes in the HDFS farm more or less equal, whether you actually need that resources or not. In another word, high-density servers are increasing the granularity of your server farm expansion and forcing you to buy resources you might not need.

Also, 12TB or 6TB might seem like a good choice for TB/$, but they are also consuming way more electricity, and they are extremely slow compared to SSD, so NL-SAS not suitable for some workloads like Machine Learning & Deep Learning.

HDFS have in its architecture Checkpoint Node which copying hourly or even daily metadata out of NameNode Master RAM, which means in case of Master (and Backup Node is you have one) collapse for any reason you will lose all the data after last time metadata been backed up even though your data is there.


There is another most annoying thing coming from HDFS architecture. When you have 23 or 81 servers and your HDFS creating three copies of your data it throws them into the nodes in the cluster randomly. What does it mean that the cluster stores data randomly? Let’s calculate what is the possibility that you will find a single piece of information on a given server? In a best-case scenario the probability is 3/23 or in the worst case, it is 3/81.

Of course, your cluster will try to run your tasks on nodes that have (almost) all the required data as part of Data locality strategy, but what is the possibility that you’ll have all the data your task needs on a single server? The more data pieces you have for a given task, the less possible to have all the data on a given server making possibility even less than 3/23 (or 3/81). Ok, you might say that situation might be not so bad as I am drawing because you are running more than one task on more than one server thus increasing the possibility to have your data locally. However, the problem in this situation that files bigger than 64KB broken up into pieces (blocks) and stored separately across all the nodes in a cluster so there might not even be a single node which storing all the pieces of a file thus committing to the base probability of local data access. Moreover, also, if that server which has data needed for your task, currently running another task and fully loaded while other servers are not loaded but do not have required piece of information, that’s where you have inefficient cluster resources utilization. In another word HDFS architecture increasing the probability of network traffic node communications, the more nodes you have in a cluster and the bigger size of files you have.

The more nodes you have and bigger size of your files, the more probability of requesting data from other nodes increasing cross-switch network traffic.

Three copies

Not efficient because of:

  1. Network Congestion
  2. High levels of IO over server system bus
  3. Poor disk space utilization

Data replication causes additional memory consumption on servers and memory problems are a large part of support calls. Server degradation causes performance degradation with data rebalancing. Cluster performing rebalancing if one storage node low in free space.

The placement of replicas is critical to HDFS reliability and performance. Optimizing replica placement distinguishes HDFS from most other distributed file systems. This is a feature that needs lots of tuning and experience.

NFS with Big Data

While with NAS storage like NetApp FAS/AFF systems you will have only about 35% space overhead compare to 200% overhead in HDFS (replication factor three) and be able to scale storage space and computing power separately reducing unneeded switches & server resources and allows customers to choose servers based on CPU & Memory characteristics eliminating storage from consideration.

Moreover, yes, 30TiB drives supported with AFF systems. With only 24x 32TB SSD in a NetApp AFF system you can get ~1PiB of effective space, in case of 2:1 data reduction, which gives extremely small physical footprint in the data center and power consumption.

A dedicated NAS storage aggregates all the drives into a single pool and capable of expanding flexible volumes on the fly without the need for cluster rebalancing.

NetApp In-Place Analytics Module is a plugin allowing to use NAS as primary or secondary storage for big data solutions.


NetApp ONTAP systems can replicate your data set for disaster recovery purposes & then replicate only new changed blocks of information as deltas to a secondary site which is essential for big data sets. NFS, unlike HDFS, allows modifying files if it’s needed but, on another hand, if you need to make sure your golden image of data not been modified you can use thin clones (FlexClones) to make sure nothing happened to your original data.

The unique feature like FabricPool allows utilizing SSD drives as primary storage for frequently accessed data and transparently destage cold data on cheap cold S3 compatible storage (and back), to further reduce storage costs from one hand but still use SSD drives on another hand for hot, frequently accessed data. Data reduction capabilities like Deduplication can significantly reduce data footprint without losing performance even on the smallest systems.

When you have two characteristics (Computing & Storage) to choose in a single solution, it will always lead to inefficiency & compromise.


When it comes to really big data, HDFS simply killing the solution because of the replication factor tree and underlying architecture, storage must be separated from a Big Data cluster to make it more flexible and surprisingly even cheaper than a commodity, especially when it comes to petabytes of data.

How to make a Metro-HA from DR? (Part 1)

This is indeed a frequently asked question often asked in many different forms, like: Can a NetApp’s DR solution automatically do site switching on DR event with a FAS2000/A200 system?

As you might guess in NetApp world, Metro-HA is called MetroCluster (or MCC) and DR called Asynchronous SnapMirror. (Read about SnapMirror Synchronous in Part 2)

The question is the same sort of questions if someone would ask “Can you build a MetroCluster-like solution based on A200/FAS2000 with async SnapMirror, without buying a MetroCluster, is there out of the box solution?”. The short answer to that question is no; you cannot do that. There are few quite good reasons for that:

  • First of all is: DR & HA (or Metro-HA) protects from different kinds of failures, therefore designed, behave & working quite differently, though both are data protection technologies. You see MetroCluster is basically is an HA solution stretched between two sites (up to 300 km for HW MCC or up to 10km for MetroCluster SDS), it is not a DR solution
  • MetroCluster Based on another technology called SyncMirror, it requires additional PCI cards, models higher then A200/FAS2000 and there are some other requirements too.

Data Protection technologies comparison

Async SnapMirror on another hand is designed to provide Disaster Recovery, not Metro-HA. When you are saying DR, it means you store point in time data (snapshots) for cases like data (logical) corruption, so you’ll have the ability to choose between snapshots to restore. Moreover, the ability also meant responsibility, because you or another human must decide which one to select & restore. So, there is no “automatic, out of the box” switchover to DR site with Async SnapMirror like MCC. Once you have many snapshots, it means you have many options, which means it is not easy for a program or a system to decide to which one it should switch. Also, SnapMirror provides many opportunities to backup & restore:

  • Different platforms on main & DR sites (in MCC both systems must be the same model)
  • Different number & types of drives (in MCC mirrored aggregates must be the same size & drive type)
  • Fun-Out & Cascade replicas (MCC have only two sites)
  • Replication can be done over L3, no L2 requirements (MCC only for L2)
  • You can replicate separate Volumes or entire SVM (with exclusions for some of the volumes if necessary). With MCC you replicate entire storage system config and selected aggregates
  • Many snapshots (though MCC can contain snapshots it switches only between Active FS on both sites).

All these options give much flexibility for async SnapMirror and mean your storage system must have a very complex logic to switch between sites automatically, long story short, it is impossible to have a single solution which gives you a logic which is going to satisfy every customer, all possible configurations & all the applications in one solution. In other words, with that flexible solution like async SnapMirror switchover in many cases done manually.

At the end of the day, an automatic or semi-automatic switchover is possible

At the end of the day automatic or semi-automatic switchover is possible & must be done very carefully with environment knowledge, understanding precise customer situation and customized for:

  • Different environments
  • Different protocols
  • Different applications.

MetroCluster on another hand can automatically switch over between sites in case of one site failure, but it operates only with the active file system and solves only Data Availability problem, not Data Corruption. It means if your data been (logically) corrupted by let’s say a virus, in this case, MetroCluster switchover not going to help, but Snapshots & SnapMirror will. Unlike SnapMirror, MetroCluster has strict deterministic environmental requirements, and only two sites between which your system can switch plus it works only with the active file system (no snapshots) used, in this deterministic environment it is possible to determine surviving site which is to choose and switch automatically with a tiebreaker. A tiebreaker is a software with built-in logic which makes the decision for site switchover.


SVM DR does not replicate some of SVM’s configuration to DR site. So, you must configure it manually or prepare a script so in case of a disaster your script is going to do it for you.

Do not mix up Metro-HA (MetroCluster) & DR; those are two separate and not mutually exclusive data protection technologies: you can have both MetroCluster & DR, so big companies usually have both MetroCluster & SnapMirror because they have budgets, business requirements & approval for that. The same logic applies not only to NetApp systems but for all storage vendors.

The solution

In this particular case, a customer with FAS2000/A200 & async SnapMirror can have only DR, so manual mount to hosts must be done on the DR site after a disaster event on primary site occurs, though it is possible to set up & configure your own script with logic suitable for your environment which switches between sites automatically or semi-automatically. For this purpose thing like NetApp Work Flow Automation & Backup/Restore ONTAP SMB shares with PowerShell script can help to do the job. Also, you might be interested in VMware SRM + NetApp SRM plugin configuration, which can give you a relatively easy solution to switch between sites.

The second part of this article “Which kind of Data Protection SnapMirror is? (Part 2)“.

A very quick article about a customer who has NetApp storage systems

As title stands, this will be a very quick article about a customer who has NetApp FAS systems.


This customer in 2014 bought their first two FAS3220 systems with NSE encryption and at that time with 7-Mode ONTAP.

Then in 2015 they bought one AFF8040 and one FAS8040 (HDD, and made it “Hybrid” with adding few SSDs tooled from AFF8040 system) both also with NSE encryption and ONTAP cDOT formed in a single cluster. All Flash systems shows no noticable performance impact with Inline efficiencies enabled.

Then they migrated all their VMware infrastructure to new storage systems, upgraded old systems with ONTAP cDOT & joined old but upgraded systems to the cluster with FAS8040 & AFF8040 and moved back some of slow workloads back on 3240 non-disruptively but now with NetApp’s LUN move & Volume move, way faster than it would be done with VMware Storage vMotion.

And then in 2017 they bought AFF A700 without encryption. All systems happily working, monitored and managed under a single cluster and data during its life cycle non-disruptively migrated across all the nodes, while they got at least 2:1 data reduction on AFF systems  (Cross-Volume Deduplication is not enabled yet) and 1.5:1 on hybrid & HDD-only systems.

Now in 2018 after 4 years since they got first FAS system they thinking to throw old FAS3220 controllers away, buy new low-end FAS2700 controllers (which probably will be same or faster then 3220) and connect old disk shelves to them with simply using MiniSAS HD to QSFP cable adapter. Then as always, connect all FAS & AFF systems to a single cluster again and be able to upgrade to 9.3 and farther, and be able to utilize new ONTAP functionality like FabricPool or Inline Aggregate Deduplication.

And now with A700 with simple ONTAP upgrade they will be able to use FC-NVMe with existing cluster when they are ready.

I have three rhetorical questions to take away:

  1. Which storage system vendor would allow you to keep in a single cluster: nodes from different models (3240, 8040, A700 and 2700); have Low-End, Mid-Range & High-End systems in a cluster; Have different types of systems (All Flash, HDD & Hybrid); have different generations (4 generations: from 3240 to 2700); some of the systems with encryption some without?
  2. Which storage system vendor would allow you to: upgrade same hardware with huge major software brake-trough release (Which was move from 7-Mode to cDOT); allow you to reconnect your old disk shelves between Low-End, Mid-range & High-End systems; allow you to reconnect old disk shelves to different generations & models?
  3. Can you name one?