Zoning for cluster storage in pictures

NetApp AFF & FAS storage systems can combine into a cluster up to 12 nodes (6 HA pairs) for SAN protocols. Let’s take a look on zoning and connections on an example with 4 nodes (2 HA pairs) in the image below.

For simplicity we will discuss connection of a single host to the storage cluster. In this example we connect each node to each server. Each storage node connected with double links for reliability reasons.

It is easy to notice only two paths going to be Preferred in this example (solid yellow arrows).

Since NetApp FAS/AFF systems implement “share nothing” architecture we have disk drives assigned to a node, then disks on a node  grouped into a RAID group, then one or a few RAID groups combined into a plex, usually one plex form an aggregate (in some cases two plexes form an aggregate, in this case both plexes must have identical RAID configuration, think of it as an analogy to RAID 1). On aggregates you have FlexVol volumes. Each FlexVol volume is a separate WAFL file system and can serve for NAS files (SMB/NFS) or SAN LUNs (iSCSI, FC, FCoE) or Namespaces (NVMeoF). A FlexVol can have multiple Qtrees and each Qtree can store an LUN or files. Read more in series of articles How ONTAP Memory works.

Each drive belongs to & served by a node. A RAID group belongs to and served by a node. All objects on top of those are belong to and are served by a single node, including Aggregates, FlexVols, Qtrees, LUNs & Namespaces.

At a given time a disk can belong to a single node and in case of a node failure, HA partner takes over disks, aggregates, and all the objects on top of that. Note that a “disk” in ONTAP can be entire physical disk as well as a partition on a disk. Read more about disks and ADP (disk partitioning) here.

Though an LUN or Namespace belong to a single node, it is possible to access them through the HA partner or even from other nodes. The most optimal path is always through a node which owns the LUN or Namespace. If a node has more than one port, all ports to that node are considered as optimal paths (also known as Non-Primary paths) through that node. Normally it is a good idea to have more optimal paths to a LUN.

ALUA & ANA

ALUA (Asymmetric Logic Unit Access) is a protocol which help hosts to access LUNs through optimal paths, it also allows to automatically change paths to a LUN if it moved to another controller. ALUA is used in both FCP and iSCSI protocols. Similarly to ALUA, ANA ( Asymmetric Namespace Access) is a protocol for NVMe over Fabrics protocols like FC-NVMe, iNVMe, etc.

Host can use one or a few paths to an LUN and that is depended on the host multipathing configuration and Portset configuration on the ONTAP cluster.

Since an LUN belong to a single storage node and ONTAP provide online migration capabilities between nodes, your network configuration must provide access to the LUN from all the nodes, just in case. Read more in series of articles How ONTAP cluster works.

According to NetApp best practices, zoning is quite simple:

  • Create one zone for each initiator (host) port on each fabric
  • Each zone must have one initiator port and all the target (storage node) ports.

Keeping one initiator per zone reduces “cross talks” between initiators to 0.

Example for Fabric A, Zone for “Host_1-A-Port_A”:

NodePortWWPN
Host 1Port APort A
ONTAP Node1Port ALIF1-A (NAA-2)
ONTAP Node2Port A LIF2-A (NAA-2)
ONTAP Node3Port A LIF3-A (NAA-2)
ONTAP Node4Port A LIF4-A (NAA-2)

Example for Fabric B, Zone for “Host_1-B-Port_B”:

NodePortWWPN
Host 1Port BPort B
ONTAP Node1Port BLIF1-B (NAA-2)
ONTAP Node2Port B LIF2-B (NAA-2)
ONTAP Node3Port BLIF3-B (NAA-2)
ONTAP Node4Port BLIF4-B (NAA-2)

Here is how zoning from tables above it looks like:

Vserver or SVM

An SVM in ONATP cluster lives on all the nodes in the cluster. Each SVM separated one from another and used for creating a multi-tenant environment. Each SVM can be managed by a separate group of people or companies and one will not interfere with another. In fact they will not know about other existence at all, each SVM is like a separate physical storage system box. Read more about SVM, Multi-Tenancy and Non-Disruptive Operations here.

Logical Interface (LIF)

Each SVM has its own WWNN in case of FCP, own IQN in case of iSCSI or Namespace in case of NVMeoF. Each SVM can share a physical storage node port. Each SVM assigns its own range of network addresses (WWPN, IP, or Namespace ID) to a physical port and normally each SVM assigns one network address to one physical port. Therefore one physical port might have a few WWPN network addresses on a single physical storage node port each assigned to a different SVM, if a few SVM exists. NPIV is a crucial functionality which must be enabled on a FC switch for ONTAP cluster with FC protocol to function properly.

Unlike ordinary virtual machines (i.e. ESXi or KVM), each SVM “exists” on all the nodes in the cluster, not just on a single node, the picture below shows two SVMs on a single node just for simplification.

Make sure that each node has at least one LIF, in this case host multipathing will be able to find an optimal path and always access an LUN through optimal route even if a LUN will migrate to another node. Each port has its own assigned “physical address” which you cannot change and network addresses. Here is an example of network & physical addresses looks like in case of iSCSI protocol. Read more about SAN LIFs here and about SAN protocols like FC, iSCSI, NVMeoF here.

Zoning recommendations

For ONTAP 9, 8 & 7 NetApp recommends having a single initiator and multiple targets.

For example in case of FCP, each physical port has its own physical WWPN (WWPN 3 in the image above) which should not be used at all, but rather WWPN addresses assigned to an LIF (WWPN 1 & 2 in the image above) must be used for zoning and host connections. Physical addresses looks like 50:0A:09:8X:XX:XX:XX:XX, this type of addresses numbered according to NAA-3 (IEEE Network Address Authority 3), assigned to a physical port, and should not be used at all. Example: 50:0A:09:82:86:57:D5:58. You can see addresses numbered according to NAA-3 listed on network switches, but they should not be used.

When you create zones on a Fabric, you should use 2X:XX:00:A0:98:XX:XX:XX, this type of addresses numbered according to NAA-2 (IEEE Network Address Authority 2) and assigned to your LIFs. Thanks to NPIV technology, the physical N_Port can register additional WWPNs which means your switch must be enabled in NPV mode in order ONTAP to serve data over FCP protocol to your servers. Example 20:00:00:A0:98:03:A4:6E

  • Block 00:A0:98 is the original OUI block for ONTAP
  • Block D0:39:EA is the newly added OUI block for ONTAP
  • Block 00:A0:B8 is used on NetApp E-Series hardware
  • Block 00:80:E5 is reserved for future use.

Read more

Disclaimer

Please note in this article I described my own understanding of the internal organization of ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only.

Why use NetApp snapshots even when you do not have Premium bundle software?

If you are extremely lazy and do not want to read any farther, the answer is “use snapshots to improve RPO and use ndmpcopy to restore files, LUNs and SnapCreator for app-consistent snapshots.

Premium bundle includes a good deal of software besides Base software in each ONTAP system, like:

  • SnapCenter
  • SnapRestore
  • FlexClone
  • And others.

So, without Premium bundle, with only Basic software we have two issues:

  • You can create snapshots, but without SnapRestore or FlexClone you cannot restore them quickly
  • And without SnapCenter you cannot make application consistent snapshot.

And some people asking, “Do I need to use NetApp snapshots in such circumstances?”

And my answer is: Yes, you can, and you should use ONTAP snapshots.

Here is the explanation of why and how:

Snapshots without SnapRestore

Why use NetApp storage hardware snapshots? Because they have no performance penalty and also no such a thing as snapshot consolidation which causes a performance impact. NetApp snapshots work pretty well and they also have other advantages. Even though it is not that fast as with SnapRestore or FlexClone to restore your data captured in snapshots, you can create snaps very fast. And most times, you need to restore something very seldom, so fast creation of snapshots with slow restoration will give you better RPO compare to a full backup. Of course, I have to admit that you improved RPO only for cases when your data were logically corrupted, and no physical damage was done to the storage because if your storage physically damaged, snapshots will not help. With ONTAP you can have up to 1023 snapshots per volume, and you can create them as fast as you need with no performance degradation whatsoever, which is pretty awesome.

Snapshots with NAS 

If we are speaking about NAS environment without SnapRestore license, you always can go to the .snapshot folder and copy any previous version of a file you need to restore. Also, you can use the ndmpcopy command to perform file, folder or even volume restoration inside storage without involving a host.

Snapshots with SAN 

If we are speaking about SAN environment without SnapRestore license, you do not have such ability as copying a file on your LUN and restore it. There are two stages in case you need to restore something on a LUN:

  1. You copy entire LUN from a snapshot
  2. And then you can either:
    • Restore entire LUN on the place of the last active version of your LUN
    • Or you can copy data from copied LUN to the active LUN.

To do that, you can use either ndmpcopy or lun copy commands to perform the first stage. And if you want to restore only some files from an old version of the LUN from a snapshot, you need to map that copy to a host and copy required data back to active LUN.

Application consistent storage snapshots 

Why do you need application consistency in the first place? Sometimes, in an environment like the NAS file share with doc files, etc., you do not need that at all. But if you are using applications like Oracle DB, MS SQL or VMWare you’d better have application consistency. Imagine you have a Windows machine and you are pulling hard drive while Windows is running, let’s forget for a moment that your Windows will stop working, this is not the point here, and let’s focus on data protection side of that. The same happens when you are creating a storage snapshot, data captured in that snapshot will be similarly not complete. Will the pulled off hard drive be a proper copy of your data? Kind of, right? Because some of the data will be lost in host memory and your FS probably will not be consistent, and even though you’ll be able to restore logged file system, your application data will be damaged in a way it hard to restore, because against of the data lost from host memory. Similarly, snapshots will contain probably damaged FS, if you try to restore from such a copy, your Windows might not start, or it might start after FS recheck, but your applications especially Data Bases definitely will not like such a backup. Why? Because most probably you’ll get your File System corrupted because applications and OS which were running on your machine didn’t have a chance to destage data from memory to your hard drive. So, you need someone who will prepare your OS & applications to create a backup. As you may know, application consistent storage hardware snapshots can be created by backup software like Veeam, Commvault, and many others, or you even can trigger a storage snapshot creation yourself with relatively simple Ansible or PowerShell script. Also, you can do application-consistent snapshots with free NetApp SnapCreator software framework, unlike SnapCenter, it does not have a simplistic and straight-forward application GUI wizards which help to walk you through with the process of integration with your app. Most times, you have to write a simple script for your application to benefit online & application-consistent snapshots, another downside that SnapCreator is not officially supported software. But at the end of the day, it is relatively easy setup, and it will definitely pay you off once you finish setting up.

List of other software features available in Basic software

This Basic ONTAP functionality also might be useful: 

  • Horizontal scaling, nod-disruptive operations such as online volume & LUN migration, non-disruptive upgrade with adding new nodes to the cluster
  • API automation
  • FPolicy file screening
  • Create snapshots to improve RPO
  • Storage efficiencies: Deduplication, Compression, Compaction
  • By default ONTAP deduplicate data across active file system and all the snapshots on the volume. Savings from the snapshot data sharing is a magnitude of number of snapshots: the more snapshots you have, the more savings you’ll have
  • Storage Multi-Tenancy
  • QoS Maximum
  • External key manager for Encryption
  • Host-based MAX Data software which works with ONTAP & SAN protocols
  • You can buy FlexArray license to virtualize 3rd party storage systems
  • If you have an All Flash system, then you can purchase additional FabricPool license which is useful especially with snapshots, because it is destaged cold data to cheap storage like AWS S3, Google Cloud, Azure Blob, IBM Cloud, Alibaba Cloud or on-premise StorageGRID system, etc.

Summary

Even Basic software has a reach functionality on your ONTAP system, you definitely should use NetApp snapshots, and set up application integration to make your snapshot application consistent. With hardware NetApp storage snapshots, you can have 1023 snapshots per volume, create them as fast as you need without sacrificing storage performance, so snapshots will increase your RPO. Application consistency with SnapCreator or any other 3rd party backup software will build confidence that all the snapshots can be restorable when needed.

NetApp in containerization era

It’s not really a technical article as I usually do, but rather a short list of topics for one direction NetApp developing a lot recently, called “Containers”. Containerization getting more and more popular nowadays and I noticed NetApp heavily invests efforts in it, so I identified four main directions in that field. Let’s name a few NetApp products using containerization technology:

  1. E-Series with built-in containers
  2. Containerization of existing NetApp software:
    • ActiveIQ PAS
  3. Trident plugin for ONTAP, SF & E-Series (NAS & SAN):
    • NetApp Trident plug-in for Jenkins
    • CI & HCI:
      • ONTAP AI: NVIDIA, AFF systems, Docker containers & Trident with NFS
      • FlexPod DataCenter & FlexPod SF with Docker EE
      • FlexPod DataCenter with IBM Cloud Private (ICP) with Kubernetes orchestration
      • NetApp HCI with RedHat OpenShift Container Platform
      • NetApp integration with Robin Systems Container Orchestration
    • Oracle, PostgreSQL & MongoDB in containers with Trident
    • Integration with Moby Project
    • Integration with Mesosphere Project
  4. Cloud-native services & Software:
    • NetApp Kubernetes Services (NKS)
      • NKS in public clouds: AWS, Azure, GCP
      • NKS on NetApp HCI platform
    • SaaS Backup for Service Providers

Documents, Howtos, Architectures, News & Best Practices:

Is it a full list of NetApp’s efforts towards containerization?

I bet that is far not complete. Post your thoughts and links with documents, news, howtos, architectures and best practice guides in the comments below to expand this list if I missed something!

New NetApp platform for ONTAP 9.6 (Part 3) AFF C190

NetApp introduced C190 for Small Business, following the new platform A320 with ONTAP 9.6.

C190

This new All-Flash system has:

  • Fixed format, with no ability to connect additional disk shelves:
    • Only 960 SSD drives installed only in the controller chassis
    • Only 4 configs: with 8, 12, 18 or 24 drives
      • Effective capacity respectively: 13, 24, 40 or 55 TB
    • Supports ONTAP 9.6 GA and higher
    • C190 build with the same chassis as A220, so per HA pair you’ll get:
      • 4x 10Gbps SFP cluster ports
      • 8x UTA ports (10 Gbps or FC 16Gbps)
      • There is a model with 10GBASE-T ports instead of UTA & cluster interconnect ports (12 ports total). Obviously BASE-T ports do not support FCP protocol
  • There will be no more “useful capacity”, NetApp will provide only “Effective capacity”:
    • With dedup, compression, compaction and 24 x 960 GB drives the system provide ~50 TiB Effective capacity. 50 TiB is pretty reliable conservative number because it is even less than ~3:1 data reduction
    • Deduplication snapshot sharing functionality introduced in previous ONTAP versions allows gaining even better efficiency
    • And of course FabricPool tiering can help to save much space
  • C190 comes with Flash bundle which adds to Basic software:
    • SnapMirror/SnapVault for replication
    • SnapRestore for fast restoration from snapshots
    • SnapCenter for App integration with storage snapshots
    • FlexClone for thing cloning.

Fixed configuration with built-in drives, I personally think, is an excellent idea in general, taking into account we have such a wide variety of capacity in SSD drives nowadays and even more to come. Is this the future format for all storage systems with flash? Though C190 supports only 960 GB SSD drives, and new Mid-range A320 system, can have more than one disk shelf.

Fixed configuration allows to manufacture & deliver the systems to clients faster and reduce costs. C190 will cost sub $25k with min config according to NetApp.

Also, in my opinion, C190 can more or less cover market place left after the announcement for the end of sale (EOS) of hardware and virtual AltaVault (AVA, and recently know under a new name “Cloud Backup”) appliances thanks to FabricPool tiering. Cloud Backup appliances still available through AWS & Azure market places. Especially now it is the case after FabricPool in ONTAP 9.6 no longer have a hard-coded ratio for how many data system can store in the cloud compare to hot tier & allows wright-through with “All” policy.

Turns out information about storage capacity “consumed” more comfortable in the form of effective capacity. All this useful capacity, garbage collector and other storage overheads, RAIDs and system reserves are too complicated, so hey, why not? I bet idea of showing only effective capacity influenced by vendors like Pure, which have very effective marketing for sure.

Cons

  • MetroCluster over IP is not supported in C190, while Entry-level A220 & FAS2750 systems support MCC-IP with ONTAP 9.6
  • C190 require ONTAP 9.6, and ONTAP 9.6 do not support 7MTT.

Read more

Disclaimer

All product names, logos, and brands are the property of their respective owners. All company, product, and service names used in this website are for identification purposes only. No one is sponsoring this article.

How does the ONTAP cluster work? (Part 4)

This article is part of the series How does the ONTAP cluster work? Also previous series of articles How ONTAP Memory work will be a good addition to this one.

Data protocols

ONTAP is considered as a unified storage system, meaning it supports both block (FC, FCoE, NVMeoF and iSCSI) & file (NFS, pNFS, CIFS/SMB) protocols for its clients. SDS versions of ONTAP (ONTAP Select & Cloud Volumes ONTAP) do not support FC, FCoE or FC-NVMe protocols because of their software-defined nature.

Physical ports, VLANs and ifgroups considered as “ports” in ONTAP. Each port can run multiple LIFs. If a port has at least one VLAN, then LIFs can be created only on VLANs, not anymore on the port itself. If a port part of an ifgroup, LIFs can be created only on top of the ifgroup, not on the port itself anymore. If a port part of ifgroup on top of which created one or a few VLANs, LIFs can be created only on top of those VLANs, not on ifgroup or physical port anymore.
It is very common configuration when two ports are part of an ifgroup and a few VLANs created for protocols. Here are two very popular examples, in this examples storage system configured with:

  • SMB for PC users (MTU 1500)
    • Popular example: User home directories
  • SMB for Windows Servers (MTU 9000)
    • Use case: MS SQL & Hyper-V
  • NFS for VMware & Linux Servers (MTU 9000)
    • Use-case: VMware NFS Datastore
  • iSCSI for VMware, Linux & Windows Servers (MTU 9000)
    • Block-deice for OS boot and some other configs, like Oracle ASM

Example 1 is for customers who want to use all of the Ethernet-based protocols, but have only 2 ports per node. Example 2 is more preferable with dedicated Ethernet ports for iSCSI traffic if a storage node have sufficient number of ports.

Notice Example 1, has two iSCSI VLANs A & B, which are not necessary, I would use two, in order to increase number of connections over ifgrp to increase load-balancing but it is up to storage & network admins. Normally each, iSCSI-A & iSCSI-B would use a separate IP subnet. See ifgroup section for the network load-balancing explanation.

VLANs

VLANs in ONTAP allow to separate two IP networks one from another and often used with NFS, CIFS and iSCSI protocols, though multiple IP addresses allowed on a single port or VLAN. A VLAN can be added on a physical Ethernet port or to an ifgroup.

ifgroup

Interface group is a collection of a few ports (typically with the same speed). Ports from a single ifgroup must be located on a single node. An ifgroup provide network redundancy to ethernet. Each ifgroup perceived and used as a physical port. One of the most notable & used type of ifgroup is Dynamic multimode. Dynamic multimode enables LACP protocol on ports so in case one port in a group dies another will be used fully transparently to upper-level protocols like NFS, SMB and iSCSI. Most notable in LACP is the ability to distribute data across links in attempt to equally load all the links in the ifgroup.

Ports starting with latter “e“, means it is a physical port, then number of PCIe bus (0 means on-board ports), and then another latter starting with “a” which represents index of the port on the PCIe bus. While ifgroup (virtual aggregated) port names starts with “a” (can be any latter), then number, and then another latter, for example a1a, to keep same format as physical ports for naming convention, even though number and the ending latter no longer represents anything in particular (i.e. PCIe bus or port position on the bus), but used only as index to distinguish from other ports.

Unfortunately LACP load distribution is far from perfect: the more hosts in network communicate with an ifgroup, the more probability to equally distribute traffic across network ports. LACP uses a single static formula depending on source and destination information of a network packet, there is no intellectual analysis and decision-making as for example in SAN protocols and there is no feedback from lower Ethernet level to upper-level protocols. Also LACP often used in conjunction of Multi-Chassis Ether Channel (vPC is another commercial name) functionality, to distribute links across a few switches and provide switch redundancy, which require some additional efforts from switch configuration and the switches themselves. While on another hand SAN protocols do not need switches to provide this type of redundancy because it done on protocol upper level. This is the primary reason why SMB and NFS protocols developed their extensions to provide similar functionality: to more intellectually and equally distribute load across links and be aware of network path status.

One day these protocols will fully remove necessity for Ethernet LACP & Multi-Chassis Ether Channel: pNFS, NFS Session Trunking (NFS Multipathing), SMB Multichannel and SMB Continuous Availability. Until then we going to use ifgroups with configurations which do not support those protocols.

NFS

NFS was the first protocol available in ONTAP. The latest versions of ONTAP 9 support NFSv2, NFSv3, NFSv4 (4.0 and 4.1) and pNFS. Starting with 9.5, ONTAP support 4-byte UTF-8 sequences in names for files and directories.
Network switch technically is not required for NFS traffic and direct host connection is possible, but network switch is used in all the configurations to provide additional level of network redundancy and be able easier add new hosts when needed.

SMB

ONTAP supports SMB 2.0 and higher up to SMB 3.1. Starting with ONTAP 9.4 SMB Multichannel, which provides functionality similar to multipathing in SAN protocols, is supported. Starting with ONTAP 8.2 SMB protocol supports Continuous Availability (CA) with SMB 3.0 for Microsoft Hyper-V and SQL Server. SMB is a session-based protocol and by default does not tolerate session brakes, so SMB CA helps to tolerate unexpected session loss, for example in case a network port went down. ONTAP supports SMB encryption, which is also known as sealing. Sped up AES instructions (Intel AES NI) encryption is supported in SMB 3.0 and later. Starting with ONTAP 9.6 FlexGroup volume supports SMB CA and thus support MS SQL & Hyper-V on FlexGroup.
Network switch technically is not required for SMB traffic and direct host connection is possible, but network switch is used in all the configurations to provide additional level of network redundancy and be able easier add new hosts when needed.
Here is the hierarchy visualization & logical representation of NAS protocols in ONTAP cluster and corresponding commands (highlighted in grey):

Each NAS LIF can accept NFS & SMB traffic, but usually engineers tend to separate them on separate LIFs.

FCP

ONTAP on physical appliances supports SCSI-based FCoE as well as FC protocols, depending on HBA port speed. Both FC & FCoE are known under a single umbrella name FCP. An iGroup is a collection of WWPN address from initiator hosts which allowed to access storage LUNs. WWPN address is interface on FC or FCoE port. typically you need to add all the initiator host ports to iGroup to allow multipathing work properly. ONTAP uses N_Port ID Virtualization (NPIV) for FC data and therefore require a FC network switch (typically at least two) which also supports NPIV, direct FC connections from host to storage are not supported.
Here is the hierarchy visualization & logical representation of FCP protocols in ONTAP cluster and corresponding commands (highlighted in grey):

Read more about ONTAP Zoning here.

iSCSI

iSCSI is another SCSI-based protocol encapsulated in IP/Ethernet transport. NetApp also supports Data Center Bridging (DCB) protocol for some models, depending on Ethernet port chips. Network switch technically is not required and direct host connection is possible, but network switch is recommended to be able easier add new hosts when needed.
Network switch technically is not required for iSCSI traffic and direct host connection is possible if number of storage ports allows, though network switch can be used to easier add new hosts when needed. Network switch is recommended to have with iSCSI.
Here is the hierarchy visualization & logical representation of iSCSI protocol in ONTAP cluster and corresponding commands (highlighted in grey):

NVMeoF

NVMe over Fabrics (NVMeoF) refers to the ability to use NVMe protocol over existing network infrastructure like Ethernet (Converged or traditional), TCP, Fiber Channel or InfiniBand for transport (as opposite to run NVMe over PCIe without additional encapsulation). NVMe is SAN block data protocol. In contrast to NVMe (just) where an extra layer of transport is not used and devices connected directly to PCIe bus.

Starting with ONTAP 9.5, NVMe ANA protocol supported which provide, similarly to ALUA, multipathing functionality to NVMe. ANA for NVMe currently supported only with SUSE Enterprise Linux 15, VMware 6.7 and Windows Server 2012/2016. FC-NVMe without ANA supported with SUSE Enterprise Linux 12 SP3 and RedHat Enterprise Linux 7.6.

FC-NVMe

NVMeoF supported only on All-Flash FAS systems and not for Entry level A200 and A220 systems due to the lack of FC 32 Gb ports. Subsystem in NVMe used for the same purpose as iGroups, it allows initiator host NQN addresses which allowed to access a namespace. A namespace in this context is very similar to a LUN in FCP or iSCSI. Do not mix up namespace term in NVMe with a single namespace in ONTAP cluster.
Here is the hierarchy visualization & logical representation of FC-NVMe protocol in ONTAP cluster and corresponding commands (highlighted in grey):

Spirit of this article

This article explains principles, architecture, NetApp’s unique approaches and maybe even spirit of ONTAP clusterization. Each configuration, model and appliance have their nuances, which left out of the scope of this article. I’ve tried to give a general direction of the ideas behind NetApp innovative technologies, while (trying) not putting too many details to the equation to keep it simpler but not to lose important system architecture details. This is a far not complete story about ONTAP; for example, I didn’t mention about 7-Mode Transition Tool (7MTT) required for the transition to Clustered ONTAP to make complex information easier to consume nor didn’t go to WAFL detail explanation. Therefore some things might be a bit different in your case.

Summary

Clustered ONTAP first time was introduced around 2010 in version 8. After almost 10 years of hardening ONTAP (cluster) become mature, highly scalable and flexible solution with not just unique, unprecedented on the market functionalities in a single product but also impressive performance thanks to its WAFL versatility & clusterization capabilities.

Continue to read

How ONTAP Memory work

Zoning for cluster storage in pictures

Disclaimer

Please note in this article I described my own understanding of the internal organization of ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only.

How does the ONTAP cluster work? (part 3)

This article is part of the series How does the ONTAP cluster work? Also previous series of articles How ONTAP Memory work will be a good addition to this one.

Horizontal Scaling Clusterization

Horizontal scaling ONTAP clusterization came from Spinnaker acquisitions and often referred by NetApp as “Single Namespace,” “Horizontal Scaling Cluster,” or “ONTAP Storage System Cluster,” or “Scale-out cluster,” or just “ONTAP Cluster.” This type of clusterization often confused with HA pair or even with MetroCluster functionality. So to distinguish this one from others in ONTAP I will call it as the third type of clusterization. While MetroCluster and HA are Data Availability and even Data Protection technologies, single namespace clusterization does not provide data protection nor Data Availability: if there will be a hardware failure, the third type of clusterization is not involved in helping to mitigate such a problem. ONTAP forms (third type of) cluster out of one or a few HA pairs (multiple single-nodes are not supported in a single cluster) and adds to ONTAP system Non-Disruptive Operations (NDO) functionality such as non-disruptive online data migration across nodes in the cluster and non-disruptive hardware upgrade or online IP address migration. Data migration for NDO operations in ONTAP Cluster require dedicated Ethernet ports for such operations; they called cluster interconnect interfaces and does not use HA interconnect interfaces for this purpose. Cluster interconnect and HA interconnect interfaces couldn’t share the same ports, until A320 system.

Cluster interconnect with a single HA pair could have directly connected cluster interconnect ports (switch-less config) while systems with 4 or more nodes require two dedicated Ethernet cluster interconnect switches. ONTAP Cluster could consist only from an even number of nodes (they must be configured as HA pairs) except for Single-node cluster. Single-node cluster ONTAP system also called non-HA (or stand-alone), in such a configuration another cluster nodes cannot be added to a single node cluster, but single-node cluster can be converted to HA system and then other HA pairs can be added. ONTAP Cluster managed with a single pane of glass built-in management through Web-based GUI, CLI (SSH and PowerShell) and API. ONTAP Cluster provides Single Namespace for NDO operations through Storage Virtual Machines (SVM). Single Namespace in ONTAP system is a name for a collection of techniques used by (the third type) cluster to provide a level of abstraction and separate data from front-end network connectivity with data protocols like FC, FCoE, FC-NVMe, iSCSI, NFS and SMB from the data in volumes and therefore provide data virtualization. This virtualization provides online data mobility across cluster nodes while clients connected over data protocols still can access their data.

The general idea behind the single namespace is to trick clients so they would think they connected to a single device, while in reality connected to a cluster which consists of a bunch of nodes. This ”trick” work in different ways with different protocols. For example with FC protocol each node’s FC port gets unique WWPN address, while cluster SVM has a single WWNN, in this way client connected to the cluster consider it as a single FC node with multiple active ports, while some of them reported as optimized for traffic and some are none-optimized, in a very similar way it works with iSCSI and NVMeoF. With FC & iSCSI ALUA protocol used to switch between ports & links in case it becomes unavailable. With pNFS cluster operate in a similar to SAN protocol way: nodes in a cluster have interfaces with IP address, so clients perceive SVM as a single node with multiple active interfaces. ONTAP reports to clients ports as optimized which have direct access to nodes which serve volume with the data, and only if those ports not available, clients will use other active none-optimized ports. If protocols like NFS v3, NFS v4, SMB v2, and SMB v3 do not have such capabilities like SAN ALUA or pNFS, then ONTAP has another trick in its sleeve: Single Namespace provides several techniques for non-disruptive IP address migration for data protocols in case node or port dies.

SMB Continuous Availability (CA) extension to SMB v3 allows clients not to drop connections and this provides transparent failover in case a port, link or a node went down. Therefore MS SQL & Hyper-V servers on file share can survive. Without CA support in SMB protocol, in case of IP address migration to another port, clients will get session disruption, so only user file shares recommended to use. Since NFS v3 is a stateless protocol, if an IP address will migrate to another port or node, clients will not experience interruption. In the case of SAN, protocols interfaces do not migrate, but rather a new path selected.

Indirect data path

In some cases when data resides on one controller and this data accessed through another controller, then indirect data access occurs: For example hosts access network address through LIF 3A on node 3, while volume with data located on data_aggregate01 on node 1. In this scenario node 3 will access controller 1 through cluster interconnect interfaces & switches, get the data from node 1, and provide the information to the hosts requested it from LIF 3A interface which located on node 3 ports, and in some cases controller which owns data, in this example node 1, can reply directly to hosts. This functionality introduced as part of single namespace strategy (the third type of clusterization) to always provide hosts access to data no matter of location of the network addresses and volumes. Indirect data access occurs rarely in the cluster in situations like a LIF migrated by admin or cluster to another port; a volume migrated to another node; or if a node port went down. Though indirect data path adds some small latency to operations, in most of the cases it can be ignored. Some protocols like pNFS, SAN ALUA, NVMe ANA can automatically detect and switch its primary path to ports on the node with direct access to the data. Again, this (third) type of clusterization is not a data protection mechanism, but rather online data migration functionality.

Heterogeneous cluster

Cluster (the third type of clusterization) can consist of different HA pairs: AFF and FAS, different models and generations, performance and disks, and can include up to 24 nodes with NAS protocols or 12 nodes with SAN protocols. SDS systems can’t intermix with physical AFF or FAS appliances. The main purpose of the third type of customization is not data protection but rather non-disruptive operations like online volume migration or IP address migration between all the nodes in the cluster.

Storage Virtual Machine

Also known as Vserver or sometimes SVM. Storage Virtual Machine (SVM) is a layer of abstraction, and alongside with other functions, it virtualizes and separates physical front-end data network from data located on FlexVol volumes. Used for Non-Disruptive Operations and Multi-Tenancy. SVMs lives on nodes and on the image below pictured on disk shelves around volummes just to demonstrate each volume belong only to a single SVM.

Multi-Tenancy

ONTAP provides two techniques for Multi-Tenancy functionality: SVM and IP Spaces. On the one hand, SVMs are like KVM Virtual Machines; they provide virtualization abstraction from physical storage but on another hand quite different because unlike ordinary virtual machines does not allow to run third-party binary code like in Pure storage systems; they just provide a virtualized environment and storage resources instead. Also, SVMs, unlike ordinary virtual machines, does not run on a single node, SVM runs as a single entity on the whole cluster (unless it looks to system admin that way). SVM divide storage system into slices so few divisions or even organizations can share a storage system without knowing and interfering with each other while using same ports, data aggregates and nodes in the cluster and using separate FlexVol volumes and LUNs. Each SVM can run its own front end data protocols, a set of users, use its network addresses and management IP. With the use of IP Spaces users can have the same IP addresses and networks on the same storage system without interfering and network conflicts. Each ONTAP system must run at least one Data SVM to function but may run more. There are a few levels of ONTAP management: Cluster Admin level has all the privileges. Each Data SVM provides to its owner vsadmin user which have nearly full functionality like Cluster Admin level but lucks management of physical level like RAID group configuration, Aggregate configuration, physical network port configuration. Vsadmin can manage logical objects inside its SVM: create, delete and configure LUNs, FlexVol volumes and network interfaces/addresses so two SVMs in a cluster can’t interfere with each other. One SVM cannot create, delete, change or even see objects of another SVM so for SVM owners such an environment looks like they are only users on the entire storage system cluster. Multi-Tenancy is a free functionality in ONTAP. On the image below SVMs pictured on top of ONTAP cluster, but in reality, SVMs are part of ONTAP OS.

Non Disruptive Operations

There a few Non-Disruptive Operations (NDO) and Non-Disruptive Upgrade (NDU) with (Clustered) ONTAP system. NDO data operations include: data aggregate relocation within an HA pair between nodes, FlexVol volume online migration (known as Volume Move operation) across aggregates and nodes within the Cluster, LUN migration (known as LUN Move operation) between FlexVol volumes within the Cluster. LUN move and Volume Move operations use Cluster Interconnect interfaces for data transfer (HA-CI is not in use for such operations). SVM behave differently with network NDO operations, depending on the front-end data protocol. To decrease latency to its original level FlexVol volumes and LUNs should be located on the same node with network address through which the clients access the data, so network address could be created for SAN or moved for NAS protocols. NDO operations are free functionality.

NAS LIF

For NAS front-end data protocols there are NFSv2, NFSv3, NFSv4, CIFSv1, SMBv2, and SMB v3 protocols which do not provide network redundancy with the protocol itself, so they rely on storage and switch functionalities for this matter. ONTAP support Ethernet Port Channel and LACP with its Ethernet network ports on L2 layer (known in ONTAP as interface group or ifgrp), within a single node. And also ONTAP provides non-disruptive network failover between nodes in the cluster on L3 layer with migrating Logical Interfaces (LIFs) and associated IP addresses, similarly to VRRP, to the survived node and back home when failed node restored.

Though new versions of NAS protocols have built-in multipathing functionality: Extensions to NFS v4.1 protocol like pNFS (Supported starting with ONTAP 9.0) and NFS Session Trunking (NFS Multipathing. Not yet supported with ONTAP 9.6), and SMB v3 extension called SMB Multichannel (Available starting with ONTAP 9.4) allows automatically switch between paths in case of network link failure, while SMB Continuous Availability (CA) helps to preserve sessions without interruption. Unfortunately, all of these capabilities have limited support from clients. Until they become more popular ONTAP, will relay on build-in NAS LIF migration capabilities to move interfaces with assigned network addresses to survived node & port.

FailoverGroup

Failover group is functionality available only to NAS protocols and applied only to Ethernet ports/VLANs. SAN interfaces cannot (and do not need) to online migrate across cluster ports like NAS LIFs, therefore do not have failover group functionality. Failover group is a prescription to a LIF interface were to migrate in case if the hosted port will go down and should it return back automatically. By default FailoverGroup equal to Broadcast Domain. It is a good practice to specify manually between which ports a LIF can migrate, especially in the case where a few VLANs are used, so the LIF would not migrate to another VLAN. A failover group can be assigned to multiple LIFs, and each LIF can be assigned to a single failover group.

Broadcast Domain

A Broadcast Domain is a list of all the ethernet ports in the cluster which using the same MTU size and therefore used only for Ethernet ports. In many cases, it is a good idea to separate ports of different speeds. Unless of cause storage administrator wants to mix ports with lower speed & higher speed with the conjunction of a failover group to prescribe LIF migration from high-speed ports to slower speed ports in case if the first ports are unavailable, that is rarely the case and usually needed only with systems with a minimal number of ports. Each Ethernet port can be assigned only to one Broadcast Domain. If a port is a part of ifgroup, then such ifgroup port assigned to a Broadcast Domain. If a port or an ifgroup have VLANs, then Broadcast Domain assigned to each VLAN on that port or ifgroup.

SAN LIF

For front-end data SAN protocols. ALUA feature used for network load balancing and redundancy with FCP and iSCSI protocols, so all the ports on the node where data located are reported to clients as an active optimized (preferred), and if there are more than one port, ALUA will make sure hosts will load balance between them. And similarly it works with ANA in NVMe. While all other network ports on all other nodes in the cluster are reported by ONTAP to hosts as active none-optimized, so in case of one port or entire node goes down, the client will have access to its data using non-optimized path. Starting with ONTAP 8.3 Selective LUN Mapping (SLM) was introduced to reduce the number of unnecessary paths to the LUN and removes non-optimized paths to the LUN through all other cluster nodes except for HA partner of the node owning the LUN so cluster will report to the host paths only from the HA pair where LUN is located. Because ONTAP provides ALUA/ANA functionality for SAN protocols, SAN network LIFs do not migrate like with NAS protocols. When volume or LUN migration is finished, it is transparent to the storage system’s clients because of ONTAP Architecture and can cause temporary or permanent data indirect access through ONTAP Cluster interconnect (HA-CI is not in use for such situations) which will slightly increase latency for the clients. SAN LIFs used for FC, FCoE, iSCSI & FC-NVMe protocols.

iSCSI LIFs can live on the same ifgroup, port or VLAN with NAS LIFs since both using Ethernet ports.

On the image below pictured ONTAP cluster with 4 nodes (2 HA pairs) and a host accessing data over a SAN protocol.

Read more about Zoning for ONTAP clusters here.

VIP LIF

VIP (Virtual IP) LIFs, also known as BGP LIFs, require Top-of-the-Rack BGP Router to be used. VIP data LIFs used with ethernet for NAS environment. VIP LIFs, automatically load-balance traffic based on routing metrics and avoids inactive unused links and ports, unlike it usually happens with NAS protocols. VIP LIFs provides distribution across all the LIFs in the cluster, not limited to a single node as in NAS LIFs. VIP LIFs provides smarter load balance than it was realized with hash algorithms in Ethernet Port Channel & LACP with interface groups. VIP LIF interfaces are tested and can be used with MCC and SVM-DR and provide a more reliable, predictable and faster switch to the survived links & paths than NAS LIFs but require BGP routers.

Management interfaces

Node management LIF interface can migrate with associated IP address across Ethernet ports of a single node and available only while ONTAP running on the node. Usually, management interface placed on e0M port of the node; Node management IP sometimes used by cluster admin to communicate with a node to cluster shell in rare cases where commands have to be issued from a particular node. Cluster Management LIF interface with associated IP address available only while the entire cluster is up & running and by default can migrate across Ethernet ports, often located on one of the e0M ports on one of the cluster nodes and used by the cluster administrator for storage management. Management interfaces used for API communications, HTML GUI & SSH console management, by default SSH, connect administrator with cluster shell. Service Processor (SP) or BMC interfaces available only at hardware appliances like FAS & AFF, and each system has only SP or BMC. SP/BMC allows SSH out-of-band console communications with an embedded small computer installed on controller main-board and similarly to IPMI, or IP KVM enables to connect, monitor & manage controller even if it does not boot ONTAP OS. With SP/BMC it is possible to forcibly reboot or halt a controller and monitor coolers, temperature, etc.; once connected to SP/BMC console by SSH administrator can switch to cluster shell through it with issuing system console command; each controller has one SP/BMC interface which does not migrate like some other management interfaces. Usually, e0M and SP both lives on single management (wrench) physical Ethernet port but each have its own dedicated MAC address. Node LIFs, Cluster LIF & SP/BMC often using the same IP subnet. SVM management LIF, similarly to cluster management LIF, it can migrate across all the Ethernet ports on the nodes of the cluster and dedicated for a single SVM management. SVM LIF does not have GUI capability and can facilitate only for API Communications & SSH console management; SVM management LIF can live on e0M port but often placed by administrators on a data port in the cluster and usually on a dedicated management VLAN and can be different from IP subnets of node & cluster LIFs.

Cluster interfaces

Each cluster interconnect LIF interface usually lives on dedicated Ethernet port and cannot share ports with management and data interfaces. Cluster interconnect interfaces used for horizontal scaling functionality at times when, for example, a LUN or a Volume migrates from one node of the cluster to another node; cluster interconnect LIF similarly to node management LIFs can migrate only between ports of a single node. A few cluster interconnect interfaces can coexist on a single port, but usually, this happens temporarily because of cluster port recabling. Inter-cluster interface LIFs on another hand can live and share the same Ethernet ports with data LIFs and used for SnapMirror replication; Inter-cluster interface LIFs, similarly to node management & cluster interconnect LIFs can only migrate between ports of a single node.

Continue to read

How ONTAP Memory work

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I described my own understanding of the internal organization of ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only.

How does the ONTAP cluster work? (part 2)

This article is part of the series How does the ONTAP cluster work? Also previous series of articles How ONTAP Memory work will be a good addition to this one.

High Availability

I will call HA the first type of clusterization and its main purpose is data availability. Even though a single HA pair consists of two nodes (or controllers), NetApp has designed it in such a way it appears as a single storage system from clients perspective of view. HA configurations in ONTAP use several techniques to present the two nodes of the pair as a single system. This allows the storage system to provide its clients with nearly uninterrupted access to their data should a node fail unexpectedly.

For example: on the network level, ONTAP will temporarily migrate the IP address of the failed node to the surviving node, and where applicable it will also temporarily switch ownership of disk drives from the downed node to the surviving node. On the data level, the contents of the disks that are assigned to the downed node will automatically be available for use via the surviving node.

An aggregate can include only disks owned by a single node; therefore, each aggregate owned by a node and an upper objects such as FlexVol volumes, LUNs, File Shares served within a single controller (until FlexGroup). Since each node in HA pair can have its own disks and aggregates and serve them independently; therefore, such configurations are called Active/Active where both nodes are utilized simultaneously even though they are not serving the same data. In case one node fails, another will take over and serve its partner’s and its own disks, aggregates, and FlexVol volumes. HA configurations were only one controller has data aggregates, called Active/Passive, were passive node has only root aggregate and simply waiting to take over in case active will fail.

Once the downed node of the HA pair has been booted, up and running, a “giveback” command issued by defaul, to return disks, aggregates and FlexVol resources back to the original node.

Share nothing architecture

There are storage architectures each node serves the same data with symmetrical client access; ONTAP is not one of them. Only one node serves each disk/aggregate/FlexVol at a time, and if that node fails, another takes over in ONTAP. ONTAP using architecture known as share nothing, meaning no special equipment really needed to such architecture. Even though in hardware appliances we can see ”special” devices like NVRAM/NVDIMM, and disks with dual ports, each node in an HA pair runs its own instance of ONTAP on two separate controllers were only NVLogs shared over HI-CI Ethernet connections between HA partners. Though ONTAP is using these special devices in its hardware appliances, SDS version of ONTAP can work without them perfectly well: NVLogs are still replicated between HA partners, and instead of having two data ports and access the same disk drives with two controllers, ONTAP SDS can simply replicate data and keep two copies of data, like in MCC configurations. Share nothing architectures are particularly useful in scale-out clusters: you can add controllers with different models, configurations, disks, and even with slightly different version of OS if needed.

On the contrary, storage systems with symmetrical data access often build on monolithic architectures, which in turn suitable only to SAN protocols. While symmetrical access and monolithic architecture might sound ”cool” and seem to give more performance from the first sight, on practice share nothing architectures shoved no less performance. Also, the monolithic architectures showed plenty of inflexibility and disadvantages. For example, when the industry moved to flash media turns out disks are no longer a performance bottleneck, but controllers and CPUs are. Which means you need to add more nodes to your storage system to increase performance. This problem can be solved with scale-out clusters but monolithic architectures particularly very bad on that front. Let me explain why.

First of all, if you have a monolithic architecture with symmetric data access, each controller needs to have access to each disk, and when you are adding new controllers you need to rearrange disk shelves connections to facilitate that need, second, all the controllers in such a cluster have to be the same model with the same firmware, these clusters become very hard to maintain and add new nodes, plus such architectures usually very limited with maximum number of controllers you can add to such a cluster. Another example, closer to practice then to theory: imagine after a 3 years you need to add new node to your cluster to increase performance, while on the market there most probably available more powerful controllers with the same price you’ve bought your old controllers, but you can add only old controllers to the cluster.

Due to its monolithic nature, it becomes a very complex architecture to scale, most vendors of cause trying to hide this underlying complexity from its customers to simplify the usage of such systems, and some A-Brand systems are excellent on that front. But still, monolithic inflexibility makes such systems complex on low-level and thus very expensive, because it requires specially designed hardware, main boards, and special buses. While on another hand, share nothing architectures, needs no modifications for commodity servers to be used as storage controllers and hardware storage appliances, while no scalability nor performance is the problem for them.

HA interconnect

High-availability clusters (HA clusters) is the first type of clusterization introduced in ONTAP systems (that’s why I call it the first). The first, the second and the third type of ONTAP clusterization are not official or well-known industry terms, – I’m using them only to differentiate ONTAP capabilities while keeping them under the same umbrella because on some level they all are clusterization technologies. HA aimed to ensure an agreed level of operations. People often confuse HA with the horizontal scaling ONTAP clusterization that came from the Spinnaker acquisition; therefore, NetApp, in its documentation for Clustered ONTAP systems, refers to an HA configuration as an HA pair rather than as an HA cluster. I will reference to horizontal scaling ONTAP (Spinnaker) clusterization as the third type of clusterization to make it even more difficult. I am just kidding, with doing so I’m drawing parallels with all three types of customization so you will easily find differences between them.

An HA pair uses network connectivity between the pairs called a High Availability interconnect (HA-IC). The HA interconnect can use Ethernet ( in some older systems you might find InfiniBand) as the communication medium. The HA interconnect used for non-volatile memory log (NVLogs) replication between two nodes in an HA pair configuration using RDMA technology to ensure an agreed level of operations during events like unexpected reboots. Usually, ONTAP assigns dedicated, non-sharable HA ports for HA interconnect which could be external or built-in to storage chassis (and not visible from the outside). We should not confuse the HA-IC with the inter-cluster or intra-cluster interconnects that are used for SnapMirror. Inter-cluster and intra-cluster interfaces can coexist with interfaces used for data protocols on data ports. Also, HA-CI traffic should not be confused with Cluster Interconnect traffic used for horizontal scaling & online data migration across the multi-node cluster, and usually these two interfaces live on two different ports. HA-IC interfaces are visible only on the node shell level. Starting with A320 HA-IC and Cluster interconnect traffic to use the same ports.

MetroCluster

MetroCluster is a free functionality for ONTAP systems for metro high availability with synchronous replication between two sites; this configuration might require some additional equipment. There can be only two sites. To distinguish between “old” MetroCluster in 7-Mode and “new” MetroCluster in Cluster-Mode, last one shortened as MCC. I will call MCC as the second type of ONTAP clusterization. The primary purpose of MCC clusterization is to provide data protection and data availability across two geographical locations and switch clients from one site to another in case of a disaster to continue access to the data.

MetroCluster (MCC) is an additional level of data availability to HA configurations and supported initially only with FAS and AFF storage systems, later SDS version of MetroCluster was introduced with ONTAP Select & Cloud Volumes ONTAP products. An MCC configuration consists of two sites (each site can have a single node or HA pair), both form MetroCluster. The distance between sites can reach up to 300 km (186 miles) or even 700 km (436 miles), therefore, called geo-distributed system. Plex and SyncMirror are the critical underlying technologies for MetroCluster which synchronize data between two sites. In MCC configurations NVLogs are also replicated among storage systems between sites in this article I will refer this traffic as metrocluster traffic, to distinguish it from HA interconnect, and Cluster interconnect traffic.

MetroCluster uses RAID SyncMirror (RSM) and plex technique where on one site number of disks form one or more RAID groups aggregated in a plex, while the second site have the same amount of disks with the same type and RAID configuration aggregated into the second plex where one plex replicate data to another. Alongside with NVLogs ONTAP replicates Configuration Replication Service (CRS) metadata. NVLogs are replicated from one system to another as part of SyncMirror process and then on destination system NVLogs restored to MBUF and dumped to disks as part of next CP process, while from the logical perspective of view it looks like data synchronously replicated between two plexes groups. To simplify things, NetApp usually shows that one plex synchronously replicates to another, but in reality, NVLogs synchronously replicated between non-volatile memories of two sites. Two plexes form an aggregate and in case of a disaster on one site, the second site provides read-write access to the data. MetroCluster Support FlexArray technology and ONTAP SDS.

As part of the third type of clusterization, individual data volumes, LUNs and LIFs could online migrate across storage nodes in the MetroCuster only within a single site where data originated from: it is not possible to migrate individual volumes, LUNs or LIFs using cluster capabilities across sites unless MetroCluster switchover operation (the second type of clusterization) is used to switch entire half of the cluster with all the data, volumes, LIFs and storage configuration from one, so clients and applications access to all the data from another location.

With MCC it is possible to have one or more storage nodes per site, so one node per site known as 2-node configuration (or two-pack configuration), 2-node per site known as 4-node configuration and 8-node configuration with 4-nodes per site. Local HA partner (if exists) and remote partner must be the same model: in 2 or 4-node configurations, all nodes must be the same model & configuration. In MCC configuration each one remote and one local storage node form a Disaster Recovery Pare (DR Pare) across two sites while two local nodes (if there is partner) form local HA pair, thus each node synchronously replicates data in non-volatile memory two nodes: one remote and one local (if there is one). In other words, the 4-node configuration consists of two HA pares and in this case NVLogs replicated to a remote site and local HA partner as in normal non-MCC HA system, while 2-node configuration NVLogs replicated only to its remote partner.

MCC with one node on each site called two-pack (or 2-node) MCC configuration.

8-Node MCC

8-node MCC configuration consists of two almost independent 4-node MCC (each 4-node with two HA pair), as in 4-node configuration, each storage node have only one remote partner and only one local HA partner. The only difference between two completely independent 4-node MCC and 8-node configuration MetroCluster is that 8-node share cluster interconnect switches therefore entire 8-node cluster seen by clients as a single namespace system administrator can move data online between all the nodes in MetroCluster within a local site. Example of 8-node MCC is four nodes of AFF A700 and four nodes of FAS8200, where two nodes of A700 and two nodes of FAS8200 on one site and the second half on the another site.

MCC network transport: FC & IP

MCC can use two network transports for synchronization: FC or IP. Most FC configurations require dedicated FC-VI ports usually located on an FC-VI card but some FAS/AFF models can convert on-board FC ports to FC-VI mode. IP requires iWRAP interfaces which can live on ethernet ports (25 GbE or higher), which usually available on an iWRAP card. Some models like Entry-level A220 can use onboard ports and share ports with cluster interconnect traffic, while MCC-FC do not support Entry systems.

MCC: Fabric & Stretched

Fabric configurations are configurations with switches, while stretched configurations are configs without a switch. Both Fabric & Stretched terms usually applies only to FC network transport because IP transport always require a switch. Stretched configs can use only 2-nodes in a MetroCluster. With MCC FC stretched configs it is possible to build 2-node cluster stretched up to 300 meters (984 feet) without a switch, such configurations require special optical cables with multiple fibers in it, because of necessity to cross-connect all controllers and all disk shelves. To reduce the number of fibers stretched configurations can use FC-SAS bridges used to connect disk shelves to it, then cross-connect controllers and the FC-SAS bridges and the second option to reduce the number of required fiber links is to use FlexArray technology instead of NetApp disk shelves.

Fabric MCC-FC

FAS and AFF systems with ONTAP software versions 9.2 and older utilize FC-VI ports and for long distances require dedicated only for MetroCuster four Fibre Channel switches (2 on each site) and 2 FC-SAS bridges per each disk shelf stack, thus minimum 4 total for 2 sites and minimum 2 dark fiber ISL links with optional DWDMs for long distances. Fabric MCC require FC-SAS bridges. 4-node and 8-node configurations require a pair of cluster interconnect switches.

MCC-IP

Starting with ONTAP 9.3 MetroCluster over IP (MCC-IP) was introduced with no need for a dedicated back-end Fibre Channel switches, no FC-SAS bridges and no dedicated dark fiber ISL which previously were needed for MCC-FC configurations. In such configuration disk shelves directly connected to controllers and cluster switches used for MetroCluster (iWRAP) and Cluster interconnect traffic. Initially, only A700 & FAS9000 systems supported MCC-IP. MCC-IP available only in 4-node configurations: 2-node Highly Available system on each site with two sites total. With ONTAP 9.4, MCC-IP supports A800 system and Advanced Drive Partitioning in the form of Rood-Data-Data (RD2) partitioning for AFF systems, also known as ADPv2. ADPv2 supported only on all-flash systems. MCC-IP configurations support single disk shelf per site where SSD drives partitioned in ADPv2. MetroCluster over IP requires Ethernet cluster switches with installed ISL SFP modules to connect with the remote location and utilize iWRAP cards in each storage controller for synchronous replication. Starting with ONTAP 9.5 MCC-IP supports distance up to 700 km and SVM-DR feature, AFF A300, and FAS8200 systems. Beginning with ONTAP 9.6 MCC-IP supports Entry-level systems A220 and FAS2750, also in these systems MCC (iWRAP), HA, and Cluster interconnect interfaces lives on the cluster interconnect onboard ports, while mid-range and high-end systems still require a dedicated iWRAP card.

Plex

Similar to RAID-1, plexes in ONTAP systems can keep mirrored data in two places, but while conventional RAID-1 must exist within the bounds of one storage system, two plexes could be distributed between two storage systems. Each aggregate consists of one or two plexes. Ordinary HA or single-node storage systems have only one plex for each aggregate while SyncMirror local or MetroCluster configurations have two plexes for each aggregate. Each plex includes underlying storage space from one or more NetApp RAID groups or LUNs from third-party storage systems (see FlexArray) in a single plex similarly to RAID-0. If an aggregate consists of two plexes, one plex is considered a master and the second as a slave; slaves must consist of the same RAID configuration and drives. For example, if we have an aggregate consisting of two plexes where the master plex consists out of 21 data and three parity 1.8 TB SAS 10k drives in RAID-TEC, then slave plex must consist of 21 data and 3 parity 1.8 TB SAS 10k drives in RAID-TEC. Second example with hybrid aggregates, if we have an aggregate consisted from two plexes where master plex consists of one RAID 17 data and 3 parity SAS drives 1.8 TB SAS 10k configured as RAID-TEC and second RAID in the master plex is RAID-DP with 2 data and 2 parity SSD 960 GB, then the second plex must have the same configuration: one RAID 17 data and 3 parity SAS 10k drives 1.8 TB configured as RAID-TEC and the second RAID in the slave plex in RAID-DP with 2 data and 2 parity SSD 960 GB. MetroCluster configurations use SyncMirror technology for synchronous data replication.

There are two SyncMirror options: MetroCluster and Local SyncMirror; both using the same plex technique for synchronous replication of data between two plexes. Local SyncMirror creates both plexes in a single controller and is often used for additional security to prevent failure for an entire disk shelf in a storage system. MetroCluster allows data to be replicated between two storage systems.

MetroCluster SDS

Is a feature of ONTAP Select software, similarly to MetroCluster on FAS/AFF systems MetroCluster SDS (MC SDS) allows to synchronously replicate data between two sites using Plex & SyncMirror and automatically switch to survived node transparently to its users and applications. MetroCluster SDS work as ordinary HA pair so data volumes, LUNs and LIFs could be moved online between aggregates and controllers on both sites, which is different from traditional MetroCluster on FAS/AFF systems where data cloud be moved across storage cluster nodes only within the site where data located initially. In traditional MetroCluster the only way for applications to access data locally on the remote site is to disable one entire site, this process called switchover wherein MC SDS the HA process occurs. MCC supports 2,4 and even 8-node configurations, while MC SDS sports only 2-node configuration. MetroCluster SDS uses ONTAP Deploy as the mediator (in FAS and AFF world this built-in software known as MetroCluster tiebreaker) which came with ONTAP Select as a bundle and generally used for deploying clusters, installing and
monitoring licenses.

Continue to read

How ONTAP Memory work

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I described my own understanding of the internal organization of ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only.