How does the ONTAP cluster work? (part 2)

This article is part of the series How does the ONTAP cluster work? Also previous series of articles How ONTAP Memory work will be a good addition to this one.

High Availability

I will call HA the first type of clusterization and its main purpose is data availability. Even though a single HA pair consists of two nodes (or controllers), NetApp has designed it in such a way it appears as a single storage system from clients perspective of view. HA configurations in ONTAP use several techniques to present the two nodes of the pair as a single system. This allows the storage system to provide its clients with nearly uninterrupted access to their data should a node fail unexpectedly.

For example: on the network level, ONTAP will temporarily migrate the IP address of the failed node to the surviving node, and where applicable it will also temporarily switch ownership of disk drives from the downed node to the surviving node. On the data level, the contents of the disks that are assigned to the downed node will automatically be available for use via the surviving node.

An aggregate can include only disks owned by a single node; therefore, each aggregate owned by a node and an upper objects such as FlexVol volumes, LUNs, File Shares served within a single controller (until FlexGroup). Since each node in HA pair can have its own disks and aggregates and serve them independently; therefore, such configurations are called Active/Active where both nodes are utilized simultaneously even though they are not serving the same data. In case one node fails, another will take over and serve its partner’s and its own disks, aggregates, and FlexVol volumes. HA configurations were only one controller has data aggregates, called Active/Passive, were passive node has only root aggregate and simply waiting to take over in case active will fail.

Once the downed node of the HA pair has been booted, up and running, a “giveback” command issued by defaul, to return disks, aggregates and FlexVol resources back to the original node.

Share nothing architecture

There are storage architectures each node serves the same data with symmetrical client access; ONTAP is not one of them. Only one node serves each disk/aggregate/FlexVol at a time, and if that node fails, another takes over in ONTAP. ONTAP using architecture known as share nothing, meaning no special equipment really needed to such architecture. Even though in hardware appliances we can see ”special” devices like NVRAM/NVDIMM, and disks with dual ports, each node in an HA pair runs its own instance of ONTAP on two separate controllers were only NVLogs shared over HI-CI Ethernet connections between HA partners. Though ONTAP is using these special devices in its hardware appliances, SDS version of ONTAP can work without them perfectly well: NVLogs are still replicated between HA partners, and instead of having two data ports and access the same disk drives with two controllers, ONTAP SDS can simply replicate data and keep two copies of data, like in MCC configurations. Share nothing architectures are particularly useful in scale-out clusters: you can add controllers with different models, configurations, disks, and even with slightly different version of OS if needed.

On the contrary, storage systems with symmetrical data access often build on monolithic architectures, which in turn suitable only to SAN protocols. While symmetrical access and monolithic architecture might sound ”cool” and seem to give more performance from the first sight, on practice share nothing architectures shoved no less performance. Also, the monolithic architectures showed plenty of inflexibility and disadvantages. For example, when the industry moved to flash media turns out disks are no longer a performance bottleneck, but controllers and CPUs are. Which means you need to add more nodes to your storage system to increase performance. This problem can be solved with scale-out clusters but monolithic architectures particularly very bad on that front. Let me explain why.

First of all, if you have a monolithic architecture with symmetric data access, each controller needs to have access to each disk, and when you are adding new controllers you need to rearrange disk shelves connections to facilitate that need, second, all the controllers in such a cluster have to be the same model with the same firmware, these clusters become very hard to maintain and add new nodes, plus such architectures usually very limited with maximum number of controllers you can add to such a cluster. Another example, closer to practice then to theory: imagine after a 3 years you need to add new node to your cluster to increase performance, while on the market there most probably available more powerful controllers with the same price you’ve bought your old controllers, but you can add only old controllers to the cluster.

Due to its monolithic nature, it becomes a very complex architecture to scale, most vendors of cause trying to hide this underlying complexity from its customers to simplify the usage of such systems, and some A-Brand systems are excellent on that front. But still, monolithic inflexibility makes such systems complex on low-level and thus very expensive, because it requires specially designed hardware, main boards, and special buses. While on another hand, share nothing architectures, needs no modifications for commodity servers to be used as storage controllers and hardware storage appliances, while no scalability nor performance is the problem for them.

HA interconnect

High-availability clusters (HA clusters) is the first type of clusterization introduced in ONTAP systems (that’s why I call it the first). The first, the second and the third type of ONTAP clusterization are not official or well-known industry terms, – I’m using them only to differentiate ONTAP capabilities while keeping them under the same umbrella because on some level they all are clusterization technologies. HA aimed to ensure an agreed level of operations. People often confuse HA with the horizontal scaling ONTAP clusterization that came from the Spinnaker acquisition; therefore, NetApp, in its documentation for Clustered ONTAP systems, refers to an HA configuration as an HA pair rather than as an HA cluster. I will reference to horizontal scaling ONTAP (Spinnaker) clusterization as the third type of clusterization to make it even more difficult. I am just kidding, with doing so I’m drawing parallels with all three types of customization so you will easily find differences between them.

An HA pair uses network connectivity between the pairs called a High Availability interconnect (HA-IC). The HA interconnect can use Ethernet ( in some older systems you might find InfiniBand) as the communication medium. The HA interconnect used for non-volatile memory log (NVLogs) replication between two nodes in an HA pair configuration using RDMA technology to ensure an agreed level of operations during events like unexpected reboots. Usually, ONTAP assigns dedicated, non-sharable HA ports for HA interconnect which could be external or built-in to storage chassis (and not visible from the outside). We should not confuse the HA-IC with the inter-cluster or intra-cluster interconnects that are used for SnapMirror. Inter-cluster and intra-cluster interfaces can coexist with interfaces used for data protocols on data ports. Also, HA-CI traffic should not be confused with Cluster Interconnect traffic used for horizontal scaling & online data migration across the multi-node cluster, and usually these two interfaces live on two different ports. HA-IC interfaces are visible only on the node shell level. Starting with A320 HA-IC and Cluster interconnect traffic to use the same ports.

MetroCluster

MetroCluster is a free functionality for ONTAP systems for metro high availability with synchronous replication between two sites; this configuration might require some additional equipment. There can be only two sites. To distinguish between “old” MetroCluster in 7-Mode and “new” MetroCluster in Cluster-Mode, last one shortened as MCC. I will call MCC as the second type of ONTAP clusterization. The primary purpose of MCC clusterization is to provide data protection and data availability across two geographical locations and switch clients from one site to another in case of a disaster to continue access to the data.

MetroCluster (MCC) is an additional level of data availability to HA configurations and supported initially only with FAS and AFF storage systems, later SDS version of MetroCluster was introduced with ONTAP Select & Cloud Volumes ONTAP products. An MCC configuration consists of two sites (each site can have a single node or HA pair), both form MetroCluster. The distance between sites can reach up to 300 km (186 miles) or even 700 km (436 miles), therefore, called geo-distributed system. Plex and SyncMirror are the critical underlying technologies for MetroCluster which synchronize data between two sites. In MCC configurations NVLogs are also replicated among storage systems between sites in this article I will refer this traffic as metrocluster traffic, to distinguish it from HA interconnect, and Cluster interconnect traffic.

MetroCluster uses RAID SyncMirror (RSM) and plex technique where on one site number of disks form one or more RAID groups aggregated in a plex, while the second site have the same amount of disks with the same type and RAID configuration aggregated into the second plex where one plex replicate data to another. Alongside with NVLogs ONTAP replicates Configuration Replication Service (CRS) metadata. NVLogs are replicated from one system to another as part of SyncMirror process and then on destination system NVLogs restored to MBUF and dumped to disks as part of next CP process, while from the logical perspective of view it looks like data synchronously replicated between two plexes groups. To simplify things, NetApp usually shows that one plex synchronously replicates to another, but in reality, NVLogs synchronously replicated between non-volatile memories of two sites. Two plexes form an aggregate and in case of a disaster on one site, the second site provides read-write access to the data. MetroCluster Support FlexArray technology and ONTAP SDS.

As part of the third type of clusterization, individual data volumes, LUNs and LIFs could online migrate across storage nodes in the MetroCuster only within a single site where data originated from: it is not possible to migrate individual volumes, LUNs or LIFs using cluster capabilities across sites unless MetroCluster switchover operation (the second type of clusterization) is used to switch entire half of the cluster with all the data, volumes, LIFs and storage configuration from one, so clients and applications access to all the data from another location.

With MCC it is possible to have one or more storage nodes per site, so one node per site known as 2-node configuration (or two-pack configuration), 2-node per site known as 4-node configuration and 8-node configuration with 4-nodes per site. Local HA partner (if exists) and remote partner must be the same model: in 2 or 4-node configurations, all nodes must be the same model & configuration. In MCC configuration each one remote and one local storage node form a Disaster Recovery Pare (DR Pare) across two sites while two local nodes (if there is partner) form local HA pair, thus each node synchronously replicates data in non-volatile memory two nodes: one remote and one local (if there is one). In other words, the 4-node configuration consists of two HA pares and in this case NVLogs replicated to a remote site and local HA partner as in normal non-MCC HA system, while 2-node configuration NVLogs replicated only to its remote partner.

MCC with one node on each site called two-pack (or 2-node) MCC configuration.

8-Node MCC

8-node MCC configuration consists of two almost independent 4-node MCC (each 4-node with two HA pair), as in 4-node configuration, each storage node have only one remote partner and only one local HA partner. The only difference between two completely independent 4-node MCC and 8-node configuration MetroCluster is that 8-node share cluster interconnect switches therefore entire 8-node cluster seen by clients as a single namespace system administrator can move data online between all the nodes in MetroCluster within a local site. Example of 8-node MCC is four nodes of AFF A700 and four nodes of FAS8200, where two nodes of A700 and two nodes of FAS8200 on one site and the second half on the another site.

MCC network transport: FC & IP

MCC can use two network transports for synchronization: FC or IP. Most FC configurations require dedicated FC-VI ports usually located on an FC-VI card but some FAS/AFF models can convert on-board FC ports to FC-VI mode. IP requires iWRAP interfaces which can live on ethernet ports (25 GbE or higher), which usually available on an iWRAP card. Some models like Entry-level A220 can use onboard ports and share ports with cluster interconnect traffic, while MCC-FC do not support Entry systems.

MCC: Fabric & Stretched

Fabric configurations are configurations with switches, while stretched configurations are configs without a switch. Both Fabric & Stretched terms usually applies only to FC network transport because IP transport always require a switch. Stretched configs can use only 2-nodes in a MetroCluster. With MCC FC stretched configs it is possible to build 2-node cluster stretched up to 300 meters (984 feet) without a switch, such configurations require special optical cables with multiple fibers in it, because of necessity to cross-connect all controllers and all disk shelves. To reduce the number of fibers stretched configurations can use FC-SAS bridges used to connect disk shelves to it, then cross-connect controllers and the FC-SAS bridges and the second option to reduce the number of required fiber links is to use FlexArray technology instead of NetApp disk shelves.

Fabric MCC-FC

FAS and AFF systems with ONTAP software versions 9.2 and older utilize FC-VI ports and for long distances require dedicated only for MetroCuster four Fibre Channel switches (2 on each site) and 2 FC-SAS bridges per each disk shelf stack, thus minimum 4 total for 2 sites and minimum 2 dark fiber ISL links with optional DWDMs for long distances. Fabric MCC require FC-SAS bridges. 4-node and 8-node configurations require a pair of cluster interconnect switches.

MCC-IP

Starting with ONTAP 9.3 MetroCluster over IP (MCC-IP) was introduced with no need for a dedicated back-end Fibre Channel switches, no FC-SAS bridges and no dedicated dark fiber ISL which previously were needed for MCC-FC configurations. In such configuration disk shelves directly connected to controllers and cluster switches used for MetroCluster (iWRAP) and Cluster interconnect traffic. Initially, only A700 & FAS9000 systems supported MCC-IP. MCC-IP available only in 4-node configurations: 2-node Highly Available system on each site with two sites total. With ONTAP 9.4, MCC-IP supports A800 system and Advanced Drive Partitioning in the form of Rood-Data-Data (RD2) partitioning for AFF systems, also known as ADPv2. ADPv2 supported only on all-flash systems. MCC-IP configurations support single disk shelf per site where SSD drives partitioned in ADPv2. MetroCluster over IP requires Ethernet cluster switches with installed ISL SFP modules to connect with the remote location and utilize iWRAP cards in each storage controller for synchronous replication. Starting with ONTAP 9.5 MCC-IP supports distance up to 700 km and SVM-DR feature, AFF A300, and FAS8200 systems. Beginning with ONTAP 9.6 MCC-IP supports Entry-level systems A220 and FAS2750, also in these systems MCC (iWRAP), HA, and Cluster interconnect interfaces lives on the cluster interconnect onboard ports, while mid-range and high-end systems still require a dedicated iWRAP card.

Plex

Similar to RAID-1, plexes in ONTAP systems can keep mirrored data in two places, but while conventional RAID-1 must exist within the bounds of one storage system, two plexes could be distributed between two storage systems. Each aggregate consists of one or two plexes. Ordinary HA or single-node storage systems have only one plex for each aggregate while SyncMirror local or MetroCluster configurations have two plexes for each aggregate. Each plex includes underlying storage space from one or more NetApp RAID groups or LUNs from third-party storage systems (see FlexArray) in a single plex similarly to RAID-0. If an aggregate consists of two plexes, one plex is considered a master and the second as a slave; slaves must consist of the same RAID configuration and drives. For example, if we have an aggregate consisting of two plexes where the master plex consists out of 21 data and three parity 1.8 TB SAS 10k drives in RAID-TEC, then slave plex must consist of 21 data and 3 parity 1.8 TB SAS 10k drives in RAID-TEC. Second example with hybrid aggregates, if we have an aggregate consisted from two plexes where master plex consists of one RAID 17 data and 3 parity SAS drives 1.8 TB SAS 10k configured as RAID-TEC and second RAID in the master plex is RAID-DP with 2 data and 2 parity SSD 960 GB, then the second plex must have the same configuration: one RAID 17 data and 3 parity SAS 10k drives 1.8 TB configured as RAID-TEC and the second RAID in the slave plex in RAID-DP with 2 data and 2 parity SSD 960 GB. MetroCluster configurations use SyncMirror technology for synchronous data replication.

There are two SyncMirror options: MetroCluster and Local SyncMirror; both using the same plex technique for synchronous replication of data between two plexes. Local SyncMirror creates both plexes in a single controller and is often used for additional security to prevent failure for an entire disk shelf in a storage system. MetroCluster allows data to be replicated between two storage systems.

MetroCluster SDS

Is a feature of ONTAP Select software, similarly to MetroCluster on FAS/AFF systems MetroCluster SDS (MC SDS) allows to synchronously replicate data between two sites using Plex & SyncMirror and automatically switch to survived node transparently to its users and applications. MetroCluster SDS work as ordinary HA pair so data volumes, LUNs and LIFs could be moved online between aggregates and controllers on both sites, which is different from traditional MetroCluster on FAS/AFF systems where data cloud be moved across storage cluster nodes only within the site where data located initially. In traditional MetroCluster the only way for applications to access data locally on the remote site is to disable one entire site, this process called switchover wherein MC SDS the HA process occurs. MCC supports 2,4 and even 8-node configurations, while MC SDS sports only 2-node configuration. MetroCluster SDS uses ONTAP Deploy as the mediator (in FAS and AFF world this built-in software known as MetroCluster tiebreaker) which came with ONTAP Select as a bundle and generally used for deploying clusters, installing and
monitoring licenses.

Continue to read

How ONTAP Memory work

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I described my own understanding of the internal organization of ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only.

1 thought on “How does the ONTAP cluster work? (part 2)”

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s