Zoning for cluster storage in pictures

NetApp AFF & FAS storage systems can combine into a cluster up to 12 nodes (6 HA pairs) for SAN protocols. Let’s take a look on zoning and connections on an example with 4 nodes (2 HA pairs) in the image below.

For simplicity we will discuss connection of a single host to the storage cluster. In this example we connect each node to each server. Each storage node connected with double links for reliability reasons.

It is easy to notice only two paths going to be Preferred in this example (solid yellow arrows).

Since NetApp FAS/AFF systems implement “share nothing” architecture we have disk drives assigned to a node, then disks on a node  grouped into a RAID group, then one or a few RAID groups combined into a plex, usually one plex form an aggregate (in some cases two plexes form an aggregate, in this case both plexes must have identical RAID configuration, think of it as an analogy to RAID 1). On aggregates you have FlexVol volumes. Each FlexVol volume is a separate WAFL file system and can serve for NAS files (SMB/NFS) or SAN LUNs (iSCSI, FC, FCoE) or Namespaces (NVMeoF). A FlexVol can have multiple Qtrees and each Qtree can store an LUN or files. Read more in series of articles How ONTAP Memory works.

Each drive belongs to & served by a node. A RAID group belongs to and served by a node. All objects on top of those are belong to and are served by a single node, including Aggregates, FlexVols, Qtrees, LUNs & Namespaces.

At a given time a disk can belong to a single node and in case of a node failure, HA partner takes over disks, aggregates, and all the objects on top of that. Note that a “disk” in ONTAP can be entire physical disk as well as a partition on a disk. Read more about disks and ADP (disk partitioning) here.

Though an LUN or Namespace belong to a single node, it is possible to access them through the HA partner or even from other nodes. The most optimal path is always through a node which owns the LUN or Namespace. If a node has more than one port, all ports to that node are considered as optimal paths (also known as Non-Primary paths) through that node. Normally it is a good idea to have more optimal paths to a LUN.

ALUA & ANA

ALUA (Asymmetric Logic Unit Access) is a protocol which help hosts to access LUNs through optimal paths, it also allows to automatically change paths to a LUN if it moved to another controller. ALUA is used in both FCP and iSCSI protocols. Similarly to ALUA, ANA ( Asymmetric Namespace Access) is a protocol for NVMe over Fabrics protocols like FC-NVMe, iNVMe, etc.

Host can use one or a few paths to an LUN and that is depended on the host multipathing configuration and Portset configuration on the ONTAP cluster.

Since an LUN belong to a single storage node and ONTAP provide online migration capabilities between nodes, your network configuration must provide access to the LUN from all the nodes, just in case. Read more in series of articles How ONTAP cluster works.

According to NetApp best practices, zoning is quite simple:

  • Create one zone for each initiator (host) port on each fabric
  • Each zone must have one initiator port and all the target (storage node) ports.

Keeping one initiator per zone reduces “cross talks” between initiators to 0.

Example for Fabric A, Zone for “Host_1-A-Port_A”:

NodePortWWPN
Host 1Port APort A
ONTAP Node1Port ALIF1-A (NAA-2)
ONTAP Node2Port A LIF2-A (NAA-2)
ONTAP Node3Port A LIF3-A (NAA-2)
ONTAP Node4Port A LIF4-A (NAA-2)

Example for Fabric B, Zone for “Host_1-B-Port_B”:

NodePortWWPN
Host 1Port BPort B
ONTAP Node1Port BLIF1-B (NAA-2)
ONTAP Node2Port B LIF2-B (NAA-2)
ONTAP Node3Port BLIF3-B (NAA-2)
ONTAP Node4Port BLIF4-B (NAA-2)

Here is how zoning from tables above it looks like:

Vserver or SVM

An SVM in ONATP cluster lives on all the nodes in the cluster. Each SVM separated one from another and used for creating a multi-tenant environment. Each SVM can be managed by a separate group of people or companies and one will not interfere with another. In fact they will not know about other existence at all, each SVM is like a separate physical storage system box. Read more about SVM, Multi-Tenancy and Non-Disruptive Operations here.

Logical Interface (LIF)

Each SVM has its own WWNN in case of FCP, own IQN in case of iSCSI or Namespace in case of NVMeoF. Each SVM can share a physical storage node port. Each SVM assigns its own range of network addresses (WWPN, IP, or Namespace ID) to a physical port and normally each SVM assigns one network address to one physical port. Therefore one physical port might have a few WWPN network addresses on a single physical storage node port each assigned to a different SVM, if a few SVM exists. NPIV is a crucial functionality which must be enabled on a FC switch for ONTAP cluster with FC protocol to function properly.

Unlike ordinary virtual machines (i.e. ESXi or KVM), each SVM “exists” on all the nodes in the cluster, not just on a single node, the picture below shows two SVMs on a single node just for simplification.

Make sure that each node has at least one LIF, in this case host multipathing will be able to find an optimal path and always access an LUN through optimal route even if a LUN will migrate to another node. Each port has its own assigned “physical address” which you cannot change and network addresses. Here is an example of network & physical addresses looks like in case of iSCSI protocol. Read more about SAN LIFs here and about SAN protocols like FC, iSCSI, NVMeoF here.

Zoning recommendations

For ONTAP 9, 8 & 7 NetApp recommends having a single initiator and multiple targets.

For example in case of FCP, each physical port has its own physical WWPN (WWPN 3 in the image above) which should not be used at all, but rather WWPN addresses assigned to an LIF (WWPN 1 & 2 in the image above) must be used for zoning and host connections. Physical addresses looks like 50:0A:09:8X:XX:XX:XX:XX, this type of addresses numbered according to NAA-3 (IEEE Network Address Authority 3), assigned to a physical port, and should not be used at all. Example: 50:0A:09:82:86:57:D5:58. You can see addresses numbered according to NAA-3 listed on network switches, but they should not be used.

When you create zones on a Fabric, you should use 2X:XX:00:A0:98:XX:XX:XX, this type of addresses numbered according to NAA-2 (IEEE Network Address Authority 2) and assigned to your LIFs. Thanks to NPIV technology, the physical N_Port can register additional WWPNs which means your switch must be enabled in NPV mode in order ONTAP to serve data over FCP protocol to your servers. Example 20:00:00:A0:98:03:A4:6E

  • Block 00:A0:98 is the original OUI block for ONTAP
  • Block D0:39:EA is the newly added OUI block for ONTAP
  • Block 00:A0:B8 is used on NetApp E-Series hardware
  • Block 00:80:E5 is reserved for future use.

Read more

Disclaimer

Please note in this article I described my own understanding of the internal organization of ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

All product names, logos, and brands are property of their respective owners. All company, product and service names used in this website are for identification purposes only.

ONTAP & Antivirus NAS protection

NetApp with ONTAP OS supports antivirus integration known as Off-box Antivirus Scanning or VSCAN. With VSCAN ability, the storage system will scan each new file with an antivirus system. VSCAN allows increasing corporate data security.

ONTAP supports the next list of antivirus software:

  • Symantec
  • Trend Micro
  • Computer Associates
  • Kaspersky
  • McAfee
  • Sophos

Also, ONTAP supports FPolicy technology which can prevent a file been written or read based on file extension or file content header.

This time I’d like to discuss an example of CIFS (SMB) integration with antivirus system McAfee.

AV-1

In this example im going to show how to set up integration with McAfee. Here are the minimum requirements for McAfee but approximately the same with other AVs:

  • MS Windows Server 2008 or higher
  • NetApp storage with ONTAP 8 or higher
  • SMB v2 or higher (CIFS v1.0 not supported)
  • NetApp ONTAP AV Connector (Download page)
  • McAfee VirusScan Enterprise for Storage (VSEfS)
  • For more details see NetApp Support Matrix Tool.
AV-2

Diagram of antivirus integration with ONTAP system.

Preparation

To set up such an integration, we will need to configure the next software components:

AV-3

VSEfS

We need to set up McAfee VSEfS, which can work in two modes: as an independent product or as managed by McAfee ePolicy Orchestrator (McAfee ePO). In this article, I will discuss how to configure it as an independent product. To set up & configure VSEfS we will need already installed and configured:

  • McAfee VirusScan Enterprise (VSE). Download VSE
  • McAfee ePolicy Orchestrator (ePO), not needed if VirusScan used as an independent product.

SCAN Server

At first, we need to configure few SCAN servers to balance the workload between them. I will install each SCAN server on a separate Windows Server with McAfee VSE, McAfee VSEfS, and ONTAP AV Connector. In this article, we will create three SCAN servers: SCAN1, SCAN2, SCAN3.

Active Directory

At the next step, we need to create user scanuser in our domain, in this example domain will be NetApp.

ONTAP

After ONTAP been started, we need to create Cluster management LIF and SVM management LIF; set up AD integration and configure file shares and data LIFs for SMB protocol. Here, we will have NCluster-mgmt LIF for cluster management and SVM01-mgmt for SVM management.

NCluster::> network interface create -vserver NCluster -home-node NCluster-01 -home-port e0M -role data -protocols none -lif NCluster-mgmt -address 10.0.0.100 -netmask 255.0.0.0
NCluster::> network interface create -vserver SVM01 -home-node NCluster-01 -home-port e0M -role data -protocols none -lif SVM01-mgmt -address 10.0.0.105 -netmask 255.0.0.0
NCluster::> domain-tunnel create -vserver SVM01
NCluster::> security login create -username NetApp\scanuser -application ontapi -authmethod domain -role readonly -vserver NCluster
NCluster::> security login create -username NetApp\scanuser -application ontapi -authmethod domain -role readonly -vserver SVM01

ONTAP AV Connector

On each SCAN server, we will install the ONTAP AV Connector. At the end of the installation, I will add AD logging & password for the user scanuser.

AV-4

Then configure management LIFs

Start → All Programs → NetApp → ONTAP AV Connector → Configure ONTAP Management LIFs

In the field “Management LIF” we will add DNS name or IP address for the NCluster-mgmt or SVM01-mgmt. In the Account field, we will fill with NetApp\scanuser. Also, then pressing “Test,” “Update” or “Save” if test finished.

AV-5

McAfee Network Appliance Filer AV Scanner Administrator Account

Assuming you already installed McAfee on three SCAN servers, on each SCAN server, we are logging as an administrator and in Windows taskbar opening VirusScan Console and then open Network Appliance Filer AV Scanner and choosing tab called “Network Appliance Filers.” So, in the field “This Server is processing scan request for these filers” press the “Add button” and put to the address field “127.0.0.1”, and then also add scanuser credentials.

AV-6

Returning to ONTAP console

Configuring off-box scanning, then enabling it, creating and applying scan policies. SCAN1, SCAN2, and SCAN3 are the Windows servers with installed McAfee VSE, VSEfS, and ONTAP AV Connector.
First, we create a pool of AV servers:

NCluster::> vserver vscan scanner-pool create -vserver SVM01 -scanner-pool POOL1 -servers SCAN1,SCAN2,SCAN3 -privileged-users NetApp\scanuser 
NCluster::> vserver vscan scanner-pool show
Scanner Pool Privileged Scanner Vserver Pool Owner Servers Users Policy 
-------- ---------- ------- ------------ ------------ ------- 
SVM01 POOL1 vserver SCAN1, NetApp\scanuser idle SCAN2, SCAN3

NCluster::> vserver vscan scanner-pool show -instance
Vserver: SVM01 Scanner Pool: 
POOL1 Applied Policy: idle 
Current Status: off 
Scanner Pool Config Owner: vserver 
List of IPs of Allowed Vscan Servers: SCAN1, SCAN2, SCAN3 
List of Privileged Users: NetApp\scanuser

Second, we apply a scanner policy:

NCluster::> vserver vscan scanner-pool apply-policy -vserver SVM01 -scanner-pool POOL1 -scanner-policy primary
NCluster::> vserver vscan enable -vserver SVM01
NCluster::> vserver vscan connection-status show
Connected Connected Vserver Node Server-Count Servers 
--------- -------- ------------ ------------------------ 
SVM01 NClusterN1 3 SCAN1, SCAN2, SCAN3

NCluster::> vserver vscan on-access-policy show
Policy Policy File-Ext Policy Vserver Name Owner Protocol Paths Excluded Excluded Status 
--------- --------- ------- -------- ---------------- ---------- ------ 
NCluster default_ cluster CIFS - - off CIFS SVM01 default_ cluster CIFS - - on CIFS 

Licensing

There is no other licensing needed on ONTAP side to enable and use FPolicy & off-box anti-virus scanning; this is a basic functionality available in any ONTAP system. However, you might need to license additional functionality from the antivirus side, so please check it with your antivirus vendor.

Summary

Here are some advantages in integration storage system with your corporate AV: NAS integration with antivirus allows you to have one of the antivirus systems on your desktops and another for your NAS share. There is no need to do NAS scanning on workstations and waste their limited resources. All NAS data protected, there is no way for a user with advanced privileges to connect to the file share without antivirus protection and put there some unscanned files.

NetApp in containerization era

It’s not really a technical article as I usually do, but rather a short list of topics for one direction NetApp developing a lot recently, called “Containers”. Containerization getting more and more popular nowadays and I noticed NetApp heavily invests efforts in it, so I identified four main directions in that field. Let’s name a few NetApp products using containerization technology:

  1. E-Series with built-in containers
  2. Containerization of existing NetApp software:
    • ActiveIQ PAS
  3. Trident plugin for ONTAP, SF & E-Series (NAS & SAN):
    • NetApp Trident plug-in for Jenkins
    • CI & HCI:
      • ONTAP AI: NVIDIA, AFF systems, Docker containers & Trident with NFS
      • FlexPod DataCenter & FlexPod SF with Docker EE
      • FlexPod DataCenter with IBM Cloud Private (ICP) with Kubernetes orchestration
      • NetApp HCI with RedHat OpenShift Container Platform
      • NetApp integration with Robin Systems Container Orchestration
    • Oracle, PostgreSQL & MongoDB in containers with Trident
    • Integration with Moby Project
    • Integration with Mesosphere Project
  4. Cloud-native services & Software:
    • NetApp Kubernetes Services (NKS)
      • NKS in public clouds: AWS, Azure, GCP
      • NKS on NetApp HCI platform
    • SaaS Backup for Service Providers

Documents, Howtos, Architectures, News & Best Practices:

Is it a full list of NetApp’s efforts towards containerization?

I bet that is far not complete. Post your thoughts and links with documents, news, howtos, architectures and best practice guides in the comments below to expand this list if I missed something!

How memory works in ONTAP: Write Allocation, Tetris, Summary (Part 3)

Write Allocation

NetApp has changed this part of the WAFL architecture a lot since its first release. Increased demand for performance, multi-core CPUs and a new type for media were the forces to continue to improve Write Allocation.

Each thread can be processed separately and depends on data type, write/read a pattern, data block size, place media where it will be written and other characteristics. Based on that WAFL can decide how and where data will be placed and what type of media, it should use.

For example, for flash media, it is important to write data taking into account the smallest block size that media can operate to ensure cells are not wear-out and evenly utilized. Another example is when data separated from metadata and stored on different tiers of storage. In reality Write Allocation is a very, very huge, separate topic to discuss, how ONTAP is optimizing data in this module.

Affinities

Increase in the number of CPU cores forced NetApp to develop new algorithms for process parallelization over time. Volume affinities (and other affinities) are algorithms which allow executing multiple processes in parallel, so they do not block each other and can run in parallel, though sometimes it is necessary to execute a serial process which blocks others. For example, when two hosts working on the same storage with two different volumes, they can write & modify data in parallel. If those two hosts start to write to a single volume, ONTAP becomes a broker and serialize processes to that volume and gives access to one host at a time and when it’s done, only then give access to another. Write operations are always executed on a single core; otherwise, because each core can be loaded unequally, you can end up in messing with your data. Each of these affinities allowed to decrease locking other processes and increase parallelization as well as improve multi-core CPU utilization. Volume affinities called “Waffinity” because each volume is a separate WAFL file system, so they combined wafl with affinity words together.

  • Classical Waffinity (2006)
  • Hierarchical Waffinity (2011)
  • Hybrid Waffinity (2016)

FlexGroup

If for instance, ONTAP needs to work on an aggregate level, the whole bunch of volumes living on that aggregate will stop getting write operations for some time because aggregate operations are locking volume operations, this is just one example of locking mechanisms and NetApp improving it throughout ONTAP lifespan. What is the most natural solution would be in such a case? To have multiple volumes, multiple aggregates, and even multiple nodes, but in this case, instead of a single bucket of space, you will have multiple volumes on multiple nodes & aggregates. That’s when FlexGroup gets into the picture: it joins all the (constituent) volumes into a single logical space visible to clients as a single volume or file share. Before FlexGroup ONTAP was very good optimized for workloads with random & small blocks and even sequential reads, but now thanks to FlexGroup, ONTAP optimized for sequential operations and mainly benefiting from sequential writes.

flexgroups

RAID

From WAFL module data delivered to the RAID module which processing it and writing in one transaction (known as stripes) to the disks, including parity data to the parity drives.

Taking in to account that data written to the disks in stripes, it means there is no need to calculate parity data for parity drives because the system prepared everything in RAID module for us. Moreover, that is the reason why on practice parity disk drives always less utilized than data drives, unlike it happens with traditional RAID-6 and RAID-4. This allows avoiding re-writes of data, placing new data to a new empty space and simply moving metadata links to a new place. This allows the system not to read data to its memory and recalculate new parity to a stripe after a single block change and therefore system memory user more rationally. More about RAID-DP in TR-3298.

de6f0fa23b6a4ddcada803d73a66b118.png

Tetris and IO redirection

Tetris is a WAFL mechanism for write & read optimizations. Tetris collects data for each CP and compiles data into chains, so blocks from a single host assembled in bigger chunks (this process is also known as IO redirection). On another hand, this simple logic allows enabling predictive read operations because there is no difference for example to read 5KB or 8KB of data; 13KB or 16KB. A predictive read is a form of read caching. So, when times come to decide what data should be read in advance the answer comes naturally: data most probably will be read right away after the first part most probably the same data what was written right away with the first part. When it comes to a decision what extra data should we read,- it already decided.

5253fc57758c4d239e54dad8b017b02f.png

Read Cache

MBUF used for both read and write caching and all the reads without exception inevitable copied to the cache. From the same cache, hosts can read just written data. When CPU cannot find data in the system cache for the host, it looks for them in another cache if available and only then on disks. Once data found on disks, the system will place it into MBUF. If that piece of data wasn’t accessed for a while CPU can de-stage it to the second memory ties like FlashPool or FlashCache.

Important to note that system very granularity evicts cache from unaccessed 4kb blocks data. Another important thing is that cache in ONTAP systems is deduplication-aware, so is that block already exists in MBUF or on the second ties cache, it will not be copied there again.

Why is NVRAM/NVMEM size not always important?

In NetApp FAS/AFF and other ONTAP-based systems NVRAM/NVMEM used exclusively for NVLOGs, not for write-buffer, so it doesn’t have to be as big as system memory size in other systems.

ONTAP NVRAM vs competitors

Battery and boot flash drive

As I mentioned before, hardware systems like FAS & AFF have a battery installed in each controller to power NVRAM/NVMEM to maintain data for 72H in the memory. Just after a controller lost it’s power, data from memory will destage to Boot Flash drive. Once ONTAP booted data restored back.

Flash media and WAFL

As I mentioned in previous articles from the series, ONTAP always writes to a new space because of a number of architectural reasons. However, it turns out flash media needs precisely that. Though some people predicted for WAFL death because of the Flash media, it turns out WAFL works on that media quite well and with “always write to the new space” technique not just optimizes garbage collection and wear out of flash memory cells equally but also shows quite competitive performance.

Summary

System memory architecturally builds not just to optimize performance to offer high data protection and availability and write optimization. Reach ONTAP functionality, unique way NVRAM/NVMEM usage and reach software integration ecosystem qualitatively differentiate NetApp storage systems from others.

Continue to read

NVRAM/NVMEM, MBUF & CP (Part 1)
NVLOG & Write-Through (Part 2)

How ONTAP cluster works?

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

How memory works in ONTAP: NVLOG and Write-Through (Part 2)

NVRAM/NVMEM & Write-Through

It is important to note that NVRAM/NVMEM technology is widely used in many storage systems, but NetApp ONTAP systems are using it for NVLOGs (HW journaling), while others using it as a block device for write-cache (disk driver level or disk cache) and that simple fact makes some difference in storage architectures. With ONTAP having its architecture with NVLOGs allows system not to switch into Write-Through mode in case one controller in an HA pair dead.

That simple statement is tough to explain in simple words. Let me try. Write-Through is a storage system mode which is not using write-buffer, and all the writes directed straight to the disks, which mean it disables write-buffer, which is a bad idea for many reasons. All the optimizations, all the magic and all the intellectual stuff happening with your data in the write buffers, so disabling write buffers is always a bad idea. For example, some problems you might experience with storage in write-through mode if you are using HDD. HDD drives are significantly slower in performance and memory always way faster than drives, so you can optimize in write-buffer random operations and glue them together in memory and later destage them sequentially on your HDD drives as one big chunk of data on your drives in one operation which is easier to process for HDD drives. Memory cache basically used to trick your hosts and to send them acknowledgment before the data actually placed on disks and in this way to improve performance. In the case of Flash media, you can optimize your data to be written in a way not to wear out your memory cells. Memory also very often used as a place to prepare checksums for RAID (or other types of data protection). So, the bottom line Write-Through is terrible for your storage system performance and all storage vendors trying to avoid that scenario in their systems. When might you need in a storage system architecture to switch to Write-Through? When you are uncertain that your write cache will protect you. The simplest example is when your battery to your write cache is dead.

Let’s examine another more complex scenario, what if you have an HA pair and battery only on one controller die? Well, since all storage systems from all A-brands doing HA, your writes should be protected. What happens if you’re in your HA pair will lose one controller and survived one will have a battery? Many of you might think, according to described logic above, that your storage system will not switch to Write-Through, right? Well, the answer to that question, “it depends.” In ONTAP world since we have NVLOGs used only for data protection purposes in dedicated NVRAM/NVMEM device, data always presented as they were placed there in the unchanged state with no architectural ability to change it, the only thing which is architecturally allowable is to write data to an empty half and then when needed clear all the NVLOGs from first one, so in this architecture there is no need to switch your ONTAP system to Write-Through even though you have only a single controller working. While all other architectures even though they are also using NVRAM/NVMEM, all the data stored in one place.

Both systems either ONTAP and other storage vendors using memory for data optimization, in other words, they are changing data from its original state. And changing your data is a big threat to your data consistency, once you have only one controller survived, even though your battery on that controller functioning properly. So, that’s why all the rest storage systems have to switch to Write-Through, because there is no way to guarantee your data will not be corrupted while in the middle of data optimization, especially after your data been (half-) processed with your RAID algorithm and you will have an unexpected reboot for a survived node. Therefore, all other platforms and systems, all other NetApp AFF/FAS competitors I know, they all will switch to Write-Trough mode once there will be only one node left. There obviously some tricks, like some vendors allow you to disable Write-Through once you get in such a situation, but of course it is not recommended, they just give you the ability to make a bad choice on your own and will lead you to data corruption on entire storage, ones the survived node will unexpectedly reboot. Another example is HPE 3Par systems, wherein 4-node configuration if you will lose only one controller, your system will continue to function normally, but once you lose 50% of your nodes, it again switches to Write-Trough, the same happens if you have a 2-node configuration.

Thanks to the fact that ONTAP stores data in NVLOGs as they were, in unchanged form, it is possible to roll back earlier unfinished transaction of data been already half-processed by RAID, restore your data back to MBUF from NVLOGs and finish that transaction. Each transaction to write new data to disk drives from MBUF executed as part of system snapshot called CP. Each transaction can be easily rolled back and after that single controller will boot it will restore data from NVLOGs to MBUF, again process it with RAID in memory, rollback last unfinished CP and write data to disks, which allows ONTAP systems to be always consistent (from the storage system perspective) and never stitch to Write-Through mode.

Continue to read

How ONTAP cluster works?

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

How memory works in ONTAP: NVRAM/NVMEM, MBUF and CP (Part 1)

In this article, I’d like to describe how NVRAM, Cache and system memory works in ONTAP systems.

System Memory

System memory for any AFF/FAS (and other OEM vendors running ONTAP) consists out of two types of memory: ordinary (volatile) memory and non-volatile memory (that’s why we use “NV” prefix) connected to a battery on the controller. Some systems are using NVRAM and others instead using NVMEM, they both used for the same purpose, the only difference is that NVMEM is installed on the memory bus, while NVRAM is a small board with memory installed on it and that board connected to PCIe bus. Both, either NVRAM or NVMEM used by ONATP to store NVLOGs. Ordinary memory used for system needs and mostly as Memory Buffer (MBUF) in another words cache.

Data de-staging in normally functioning storage system is happening from the MBUF but not from NVRAM/NVMEM. However, the trigger for data de-staging might be the fact of NVRAM/NVMEM been half-full or others. See CP triggers section below.

6ec5d438a6ca472388c199391354b620

NVRAM/NVMEM & NVLOGs

Data from hosts always placed to the MBUF first. Then from MBUF with Direct Memory Access (DMA) request data copied to NVMEM/NVRAM, as the host sent them, in unchanged form of logs just like a DB log or similar to a journaling file system. DMA does not consume CPU cycles. That’s why NetApp says this system has Hardware Journaling file system. As soon data been acknowledged to be in NVRAM/NVMEM, the system sends data receiving an acknowledgment to the host. After Consistency Point (CP) occurred, data from MBUF are de-staged to the disks, and the system clears NVRAM/NVMEM without using NVLOGs at all in a normally functioning system. NVLOGs used only in special events or configurations. So, ONTAP does not use NVLOGs in normally functioning High Availability (HA) system it erases logs each time CP occur and then writes new NVLOGs. For instance, ONTAP will use NVLOGs to restore data “how they were” back to MBUF, in case of an unexpected reboot.

Memory Buffer: Write operations

The first place where all writes placed is always MBUF. Then from MBUF data copied to NVRAM/NVMEM with DMA call, after that WAFL module allocates a range of blocks where data from MBUF will be written that simply called Write Allocation. It might sound simple, but it’s kind of big deal, been constantly optimized by NetApp. However, just before it will allocate space for your data, the system will compile Tetris! Yes, I’m talking about the same kind of Tetris game for puzzle-matching you might play in your childhood. So, the Write Allocation’s one of the jobs is to make sure to write all the Tetris data in one unbreakable sequence to the disks whenever possible.

WAFL is also doing some additional data optimizations depending on the type of the data, where it goes, what type of media is going to be written, etc. After WAFL module gets acknowledgment from NVRAM/NVMEM that data secured, RAID module processing data from MBUF and adds checksums to each block (known as Block/Zone checksum) and it will calculate and write parity data on the parity disks. It is important to note that some data in MBUF contain commands which can be “extracted” before they will be delivered to RAID module, for example, some commands can ask the storage system to format some space in preassigned repeated patterns or commands which asks the storage to move chunks of data. Those commands might consume a small amount of space in NVRAM/NVMEM but might generate a big amount of data when executed.

8c2c56d4a0bd427f9b2bfbfff0323d4b

NVMEM/NVRAM in HA

Each HA pair with ONTAP consists out of two nodes (controllers), and each node has the copy of data from its neighbored HA partner. This architecture allows to switch hosts to a survived node and continue to serve them without a noticeable disruption for the hosts.

To be more precise, each NVRAM/NVMEM divided into two pieces: one to store NVLOGs from local controller and another piece to store the copy of NVLOGs from HA partner controller. While each piece also divided into halves. So, each time the system filled first local half of NVRAM/NVMEM, CP event generated and while it happens local controller using second local half for new operations and after second half filled by the system with logs, the system switches back to the first one, already emptied half, and repeats the cycle.

4b8e159e34fe468aa19d9af6d7a0370f

Consistency Points

As many modern file systems WAFL is a journaling FS which used to keep consistency and data protection. However, unlike general-purpose journaling file systems, WAFL do not need time or special FS checks to rollback & make sure the FS is consistent. Once a controller unexpectedly rebooted, last unfinished CP transaction not confirmed and similarly to snapshot the last snapshot just deleted and data from NVLOGs used to create new consistency point once the controller booted after an unexpected reboot. CP transaction confirmed only once all the transaction entirely been written to the disks and root inode changed with new pointers to the new data blocks added.

It turns out that NetApp snapshot technology was so successful, so it used literally almost everywhere in ONTAP. Let me remind you that each CP contain data already processed by WAFL and then by RAID module. So, CP is also a snapshot, so before data already been processed with WAFL & RAID been destaged from MBUF to disks, ONTAP create system snapshot from the aggregate where it is going to write data. Then ONTAP writes new data to the aggregate. Once data was successfully written as part of CP transaction, ONTAP changes root inode pointers and clear NVLOGs. Before data from CP transaction been written to disks, ONTAP creates a snapshot which represents the last active file system state. To be more precise, it just copies root inode pointers. If in case of failure even if both controllers will reboot simultaneously, last system snapshot will be rolled back, data will be restored from NVLOGs, aging processed with WAFL and RAID modules and destage back on disks on next CP as soon as the controllers get back online.

In case only one controller will suddenly switch off or reboot, second survived controller will restore data from its own NVRAM/NVLOG and finish earlier unsuccessful CP, applications will be transparently switched to the survived controller, and they will continue to run after a small pause as there was no disruption at all. Once CP is successful, as part of CP transaction, ONTAP changes root inode with pointers to the new data and create a new system snapshot which will capture newly added data and pointers to old data. In this way, ONTAP always maintains data consistency on the storage system without the need to switch to Write-Through mode in any case.

ONTAP 9 performs CP separately for each aggregate, while previously it was controller-wide. With CP separation for each aggregate slow aggregates no longer influencing other aggregates in the system.

iNodes

An inode contains information about files which is known as metadata, but inodes can store a small fraction of data too. Inodes have a hierarchical structure. Each inode can store up to 4 KB of information. If a file is small enough to fit into an inode and store metadata in the same inode, then only 4 KB block is used for such an inode, a directory actually is also a file on WAFL file-system. So one of the real world examples were an inode stores metadata, and the data itself is an empty directory or an empty file. However, what if the file is not fitting into an inode? Then the inode will store pointers to other inodes, and that indoes store can store pointers to other inodes or address for data blocks. Currently, WAFL has 5-level hierarchy limit. Sometimes inodes and data blocks referred to as files in deep-dive technical documentation about WAFL. Therefore, each file on FlexVolume file system can store no more than 16 TiB. Each FlexVol volume is a separate WAFL file system and has its own volume root inode.

The reason why Write Anywhere File Layout got Anywhere word in its name is metadata can be anywhere in the FS

Interesting nuance. The reason why Write Anywhere File Layout is probably got Anywhere word in its name is metadata can be anywhere in the FS and mixed up with data blocks, while other traditional file systems usually store their metadata on a dedicated area on the disk which usually has a fixed size. Here is the list of metadata information which can be stored alongside with data.

  • Volume where the inode resides
  • Locking information
  • Mode and type of file
  • Number of links to the file
  • Owner’s user and group ids
  • Number of bytes in the file
  • Access and modification times
  • Time the inode itself was last modified
  • Addresses of the file’s blocks on disk
  • Permission: UNIX bits or Windows Access Control List (ACLs)
  • Qtree ID.
Write_Anywhere_File_Layout_iNODEs

Events generating CP

CP is the event which generated by one of the following conditions:

  • 10 seconds passed by since the last CP
  • The system filled the first piece if NVRAM
  • Local MBUF filled (known as High Watermark). It happens really because of MBUF is usually way bigger that NVMEM/NVRAM. When commands in MBUF after execution generates a lot of new data before of in WAFL/RAID modules.
  • When you executed halt command on the controller to stop it
  • Others

CP condition might indirectly show on some system problems, for example, when there are not enough HDD drives to maintain performance, you will see back-to-back (“B” or “b”) operations. See also Knowledge Base FAQ: Consistency Point.

NVRAM/NVMEM and MetroCluster

To protect data from Split-Brain scenario in MetroCluster (MCC), hosts which writes data to the system will get acknowledgment only after the data will be acknowledged by the local HA partner and by the remote MCC partner (in case if the MCC comprises 4 nodes).

MetroCluster_local_and_DR_pare_memory_replication

HA interconnect

Data synchronization between local HA pair partners happens over HA interconnect. If two controllers in an HA pair located in two separate chassis, then HA interconnect is an external connection (in some models can be over InfiniBand or Ethernet connections and usually named cNx, i.e., c0a, c0b, etc., for example in FAS3240 systems). If two controllers in an HA pair placed in a single HA chassis, HA interconnect is internal, and there are no visible connections. Some controllers might be used in both configurations: HA in a single chassis or HA with each controller in its own chassis, in this case such controllers have dedicated HA interconnect ports often named cNx (i.e., c0a, c0b, etc., for example in FAS3240 systems) but in case this controller used in a single chassis configuration those ports are not used (and cannot be used for other purposes) and internal communication established internally through controller’s back-plain.

Controllers vs Nodes

A storage system formed out of one or few HA pairs. Each HA pair consists out of two controllers; sometimes they called nodes. Controllers and nodes are very similar and often interchangeable terms. The difference between them that controllers are physical devices and nodes are ephemeral OS instance which running on the controllers. Controllers in an HA pair connected with HA cluster interconnect. With hardware appliances like AFF and FAS systems, each hardware disk connected simultaneously to both controllers in an HA pair. Often controllers in tech documents as “controller A” and “controller B.” Even though hard drives in AFF & FAS systems physically got one port, the port comprises two ports. Each port from each drive connected to each controller. So if ever will dig dip into node shell console, and enter disk show command, you’ll see disks named like 0c.00.XX, where 0c means current port through which that disk is connected to the controller which “owns it”, XX is a position of the drive in the disk shelf, and 00 is an ID for the disk shelf.  At each time only one controller owns a disk or a partition on a disk. When a controller owns a disk or a partition on a disk, it means that the controller serves data to hosts from that disk or partition. The HA partner is used only when the owner of the disk or the partition will die, therefore each controller in ONTAP has its own drives/partitions and each serves its own drives/partitions, this architecture known as “share nothing“. There two types of HA policies: SFO (storage failover) and CFO (controller failover). CFO used for root aggregates and SFO for data aggregates. CFO do not change disk ownership in the aggregate, while SFO change disk ownership in the aggregates.

ToasterA*> disk show -v
  DISK       OWNER                  POOL   SERIAL NUMBER
------------ -------------          -----  ----------------
0c.00.1      unowned                  -     WD-WCAW30485556
0c.00.2      ToasterA  (142222203)  Pool0   WD-WCAW30535000
0c.00.11     ToasterB  (142222204)  Pool0   WD-WCAW30485887
0c.00.6      ToasterB  (142222204)  Pool0   WD-WCAW30481983

But since each drive in hardware appliance like FAS & AFF system connected to both controllers, it means each controller can address each disk. And if you will manually change 0c to 0d from this example to the port through which the drive should be available, the system will be able to address the drive.

ToasterB*> disk assign 0d.00.1 -s 142222203
Thu Mar 1 09:18:09 EST [ToasterB:diskown.changingOwner:info]: 
changing ownership for disk 0d.00.1 (S/N WD-WCAW30485556) from (ID 1234) to ToasterA (ID 142222203)

While software-defined ONTAP storage (ONTAP Select & Cloud Volumes ONTAP) works very like MetroCluster, because by definition it has none ”special” equipment, in this case, it doesn‘t have special 2-port drives connected to both servers (nodes). So instead of connecting to a single drive with both nodes in an HA pair, ONTAP Select (and Cloud Volumes ONTAP) it copies data from one controller to the second controller and keeps two copies of data on each node. And that is the price, another side, of the commodity equipment.

Technically, it is possible to connect single external storage, for instance, by iSCSI to each storage node, avoiding unnecessary data duplicating, but that option is not available in SDS ONTAP at the moment.

Mailbox “disk”

While it sounds like disk, it is in really not a disk but rather tiny special area on a disk which consume a few KB. That mailbox area is used to send messages from one HA-partner to another. Mailbox disk is a mechanism which gives ONTAP some additional security level for HA capabilities. Mailbox disks are used to determine the state for the HA partner, in a similar way to emails, where each controller time to time is posting it’s messaging to its own (local) mailbox disks that it is alive, healthy and well while reads from partner’s mailbox. On another hand, if the last timestamp of the last message from a partner is too old, the surviving node will take over. In this way, if HA interconnects not available for some reason or a controller freezes, the partner will determine the state of the second controller using mailbox disks and will perform the takeover. If a disk with mailbox dies, ONTAP going to choose a new disk.
By default, mailbox disks reside on two disks: one data and one parity disk for RAID 4, or one party and one double parity disk for RAID DP, and usually reside at a first aggregate which usually is the system root aggregate.

Cluster1::*> storage failover mailbox-disk show -node node1
Node    Location  Index Disk Name     Physical Location   Disk UUID
------- --------- ----- ------------- ------------------ -------------------
node1    local    0      1.0.4         local        20000000:8777E9D6:[...]
         local    1      1.0.6         partner      20000000:8777E9DE:[...]
         partner  0      1.0.1         local        20000000:877BA634:[...]
         partner  1      1.0.2         partner      20000000:8777C1F2:[...]

Takeover

When a node in HA pair whether software-defined of hardware dies, the survived one will ”takeover” and continue to serve the data from offline node to the hosts. With hardware appliance, the survived node will also change disk ownership from HA-partner to its own.

Active/Active and Active/Passive configurations

In the Active/Active configuration with ONTAP architecture, each controller has its own drives and serves data to hosts, in this case, each controller has at least one data aggregate. In an Active/Passive configuration passive node does not serve data to hosts and have disk drives only for root aggregate (for internal system needs). Each Active/Active and Active/Passive configuration needs to have for each node one root aggregate for the node to function properly. Aggregates formed out of one or few RAID groups. Each RAID group consists out of few disks drives or partitions. All the drives or partitions from an aggregate has to be owned by a single node.

Continue to read

How ONTAP cluster works?

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

Which kind of Data Protection SnapMirror is? (Part 2)

I’m facing this question over and over again in different forms. To answer that question, we need to understand what kinds of data protection exists. The first part of this article How to make a Metro-HA from DR (Part 1)?

High Availability

This type of data protection trying to do its best to make your data available all the time. If you have an HA service, it will continue to work even if one or even a few components fail which means your Recovery Point Objective (RPO) is always 0 with HA, and Recovery Time Objective (RTO) is near 0. With RTO whatever that number is we assume that our service and applications using that service (maybe with a small pause) will survive failure and continue to function and will not return an error to its clients. An essential part of any HA solution is automatic switchover between two or more components, so your applications will transparently switch to the survived elements and your applications continue to interact with survived components instead of the failed one. With HA your timeouts should be set for your applications (typically up to 180 seconds) so that RTO will be equal to or lower. HA solutions made in a way not to reach those application timeouts to make sure they not going to return an error to upstream services but rather a short pause. Whenever you got RPO not 0, it instantly means data protection is not an HA solution. The biggest problem with HA solutions they limited by the distance between which components can communicate, the more significant gap between them, the more time they need all your data to be fully Synchronous across all of them and ready to take over the failed part.

In the context of NetApp FAS/AFF/ONTAP systems, HA can be local HA-pair or MetroCluster stretched between two sites up to 700 km.

Slide1.png
Slide2

Disaster Recovery

The second data protection is DR. What is the difference between DR and HA, they both for data protection, right? By definition, DR is the kind of data protection which starts with the assumption you already get into a situation where your data not available and your HA solution has failed for any reason. Why DR assumes your data not available, and you have a disruption in your infrastructure service? The answer is “by definition.” With DR you might have RPO 0 or not and your RTO is always not 0 which means you will get an error accessing your data, there will be a disruption in your service. DR assumes by definition there is no fully automatic and transparent switchover.

Because HA and DR are both Data Protection techniques, people often confuse them, mix them up and do not see the difference or vice versa, they are trying to contrapose them and choose between them. But, now after explanation what they are and how they are different, you might already guess that you cannot replace one with another they do not compete but rather complement each other.

In the context of NetApp systems, SnapMirror technology strongly associated with DR capabilities.

Slide3.png

Backup & Archive data protection

Backup is another type of data protection. Backup is an even lower level of data protection than DR and allows you to access your data all the time from the Backup site for the data restoration to a production site. An essential role for Backup data is to ensure it does not alter your data. Therefore, with Backup, we assume to restore data back to original or another place but not alter backed up data which means not to run DR on your Backup data. In the context of NetApp AFF/FAS/ONTAP systems backup solution are local Snapshots (a kind of) and SnapVault D2D replication technology. In ONTAP Cluster-Mode (version 8.3 and newer) SnapVault becomes XDP, just another engine for SnapMirror. With XDP SnapMirror capable of Unified Replication for both DR and Backup. With Archives you do not have access to your backups, so you need some time to bring them online before you can restore it back to the source or another location. The type library or NetApp Cloud Backup are the examples for the archive solution.

NetApp_SnapMirror.png

Is SnapMirror HA or DR data protection technology?

There is no straightforward answer to that, and to answer the question we have to consider the details.

SnapMirror comes in two flavors: Asynchronous SnapMirror which transfers data time to time to a secondary site, it is obviously a DR technology because you cannot switch to the DR site automatically since you do not have the latest version of your data. That means that before you start your applications, you might need to prepare them first. For instance, you might need to apply DB logs to your database, so your “not the latest version of data” will become the latest one. Alternatively, you might need to choose one snapshot out of the last few which you need to restore because the latest one might have corrupted data with a virus for instance. Again, by definition DR scenario assumes that you will not switch to a DR instantly, it assumes you already have downtime, and it assumes you might have manual interaction or a script or some modifications made before you’ll be able to start & run your services which require some downtime.

Synchronous SnapMirror (SM-S) also has two modes: Strict Full Synchronous mode and Relaxed Synchronous mode. The problem with Synchronous replication, similarly to HA solutions, is that the longer distance between the two sites, the more time needed to replicate the data. And the longer data will be transferred and confirmed to the first system, the longer time your application will not get the confirmation from your system.

The relaxed mode allows to have lags and network break-out and after network communication restoration auto-sync again, which means it is also a DR solution because it enables RPO to be not 0.

Strict mode does not tolerate network break-out by definition, which means it ensures your RPO to be always 0, which kind of makes it closer to HA.

Does it mean Synchronous SnapMirror in Strict mode is an HA solution?

Well, not precisely. Synchronous SnapMirror in Strict mode can also be part of a DR solution. For instance, if you have a DB with all the data been Asynchronously replicated to a DR site and only DB logs synchronously replicated to DR site, in this way we can reduce network traffic between two locations, provide small overall RPO and with DB synchronous logs restore data to the DB to ensure entire DB with RPO 0. In such a scenario RTO will not be so big but allows your DR site to be located very far away one from another. See scenarios how SnapMirror Sync can be combined with SnapMirror Async to build more robust beneficial DR solution.

To comply with HA definition, you need to have not only RPO to be 0 but also to be able to automatically switch over with RTO not higher than timeouts for your applications & services.

Can SM-S Strict mode switchover between sites automatically?

The answer is “not YET.” To do automatic switchover between sites, NetApp has an entirely different technology called MetroCluster which is Metro-HA technology. Any MetroCluster or local HA systems should be accommodated with DR, Backup & Archive technologies to provide the best data protection possible.

Will SM-S become HA?

I personally believe that NetApp will make it possible in the future to automatically switch over between two sites with SM-S. Most probably it will spin around SVM-DR feature to replicate not only data but also network interfaces and configurations, and for doing that SM-S will need some kind of Tiebreaker like in MCC, but those are not there yet. In my personal opinion, this kind of technology most probably going to (and should) be positioned as online data migration technology across NetApp Data Fabric rather than as a (Merto-) HA solution.

Why should SM-S not be positioned as an HA?

Few reasons:

1) NetApp already has MetroCluster (MCC) technology, and for many-many years it was and still is a superior Metro-HA technology proven to be stable, reliable and performant.

2) Now MCC become easier, simpler and smaller, and the only reasons you would like to have HA on top of SnapMirror are basically that tree. Since we already have MCC over IP (MC-IP), it is theoretically possible to run it even on the smallest AFF systems someday.

According to my own sense of how it will be, in some cases, SM-S might be used as an HA solution someday.

How HA, DR & Backup solutions applied to practice?

As you remember HA, DR & Backup solutions do not compete with but rather complement each other to provide full data protection. In a perfect world without money where you need to provide the highest possible and fully covered data protection, you would need HA, DR, Backups, and Archive. Where HA is located in one place or Geo-distributed as far as possible (up to 700 km), and on top of that, you need DR and Backups. For Backups, you might probably need to place your site as far as possible, for instance, on another side of the country or even to another continent. In these circumstances, you can do Synchronous SnapMirror only for some of your data like DB logs and Async for the rest to an intermediate site (up to a 10 ms network RTT latency) to a DR site and from that intermediate site to another continent all the data replicated Asynchronously or as Backup protection. And from DR and/or Backup sites we can do Archiving to Tape Library or NetApp Cloud Backup or another archive solution.

Slide4

Summary

HA, DR, Backup and Archive are different types of data protection which complement each other. Any company should have not only HA solution for their data but also DR, Backup, and Archive in the best-case scenario or at least HA solution & Backup, but it always depends on business needs, business willingness to get some level of protection, and understanding risks involved with not protecting the data properly.