How memory works in ONTAP: Write Allocation, Tetris, Summary (Part 3)

Write Allocation

NetApp has changed this part of the WAFL architecture a lot since its first release. Increased demand for performance, multi-core CPUs and a new type for media were the forces to continue to improve Write Allocation.

Each thread can be processed separately and depends on data type, write/read a pattern, data block size, place media where it will be written and other characteristics. Based on that WAFL can decide how and where data will be placed and what type of media, it should use.

For example, for flash media, it is important to write data taking into account the smallest block size that media can operate to ensure cells are not wear-out and evenly utilized. Another example is when data separated from metadata and stored on different tiers of storage. In reality Write Allocation is a very, very huge, separate topic to discuss, how ONTAP is optimizing data in this module.

Affinities

Increase in the number of CPU cores forced NetApp to develop new algorithms for process parallelization over time. Volume affinities (and other affinities) are algorithms which allow executing multiple processes in parallel, so they do not block each other and can run in parallel, though sometimes it is necessary to execute a serial process which blocks others. For example, when two hosts working on the same storage with two different volumes, they can write & modify data in parallel. If those two hosts start to write to a single volume, ONTAP becomes a broker and serialize processes to that volume and gives access to one host at a time and when it’s done, only then give access to another. Write operations are always executed on a single core; otherwise, because each core can be loaded unequally, you can end up in messing with your data. Each of these affinities allowed to decrease locking other processes and increase parallelization as well as improve multi-core CPU utilization. Volume affinities called “Waffinity” because each volume is a separate WAFL file system, so they combined wafl with affinity words together.

  • Classical Waffinity (2006)
  • Hierarchical Waffinity (2011)
  • Hybrid Waffinity (2016)

FlexGroup

If for instance, ONTAP needs to work on an aggregate level, the whole bunch of volumes living on that aggregate will stop getting write operations for some time because aggregate operations are locking volume operations, this is just one example of locking mechanisms and NetApp improving it throughout ONTAP lifespan. What is the most natural solution would be in such a case? To have multiple volumes, multiple aggregates, and even multiple nodes, but in this case, instead of a single bucket of space, you will have multiple volumes on multiple nodes & aggregates. That’s when FlexGroup gets into the picture: it joins all the (constituent) volumes into a single logical space visible to clients as a single volume or file share. Before FlexGroup ONTAP was very good optimized for workloads with random & small blocks and even sequential reads, but now thanks to FlexGroup, ONTAP optimized for sequential operations and mainly benefiting from sequential writes.

flexgroups

RAID

From WAFL module data delivered to the RAID module which processing it and writing in one transaction (known as stripes) to the disks, including parity data to the parity drives.

Taking in to account that data written to the disks in stripes, it means there is no need to calculate parity data for parity drives because the system prepared everything in RAID module for us. Moreover, that is the reason why on practice parity disk drives always less utilized than data drives, unlike it happens with traditional RAID-6 and RAID-4. This allows avoiding re-writes of data, placing new data to a new empty space and simply moving metadata links to a new place. This allows the system not to read data to its memory and recalculate new parity to a stripe after a single block change and therefore system memory user more rationally. More about RAID-DP in TR-3298.

de6f0fa23b6a4ddcada803d73a66b118.png

Tetris and IO redirection

Tetris is a WAFL mechanism for write & read optimizations. Tetris collects data for each CP and compiles data into chains, so blocks from a single host assembled in bigger chunks (this process is also known as IO redirection). On another hand, this simple logic allows enabling predictive read operations because there is no difference for example to read 5KB or 8KB of data; 13KB or 16KB. A predictive read is a form of read caching. So, when times come to decide what data should be read in advance the answer comes naturally: data most probably will be read right away after the first part most probably the same data what was written right away with the first part. When it comes to a decision what extra data should we read,- it already decided.

5253fc57758c4d239e54dad8b017b02f.png

Read Cache

MBUF used for both read and write caching and all the reads without exception inevitable copied to the cache. From the same cache, hosts can read just written data. When CPU cannot find data in the system cache for the host, it looks for them in another cache if available and only then on disks. Once data found on disks, the system will place it into MBUF. If that piece of data wasn’t accessed for a while CPU can de-stage it to the second memory ties like FlashPool or FlashCache.

Important to note that system very granularity evicts cache from unaccessed 4kb blocks data. Another important thing is that cache in ONTAP systems is deduplication-aware, so is that block already exists in MBUF or on the second ties cache, it will not be copied there again.

Why is NVRAM/NVMEM size not always important?

In NetApp FAS/AFF and other ONTAP-based systems NVRAM/NVMEM used exclusively for NVLOGs, not for write-buffer, so it doesn’t have to be as big as system memory size in other systems.

ONTAP NVRAM vs competitors

Battery and boot flash drive

As I mentioned before, hardware systems like FAS & AFF have a battery installed in each controller to power NVRAM/NVMEM to maintain data for 72H in the memory. Just after a controller lost it’s power, data from memory will destage to Boot Flash drive. Once ONTAP booted data restored back.

Flash media and WAFL

As I mentioned in previous articles from the series, ONTAP always writes to a new space because of a number of architectural reasons. However, it turns out flash media needs precisely that. Though some people predicted for WAFL death because of the Flash media, it turns out WAFL works on that media quite well and with “always write to the new space” technique not just optimizes garbage collection and wear out of flash memory cells equally but also shows quite competitive performance.

Summary

System memory architecturally builds not just to optimize performance to offer high data protection and availability and write optimization. Reach ONTAP functionality, unique way NVRAM/NVMEM usage and reach software integration ecosystem qualitatively differentiate NetApp storage systems from others.

Continue to read

NVRAM/NVMEM, MBUF & CP (Part 1)
NVLOG & Write-Through (Part 2)

How ONTAP cluster works?

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

How memory works in ONTAP: NVLOG and Write-Through (Part 2)

NVRAM/NVMEM & Write-Through

It is important to note that NVRAM/NVMEM technology is widely used in many storage systems, but NetApp ONTAP systems are using it for NVLOGs (HW journaling), while others using it as a block device for write-cache (disk driver level or disk cache) and that simple fact makes some difference in storage architectures. With ONTAP having its architecture with NVLOGs allows system not to switch into Write-Through mode in case one controller in an HA pair dead.

That simple statement is tough to explain in simple words. Let me try. Write-Through is a storage system mode which is not using write-buffer, and all the writes directed straight to the disks, which mean it disables write-buffer, which is a bad idea for many reasons. All the optimizations, all the magic and all the intellectual stuff happening with your data in the write buffers, so disabling write buffers is always a bad idea. For example, some problems you might experience with storage in write-through mode if you are using HDD. HDD drives are significantly slower in performance and memory always way faster than drives, so you can optimize in write-buffer random operations and glue them together in memory and later destage them sequentially on your HDD drives as one big chunk of data on your drives in one operation which is easier to process for HDD drives. Memory cache basically used to trick your hosts and to send them acknowledgment before the data actually placed on disks and in this way to improve performance. In the case of Flash media, you can optimize your data to be written in a way not to wear out your memory cells. Memory also very often used as a place to prepare checksums for RAID (or other types of data protection). So, the bottom line Write-Through is terrible for your storage system performance and all storage vendors trying to avoid that scenario in their systems. When might you need in a storage system architecture to switch to Write-Through? When you are uncertain that your write cache will protect you. The simplest example is when your battery to your write cache is dead.

Let’s examine another more complex scenario, what if you have an HA pair and battery only on one controller die? Well, since all storage systems from all A-brands doing HA, your writes should be protected. What happens if you’re in your HA pair will lose one controller and survived one will have a battery? Many of you might think, according to described logic above, that your storage system will not switch to Write-Through, right? Well, the answer to that question, “it depends.” In ONTAP world since we have NVLOGs used only for data protection purposes in dedicated NVRAM/NVMEM device, data always presented as they were placed there in the unchanged state with no architectural ability to change it, the only thing which is architecturally allowable is to write data to an empty half and then when needed clear all the NVLOGs from first one, so in this architecture there is no need to switch your ONTAP system to Write-Through even though you have only a single controller working. While all other architectures even though they are also using NVRAM/NVMEM, all the data stored in one place.

Both systems either ONTAP and other storage vendors using memory for data optimization, in other words, they are changing data from its original state. And changing your data is a big threat to your data consistency, once you have only one controller survived, even though your battery on that controller functioning properly. So, that’s why all the rest storage systems have to switch to Write-Through, because there is no way to guarantee your data will not be corrupted while in the middle of data optimization, especially after your data been (half-) processed with your RAID algorithm and you will have an unexpected reboot for a survived node. Therefore, all other platforms and systems, all other NetApp AFF/FAS competitors I know, they all will switch to Write-Trough mode once there will be only one node left. There obviously some tricks, like some vendors allow you to disable Write-Through once you get in such a situation, but of course it is not recommended, they just give you the ability to make a bad choice on your own and will lead you to data corruption on entire storage, ones the survived node will unexpectedly reboot. Another example is HPE 3Par systems, wherein 4-node configuration if you will lose only one controller, your system will continue to function normally, but once you lose 50% of your nodes, it again switches to Write-Trough, the same happens if you have a 2-node configuration.

Thanks to the fact that ONTAP stores data in NVLOGs as they were, in unchanged form, it is possible to roll back earlier unfinished transaction of data been already half-processed by RAID, restore your data back to MBUF from NVLOGs and finish that transaction. Each transaction to write new data to disk drives from MBUF executed as part of system snapshot called CP. Each transaction can be easily rolled back and after that single controller will boot it will restore data from NVLOGs to MBUF, again process it with RAID in memory, rollback last unfinished CP and write data to disks, which allows ONTAP systems to be always consistent (from the storage system perspective) and never stitch to Write-Through mode.

Continue to read

How ONTAP cluster works?

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.

How memory works in ONTAP: NVRAM/NVMEM, MBUF and CP (Part 1)

In this article, I’d like to describe how NVRAM, Cache and system memory works in ONTAP systems.

System Memory

System memory for any AFF/FAS (and other OEM vendors running ONTAP) consists out of two types of memory: ordinary (volatile) memory and non-volatile memory (that’s why we use “NV” prefix) connected to a battery on the controller. Some systems are using NVRAM and others instead using NVMEM, they both used for the same purpose, the only difference is that NVMEM is installed on the memory bus, while NVRAM is a small board with memory installed on it and that board connected to PCIe bus. Both, either NVRAM or NVMEM used by ONATP to store NVLOGs. Ordinary memory used for system needs and mostly as Memory Buffer (MBUF) in another words cache.

Data de-staging in normally functioning storage system is happening from the MBUF but not from NVRAM/NVMEM. However, the trigger for data de-staging might be the fact of NVRAM/NVMEM been half-full or others. See CP triggers section below.

6ec5d438a6ca472388c199391354b620

NVRAM/NVMEM & NVLOGs

Data from hosts always placed to the MBUF first. Then from MBUF with Direct Memory Access (DMA) request data copied to NVMEM/NVRAM, as the host sent them, in unchanged form of logs just like a DB log or similar to a journaling file system. DMA does not consume CPU cycles. That’s why NetApp says this system has Hardware Journaling file system. As soon data been acknowledged to be in NVRAM/NVMEM, the system sends data receiving an acknowledgment to the host. After Consistency Point (CP) occurred, data from MBUF are de-staged to the disks, and the system clears NVRAM/NVMEM without using NVLOGs at all in a normally functioning system. NVLOGs used only in special events or configurations. So, ONTAP does not use NVLOGs in normally functioning High Availability (HA) system it erases logs each time CP occur and then writes new NVLOGs. For instance, ONTAP will use NVLOGs to restore data “how they were” back to MBUF, in case of an unexpected reboot.

Memory Buffer: Write operations

The first place where all writes placed is always MBUF. Then from MBUF data copied to NVRAM/NVMEM with DMA call, after that WAFL module allocates a range of blocks where data from MBUF will be written that simply called Write Allocation. It might sound simple, but it’s kind of big deal, been constantly optimized by NetApp. However, just before it will allocate space for your data, the system will compile Tetris! Yes, I’m talking about the same kind of Tetris game for puzzle-matching you might play in your childhood. So, the Write Allocation’s one of the jobs is to make sure to write all the Tetris data in one unbreakable sequence to the disks whenever possible.

WAFL is also doing some additional data optimizations depending on the type of the data, where it goes, what type of media is going to be written, etc. After WAFL module gets acknowledgment from NVRAM/NVMEM that data secured, RAID module processing data from MBUF and adds checksums to each block (known as Block/Zone checksum) and it will calculate and write parity data on the parity disks. It is important to note that some data in MBUF contain commands which can be “extracted” before they will be delivered to RAID module, for example, some commands can ask the storage system to format some space in preassigned repeated patterns or commands which asks the storage to move chunks of data. Those commands might consume a small amount of space in NVRAM/NVMEM but might generate a big amount of data when executed.

8c2c56d4a0bd427f9b2bfbfff0323d4b

NVMEM/NVRAM in HA

Each HA pair with ONTAP consists out of two nodes (controllers), and each node has the copy of data from its neighbored HA partner. This architecture allows to switch hosts to a survived node and continue to serve them without a noticeable disruption for the hosts.

To be more precise, each NVRAM/NVMEM divided into two pieces: one to store NVLOGs from local controller and another piece to store the copy of NVLOGs from HA partner controller. While each piece also divided into halves. So, each time the system filled first local half of NVRAM/NVMEM, CP event generated and while it happens local controller using second local half for new operations and after second half filled by the system with logs, the system switches back to the first one, already emptied half, and repeats the cycle.

4b8e159e34fe468aa19d9af6d7a0370f

Consistency Points

As many modern file systems WAFL is a journaling FS which used to keep consistency and data protection. However, unlike general-purpose journaling file systems, WAFL do not need time or special FS checks to rollback & make sure the FS is consistent. Once a controller unexpectedly rebooted, last unfinished CP transaction not confirmed and similarly to snapshot the last snapshot just deleted and data from NVLOGs used to create new consistency point once the controller booted after an unexpected reboot. CP transaction confirmed only once all the transaction entirely been written to the disks and root inode changed with new pointers to the new data blocks added.

It turns out that NetApp snapshot technology was so successful, so it used literally almost everywhere in ONTAP. Let me remind you that each CP contain data already processed by WAFL and then by RAID module. So, CP is also a snapshot, so before data already been processed with WAFL & RAID been destaged from MBUF to disks, ONTAP create system snapshot from the aggregate where it is going to write data. Then ONTAP writes new data to the aggregate. Once data was successfully written as part of CP transaction, ONTAP changes root inode pointers and clear NVLOGs. Before data from CP transaction been written to disks, ONTAP creates a snapshot which represents the last active file system state. To be more precise, it just copies root inode pointers. If in case of failure even if both controllers will reboot simultaneously, last system snapshot will be rolled back, data will be restored from NVLOGs, aging processed with WAFL and RAID modules and destage back on disks on next CP as soon as the controllers get back online.

In case only one controller will suddenly switch off or reboot, second survived controller will restore data from its own NVRAM/NVLOG and finish earlier unsuccessful CP, applications will be transparently switched to the survived controller, and they will continue to run after a small pause as there was no disruption at all. Once CP is successful, as part of CP transaction, ONTAP changes root inode with pointers to the new data and create a new system snapshot which will capture newly added data and pointers to old data. In this way, ONTAP always maintains data consistency on the storage system without the need to switch to Write-Through mode in any case.

ONTAP 9 performs CP separately for each aggregate, while previously it was controller-wide. With CP separation for each aggregate slow aggregates no longer influencing other aggregates in the system.

iNodes

An inode contains information about files which is known as metadata, but inodes can store a small fraction of data too. Inodes have a hierarchical structure. Each inode can store up to 4 KB of information. If a file is small enough to fit into an inode and store metadata in the same inode, then only 4 KB block is used for such an inode, a directory actually is also a file on WAFL file-system. So one of the real world examples were an inode stores metadata, and the data itself is an empty directory or an empty file. However, what if the file is not fitting into an inode? Then the inode will store pointers to other inodes, and that indoes store can store pointers to other inodes or address for data blocks. Currently, WAFL has 5-level hierarchy limit. Sometimes inodes and data blocks referred to as files in deep-dive technical documentation about WAFL. Therefore, each file on FlexVolume file system can store no more than 16 TiB. Each FlexVol volume is a separate WAFL file system and has its own volume root inode.

The reason why Write Anywhere File Layout got Anywhere word in its name is metadata can be anywhere in the FS

Interesting nuance. The reason why Write Anywhere File Layout is probably got Anywhere word in its name is metadata can be anywhere in the FS and mixed up with data blocks, while other traditional file systems usually store their metadata on a dedicated area on the disk which usually has a fixed size. Here is the list of metadata information which can be stored alongside with data.

  • Volume where the inode resides
  • Locking information
  • Mode and type of file
  • Number of links to the file
  • Owner’s user and group ids
  • Number of bytes in the file
  • Access and modification times
  • Time the inode itself was last modified
  • Addresses of the file’s blocks on disk
  • Permission: UNIX bits or Windows Access Control List (ACLs)
  • Qtree ID.
Write_Anywhere_File_Layout_iNODEs

Events generating CP

CP is the event which generated by one of the following conditions:

  • 10 seconds passed by since the last CP
  • The system filled the first piece if NVRAM
  • Local MBUF filled (known as High Watermark). It happens really because of MBUF is usually way bigger that NVMEM/NVRAM. When commands in MBUF after execution generates a lot of new data before of in WAFL/RAID modules.
  • When you executed halt command on the controller to stop it
  • Others

CP condition might indirectly show on some system problems, for example, when there are not enough HDD drives to maintain performance, you will see back-to-back (“B” or “b”) operations. See also Knowledge Base FAQ: Consistency Point.

NVRAM/NVMEM and MetroCluster

To protect data from Split-Brain scenario in MetroCluster (MCC), hosts which writes data to the system will get acknowledgment only after the data will be acknowledged by the local HA partner and by the remote MCC partner (in case if the MCC comprises 4 nodes).

MetroCluster_local_and_DR_pare_memory_replication

HA interconnect

Data synchronization between local HA pair partners happens over HA interconnect. If two controllers in an HA pair located in two separate chassis, then HA interconnect is an external connection (in some models can be over InfiniBand or Ethernet connections and usually named cNx, i.e., c0a, c0b, etc., for example in FAS3240 systems). If two controllers in an HA pair placed in a single HA chassis, HA interconnect is internal, and there are no visible connections. Some controllers might be used in both configurations: HA in a single chassis or HA with each controller in its own chassis, in this case such controllers have dedicated HA interconnect ports often named cNx (i.e., c0a, c0b, etc., for example in FAS3240 systems) but in case this controller used in a single chassis configuration those ports are not used (and cannot be used for other purposes) and internal communication established internally through controller’s back-plain.

Controllers vs Nodes

A storage system formed out of one or few HA pairs. Each HA pair consists out of two controllers; sometimes they called nodes. Controllers and nodes are very similar and often interchangeable terms. The difference between them that controllers are physical devices and nodes are ephemeral OS instance which running on the controllers. Controllers in an HA pair connected with HA cluster interconnect. With hardware appliances like AFF and FAS systems, each hardware disk connected simultaneously to both controllers in an HA pair. Often controllers in tech documents as “controller A” and “controller B.” Even though hard drives in AFF & FAS systems physically got one port, the port comprises two ports. Each port from each drive connected to each controller. So if ever will dig dip into node shell console, and enter disk show command, you’ll see disks named like 0c.00.XX, where 0c means current port through which that disk is connected to the controller which “owns it”, XX is a position of the drive in the disk shelf, and 00 is an ID for the disk shelf.  At each time only one controller owns a disk or a partition on a disk. When a controller owns a disk or a partition on a disk, it means that the controller serves data to hosts from that disk or partition. The HA partner is used only when the owner of the disk or the partition will die, therefore each controller in ONTAP has its own drives/partitions and each serves its own drives/partitions, this architecture known as “share nothing“. There two types of HA policies: SFO (storage failover) and CFO (controller failover). CFO used for root aggregates and SFO for data aggregates. CFO do not change disk ownership in the aggregate, while SFO change disk ownership in the aggregates.

ToasterA*> disk show -v
  DISK       OWNER                  POOL   SERIAL NUMBER
------------ -------------          -----  ----------------
0c.00.1      unowned                  -     WD-WCAW30485556
0c.00.2      ToasterA  (142222203)  Pool0   WD-WCAW30535000
0c.00.11     ToasterB  (142222204)  Pool0   WD-WCAW30485887
0c.00.6      ToasterB  (142222204)  Pool0   WD-WCAW30481983

But since each drive in hardware appliance like FAS & AFF system connected to both controllers, it means each controller can address each disk. And if you will manually change 0c to 0d from this example to the port through which the drive should be available, the system will be able to address the drive.

ToasterB*> disk assign 0d.00.1 -s 142222203
Thu Mar 1 09:18:09 EST [ToasterB:diskown.changingOwner:info]: 
changing ownership for disk 0d.00.1 (S/N WD-WCAW30485556) from (ID 1234) to ToasterA (ID 142222203)

While software-defined ONTAP storage (ONTAP Select & Cloud Volumes ONTAP) works very like MetroCluster, because by definition it has none ”special” equipment, in this case, it doesn‘t have special 2-port drives connected to both servers (nodes). So instead of connecting to a single drive with both nodes in an HA pair, ONTAP Select (and Cloud Volumes ONTAP) it copies data from one controller to the second controller and keeps two copies of data on each node. And that is the price, another side, of the commodity equipment.

Technically, it is possible to connect single external storage, for instance, by iSCSI to each storage node, avoiding unnecessary data duplicating, but that option is not available in SDS ONTAP at the moment.

Mailbox “disk”

While it sounds like disk, it is in really not a disk but rather tiny special area on a disk which consume a few KB. That mailbox area is used to send messages from one HA-partner to another. Mailbox disk is a mechanism which gives ONTAP some additional security level for HA capabilities. Mailbox disks are used to determine the state for the HA partner, in a similar way to emails, where each controller time to time is posting it’s messaging to its own (local) mailbox disks that it is alive, healthy and well while reads from partner’s mailbox. On another hand, if the last timestamp of the last message from a partner is too old, the surviving node will take over. In this way, if HA interconnects not available for some reason or a controller freezes, the partner will determine the state of the second controller using mailbox disks and will perform the takeover. If a disk with mailbox dies, ONTAP going to choose a new disk.
By default, mailbox disks reside on two disks: one data and one parity disk for RAID 4, or one party and one double parity disk for RAID DP, and usually reside at a first aggregate which usually is the system root aggregate.

Cluster1::*> storage failover mailbox-disk show -node node1
Node    Location  Index Disk Name     Physical Location   Disk UUID
------- --------- ----- ------------- ------------------ -------------------
node1    local    0      1.0.4         local        20000000:8777E9D6:[...]
         local    1      1.0.6         partner      20000000:8777E9DE:[...]
         partner  0      1.0.1         local        20000000:877BA634:[...]
         partner  1      1.0.2         partner      20000000:8777C1F2:[...]

Takeover

When a node in HA pair whether software-defined of hardware dies, the survived one will ”takeover” and continue to serve the data from offline node to the hosts. With hardware appliance, the survived node will also change disk ownership from HA-partner to its own.

Active/Active and Active/Passive configurations

In the Active/Active configuration with ONTAP architecture, each controller has its own drives and serves data to hosts, in this case, each controller has at least one data aggregate. In an Active/Passive configuration passive node does not serve data to hosts and have disk drives only for root aggregate (for internal system needs). Each Active/Active and Active/Passive configuration needs to have for each node one root aggregate for the node to function properly. Aggregates formed out of one or few RAID groups. Each RAID group consists out of few disks drives or partitions. All the drives or partitions from an aggregate has to be owned by a single node.

Continue to read

How ONTAP cluster works?

Zoning for ONTAP Cluster

Disclaimer

Please note in this article I will describe my own understanding of the internal organization of system memory in ONTAP systems. Therefore, this information might be either outdated, or I simply might be wrong in some aspects and details. I will greatly appreciate any of your contribution to make this article better, please leave any of your ideas and suggestions about this topic in the comments below.