ZFS Caching

ZFS is an advanced file system that offers many beneficial features such as pooled storage, data scrubbing, capacity and more. But one of the most beneficial features of ZFS is the way it caches reads and writes. ZFS allows for tiered caching of data through the use of memory.

The first level of caching in ZFS is the Adaptive Replacement Cache (ARC), once all the space in the ARC is utilized, ZFS places the most recently and frequently used data into the Level 2 Adaptive Replacement Cache (L2ARC).

With the ARC and L2ARC, along with the ZIL (ZFS Intent Log,) and SLOG (separate log), there is some confusion on what role they actually fill.

The ARC and its extension, the L2ARC are straight up read caches. They exist to speed up reads in the server so that the system doesn't need to go searching through slow spinning disks every time it needs to find data. For writes, the ZIL is a log, and isn't actually a write cache (though it is often referred to as one) even if it exists on a separate log device (SLOG). This article will attempt to shed some light on what all these acronyms do and how implementing them might benefit your server. First we'll talk about writes, so the ZIL and SLOG.

Writes

When ZFS receives a write request, it doesn't just immediately start writing to disk - it caches its writes in RAM before sending them out in Transaction Groups (TXGs) in set intervals (default of 5 seconds). This is called a Transactional File System.

This benefits your performance as all writes going to disk are better organized and therefore much easier for spinning disks to process. It also benefits data consistency, as partial writes will not occur in the event of a power failure. Instead of committing a partial write - all the data in the TXG will be discarded.

To understand the basics of how ZFS handles writes, you will need to understand the difference between synchronous and asynchronous writes. So let's dive in.

Asynchronous Writes

Asynchronous writes: data is immediately cached in RAM and seen as completed to the client, then later written to disk.

When the client sends out a write request, it is immediately stored in RAM. The server then acknowledges to the client that the write has been completed. After it is stored in the RAM the server will take more requests without writing anything to the disks until the transaction group is written to disk together. Asynchronous writes are a very fast process from the end users perspective because the data only needs to be stored in high speed RAM to be seen as completed.

The issue? Although data will still be consistent, if a power failure does occur then everything in the transaction group will be lost, because it only exists on volatile memory. Synchronous writes are meant to ensure data consistency, but they come at the cost of performance.

Synchronous Writes (without a separate logging device)

Synchronous writes: need to be acknowledged to have been written to persistent storage media before the write is seen as complete.

When the client sends out a write request for a synchronous write, it is still first sent to the RAM just like an asynchronous write, but the server will not acknowledge that the write has completed until it has been logged in the ZFS intent log (ZIL). Once the ZIL has been updated, then the write is committed and acknowledged. The ZIL exists as a portion of your storage pool by default which means the drive heads needs to physically move location to both update the ZIL and actually store the data as part of the pool, further impacting performance. Waiting for slower storage media (HDDs) causes some performance issues, especially from small random writes.

ZFS's solution to slowdowns and unwanted loss of data from synchronous writes is by placing the ZIL on a separate, faster, and persistent storage device (SLOG) typically on an SSD.

Synchronous Writes with a SLOG

When the ZIL is housed on an SSD the clients synchronous write requests will log much quicker in the ZIL. This way if the data on the RAM was lost because of a power failure, the system would check the ZIL next time it was back on and find the data it was looking for.

Alternatively the data may be immediately written to disk along with pointers to it saved on the ZIL. The data would have its metadata updated after the next TXG was sent through to point to the correct location. If a power failure occurred, the server would check the ZIL and find out where the data was. The system wasn't actually aware of where the data was housed until the next transaction group went through and would need to check the ZIL because the metadata wouldn't show where it was. One thing to note with an SLOG is it is generally best to mirror to allow it to serve its job of ensuring data consistency in the event of a power failure.

So how much does an SLOG help performance?

The impact performance of an SLOG will depend on the application. For small IO there will be a large improvement and could be a fair improvement on sequential IO as well. For a lot of synchronous writes such as use cases like database servers or hosting VMs it could also be helpful. However, the SLOGs primary function is not as a performance boon, but to save data that would otherwise be lost in the event of a power failure. For mission critical applications, it could potentially be quite costly to lose the 5 seconds of data that would have been sent over in the next transaction group. That's also why an SLOG isn't truly a cache, it is a log like its name suggests. The SLOG is only accessed in the event of an unexpected power failure.

If the 5 seconds of data you might lose is vital, then it is possible to force all writes to be performed as synchronous in exchange for a performance loss. If none of that data is mission critical, then sync can be disabled and all writes can simply use RAM for a cache at the risk of losing a transaction group. Standard sync is the default which is determined by the application and ZFS on each write.

An unofficial requirement to picking a device for an SLOG is making sure you pick drives that function well with single queue depth. Because the synchronous writes are not coming over in the large batches most SSDs are best at, they may actually be a performance loss when using a standard SSD. Intel Optane drives are generally considered one of the best drives for use as a SLOG, due to their high speeds at low queue depth and battery to finish off writes in the event of a powerfailure. Having a battery in your SLOG is important if you want it to be able to fulfil its purpose of saving data.

Reads

Just like writes, ZFS caches reads in the system RAM. They call their read cache the "adaptive replacement cache" (ARC). It is a modified version of IBMs ARC, and is smarter than average read caches, due to the more complex algorithms the ARC uses.

ARC

The ARC functions by storing the most recently used, and most frequently used data within RAM. It is a true cache unlike the ZIL, as the data that exists in the ARC on memory also exists in the storage pool on disks. It is only in the ARC to help speed up read performance, which it generally does an excellent job at. Having a large ARC can take up a lot of RAM, but it will give it up as other applications need it and can be set to whatever you think is optimal for your system.

ARC uses a changing share of most recently used and most often used data, by allocating more space to one or the other whenever a cold hit occurs. A cold hit occurs when a piece of data is requested that was previously cached, but has already been pushed out to allow the ARC to store new data. ZFS keeps track of what data was stored in the cache after it is removed in order to enable the recognition of cold hits. As new data comes in, data that hasn't been used in a while or that has not been used as much as the new data will be pushed out.

The more RAM your system has the better, as it will just give you more read performance. There will be physical and cost limitations to adding more ARC due to motherboard RAM slots and budget constraints. And unfortunately, it is impossible to actually download more RAM as much as you may try. If your ARC is full without having a high enough hit rate and your system already has a large amount of RAM, you may want to consider adding a level 2 ARC (L2ARC).

L2ARC

L2ARC exists on an SSD instead of much quicker RAM. It is still far faster than spinning disks though, so when the hit rate for ARC is low, adding a L2ARC could have some performance benefits. Instead of looking at the HDDs to find data, the system will look at RAM and a SSD to improve performance. L2ARC is usually considered if hit rate for the ARC is below 90% while having 64+ GB of RAM.

L2ARC will only fill up as your ARC is full and clearing space to make room for new data its algorithm has deemed as more important. This ejected data will be moved to the L2ARC. It could take a long time for the L2ARC to fill up.

*L2ARC does not need to be mirrored like an SLOG should be, because all of the data the L2ARC stores still exists in the pool.

Also, the L2ARC uses a certain amount of space on the RAM to keep track of what data is stored there, so upgrading RAM will generally be better done first before considering a L2ARC. SSDs are more expensive than HDDs, but far cheaper than RAM so there may be a certain price/performance trade off when making this decision as well.

Summary

ZIL is the space synchronous writes are logged before the confirmation is sent back to the client. By default it exists as part of your storage pool.
SLOG is a separate device for the ZIL to exist on. It could improve performance for some specific uses. However, its primary function is to save data that would otherwise be lost in case of a power failure.
ARC is a portion of RAM used to cache data to speed up read performance.
L2ARC is an extension of the ARC, typically on an SSD. It is meant to increase the size of the ARC while getting around the physical constraints of adding more RAM.