Maximizing the value of a VMware caching solution involves more than developing algorithms to determine the hottest data and figuring out ways to keep that data in the cache. It also means getting the most out of that expensive flash capacity supporting the VMware cache implementation. This requires understanding VMware data types and eliminating redundant data in the cache. It’s also important to understand how data is written to NAND flash to get the best performance from the flash medium and to minimize the impact of those writes on flash endurance.
SSDs are more expensive than hard drives so it’s important to use flash capacity efficiently. This means being more selective about what data is subject to cache acceleration. One way is to do this is to take a more granular approach in the data selection process. By comparing smaller data objects less cache-worthy data can be weeded out, saving space on the flash cache for other data.
As an example, some caching solutions allow putting an entire LUN into cache. But this wouldn’t be a very efficient use of the SSD resource since the entire LUN may contain some VM data that need not be accelerated. Other caching solutions allow individual virtual disks to be cached. But this can be less than ideal as well, since each virtual disk can still have multiple snapshots, not all of which are equally important from a caching point of view. This is particularly true in the case of a “linked clone” VDI deployment with a common base, as described below. In these types of situations, caching individual VMDK files can be more effective.
When snapshots are taken VMware makes the original VMDK file read-only and creates a new VMDK to record the subsequent changes that are made to that virtual disk. Over time, the virtual disk becomes a chain of these sequentially created VMDK files. Each additional VMDK represents snapshot data that may not be important from a caching point of view, but is needed to recreate the state of the original virtual disk at each particular point in time that a snapshot was taken. So each virtual disk can end up containing many VMDK files that are important for data protection but do not all need to be cached.
Caching applications that can discriminate down to the VMDK level, and can understand the relationship of these VMDK files in the context of snapshots, can provide the flexibility to ignore these VMDK files that don’t need to be accelerated. This fine-grained control in the caching process creates a ‘more for less’ situation; more data can be accelerated with an existing amount of flash or less flash can be used to accelerate an existing data set.
Caching solutions like SanDisk’s FlashSoft for VMware vSphere give users this kind of flexibility, allowing them to accelerate just the base VMDK, just the snapshot VMDK files or the entire virtual disk (the base and all snapshots). A typical use case for this kind of caching granularity is with linked clones in a VDI deployment.
In this scenario, the original VMDK would be the ‘golden copy’ and would typically be stored in flash, with snapshots taken for each virtual desktop. Since the original VMDK is usually manually allocated to an SSD partition, it doesn’t need to be accelerated through the caching software as well. But the linked clones, representing the data for each individual desktop, do need to be in the cache. So again, being able to populate the cache at the VMDK level results in much better cache efficiency. This time the efficiency comes from omitting the redundancy of caching the base VMDK for each virtual desktop, while still caching each of the relevant snapshot VMDKs.
Dynamic cache space allocation is another way to improve how efficiently a caching solution uses flash. In a VMware environment, virtual machines can be created in a short time, each potentially requiring its own flash capacity. Because different VMs cannot “borrow” idle cache space from each other, caching applications that allocate a fixed amount of the SSD to each VM end up wasting a large amount of available flash space. They also create the need to de-allocate that space when VMs are destroyed.
Even worse is the situation when all the cache space on the SSD has been allocated, and a new VM is created, or a VM is migrated to the host. Reallocation of the cache will typically require the destruction of all existing caches on the SSD, the re-creation of new static caches for all VMs on the host and a restart of all VMs. Caching solutions can address this problem by automatically allocating only as much flash capacity as needed when VMs are created, and then returning that space to the cache pool when they’re decommissioned.
Flash is a finite storage medium, meaning it eventually wears out, so extending its endurance is an important part of maximizing a flash investment. NAND flash has some unique characteristics that require a different write and erase process than is used for magnetic disk storage. Flash can be written in bytes but must be erased in much larger block-level increments. This creates several overhead processes to conduct the actual erasure and to manage data that’s not being erased. Since flash can only sustain a fixed number of these write and erase cycles it’s important to minimize their occurrence.
By writing data mostly in a sequential manner (writing to the SSD device as a “circular buffer”), a caching solution can reduce the erase activity the NAND flash cells are exposed to. This can increase the endurance of the flash cells and improve write performance of the cache as well.
Maximizing a flash investment focuses mainly on the efficiency of the cache and the way it writes data to the underlying flash storage area. But an important operational characteristic that can impact the return on that cache investment is cache persistence, or how the caching solution handles cache rebuilds.
When servers or applications are restarted the cache needs to be rebuilt or repopulated with the current data. Since the cache is only keeping a copy of data, this rebuild process isn’t like a restore in the backup sense, but it does require that the metadata be available so that the appropriate data can be accessed from the primary data store.
Caching solutions that store operational metadata on the same non-volatile flash device where the data resides (not just in RAM) can provide the cache persistence needed to greatly improve uptime for the system. This means not purging the cache until absolutely necessary. For example, if a VM or the host is rebooted without changing the VM’s contents outside the boot cycle, the cache for the VM is not purged. That way it continues to improve application performance even after the VM or the host reboots.
By avoiding unnecessary purging of cached data, cache rebuild times can be reduced, providing consistent application performance improvements for a longer period of time. On the other hand, by purging the cached data when it should be purged, data coherence can be protected. This is another area where granularity matters. By purging the data that belongs only to the relevant VMDK files on a host, it’s possible to ensure that cache contents are as ‘warm’ as possible for unaffected VMDK files on that host.
Flash is a great way to increase application performance, especially in a VMware environment. Flash does a good job of handling the “I/O Blender” that’s created when multiple VMs on a host or cluster start accessing the same storage resources. Flash is expensive so caching is often used to help maximize that precious capacity. But how can you make sure that a caching solution is making the best use of that flash?
Solutions that cache at the VMDK level can reduce the redundancy created by the snapshot process and increase the effective size off the cache. Flash efficiency and endurance can be improved as well by caching solutions that leverage sequential data write buffering and dynamic flash allocation. Together these technologies can help a caching solution maximize a flash investment in a VMware environment.