It used to be that the storage media was the slowest component of the storage infrastructure. Now, thanks to flash, it is the fastest. Long before IO leaves the storage system and traverses the network, bottlenecks begin. In fact, the Interface between the flash controller and each NAND chip, known as the Common Flash Memory Interface (CFI), has significantly more bandwidth than the PCIe interface used by NVMe flash drives. While networks will get faster, they will still always be one of the major roadblocks for organizations looking to optimize NVMe performance.
A four-lane wide (x4) PCIe Express interface’s bandwidth is less than a third of the total bandwidth going into an 8-channel flash controller and less than a sixth of what a 16-channel flash controller will support. PCIe speeds will double as Gen4 and Gen5 come to market but CFI clock speeds will double as well. The adoption of improved CFI clock speeds will occur faster than future PCIe architectures because the flash vendor is in control of the whole process.
To get around the problem of network bottlenecks IT architects have tried several approaches. The most common is to move storage internal to the application, which eliminates the storage network but introduces other problems. Direct attached storage architectures seldom use capacity as efficiently as shared storage networks. They also create a single point of failure since all storage is contained within the application server. Data can be mirrored or replicated to another host for redundancy but doing so reintroduces network latency and it consumes twice as much storage capacity.
Another challenge with the direct attached workaround is applications often need to be “sharded” so that different servers can operate on different parts of the application or user loads can be distributed. Both use cases introduce massive amounts of complexity.
While direct attached does eliminate network latency it does nothing to relieve internal latencies between the PCIe bus and the CFI. It can further complicate the effective utilization of the CPUs in those systems as well.
Another approach is to use an advanced network and continue to share storage. Up until now the only option was to keep increasing network bandwidth. The problem is each jump in bandwidth is more expensive than the last and it requires a higher quality cable infrastructure as light loss increases in importance as bandwidths increase. In many cases bandwidth is not the real issue, network latency is. Next generation networking technology like NVMe over Fabrics (NVMe-oF) helps by reducing latency. NVMe-oF approaches the latency of direct attached storage but, again, it is expensive and still doesn’t circumvent the latency differences between PCIe and CFI. It also pushes the storage physically further from the compute via the network cabling, which in turn complicates the CPU utilization.
The reality is there will more than likely always be a performance gap between the CFI bus and the PCIe bus, which means the organization will not be able to tap into the full performance of current and future generations of flash or have the ability to properly manage the data growth.
A way to solve this problem, instead of merely working around it, is to perform the computation within the flash controller architecture so that the performance of the CFI is fully available. The process is similar to a map/reduce process where a large data set is queried, leveraging the high performance of flash and the CFI. Then the smaller result is sent across the standard network for further analysis. Computing on the storage device (computational storage) enables organizations to take full advantage of flash performance with almost no modification to their applications or system hardware.
In our next blog we dive deeper on computational storage, how it works and what use cases are the most likely to benefit.
Sponsored by NGD Systems
Sign up for our Newsletter. Get updates on our latest articles and webinars, plus EXCLUSIVE subscriber only content.