Ready, speaking to ComputerWeekly.com, said he believed the current storage architecture has to change, that non-volatile memory express (NVMe) flash could not realise its potential with the storage controller in the I/O path.
By submitting your personal information, you agree that TechTarget and its partners may contact you regarding relevant content, products and special offers.
“Up until now storage has been the bottleneck,” he said. “Now it’s everything else. It’s been a while since we heard of the central processing unit (CPU) being the bottleneck but now it is.
“You have to get rid of the controller. There’s no way you can have a controller with NVMe. Having a controller is unnecessary. This is as big a shift as we have seen since the introduction of the SAN itself.”
Ready’s comments came following Scale Computing’s announcement that in testing it had achieved 150µs latency with a four-node Scale HC3 cluster using NVMe with NAND flash and latency as low as 20µs with NVMe and Intel Optane 3D Xpoint media.*
The same NAND/NVMe-equipped cluster achieved IOPS of 2.6 million.
That compares with latency achievable with “standard” – ie small computer system interface (SCSI) and serial-attached SCSI (SAS) connected flash drives of several hundred microseconds up to milliseconds and above. Existing flash products are likely to see input/output operations per second (IOPS) of a few hundred thousand.
Scale says it can achieve such low latencies by eliminating the controller, storage protocols and file systems from the mix.
“Legacy workloads are often poorly written and don’t take advantage of multi-threading,” said Ready. “Every piece of data goes through file systems and jumps around in the VMs.”
“Also there is context switching between user space and kernel space which has existed for all time and it didn’t matter because 20µs for a call didn’t matter. Now it does, especially with Optane, 3D Xpoint.”
Scale’s tests with (as yet publicly unavailable) HyperCore-Direct hardware was done with Windows 7, chosen, said Ready for its perceived performance challenges and because many users still use older Windows platforms.
“With Windows 7 virtual machines you have the virtualisation stack, the virtual controller that sits on a file system, for example virtual machine file system (VMFS), and a file system again on the virtual operating system. Then maybe it’s on a NetApp box with its write anywhere file layout (WAFL) file system too,” he said.
“Every time data traverses a file system the cost is 100µs or 200µs.”
Doing away with storage controller functionality
According to Ready, its performance results with NVMe are gained because it can do away with storage controller functionality, in particular the need to translate (especially SCSI) protocols.
“As far as the application is concerned there is no protocol,” he said. “The application and the virtualisation stack believe they’re talking to direct-attached storage. But actually blocks are shared throughout the cluster. Work is handled by [Scale’s software-defined storage] Scribe and protocols are eliminated.
“We get near bare metal NVMe performance with a streamlined data path from app to NVMe. It’s all contained in user space, with direct block access from applications to drives.”
NVMe is a peripheral component interconnect express (PCIe) based protocol written for flash. It allows huge increases in the number of input/output (I/O) queues and the depth of those queues compared to existing SCSI and SATA storage protocols and enables flash to operate at orders of magnitude greater performance.
But NVMe is roadblocked to some extent as a shared storage medium. It can operate at bare-metal speeds as server flash or in the storage controller, but when you try to make it work as part of a shared storage setup behind a controller you lose I/O performance.
That’s because the functions of the controller are – or have been – vital to shared storage. At a basic level the controller is responsible for translating protocols and physical addressing, with the associated tasks of configuration and provisioning of capacity, plus the basics of Raid data protection.
* Scale’s test results are claimed to come from a four-node HyperCore-Direct configuration with a mixed (90% read/10% write) random workload to 24 virtual machines.