ABSTRACT
SSDs have been designed to emulate HDDs in order to easily integrate into compute infrastructure. Applications are increasing the stress on SSDs, amplifying the adverse side effects of HDD emulation. These side effects are becoming significant or even intolerable and a new approach is needed.
EXECUTIVE SUMMARY
Organizations invest billions of dollars annually to improve database and application responsiveness. These efforts seem never-ending and often involve moving storage-related latency a few milliseconds (ms) closer to the ideal goal of zero.
A recent paper published by analyst firm IDC and Lumen (Edge Computing Solutions Powering the 4th Industrial Revolution) indicated that ninety percent of business leaders need a latency of 10 ms or less to ensure the success of their applications. Seventy-five percent require a latency of 5 ms or less for edge initiatives.
In a 2022 OCP conference presentation, Meta cites latency stalls of 1s every minute and 10s every 3 days across their SSD fleet causing a significant impact on services. This issue is pervasive across the data center market and driving multi-billion dollar industry initiatives in search of a solution.
BACKGROUND
SSDs have become a critical component within data center infrastructure. They have also become a significant and ever-increasing part of the data center budget, with the rapid adoption of SSDs providing a solid value proposition. Yet, more opportunity to gain value exists beyond substituting HDDs with SSDs.
The decade of the 1990s saw a revolutionary change in IT applications and technical infrastructure. Early in the decade, computer applications were predominantly record-keeping and online transaction systems. Application data was primarily text and stored on hard drives or tape. The user communities were usually small, particularly outside the U.S., and accessed the application and associated data with dumb terminals directly attached to a mainframe or minicomputer. Remote users connected to the computer through a dialup modem.
The look and feel of computer applications transformed dramatically during the 1990s. By the decade’s end, most enterprise applications were browser-based. The new applications required large amounts of data to be stored, processed, and moved across global networks. These demands drove radical advancements throughout the computing infrastructure. The innovation in network technology allowed for application access through ubiquitous broadband connections, thus enabling large and geographically diverse user communities to use a single application. Most data transitioned and progressed from tapes to HDDs.
The new millennium saw the broad adoption of the data-centric computing infrastructure that enabled the collection, processing, and dissemination of content-rich information. The proliferation of complex applications propelled growth and performance requirements on the storage infrastructure, especially for cloud-based applications and software-as-a-service (SaaS). The SSDs introduced into data center environments around 2008 continue displacing HDDs at an increasing pace. The higher priced SSDs are justified when performance matters since they deliver orders of magnitude better performance and latency than HDDs.
CHALLENGES
Architectural Constraints
SSD architecture hasn’t materially changed since its introduction. It consists of non-volatile memory connected to a specialized controller providing a storage interface, with everything packaged using an industry-standard form factor. The essential functions of the controller include memory management, storage interfacing, and HDD emulation.
SSD architecture has remained similar across significant advances in memory (e.g., storage class memory), interfaces (e.g., PCIe), protocols (e.g., NVMe), and packaging (e.g., EDSFF). Despite these innovations, SSD evaluations remain based on historical HDD selection criteria and best practices.
One of the significant hurdles to introducing SSDs was the decades of design and investment around HDDs as the primary storage device, so SSDs were intentionally designed to emulate HDDs. The rationale behind this decision was to accelerate adoption by minimizing integration efforts. Otherwise, using SSDs might require fundamental changes to hardware and software, including storage controllers, hypervisors, operating systems, databases, applications, and more.
As modern databases and applications push storage to its extreme thresholds of performance, the limitations of SSD architecture and the side effects of HDD emulation are becoming more disruptive. As costly examples, over-provisioning storage and serializing data streams are now standard practices for mitigating these challenges. Industry associations collaborate with subject matter experts to develop workarounds, including Zone Named Space (ZNS), Flexible Data Placement (FDP), and Software Enabled Flash (SEF).
A permanent and more compelling solution will involve a new approach to controller firmware, device evaluation, qualification methodology, testing tools, and performance metrics.
Operational Factors
Almost all SSDs use internal NAND flash memory for persistent storage. Still, the solid-state memory of SSDs works very differently from the moving magnetic heads and magnetic disks of the HDDs they emulate. The operations internal and external controllers perform to make solid-state memory emulate magnetic storage are complex, making it difficult (if not impossible) to hide all the differences between the technologies.
The side effects can be challenging to detect when the system, databases, and applications are not stressing the available performance and capacity of the SSDs. However, several issues arise as SSDs approach their design limits. These operational issues typically occur with real world production workloads and may remain undetectable during benchmark testing.
Some of the most severe SSD operational issues include:
- Inconsistent performance and latency, especially when drives approach full capacity
- Intermittent latency spikes, sometimes exceeding multiple seconds
- Significant performance drop-off after weeks or months of usage
- Premature warranty wear-out based on rated endurance specifications
- Service life reductions that are workload-related
- Early device failures in production environments
Technical Considerations
SSDs use complex controller algorithms to move data to other areas of NAND flash, making space available for future data. These extra data movement operations are interleaved with read and write commands from the host, generating additional background traffic causing inconsistent performance and latency. Over time, these internal data movements create bottlenecks that degrade performance, latency, and endurance.
There are three overriding themes for why SSD issues have continued to exist:
- Users accept SSD performance because it is far better than the HDDs they replaced.
- Systems have many performance bottlenecks, so SSDs are not always critical path items.
- Standard benchmark testing does not reveal the production performance limitations of SSDs.
The prevalent use of the legacy HDD-based qualification strategy applied to SSDs is a core contributor to the enduring issues with SSD performance. The physical properties of an HDD result in a simple, single-threaded, reliable performance model making it independent of workload. Thus, the simple streaming and random IO test suite for HDDs fully characterize their performance characteristics.
Conversely, SSDs are multi-threaded and mixed with background processes that result in behavior that changes dramatically based on the production workload. Therefore, the results obtained from applying simplistic HDD performance tests to SSDs are very misleading. The standard benchmarking and qualification tools do not expose the performance variability seen under production workloads, so the need to optimize SSDs against real workloads is not even apparent.
In addition to the complexities of HDD emulation, SSDs are being designed, optimized, and measured to the wrong criteria. This is concealing the actual problem.
REAL WORLD SOLUTIONS
Some organizations may have applications that do not place high demands on storage performance, or they may have bottlenecks elsewhere in the system. These systems may be fine with commodity SSDs.
However, organizations that are stressing their storage systems utilizing applications including virtual machines and containers, artificial intelligence and machine learning, low latency edge applications, big data or fast data analytic workloads are probably using SSDs that are not optimized for their environments.
Burlywood’s workload aware™ SSDs offer customers a Plug and Play NVMe 2.0 solution to simplify edge, core, and cloud performance optimization. They employ patented FlashOS™ technology designed for optimal operation on real workloads instead of standard benchmarks. FlashOS™ analyzes workload behavior at the flash controller level, then adapts for an optimal fit with the environment. The Burlywood solution consistently delivers higher ongoing performance, lower latency without spikes, and better reliability with longer lifespans.
Along with the FlashOS solution, Burlywood also provides a set of tools and services to help users move away from the HDD benchmark mindset to the workload-centric paradigm. These tools and expertise help you to:
- Gain insight into the effectiveness of your current SSDs under a workload
- Measure and analyze your production workload characteristics
- Define a test suite to measure essential criteria of performance for that workload
- Use SSDs that are explicitly designed to deliver the performance, latency, and endurance required for your system under your workload
CONCLUSION
While the IT industry is not ready to replace the emulation strategy used by SSDs, despite its increasingly disruptive side effects, there’s no longer any reason to continue suffering its impacts on performance, latency, endurance, reliability, and service life. Instead, you can explore using the Burlywood solution, including SSDs that analyze and adapt to workloads, plus testing tools and services to characterize the outcomes that you can achieve.