Splunk SmartStore enables organizations to achieve a high degree of compute/storage elasticity and makes it possible to achieve longer data retention without compromising performance or breaking the budget. A critical component of SmartStore is the storage system that it uses for the remote storage tier to retain warm/cold data. The term “remote” doesn’t imply distance; it simply means storage that is not embedded in the Splunk indexers.
The traditional Splunk cluster tightly integrates storage and compute. Each node stores hot, warm and cold data. The goal of SmartStore is not to create a fourth tier of storage but to move warm and cold data to a remote tier, separating storage from compute. After SmartStore, 80%-90% of the Splunk dataset should be on the more cost-effective remote tier, lowering the cost of deploying both compute and storage but maintaining the performance for which Splunk is known.
While SmartStore uses an AWS S3 API to plug into the remote storage tier, IT planners need to use caution in selecting what storage system they use for that remote tier. The goal for this tier is to not only provide cost effective long term retention of data but to also provide fast search performance. The remote storage tier is not supposed to replace the hot data tier but it shouldn’t be so slow that it is unusable since many searches will span both the hot and warm tiers. In fact the better the remote tier performance, the more likely users are to accept its use, enabling the organization to further drive down costs.
Requirement 1: On-premises Option
The first requirement for storage infrastructure acting as the remote tier is to be available on-premises, on the same corporate network as the Splunk indexers. With SmartStore, organizations may have their environments scale from terabytes to petabytes. These organizations also want to keep data for long periods of time so Splunk can provide richer analysis. The combination of very high capacity and long retention times makes a cloud-only storage option less appealing for some organizations. While cloud storage has a low upfront cost, the periodic cost of egress fees and storing petabytes of data for years, if not decades, may become prohibitive. The storage system should integrate with the cloud for disaster recovery and data mobility but it should not be solely dependent on it.
Requirement 2: Cloud-Like Scalability
The remote tier should leverage a modular design so that the organization can experience cloud-like scalability even if the data is on-premises. If the organization has to manage dozens of storage systems to store the Splunk environment’s capacity, then the cost advantage of “owning” the storage is quickly eroded. Organizations should look for S3 compatible object storage systems that can provide non-disruptive and incremental expansion as Splunk datasets grow from terabytes to petabytes.
The remote tier storage system should also be free from hardware vendor lock-in. Since the remote tier may potentially store data for decades, the organization needs the flexibility to intermix and change hardware platforms and storage media as the technology becomes available. The more open the object storage environment is, the more flexible it is long-term.
Requirement 3: High Data Durability
Splunk protects its hot buckets through data replication; the protection of the remote tier’s data though, is the responsibility of the storage system. The capacity requirements of the remote tier may make replication of all data on that tier financially impossible. The organization needs an option to use a more data efficient technology like erasure coding. Erasure coding is a parity based data protection scheme that maintains data availability during media or node failures.
Erasure coding though, places extra demand on node CPUs, internode communication and its network. IT planners need to look for an object storage solution that places minimal demands on storage nodes and the storage cluster’s network. Additionally, most organizations that use Splunk and SmartStore are multi-location. The erasure coding function should provide the capability to span multiple geographic regions automatically and amplify data durability.
Requirement 4: Fast Search Even on the Remote Tier
The primary job of the remote tier is to feed Splunk SSD-based Indexer nodes. The remote tier should be able to move large amounts of data to the indexer node’s cache quickly. The remote tier storage system needs to deliver data from all available nodes in the storage cluster in a parallel fashion to meet these expectations.
With SmartStore, the remote tier plays a critical role in Splunk analysis. If it can deliver acceptable performance as it interacts with Splunk indexers, then the organization can afford to analyze more data and more efficiently use compute power. The remote tier, holding potentially 90% of the organization’s Splunk data assets, needs to scale to meet the demand and protect the data once it is there.
Our next blog provides an analysis of SwiftStack, a storage solution that exceeds the above requirements. In the meantime, register for our on demand webinar with SwiftStack, “Rearchitecting Storage for the Next Wave of Splunk Data Growth” and receive our latest eBook “Doing More With Splunk Data…For Less”.