Splunk is software for searching, monitoring and analyzing machine-generated data. While there is a lot of talk about “big data initiatives”, this is the big data that organizations have right now. Use cases include predictive analytics for IT operations, security and event management (fraud detection) and web click tracking. Essentially organizations want to monetize the raw data they are collecting. For Splunk to really deliver though it needs a storage infrastructure that is capable of keeping pace with its compute power.
A Splunk Architecture typically has three layers. The first is called the “Searcher”. The Searcher tier has the cluster master or controller and it serves as the front-end for users to generate search requests. The third layer is the “Forwarder”. This tier is made up of any system that can forward data into the Splunk cluster. The middle tier, the “Indexer”, is where the storage happens, both capacity management and I/O performance are critical.
Understanding Splunk Storage Management
Splunk manages storage by placing it into what it calls buckets. A bucket is essentially a directory. When data is first sent to the Indexer Tier it is sent to the db-hot bucket. Generally Splunk is directed to store this bucket on an all-flash array. The hot bucket is assigned a user-defined size limit and age limit, once it reaches either of these limits, the hot bucket is “rolled” to the warm bucket, which is another directory typically located on a separate decently performing hard disk-based storage system. Finally, after another set of user defined limits, the warm bucket is “rolled” to a cold bucket – which is typically either a high capacity NAS system or an object store. There is also an option to move data, again after a user defined set of parameters, to a frozen bucket, which is often either tape or the cloud.
The Challenge with Splunk Storage Architectures
Splunk does an admirable job of moving data between storage tiers. The challenge is that all this data management takes away from compute that would otherwise apply to analyzing data. There is also the time involved in moving data back and forth between these bucket types. If the default architecture is used; All-Flash Array for hot data, performance hard disk system for warm and capacity hard disk system for cold, the time it takes to move data between systems can be significant.
A solution might be to centralize all the tiers on to a single storage system and let that storage system move data as needed. A hybrid flash array that can integrate with Splunk would work well in this situation. It would off-load the management of data from the Splunk and while data would still move between tiers, that movement would be within the system, instead of having to traverse a network.
StorageSwiss Take
Most big data projects include a laborious step that involves figuring out how to collect data. Splunk provides analysis on data that the organization has been capturing for years, maybe decades. But it does present a few storage challenges that IT professionals need to be aware of. Join us in our on demand webinar “How To Design High Performance, Cost Effective Splunk Storage” to learn more Splunk Storage Basics, the challenges splunk creates, the flaws behind the typical Splunk storage designs and ideas on how to overcome them.