Object stores have found a home in the cloud and in data centers, becoming the repository for long-lived and high-value data. They abstract away the location of an object, giving them the flexibility to implement higher forms of redundancy. This redundancy protects against device (e.g. hard disk) failure but also against failures of entire nodes or even entire data centers. With erasure coding and proper system design, object storage can deliver very high durability and availability economically.
Abstracting away object location also allows object stores to scale to sizes and topologies difficult to achieve with filesystems. The user of an object store may have no idea where their data is physically stored. What looks like a single object store may actually be distributed across locations in multiple cities to achieve greater reliability against fires, earthquakes and other natural disasters. This durability could tremendously increase the capacity requirements of the underlying hardware, but smart erasure coding algorithms allow durability to be achieved using less capacity than by mirroring the data.
Oftentimes, retention periods for data are determined by legal and other compliance constraints. One might expect data not subject to compliance requirements is likely to be deleted sooner. However some data—like geological and genetic data—has value indefinitely. Many companies are finding the business value of their old data can be very high. Rather than risk deleting valuable data, companies want to store most of their data forever.
Other data is of high intrinsic value. Examples include
- Data sets used for machine learning applications are typically expensive to acquire. Consider autonomous driving; there are literally years of recorded driving experiences under all manner of conditions.
- In geological exploration, depth-sounding data from scheduled explosions on land or at sea.
- In movie production, image and sound data may be recorded at very high fidelity in unique locations, costing millions of dollars to access.
These unstructured data sets can represent a tremendous investment and are examples of data which does not expire. The requirements for data durability are very high, making object stores a natural location for this data.
With success comes greater demands – and object storage is no exception. Object storage is being pulled in two opposing directions simultaneously: toward colder, cheaper data storage and toward hotter, more responsive workloads.
Colder and cheaper
The demand for storage capacity is growing at a compound annual growth rate (CAGR) of more than 20%. Becoming the long-term repository puts an emphasis on becoming cheaper and deeper without losing data durability. The overall cost of storage needs to be reduced – not just media cost, but the TCO (total cost of ownership). This includes every associated expense of acquiring and owning a piece of equipment, including the costs of acquisition, maintenance, power, cooling, the enclosing building, and the land it sits on.
Most object stores today are hard-disk based, which provides good performance and reliability but with significant power and physical footprint costs. To lower TCO, object stores are incorporating tape – yes, tape lives! Tape has lower media costs and, unlike disk, requires minimal power and cooling when not being accessed.
But tape isn’t just slow disk. In fact, for sequential I/O, tape outperforms disk for both reading and writing. However, tape works (and fails) differently from disk. The latencies to access data on tape cannot be ignored. Most best practice implementations will present tape as a separate tier to allow applications to help manage access to data.
Reaping the full advantages of tape in an object store also requires a deep understanding of how to properly manage and treat it. The object store must account for and survive failure modes which are unique to tape. It must also manage access patterns to reduce tape latencies and wear. All of the required complexity is implemented below the object interface, saving the user from experiencing it.
Tape tiers are ideal for large amounts of data stored for long periods of time. As tape excels at sequential access, large individual objects will perform best. However, a well-implemented object store will group small objects into larger sequential streams to and from tape.
Hot, hotter, hottest!
While long-term repositories require colder and cheaper data storage, some cloud applications require much lower latency and very high bandwidth. Machine learning and video post production both fall into the latter category.
Focus: Machine learning
Machine learning (ML) applications have a voracious appetite for data. Much of it is large sequential access (e.g. video and audio). However, by object count, there is a great deal that is small I/O. ML applications are heavily read-oriented as the algorithms traverse vast amounts of data multiple times during training. Any delays reading data – whether large or small – can stall calculations. Most object stores are designed for large aggregate throughput, not for low latency of any particular request. This mismatch can result in poor interaction with the ML application.
For these kinds of applications, hard-disk based object stores may not be fast enough and tape is certainly out of the question. Solid state storage, while expensive, is needed to feed the ML beast.
Focus: Video post production
Video processing workflows can put tremendous demands on storage. The source material is high definition images, often stored without compression and with each frame stored as a separate file or object. Multiple streams of images may be combined into the final result, which multiplies the required read speeds. The video editor needs to see the result of their edits played in real time to avoid dropped frames for the feedback they require. With increasing video resolution being used, this requires very high read speeds (30 frames per second 8K video requires more than 5 GB/sec).
Thus, a tier of solid-state storage may also be needed. Like tape, solid-state storage has unique operational characteristics. At the present time solid state is comparatively expensive but irreplaceable for certain workloads. Solid-state storage tiers are ideal as a cache for holding tightly focused data sets while they’re being used by applications which do not tolerate latency.
Object store conundrums
An object store comprised of solid-state storage, hard disks, and tape poses several challenges in two key areas:
- First, the object store is large – certainly petabytes, possibly exabytes in size. Yet objects are likely in the range of kilobytes to megabytes in size. There are potentially quadrillions of objects in an exabyte-scale object store. How do we know what’s in the data store? How do we identify and select complete subsets of information in a pool of data this vast?
- Second, what determines which objects go on solid-state storage, on hard disks, or on tape? How is that decision made? Further, some workloads require the object store optimize latency required to access data sets, rather than individual objects. What’s the best way to select the sets of data that are needed for a particular task?
The data catalog
The first part of the solution is a catalog of the object store contents. This catalog serves as both a replacement for human memory and a mechanism for selecting subsets of the data.
Since the data in an object store may exist long past the people who put it there, we need a replacement for human memory. To do this, data needs to be classified as it comes into the object store.
The initial classification would include the standard attributes (who entered the object into the object store, when it was entered, how big it is, what type of data it is, who can use it, and so on when the data was collected) and might include domain-specific classification as well (for example, optical character recognition of video surveillance data to find and record license plates).
Over time, the uses of data and the information we can extract from it will change and improve. This requires the classification information to be malleable in ways users can’t predict when data is added.
The data catalog must provide mechanisms for selecting a tightly focused subset of data. For example, for a study on the effects of fracking, we might need all depth soundings recorded in the state of North Dakota from 1990 through 2020. The catalog needs to support rich search semantics to allow searching through time, geography, and diverse data types.
The second part of the solution is data management – a collection of capabilities that ensure the data is where the user needs it when they need it there. Once any items of interest are identified, the system needs to ensure the data (or copies of it) are located to optimize processing. Conversely, the data needs to be where it can be kept safe for the least cost when it’s not in use.
- For example, the automatic (i.e. policy-driven) placement of data, according to its classification(s), movement of data (e.g., from tier to tier, based on recent access patterns), the augmentation of classifications of a set of data, based on its current classification(s).
- These capabilities must also be available ad hoc to the user. Once a set of data has been identified to work on, they must be able to put it (or a copy of it) on solid-state storage for a week. Users must be able to delete objects (or if compliance requirements dictate otherwise, users must be prevented from doing so).
The durability and scalability of object stores make it an attractive option for organizations which need to store large volumes of valuable data, particularly for disaster recovery. With long-term repositories requiring cheaper storage and other applications requiring low-latency, high-bandwidth storage, today’s object stores may be built on a mix of hard disks, tape, and solid-state storage. This introduces challenges and complexities around how to find objects in an exabyte-scale object store and how to determine which objects are stored on each storage medium. But with effective data cataloging and data management, organizations can overcome these complexities to reap the durability and scalability benefits of object stores for their high-value unstructured data.