Object Stores and HCP: An Interview with Bob Primmer (Part 1)

I have read a number of my colleague Bob Primmer’s publications on object storage and our Hitachi Content Platform. I thought he would be the best person to help us understand how object storage and HCP help us address the explosion of data.

Can you tell us how you have come to be the expert in this area?

I’ve worked in software my whole career, both as a developer and also within product management. In 2002, I began by working at EMC on Centera, the first commercial instantiation of an object store. I then worked on Atmos in the early stages and subsequently came to HDS in 2009 as an architect in the Hitachi Content Platform (HCP) engineering team. I worked with a team that set about transforming HCP from an archive product to more of a general-purpose object store that would be suitable for cloud applications. I then took over the product management teams for HCP, Hitachi Data Ingestor (HDI), and HCP Anywhere.

Our strategy for addressing the explosion of data is based upon virtualization. We virtualize block and file infrastructure through our VSP, HUS VM and HNAS platforms to help customers reduce operational costs and leverage their capital assets. How does HCP support virtualization?

Abstraction or virtualization is fundamental to object stores. HCP provides storage virtualization at the hardware and software layers, both to the end user (client applications) and the storage administrator. In the process, HCP shifts the client model, essentially presenting storage as a service. Such abstraction allows for comparatively naïve users and administrators as HCP takes care of the detail of how and where data is stored, protected, geo-replicated, de-duplicated, versioned, garbage collected, etc. In this respect I think that HCP is the next evolution in the virtualization that you’ve described, where virtualization reaches the point of simply pointing to a service access point, requesting the data, and having it presented without any detail of the underlying mechanisms (hardware or software) involved in producing that data.

You mention an object store. What is an object and what is an object store?

While there isn’t a universal definition of what constitutes an object in the context of storage, generally an object is considered to be the union of the user data and metadata. Distilled to its simplest definition, an object store is a database for unstructured data, typically comprised of two components, metadata and user data. There are two types of metadata, system metadata (SMD) and custom metadata (CMD). We store custom metadata as files (serialized as XML) in a distributed file system along with the user data, referred to as data objects or blobs.

What is the importance of metadata and how is it used?

Metadata allows us to add structure to unstructured data by providing the connective tissue to logically bind otherwise discrete objects. Additionally metadata allows us to add semantic meaning to opaque data objects. For example, if you upload a picture to an object store purely for data storage reasons, that has one level of value, namely data durability and availability. However, if you can annotate that object with key-value pairs such as who is in the picture and where it was taken, you’ve added semantic meaning to the object that can be queried and manipulated programmatically to connect this picture to other objects with similar traits (such as pictures with the same people in them, or taken at the same location).

Object stores vary significantly on the degree of flexibility allowed with metadata. HCP has evolved this ability quite a bit over the last three years. Today HCP allows the user (client application) to annotate the user data with arbitrary text in the form of key-value pairs. HCP uses XML as the serialization format for persisting metadata and XPATH structured queries against that metadata. This allows the system to return specific answers to user queries, rather than a collection of potential matches.

To allow sophisticated searches against the combined metadata (system and custom) we first index the sum of the metadata. We used XPATH to extract the key-value pairs from the customer metadata to create an inverted index. Once this index has been created, users can query the metadata in one of two ways:

Through the GUI – which allows them to do a Google-like query
Programmatically through the MQE (Metadata Query Engine) interface.

How do we obtain metadata in the first place? Applications are not likely to provide metadata, which allow data to be accessed by other applications.

This is a tough problem. Applications that write to a structured database (DB) typically do so in an well-defined manner, leveraging SQL, that readily allows other applications to query that DB, independent of the creating application. In the unstructured world it’s not nearly so clean. Often the only way to access the full set of information associated with unstructured data is through the creating application, which greatly limits the value the business can derive from this data. Since unstructured data is by far the predominant data form in the 21st century, this is a non-trivial problem for companies if they wish to derive business value from their accumulated data mass. For HCP we have developed some standard packages for ingesting and indexing data such as Hitachi Clinical Repository (HCR) as you described in your previous blog post.

Are object stores a replacement for file systems?

A common misperception is that object stores are a replacement for file systems; instead, they are an augmentation. The file system is tightly coupled to the operating system and provides a well-established mechanism for organizing files within a hierarchy of directories. By contrast, an object store focuses on changing the presentation layer to the storage consumer (typically an application) through a simplified interface (REST) while achieving enormous scale by aggregating many file systems into a single, higher-order grouping.

Can you describe how an object store changes the presentation layer?

The figure below presents an abstract view of the storage stack for a single node (storage server). The function of each superior layer in the stack is to aggregate and abstract the layer beneath, permitting greater sophistication and specialization in each layer without increasing complexity to clients of upper layers. The object storage layer creates a distributed storage service to client applications without requiring the clients to manage data distribution.

In this sense it provides a similar function to Hadoop. The point of Hadoop is to simplify distributed programming by having Hadoop worry about the messy details of splitting up jobs over many servers (the “map” function), handling errors and combining the results into a single, simple form (the “reduce” function). Similarly, an object store worries about the detail of how to break up and distribute data over many storage servers while presenting to the client the veneer of a simple, single simple form (a data object) wholly contained within a global namespace.

Thanks, Bob, for helping us understand what an object store is and the value it provides as a service for storing data without the need to worry about the underlying infrastructure. HCP provides virtualization of hardware and software to the application client as well as the storage administrator, making it easier to scale, use, and administer.

I will follow up this interview in another blog to understand how we can ingest different types of data using different protocols and still be able to do a common query to find the content we need among billions of objects.