Fork me on GitHub

basho

Concepts

This section is a high level overview of concepts, technology choices, and implementation details that are at work in Riak.

Basics

Languages Used

Written in Erlang and C primarily with a small amount of Javascript.

Clients

Basho supported Drivers are available for Erlang, Python, Java, PHP, Javascript, and Ruby.

History

Riak is based on technology originally developed by Basho Technologies to run a Salesforce automation business. There was more interest in the datastore technology than the applications built on it so Basho decided to build a business around Riak itself.

Influences

Riak is heavily influenced by Dr. Eric Brewer’s CAP Theorem and Amazon’s Dynamo paper. Most of the core team comes from Akamai which also explains their focus on operational ease and fault tolerance.

Enterprise vs. Open Source

Riak comes in two flavors: open source and enterprise. Enterprise is a superset of the open source version with a few features added. The current enterprise features are SNMP support, inter-datacenter replication (not to be confused with intra-datacenter replication, which is supported in the open source version), web-based administration interface, and top-tier support.

Data Storage

Buckets and keys

Buckets and keys are the only way to organize data inside of Riak. User data is stored and referenced by bucket/key pairs.

Links and Metadata

Bucket/key entries, hereafter referred to as Riak objects, can store links pointing to other entries. These links can be walked via Riak’s HTTP interface directly or as part of a map/reduce job. Riak objects and buckets can store user-defined metadata in addition to the actual data payload. The metadata is exposed as HTTP headers in the HTTP interface and as associative arrays inside map/reduce jobs.

Local Disk Storage

As of the 0.12 release, Bitcask is the default backend for Riak. Bitcask is a simple yet powerful local key/value store that serves as Riak’s low latency, high throughput storage back end.

Pluggable Backends

Riak uses an API to interact with its storage subsystem. The API allows Riak to support multiple backends which can be “plugged in” as needed. Riak currently ships with backends for Bitcask, dets, ets, Erlang’s balanced trees (gb_trees), and writing directly to the filesystem. Riak also supports per-bucket backends. You can read more about configuring backends on the “storage_backends” section of the Configuration Files page.

Other Backends

Basho also develops a Riak backend named Innostore based on the embeddable version of InnoDB. Due to licensing restrictions, Innostore is provided separately. You can read more about configuring Innostore here.

Clustering

Nodes, Vnodes, and Partitions

Central to any Riak cluster is a 160-bit integer space which is divided into equally-sized partitions.

Physical servers, referred to in the cluster as “nodes,” run a certain number of virtual nodes, or “vnodes”. Each vnode will claim a partition on the ring. The number of active vnodes is determined by the number of physical nodes in the cluster at any given time.

As a rule, each node in the cluster is responsible for 1/(total number of physical nodes) of the ring. You can determine the number of vnodes on each node by calculating (number of partitions)/(number of nodes). More simply put, a ring with 32 partitions, composed of four physical nodes, will have approximately eight vnodes per node. This setup is represented in the diagram below.

Nodes can be added and removed from the cluster dynamically and Riak will redistribute the data accordingly.

Distribution

Riak is designed, from the ground up, to run in a distributed environment. Core operations, such as read/writing data and executing map/reduce jobs, actually become faster when more Riak nodes are added to a cluster.

Riak & CAP Theorem

Riak’s guiding design principle is Dr. Eric Brewer’s CAP Theorem. The CAP theorem defines distributed systems in terms of three desired properties: Consistency, Availability, and Partition (failure) tolerance. The theorem states you can only rely on having two of the three properties at any time.

Riak chooses to focus on the A and P of CAP. This choice puts Riak in the eventually consistent camp. However, the window for “eventually consistent” is in terms of milliseconds which can be good enough for many applications.

No master node

All nodes in a Riak cluster are equal. Each node is fully capable of serving any client request. This is possible due to the way Riak uses consistent hashing to distribute data around the cluster.

Storage implications

Riak communicates bucket information around the cluster using a gossip protocol. In general, large numbers of buckets within a Riak cluster is not a problem. In practice, there are two potential restrictions on the maxmimum number of buckets a Riak cluster can handle.

First, buckets which use a non-standard set of properties will force Riak to gossip more data around the cluster. The additional data can slow processing and place an upper limit on performance. Second, some backends, namely Innostore, store each bucket as a separate entity. This can cause a node to run out of resources such as file handles. These resource restrictions might not impact performance but they can represent another limit on the maximum number of buckets.

Replication

Riak controls how many replicas of data are kept via a setting called the “N value”. This value has a per-node default but can be overridden on each bucket. Riak objects inherit the N value of their parent bucket. All nodes in the same cluster should agree on and use the same N value.

For example, here is a depiction of what happens when n_val = 3 (This is the default setting). When you store a datum in a bucket with an N value of three, the datum will replicated to three separate partitions on the Riak Ring.

Hinted Handoff

Riak uses a technique called ‘hinted handoff’ to compensate for failed nodes in a cluster. Neighbors of a failed node will pick up the slack and perform the work of the failed node allowing the cluster to continue processing as usual. This can be considered a form of self-healing.

Dynamic Growth

Riak will automatically re-balance data as nodes join and leave the cluster.

Reading Data

Fetching

Riak objects can be fetched directly if the client knows the bucket and key. This is the fastest way to get data out of Riak.

R Value

Riak allows the client to supply an “R value” on each direct fetch. The R value represents the number of Riak nodes which must return results for a read before the read is considered successful. This allows Riak to provide read availability even when nodes are down or laggy.

Read failure tolerance

Subtracting R from N will tell you the number of down or laggy nodes a Riak cluster can tolerate before becoming unavailable for reads. For example, an 8 node cluster with an N of 8 and a R of 1 will be able to tolerate up to 7 nodes being down before becoming unavailable for reads.

Link walking

Riak can also return objects based on links stored on the object. Link walking can be used to return a set of related objects from a single request.

Writing and Updating Data

Vector clocks

Each update to a Riak object is tracked by a vector clock. Vector clocks allow Riak to determine causal ordering and detect conflicts in a distributed system.

Conflict resolution

Riak has two ways of resolving update conflicts on Riak objects. Riak can allow the last update to automatically “win” or Riak can return both versions of the object to the client. This gives the client the opportunity to resolve the conflict on its own.

W Value

Riak’s API allows the client to supply a “W value” on each update. The W value represents the number of Riak nodes which must report success before an update is considered complete. This allows Riak to provide write availability even when nodes are down or laggy.

Write failure tolerance

Subtracting W from N will tell you the number of down or laggy nodes a Riak cluster can tolerate before becoming unavailable for writes. For example, an 8 node cluster with an N of 8 and a W of 2 will be able to tolerate up to 6 nodes being down before becoming unavailable for writes.

MapReduce

MapReduce in Riak allows you to process your data in real-time in parallel utilizing the hardware resources of your entire cluster. MapReduce jobs are described in JSON using a set of nested hashes describing the inputs, phases, and timeout for a job. A job can consist of an arbitrary number of Map and Reduce phases. For this reason, MapReduce in Riak can be thought of as a real-time “mini-Hadoop”. A job is submitted via HTTP and the results are returned in JSON-encoded form. (A Protocol Buffers interface is also supported.)

Inputs

MapReduce jobs operate on a set of objects, specified using MapReduce inputs. The objects can be explicitly specified as a list of bucket/key pairs, or implicitly specified by providing a secondary index query or search query. The output of the query is used as the input of the MapReduce job.

Phases

Map/Reduce jobs are chains of functions. Riak combines the function along with information about what to do with the output and the language the function is written into a ‘phase’. The output of one phase becomes the input for the next. Riak can accumulate the output of any phase or set of phases and return the data to the client as part of that job’s output.

Map Phase

Map phases are responsible for gathering data from Riak bucket/key entries. Map phases can operate on entire buckets or lists of bucket/key pairs. The fetching of data is highly parallel and scales with the number of nodes in the cluster. Riak caches the results of map fetches to reduce the load on the storage backends. This means subsequent map fetches for the same data will be faster after the initial fetch.

Link Phase

Link phases are a specialized version of map phases. A link phase fetches Riak objects based on a link walk. Link phases can be used to perform map/reduce processing on sets of related objects.

Reduce Phase

Reduce phases can perform arbitrary processing on the data retrieved from map or link phases. Unlike CouchDB, Riak reduce phases are not required to return a single answer.

Query Languages

Map/Reduce queries can be written in Erlang or Javascript.

Erlang

Erlang functions have complete access to the Riak Erlang API.

Javascript

Mozilla’s Spidermonkey engine provides the runtime environment. Pre-defined Javascript functions run almost as fast as Erlang functions. Javascript functions are currently permitted read-only access to Riak.

Watch a screencast overview of MapReduce in Riak

Secondary Indexes

Version 1.0 adds support for Secondary Indexes in Riak. This feature allows an application to tag a Riak object with one or more field/value pairs. The object is indexed under these field/value pairs, and the application can later query the index to retrieve a list of matching keys.

Schema Free

Indexes are set on an object-by-object basis, there is no schema. The indexes are defined at the time the object is written. To change the indexes for an object, simply write the object with a different set of indexes.

Real-Time and Atomic

Indexing is real-time and atomic; the results show up in queries immediately after the write operation completes, and all indexing occurs on the partition where the object lives, so the object and its indexes stay in sync.

Query via HTTP, Protocol Buffers, or MapReduce

Indexes can be stored and queried via the HTTP interface or the Protocol Buffers interface. Additionally, index results can feed directly into a Map/Reduce operation, allowing further filtering and processing of index query results.

The Riak API

Data Storage

The guys who wrote Riak are also responsible for the Erlang REST framework Webmachine, so it’s not surprising Riak uses REST for its API. Storage operations use HTTP PUTs or POSTs and fetches use HTTP GETs. Storage operations are submitted to a pre-defined URL which defaults to ‘/riak’.

Clients can set the Content-Type header for each Riak object. Riak will replay the header when the object is fetched. This is nice since it allows Riak to be content agnostic.