Clustering Graphite - bitprophet.org

Intro

Graphite is a popular metrics collection, storage & graphing tool. Unfortunately, clear info on how it works in a cluster configuration is hard to come by. I’ve done a lot of reading on the topic and want to share the result.

Note that I haven’t actually put this into practice yet (though I plan to at my day job in the near future). However, I’ll be working with folks who’ve deployed clusters to hammer out any misinformation ASAP.

I sourced these notes from community blog posts, the docs, and reading the Graphite/Carbon/Whisper source code. Specifically:

Richard Crowley’s ‘Federated Graphite’ blog post - one reason I wrote my post is that his, while well put together, is still missing gaps in the big picture.
Tuning Graphite for 3m points/min - a somewhat old Q&A thread on Launchpad that’s still at least somewhat relevant.
Mailing list post mentioning CLUSTER_SERVERS confusion
The official docs
The code

Overview

Components

A Graphite setup consists of the following components:

A collector that speaks Graphite wire protocols.
One or more Carbon daemons (typically carbon-cache, the workhorse).
- Carbon-cache’s primary responsibility is accepting metrics over the wire & writing them into Whisper db files (see below).
- While doing so, it caches metrics in memory to help write throughput (and smooth over disk contention).
- Because of this, at any given moment datapoints that have “arrived” at a given node may exist either on-disk OR in a carbon-cache’s cache. This affects how the webapp queries for them (see below).
- When multiple carbon-caches are needed (to take advantage of multiple cores or nodes), a carbon-relay process load-balances across them.
An on-disk tree of Whisper database files.
- These files are the canonical resource for metric data; all metric data should eventually (and ideally, quickly) arrive here.
At least one webapp install
- The webapp is the canonical query endpoint for its node’s metric store.
- It handles all necessary queries to disk, carbon-caches, and if necessary webapps on other nodes, to end up with the most up to date set of endpoints for a given metric.
- The webapp ships with a graph composer & dashboard, but for the purposes of this document, its primary use is that of query API endpoint.

Configuration

Given that there may be multiple carbon-caches per node and/or multiple nodes’ worth of caches & webapps, Graphite has a handful of settings that inform these components of each other’s locations:

CLUSTER_SERVERS (local_settings.py): tells webapp which other webapps to query.
CARBONLINK_HOSTS (local_settings.py): tells webapp which carbon-caches to query.
DESTINATIONS (carbon.conf): tells carbon-relays which carbon-caches are in the cluster, determining the hash ring layout when in consistent-hash mode (more on this later) or simply the available hosts when in rule-oriented mode (ditto).
WHISPER_DIR (carbon.conf): Which on-disk location holds the Whisper files. This tells the carbon-caches where to write, and the webapp where to read.

The write algorithm

Writes into a Graphite system depend on how your cluster is set up, but typically look like this:

A collector writes metrics over the wire to some Carbon daemon.
That daemon may be a carbon-cache directly responsible for writing metrics to disk, in which case, you’re done.
- carbon-cache uses WHISPER_DIR to determine where metrics live.
It might instead be a carbon-relay that doesn’t write to disk but balances writes across multiple backend carbon-caches (which then write to disk as above) depending on method/mode:
- Consistent-hash mode creates a hash ring out of DESTINATIONS for even balancing.
- Rules-based sends to one or more DESTINATIONS members according to additional configuration hinging on the metric paths themselves.
It’s also possible to add carbon-aggregator daemons to the mix, which function as combination aggregation + relay services. We don’t cover these here as they’re orthogonal to primary metric collection.
Depending on how well a single carbon-relay performs and how many collectors you have, you may need an additional load balancing layer between your collectors and (now multiple) carbon-relays.
- We haven’t dug into this angle as much yet; presumably there’s a lot of room for flexibility when multiple relays are in play, re: whether they all have the same DESTINATIONS lists, or split things up amongst themselves.

The read/query algorithm

Graphite’s webapp does the following when it receives a query request for metric X:

Seeks an on-disk Whisper file from WHISPER_DIR for X and queries it for the timeframe involved.
If that query fails (no on-disk metrics for that timeframe for X) and CLUSTER_SERVERS is set, sends the request to all CLUSTER_SERVERS (which recurses through this algorithm on all of them), then iterates through the responses and returns the first match.
- In a typical sharded setup, there would only be one valid result.
At this point, the data is either from the local disk, or from some other webapp (which itself may have gotten it from disk, or elsewhere, as below).
Next, it communicates with whichever carbon-cache is expected to be caching any not-on-disk-yet data for X, by way of its CARBONLINK_HOSTS.
- Specifically, a single carbon-cache in that list should be responsible for the metric X; which carbon-cache this is, is determined by the mode we’re in – hashing or rules.
- This carbon-cache selection is the same algorithm used by carbon-relay if that is in play. Thus, CARBONLINK_HOSTS must always match DESTINATIONS if hashing is involved, or the hashes will mismatch and the webapp will frequently look in the wrong place.
If carbon-cache had any data for X, it is merged with the on-disk (or other-webapp-queried) data to present the final time series, which is returned to the client.

Use case breakdown

We’ll examine how the webapp fulfills queries in detail via a series of example setups, from simple to complex.

Generally, we expect resource starvation begins with single-core CPU usage by carbon-cache, then moves to disk throughput from multiple carbon-caches. Unless one’s disks are quite slow, it seems unlikely to occur in reverse order (a single carbon-cache saturating disk throughput) so we don’t consider that scenario.

The base case

One node with one carbon-cache and one webapp, plenty of resources (CPU, etc) available.

Writes
- Collector submits to the carbon-cache.
- Carbon-cache writes near-instantly to WHISPER_DIR.
Reads
- Query reaches webapp for metric X.
- Webapp checks WHISPER_DIR, pulls up submitted data.
  - CLUSTER_SERVERS is empty, so no network requests are made.
- Webapp checks CARBONLINK_HOSTS for cached data.
  - Just contains the one local carbon-cache; no data is cached so none returned.
- Those two results are merged; on-disk + nothing = on-disk. Done.

Base case with active cache

Same as above, but the disk is slow, or carbon-cache’s write batch settings are higher than the incoming data (i.e. it has to wait a while before it has enough data to justify a write).

Writes
- Unchanged, except the carbon-cache now usually has data in its cache which is not on disk yet.
Reads
- Unchanged, except now the webapp receives non-empty data when it asks CARBONLINK_HOSTS, which is then merged with any on-disk data.

Single node, multiple carbon-caches

We’ve reached the point where a single carbon-cache is maxing out one core, but our disks are still keeping up (i.e. no need to go multi-node yet).

Writes
- Instead of metrics going to a single carbon-cache, we now have a carbon-relay listening on the port the carbon-cache used to listen on, and N carbon-caches listening on unique ports on the same node.
- This carbon-relay, depending on settings (again, hashing vs rules), forwards each data point to one of the backend carbon-caches.
- That carbon-cache then writes to disk, and depending on disk utilization, probably has some data in its cache most of the time.
  - All carbon-caches are still writing to the same directory tree in WHISPER_DIR, but if one’s rules are good or one is using hashing, each file “belongs” to only one carbon-cache, so no write conflicts arise. (Carbon also has a lock setting for when this is not the case.)
Reads
- Still the same, except now the webapp has multiple backend carbon-caches in CARBONLINK_HOSTS (not just the one).
- However, because any given metric X will live in only one of the carbon-caches, the webapp still only queries one of them by using the hash ring.
  - OPEN QUESTION: If one is using relay-rules instead, it looks like the webapp won’t work right (the code seems to always assume hashing) – perhaps the expectation is one utilizes CARBONLINK_HOSTS differently when rules are in effect?
- Assuming hashing, the order matters – CARBONLINK_HOSTS must match the relay’s DESTINATIONS exactly.

Multiple nodes, multiple caches on each, multiple relays

Now a single disk array can’t keep up and we have to shard across multiple disk arrays. Assuming a single relay can still handle the traffic involved, we can do the following:

Create multiple identical instances of the previous solution: self-contained nodes each with one relay, multiple caches and one webapp.
These nodes, as before, have local caches listed in their local relay’s DESTINATIONS and their local webapp’s CARBONLINK_HOSTS, preserving the hash ring.
We then add a single relay on another node (or, if there are CPUs to spare, one of the existing nodes, adjusting ports appropriately) and place the other relays in its DESTINATIONS.
- Optionally, we may set the REPLICATION_FACTOR setting to 2 or above to gain some redundancy within the cluster.

That setup gives us the following write and read flows:

Writes
- The top carbon-relay shards across the backend nodes’ relays, so that each individual node becomes responsible for one segment of the hash ring.
- Once metrics arrive at the backend relays, the write pattern looks the same as the previous scenario, with another level of sharding across the local caches.
Reads
- We now have multiple webapps in play, so each must have its siblings in CLUSTER_SERVERS. However, each still has only local-to-node caches in CARBONLINK_HOSTS, again per the previous scenario.
- Reads may go to any webapp (possibly load-balanced) which will speak to its siblings in CLUSTER_SERVERS in the first phase of data discovery:
  - If the metric is found locally, CLUSTER_SERVERS is not consulted.
  - If it’s not found locally, all CLUSTER_SERVERS are queried and the results tallied.
    - These sibling webapps will be pulling in cached metrics from their local carbon-caches during this step, according to each’s CARBONLINK_HOSTS.
  - The webapp in question queries its local CARBONLINK_HOSTS and merges that with the result.
  - Done.

Side scenario: duplicating metrics with even more relays

A quick side note: the REPLICATION_FACTOR setting mentioned above can also be used in conjunction with additional levels of relaying to fully duplicate data, e.g. across data centers or simply for extra high availability.

For example:

Create multiple (let’s say two) clusters as in the previous scenario, including their “top level” relay.
- Make sure they’re truly independent; their webapps’ CLUSTER_SERVERS settings should only include the other nodes in that local cluster, not in the sibling cluster.
Add a new, actually-top-level relay, whose DESTINATIONS is those other two, previously-top-level relays. Set its REPLICATION_FACTOR to 2.
Now metrics entering this new top-level relay are duplicated to both backend clusters, which perform otherwise identical to the previous scenario.