Intro
Graphite is a popular metrics collection, storage & graphing tool. Unfortunately, clear info on how it works in a cluster configuration is hard to come by. I’ve done a lot of reading on the topic and want to share the result.
Note that I haven’t actually put this into practice yet (though I plan to at my day job in the near future). However, I’ll be working with folks who’ve deployed clusters to hammer out any misinformation ASAP.
I sourced these notes from community blog posts, the docs, and reading the Graphite/Carbon/Whisper source code. Specifically:
- Richard Crowley’s ‘Federated Graphite’ blog post - one reason I wrote my post is that his, while well put together, is still missing gaps in the big picture.
- Tuning Graphite for 3m points/min - a somewhat old Q&A thread on Launchpad that’s still at least somewhat relevant.
- Mailing list post mentioning CLUSTER_SERVERS confusion
- The official docs
- The code
Overview
Components
A Graphite setup consists of the following components:
- A collector that speaks Graphite wire protocols.
- One or more Carbon daemons (typically carbon-cache, the workhorse).
- Carbon-cache’s primary responsibility is accepting metrics over the wire & writing them into Whisper db files (see below).
- While doing so, it caches metrics in memory to help write throughput (and smooth over disk contention).
- Because of this, at any given moment datapoints that have “arrived” at a given node may exist either on-disk OR in a carbon-cache’s cache. This affects how the webapp queries for them (see below).
- When multiple carbon-caches are needed (to take advantage of multiple cores or nodes), a carbon-relay process load-balances across them.
- An on-disk tree of Whisper database files.
- These files are the canonical resource for metric data; all metric data should eventually (and ideally, quickly) arrive here.
- At least one webapp install
- The webapp is the canonical query endpoint for its node’s metric store.
- It handles all necessary queries to disk, carbon-caches, and if necessary webapps on other nodes, to end up with the most up to date set of endpoints for a given metric.
- The webapp ships with a graph composer & dashboard, but for the purposes of this document, its primary use is that of query API endpoint.
Configuration
Given that there may be multiple carbon-caches per node and/or multiple nodes’ worth of caches & webapps, Graphite has a handful of settings that inform these components of each other’s locations:
CLUSTER_SERVERS
(local_settings.py
): tells webapp which other webapps to query.CARBONLINK_HOSTS
(local_settings.py
): tells webapp which carbon-caches to query.DESTINATIONS
(carbon.conf
): tells carbon-relays which carbon-caches are in the cluster, determining the hash ring layout when in consistent-hash mode (more on this later) or simply the available hosts when in rule-oriented mode (ditto).WHISPER_DIR
(carbon.conf
): Which on-disk location holds the Whisper files. This tells the carbon-caches where to write, and the webapp where to read.
The write algorithm
Writes into a Graphite system depend on how your cluster is set up, but typically look like this:
- A collector writes metrics over the wire to some Carbon daemon.
- That daemon may be a carbon-cache directly responsible for writing
metrics to disk, in which case, you’re done.
- carbon-cache uses
WHISPER_DIR
to determine where metrics live.
- carbon-cache uses
- It might instead be a carbon-relay that doesn’t write to disk but balances
writes across multiple backend carbon-caches (which then write to disk as
above) depending on method/mode:
- Consistent-hash mode creates a hash ring out of
DESTINATIONS
for even balancing. - Rules-based sends to one or more
DESTINATIONS
members according to additional configuration hinging on the metric paths themselves.
- Consistent-hash mode creates a hash ring out of
- It’s also possible to add carbon-aggregator daemons to the mix, which function as combination aggregation + relay services. We don’t cover these here as they’re orthogonal to primary metric collection.
- Depending on how well a single carbon-relay performs and how many collectors
you have, you may need an additional load balancing layer between your
collectors and (now multiple) carbon-relays.
- We haven’t dug into this angle as much yet; presumably there’s a lot of
room for flexibility when multiple relays are in play, re: whether they all
have the same
DESTINATIONS
lists, or split things up amongst themselves.
- We haven’t dug into this angle as much yet; presumably there’s a lot of
room for flexibility when multiple relays are in play, re: whether they all
have the same
The read/query algorithm
Graphite’s webapp does the following when it receives a query request for metric X:
- Seeks an on-disk Whisper file from
WHISPER_DIR
for X and queries it for the timeframe involved. - If that query fails (no on-disk metrics for that timeframe for X) and
CLUSTER_SERVERS
is set, sends the request to allCLUSTER_SERVERS
(which recurses through this algorithm on all of them), then iterates through the responses and returns the first match.- In a typical sharded setup, there would only be one valid result.
- At this point, the data is either from the local disk, or from some other webapp (which itself may have gotten it from disk, or elsewhere, as below).
- Next, it communicates with whichever carbon-cache is expected to be caching
any not-on-disk-yet data for X, by way of its
CARBONLINK_HOSTS
.- Specifically, a single carbon-cache in that list should be responsible for the metric X; which carbon-cache this is, is determined by the mode we’re in – hashing or rules.
- This carbon-cache selection is the same algorithm used by carbon-relay if
that is in play. Thus,
CARBONLINK_HOSTS
must always matchDESTINATIONS
if hashing is involved, or the hashes will mismatch and the webapp will frequently look in the wrong place.
- If carbon-cache had any data for X, it is merged with the on-disk (or other-webapp-queried) data to present the final time series, which is returned to the client.
Use case breakdown
We’ll examine how the webapp fulfills queries in detail via a series of example setups, from simple to complex.
Generally, we expect resource starvation begins with single-core CPU usage by carbon-cache, then moves to disk throughput from multiple carbon-caches. Unless one’s disks are quite slow, it seems unlikely to occur in reverse order (a single carbon-cache saturating disk throughput) so we don’t consider that scenario.
The base case
One node with one carbon-cache and one webapp, plenty of resources (CPU, etc) available.
- Writes
- Collector submits to the carbon-cache.
- Carbon-cache writes near-instantly to
WHISPER_DIR
.
- Reads
- Query reaches webapp for metric X.
- Webapp checks
WHISPER_DIR
, pulls up submitted data.CLUSTER_SERVERS
is empty, so no network requests are made.
- Webapp checks
CARBONLINK_HOSTS
for cached data.- Just contains the one local carbon-cache; no data is cached so none returned.
- Those two results are merged; on-disk + nothing = on-disk. Done.
Base case with active cache
Same as above, but the disk is slow, or carbon-cache’s write batch settings are higher than the incoming data (i.e. it has to wait a while before it has enough data to justify a write).
- Writes
- Unchanged, except the carbon-cache now usually has data in its cache which is not on disk yet.
- Reads
- Unchanged, except now the webapp receives non-empty data when it asks
CARBONLINK_HOSTS
, which is then merged with any on-disk data.
- Unchanged, except now the webapp receives non-empty data when it asks
Single node, multiple carbon-caches
We’ve reached the point where a single carbon-cache is maxing out one core, but our disks are still keeping up (i.e. no need to go multi-node yet).
- Writes
- Instead of metrics going to a single carbon-cache, we now have a carbon-relay listening on the port the carbon-cache used to listen on, and N carbon-caches listening on unique ports on the same node.
- This carbon-relay, depending on settings (again, hashing vs rules), forwards each data point to one of the backend carbon-caches.
- That carbon-cache then writes to disk, and depending on disk utilization,
probably has some data in its cache most of the time.
- All carbon-caches are still writing to the same directory tree in
WHISPER_DIR
, but if one’s rules are good or one is using hashing, each file “belongs” to only one carbon-cache, so no write conflicts arise. (Carbon also has a lock setting for when this is not the case.)
- All carbon-caches are still writing to the same directory tree in
- Reads
- Still the same, except now the webapp has multiple backend carbon-caches
in
CARBONLINK_HOSTS
(not just the one). - However, because any given metric X will live in only one of the
carbon-caches, the webapp still only queries one of them by using the hash
ring.
- OPEN QUESTION: If one is using relay-rules instead, it looks like
the webapp won’t work right (the code seems to always assume hashing)
– perhaps the expectation is one utilizes
CARBONLINK_HOSTS
differently when rules are in effect?
- OPEN QUESTION: If one is using relay-rules instead, it looks like
the webapp won’t work right (the code seems to always assume hashing)
– perhaps the expectation is one utilizes
- Assuming hashing, the order matters –
CARBONLINK_HOSTS
must match the relay’sDESTINATIONS
exactly.
- Still the same, except now the webapp has multiple backend carbon-caches
in
Multiple nodes, multiple caches on each, multiple relays
Now a single disk array can’t keep up and we have to shard across multiple disk arrays. Assuming a single relay can still handle the traffic involved, we can do the following:
- Create multiple identical instances of the previous solution: self-contained nodes each with one relay, multiple caches and one webapp.
- These nodes, as before, have local caches listed in their local relay’s
DESTINATIONS
and their local webapp’sCARBONLINK_HOSTS
, preserving the hash ring. - We then add a single relay on another node (or, if there are CPUs to spare,
one of the existing nodes, adjusting ports appropriately) and place the
other relays in its
DESTINATIONS
.- Optionally, we may set the
REPLICATION_FACTOR
setting to 2 or above to gain some redundancy within the cluster.
- Optionally, we may set the
That setup gives us the following write and read flows:
- Writes
- The top carbon-relay shards across the backend nodes’ relays, so that each individual node becomes responsible for one segment of the hash ring.
- Once metrics arrive at the backend relays, the write pattern looks the same as the previous scenario, with another level of sharding across the local caches.
- Reads
- We now have multiple webapps in play, so each must have its siblings
in
CLUSTER_SERVERS
. However, each still has only local-to-node caches inCARBONLINK_HOSTS
, again per the previous scenario. - Reads may go to any webapp (possibly load-balanced) which will speak to
its siblings in
CLUSTER_SERVERS
in the first phase of data discovery:- If the metric is found locally,
CLUSTER_SERVERS
is not consulted. - If it’s not found locally, all
CLUSTER_SERVERS
are queried and the results tallied.- These sibling webapps will be pulling in cached metrics from
their local carbon-caches during this step, according to each’s
CARBONLINK_HOSTS
.
- These sibling webapps will be pulling in cached metrics from
their local carbon-caches during this step, according to each’s
- The webapp in question queries its local
CARBONLINK_HOSTS
and merges that with the result. - Done.
- If the metric is found locally,
- We now have multiple webapps in play, so each must have its siblings
in
Side scenario: duplicating metrics with even more relays
A quick side note: the REPLICATION_FACTOR
setting mentioned above can also be
used in conjunction with additional levels of relaying to fully duplicate data,
e.g. across data centers or simply for extra high availability.
For example:
- Create multiple (let’s say two) clusters as in the previous scenario,
including their “top level” relay.
- Make sure they’re truly independent; their webapps’
CLUSTER_SERVERS
settings should only include the other nodes in that local cluster, not in the sibling cluster.
- Make sure they’re truly independent; their webapps’
- Add a new, actually-top-level relay, whose
DESTINATIONS
is those other two, previously-top-level relays. Set itsREPLICATION_FACTOR
to 2. - Now metrics entering this new top-level relay are duplicated to both backend clusters, which perform otherwise identical to the previous scenario.