How Ceph Stores Data

Hey everyone, Brett Kelly here for another Tuesday Tech Tip. So today we're talking Ceph again, and in particular how its stores. And so, in the past I've talked - kind of high level, how it stores its data, replication, erasure code (and we all know about that), but that's kind of high level, right? That's replication, “give me ‘n’ number of copies”, “erasure code, cut it up into chunks and give me some parity”.

But, like we still get this question a lot, whether through customers and through everyone, how does it actually, like store it behind the scenes? Where does it go? Like, I know it's creating replicas, but you told me I've got a big sea of hard drives here across the nodes, how do I know that two copies in my replicated 2Pool aren't going on the same hard drive? Like, right? Like, I assume that is not how it works, but how does that work?

And we get questions like that all time; so what we wanted to do for this tech tip is we're gonna hop over to my computer, I’m gonna do a little screen cap, and I'm just gonna kind of walk you through where the data gets stored, I'm gonna show you kind of a little bit behind the scenes of a crush map and the idea of the rules on how Ceph determines where objects in your data should live, such that everything's always safe. So, let's go over to my computer and I will get into it.

Okay, so we're gonna get started with a pretty small Ceph cluster here, just three node with two OSD disks and each one hard drives. The idea here is just easily show the data distribution. So, we've spent a lot of time talking about how Ceph can present and store your data as either file block or object, we've talked about how it keeps it safe by keeping replications of the data or erasure code, how it splits it up into “K” number of data chunks, “M” number of parity chunks and distributes around the cluster, similar to a RAID, but we never really talked about how Ceph quite keeps things safe behind the scenes, we know it's just a bunch of disks and a bunch of storage servers, no RAID. What happens if a disk fails? I know I've got copies, but what happens if my copies are all in the same hard drive? Like how does Ceph prevent this from happening? So that's what I want to talk about today.

So, the key kind of driver of to all that is called the crush algorithm, and that's “Controlled Replication Under Scalable Hashing”. Now, the crush algorithm essentially determines how to store or retrieve the data in the cluster. How does it do that? Well, it stores and retrieves from the data locations. So what are these data locations? Well, the data locations are the physical hard drives, or SSD’s, or NVMe, whatever your physical storage media is organized into OSD hosts, and it is best represented as a tree. And what this thing is called the crush map, and it is the organization, it's the visualization, it's how us as people - on how the crush algorithm can visualize our storage cluster, if you will.

So if we look right now on my screen here (so I'm on the Ceph dashboard), and we're looking at the default crush map that was created when I built this cluster. Now you can get pretty crazy into some custom crush maps, but we're just gonna stay with the default here to get an understanding of what's going on. So as you can see, we've got our default route, and that's just like the root of our tree of all our possible storage locations. I have three hosts, and in each host there's two physical OSD hard drives.

So this is where the data's gonna go, but these are all individual hard drives. How, if I write to a three replicated storage pool that's living on this cluster, how do I know that all three copies aren't gonna go into OSD.0 here? Because two problems with that - it's not very safe at all and two, it's just uneven data, like you just get a horrible - we want a clean, even distribution, and we do not want our copies all on the same storage media, and even further, even in the same host.

So how does Ceph - how does crush do this? Well, it uses a concept of a crush rule, and the crush rule is exactly that - the rule that tells the crush algorithm how to put the data in the crush map, here, in the tree, such that everything is very safe. So, let me hop over into this screen here, and I'll show you.

Let's look at the command here to create a replicated rule. The concept of what I'm gonna get into applies to erasure code as well, but it's just really easy to visualize with replication so we'll focus on that. So let's look at this command here, so we're gonna create a new replicated crush rule. So, it needs a name, (everything needs a name) and then the route. The route is our default route here, again, I said you can get pretty crazy with the custom maps, but for the time being we would just use the default route. And the magic here - the real important one that answers that question I was asking, “how do I keep my data dispersed?” is the failure domain type.

So this failure domain type can be a number of different options, but we'll just talk about two right now to keep it simple. It can either be "host", or it can be "OSD". And what that means - if I set this failure domain to “host” when I create a rule, I make this replicated rule I set my failure domain to “host”, and then I write to that pool that has this rule applied to it, and it's three replication. That guarantees that each copy of the data, 1, 2, and 3, has to go to a unique host in the crush map, so either of my hosts here. So under each one, it could be either OSD 1 and 5, 2 and 3, or 0 and 4, but no copy of this object can ever live on the same host of another one. That's what this failure domain type does. And then the last option here is “class”, it's the device class, whether it's hard drive, SSD, NVMe. Not as important for - more performance, not so much for keeping your data very safe. So, that's the failure domain type.

This second one we could do here is the OSD failure domain, and what that would mean is the same idea, but no more than one copy of the data would live on the same OSD. So, we're safe against drive failure, but not so much against host failure. So you don't see too many times, people in the field actually using the OSD is a failure domain level, that's not a very safe way to build a cluster. Matter of fact, as these things get bigger, it's better to even abstract out further and organize your hosts into data centers or rooms and spread the data out that way.

But that's really the point I wanted to make sure - how does Ceph guarantee that the data does not end up living on the same host or the same disk, or even further as we extrapolate maybe the same room or the same data center, it's this failure domain type of the crush rule that's applied to the storage pool, which tells the crush algorithm how to behave, how to keep the data safe and where to pull it back from.

Okay, so that was a lot of talk and so let's just show you, let's show you what I'm talking about here. So I've already made a couple rules here, so I'm gonna hop over to the pool screen, I'm gonna make a new pool and I'm just gonna call it “test_pool”. And we're gonna make a replicated pool - I'm just gonna pick a small number of placement groups (doesn't matter so much for this), and I'm gonna use one of the rules that I have, the replicator rule gets made by default, replicated HTD is another one I have here, and let me just hit this. So what this is saying is it's a replicated rule, my data type, or my device class is a hard drive, so only put data from this pool on hard drives, and my failure domain is "host". Yeah, so let's make a rep, we'll do a rep 3, we'll just put RBD on it to silence the warning. Yeah, so let's create a pool. Well, just take a minute for it to go active and clean, okay. So, there's our pool.

Okay, so we have our test pool created, replicated device class only live on hard drives, and failure domains at the host. So, what I'm going to do is I have a file created here, just “hello world”, fun classic example, and we're going to put this into the cluster. So I'm gonna say “rados -p test_pool put object1 hello.txt”. Alright, so it's in there. “rados -p test_pool ls”, here we go. And if we look at Ceph status, here's our one massive object of six bytes.

So, that's all well and good, it’s in the cluster, how do you know it's following our rule? So, let's take a look at “ceph osd map” and then “test_pool object1”, and I’m gonna fix the format. Okay, so what we're looking at here is this (my mouse sucks), this up block. So what that's telling us is those are three OSD’s that each hold a copy of this data, so right away you can see our 3 rep is following the rules and we have 3 cups of the data. These are the OSD ID’s, this is what we can recognize the OSD discs as, and let's find out if it is correctly mapped out.

So we're looking at 3, 4, & 5, so 4 is on OSD 2, 3 is on OSD 3, and 5 is on OSD 1, so right there they're all living on three unique hosts. Great, looks like it's following the rule. So let's do this, let's go “ceph osd pool set test_pool crush rule”, so I had another crush rule made, it’s the exact same rule except the failure domain is in the OSD level now rather than the host. So, let's set this rule “HDD OSD”.

Okay, so that's there, and what we're gonna do now is we're just gonna put that same object (p) back onto the pool as a separate name and see where it maps to, “Object2 hello.txt”, and I have to put “put”, tell it what to do. There. It's easier, 2 objects of 12 bytes, and let's map that out, “ceph osd map test_pool object2 -f json-pretty”.

Alright, so, here is our “object2”, made on the pool after we've changed the rule to “OSD Failure Domain”, and we see 1, 5, and 3. So 1 and 5 are on the same host and that goes to show you right there - that kind of way to illustrate the rule. That failure domain is really the key part that keeps your data safe. Of course this is a simple example of just host and OSD, a lot of large clusters can sometimes span rooms or data centers and you can extrapolate these maps out bigger and have your default have a data center tree, and then many hosts below it, and organize data out that way. So, incredibly, really scalable way of storing data, obviously, that's why Ceph is such an amazing project, but, hope this gave you little more insight on where your data is actually going and that you can trust that it's safe and sound.

So there it is, a little deeper look into what Ceph's actually doing behind the scenes in keeping your data safe and how it follows the rules to keep everything dispersed. Again, yeah, hope you learned something, hope you enjoyed that.

As always, reach out, questions, comments, any of our social media, we'd love to hear from you.

Fun fact time - as always, I'm out of fun facts now, so I'm gonna go to my old trusty friend Reddit here. I was browsing the other day and found a good thread, it was like “what's your favorite fun fact to pull out”, they're always fun to read through, and found a couple good ones. But one of them actually - the reason why it is “Top 40 Hits” is that jukeboxes could only fit 40 records on them so the owners would use the top 40 list as a way to know which records to play, as the more popular ones will get played more, and thus get more money. Honestly it just sounds like a teeny storage cluster.

Alright, well thanks for watching and we'll see you all next week, thanks.