CEPH Webinar - Questions and Answers

➤ Will you continue to offer GlusterFS and TrueNAS support besides the recent uptake of CEPH?

We will continue support existing Glusterfs setups, and FreeNAS/TrueNAS along with Ceph deployments.

➤ How does CEPH compare NFS performance wise compared to a classical ZFS based server? Many/smaller files access performance?

Great question. In general, a single ZFS server with a single client accessing the storage will perform with less latency than a clustered filesystem. Where Ceph really shines is its ability to have a tera-peta-exabyte scale storage solution with no single point of failure, and its parallelism. Where you are going to see Ceph beat a single ZFS server is at scale. 100-1000s of users hitting the cluster with outperform a single ZFS server.

➤ Looking for more info on performance tuning with CEPH (like ARC, L2ARC & ZIL in ZFS).

Sounds like you are looking for an accelerated caching option. You can accelerate a Ceph cluster by utilizing either NVMe or SSD storage in combination with an open-source caching tool called open-CAS-Linux.

This combines fast storage with the slower spinning storage for lower latency applications. You can learn more about this in some video we put together about open-CAS and Ceph. https://www.youtube.com/watch?v=yfkwI2EhHPk&t=357s

➤ Open source backup client recommendations backup 135tb zfs?

If you are using ZFS on both sides, the answer is definitely ZFS send/receive. If they are different file systems, rsync is rock solid.

➤ What amount of OSD’s do you guys recommend per host?

The more the better, but there are some factors to keep in mind. Which comes down to what is more important to you, the (a) performance/$ or (b) capacity/$. I start with the minimum; we recommend at least 5 drives (OSDs) per host. To maximize perf/$ sweet spot is 30 OSDs. To maximize capacity/$ sweet spot is 60 OSDs. 45Drives OSDs are a good balance between the two.

➤ Is Ceph hardware agnostic? Do the servers need to be identical?

Ceph is hardware agnostic, and you want to be careful to not have large variation in the density of the storage nodes. Your storage nodes should be as closely weighted as possible (i.e the same amount of total storage in both). Essentially you want to avoid clusters with servers with 30 bays and servers with 60 bays.

We do support mixed servers’ clusters (non 45Drives servers). For example, we have a couple Storinator and Supermicro server clusters out in the world.

➤ Can pools be nested? E.g. can I have two OSD level failure domain erasure coded pools combined into a host failure domain replica pool?

No pools cannot be nested as described. Even if you could this wouldn’t an ideal situation in terms of efficiency.

Let's assume we are building a 3 server Ceph cluster. Each server (or node) has 100TB of storage, and the cluster would see raw storage capacity of 300TB total.

If we did a simple 3 replicated pool you have a usable storage capacity of 100TB, 33%
If we were able to create a 4+2 (66%) on the OSD failure domain, and then replicated 3 times on the host level, you would only see usable storage capacity of 66TB, so only 22% effective. Even if we used 2 rep instead of three you would see usable capacity of 98TB which is less

The closest approximate setup would be to create a pool with a rule such that you had a mix of two failure domains osd and host.

Let's assume we want to make a erasure coded pool that with a 4+2 profile with only 3 servers. 4 data chunks and 2 parity chunks. This gives 66% percent efficiency and allows to failure of 2 disks (if failure domain OSD) or failure of 2 servers (if failure domain is host). Failure domain at osd is playing a little fast and loose with redundancy, as you can only use 2 disks from your total cluster of 3 servers. Failure domain at the host level sounds great but needs 6 unique servers in the cluster.

The compromise here is to a create 4+2 erasure coded pool that spreads the chunks evenly across the three hosts, with 2 chunks per host.

The CRUSH rule would look like this:

rule rgw4plus2_hdd {
	id 6
	type erasure
	min_size 3
	max_size 6
	step set_chooseleaf_tries 5
	step set_choose_tries 100
	step take default class hdd
	step choose indep 3 type host
	step choose indep 2 type osd
	step emit
}

The bolded parts are the important bits, this rule is stating that when placing data into the pool CRUSH is to grab 3 hosts and place two chunks of the data on each.

With a config like this, it would allow a failure of 2 osds per host or 1 host without risk of data loss, and still maintain 66% storage efficiency

➤ How would you recommend connecting to the cluster from the internet? The application is a huge POSTGRES database used as the base for a search engine.

Put all storage on an internal network - don’t connect your cluster directly to the internet.

You will want to look at using a proxy service that can communicate with storage on the back end and communicate with the users on the front end, but not allow the users to communicate directly to your storage

➤ Do you have any experience in latency difference/performance between 10GBASE-T and 10GBASE-LR/SR in the Ceph OSD machines?

It would really depend on the application consuming the storage. If it is really latency sensitive (i.e database, VM storage) you will notice it. If you are streaming large sequential files (video, high res imaging) than you most likely wont notice.

*10Gbase-LR/SR does have a few NS more latency due to the electrical to optic conversion and back.

➤ If CANARIE network access is possible (100Gb/s). How do we ensure multiple sites can have a working copy of scientific data and allow for site specific data? At some point, the site specific may be moved over to the multi-site datasets.

Two options to look at here, a single spanned cluster or multiple independent clusters

Single Spanned cluster
- 1 big logical cluster that spans multiple locations. This is the “simplest” solution as just functions as one big cluster. All data access types could be used (Filesystem, Block, and Object). Ceph is synchronous in nature, so the caveat is you need low latency link between each site and large bandwidth
Multiple independent clusters
- 2 or more independent clusters who can communicate can keep data in sync. A caveat to this is you are limited to only object (S3) or block (rbd).
Check out these two videos giving a little details on the options here:
- https://www.youtube.com/watch?v=X5GIix2x3Pk&t=1s
- https://www.youtube.com/watch?v=-Q_W9MKv2JE&t=1s

In both cases above you are able to have both multisite data and site-specific. Further details that will determine the best case scenarios will depend on if file system, block, object, or a mix is required.

➤ Can the gateways be virtual?

Yes, gateways can be virtual, in the case of MetaDataServer (MDS) needed for CephFS to function it is very RAM intensive as the service is essentially a big cache of filesystem metadata in a cache. This is sometimes restrictive for some virtualized environments where system memory is a premium.

➤ Can you create a cluster from an existing server, and retain the data on the existing server? What if you have two servers, could you retain the data on both and create the cluster?

Short answer, no. (also answered live)

Ceph will need to entirely reformat the HDD/SSD when it creates an OSD so unfortunately all data currently on a server will need to be wiped.

45Drives offers a data migration service for those in this scenario that need to temporarily offload data when migratin to a cluster.

See more detail here: https://www.45drives.com/products/data-migration/

➤ How does the cluster behave if some OSD’s are failing or reading/writing very slow but are not completely dead yet?

Ceph is highly parallel so a large percentage of data will not be affected, however for the pieces of your dataset that are affected by a dying HDD Ceph will report “slow operations (or slow ops)” on the OSDs affected, from this an alert is sent and an admin can mark the OSD out of the cluster. At this point, Ceph will generate the data on that slow OSD elsewhere in the cluster, and the admin can replace the failing disk at their convenience.

Ceph also has a module called “diskprediction” where its job is to estimate the the disk will fail. It has a function where it can kick out an OSD it has deemed about to fail. Note that this feature is new as of the current stable release of Ceph (v14 Nautilus) and where it is based on data gathered will get more and more accurate as new releases become available.

➤ Can some OSD’s cause a performance decrease for a complete cluster?

See above answer

➤ Do additional drives need to be the same vendor or size?

They do not need to be the same vendor nor the same size, however when it comes to disk size you want to be careful to not have large variation in the density of the storage nodes. Your storage nodes should be as closely weighted as possible (i.e the same amount of total storage in both).

So feel free to have a mixed capacity of disks but try to have them evenly balanced throughout all storage servers.

➤ How does Georeplication work?

https://www.youtube.com/watch?v=X5GIix2x3Pk

See ^ videos for multisite clusters

For filesystem (cephfs) we have built a tool that utilized rsync and cephfs “ctime” to intelligently send data to a remote filesytem. Note filesytem geo-replication is a one way replciation.

https://github.com/45Drives/cephgeorep

➤ Can the system control an onsite, and an offsite data set in one go?

See answer for “If CANARIE network access is possible (100Gb/s). How do we ensure multiple sites can have a working copy of scientific data and allow for site specific data? At some point, the site specific may be moved over to the multi-site datasets.”

➤ I noticed that samba is not in the dashboard, is that a separate function?

It currently is managed via ansible. It is on the roadmap for next release of Ceph Pacific v15.

➤ What happens if you lose some servers due to a power issue, does Ceph automatically heal when they come back online?

Yes, Ceph will automatically heal after unexpected power loss. The way Ceph monitors deal with quorum prevents split brain and the cluster will start up again and sort itself out without human intervention. We have had several customers lose power unexpectedly and be pleasantly surprised that the storage cluster comes back to a healthy state with no intervention

➤ How granular are the permissions? We have a design where our folders in a given user folder have one set to Dropbox (write only) one set to Read only and other set to modify.

Detailed answer to this depends on if you are using file block or object, but long story short all three are very granular and can accomplish what is asked. Sounds like your solution is using a filesystem so Cephfs is a POSIX compliant fs and permissions are configured the same way as standard linux fileystems. It can also be extended to have SMB/CIFS and NFS access. SMB/CIFS can be controlled with windows acls, and NFS will retain whatever unix permissions are set on the underlying cephfs share.

➤ With the throughput graphs, when I see a slowdown, can I drill down and find out if maybe one of my hosts, or one of my hosts' drives is the cause of the slowdown?

Yes, all individual device (CPU, Memory usage, Disk utilization and throughput) metrics are exposed and organized by host.

You could have quick glance at the overall hosts screen, see say host1 is acting up look at the host specific metrics and isolate if CPU, system memory, network or a specific disk or group of disks.

➤ Is Ceph the complete storage solution? How do existing NAS platforms like FreeNAS/TrueNAS interact with Ceph for storage?

Ceph is just about as much of a complete storage solution as any storage solution can be. What we mean by this is that Ceph can do block storage(Native RBD/iSCSI), file system(native CephFS/SMB/NFS/FTP etc), and object storage(S3/swift). Not only that, it is able to scale functionally infinitely in not only size but also performance, which comes with massive advantages by being able to keep a singular namespace. FreeNAS/TrueNAS are built as single servers with ZFS as the underlying file system/RAID/volume manager and so are not designed to scale past a single server. Unfortunately, you cannot cluster FreeNAS/TrueNAS.

➤ Do you guys provide any Cyber security feature? If the server or hard drive get infected?

Ceph can encrypt data at rest by using dm-crypt on the OSDs, or Self encrypting drives (SED). Data on the wire (communication between the ceph nodes) is encrypted via ceph specific msgrv2.

Ceph support snaphotting and data can be recovered by rollling back older snaps.

➤ What restrictions will EU based customers face in your support portfolio (different continent)?

No onsite support available. Other than that full access to our support team

More details here: https://www.45drives.com/support/

➤ What is the difference between Cephadm and ansible? How does it affect managibility of the Ceph cluster?

Cephadm is a deployment tool created for the latest release of Ceph Octopus (version 15). It deploys and manages the Ceph services by using containers instead of bare metal rpm installs. Cephadm was created by Ceph developers to replace third party deploy tools like ansible or puppet and ease the learning curve of building a Ceph cluster for the first time. So, you don’t have to learn how to use Ansible and Ceph at once.

In my opinion, ansible deployments of Ceph are not going away whether or not you choose to deploy containerized with Cephadm or not. It is too powerful and flexible of an automation framework to not be useful. But I will say the Cephadm tool is an awesome piece of software built by the Ceph developers, so hats off to them!

➤ How easy is to add a node to the Ceph cluster?

Very easy, using the Ceph-Ansible playbooks once a server is racked and networking configured you can add the new node or nodes seamlessly and watch your usable storage cluster size grow completely transparent to the end-users.

➤ Your launch of 90 bay server when can we expect?

It is not currently on our roadmap

➤ Is off site replication of data available?

See the above multisite answers

➤ Do you have an ilo or drac like device for the Stornators?

All of our Storinators use IPMI for remote out of band management.

➤ While scaling, do you also add more gateways along the way?

If you need to scale out capacity you only need to add more storage nodes.

If your user count is increased or if you want more parallel access into the cluster then scale gateways as well.

➤ Can somehow Ceph match GlusterFS in performance for same amount of data?

Cephfs will out perform GlusterFS as it scales larger and larger. Ceph is built to be massively more parallel than GlusterFS. Cephfs also uses centralized metadata where GlusterFS does not. Cephfs also allows the size of the metadata cache to grow with the amount of the data, Glusterfs does not.

You will find that cephfs will scale linearly as data increases where GlusterFS will eventually level off and fall off after passing the PB range.

➤ I am ready to jump in, but also feel the need to deeply understand the CEPH stack. Do you offer tech support on a break/fix model? CEPH training for our techs?

45Drives is very flexible on how you can use your support hours when you purchase a cluster. If you would like to use some of those hours for training, we would be happy to jump into a live demo with you and run through day-to-day administration and show you how to do any normal tasks. For anything that runs more complex, that’s what our support team is here to help you with!

➤ What's the purpose of the gateway? It seems more like a bottleneck to the storage servers than anything else. Why use a gateway at all?

Gateways are only for those systems that are not speaking native RADOS/librados to Ceph.

➤ Is it suitable for HD video editing, I.e. regarding latency /throughput performance?

For 1080P HD editing a well-architected Ceph cluster, you should have no issue editing directly from your SMB/NFS network shares. You can expect on a per-client basis to get slightly better than a single HDD, meaning for sequential speeds you can expect between 350-400MB/s read/write. With a Ceph cluster using SSD’s you can expect slightly better than a single SSD.

➤ How much RAM/CPU per TB?

We recommend at least 1 hyper-threaded core per OSD, as well as by default each OSD will use 4GB of ram. However, we have found it acceptable to decrease that 4GB number for many clusters depending on use case. We will never recommend lower than 2GB.

➤ Can Ceph storage work for ESXi cluster or Kubernetes clusters?

When using Ceph as your back-end for something like ESXi, you would need to use another layer. VMware does not speak Ceph. However, VMware does speak NFS and iSCSI. If your main use case for Ceph would be as datastores for ESXi, we would recommend using Petasan. It is an open source appliance that uses Ceph underneath but takes advantage of SUSE Linux’s Ceph-iSCSI project. Petasan is purpose-built to act as a highly available SAN built on Ceph storage.

Ceph can be deployed as containers natively in Kubernetes using a project called Rook. Rook uses the power of the Kubernetes platform to deliver its services via a Kubernetes Operator

➤ Is it possible to use both host and drive failure domains? If so, does it make sense to use both?

Yes, and yes- with the right use case.

➤ How does the cluster behave if some OSD’s are failing or reading/writing very slow but are not completely dead yet?

See “How does the cluster behave if some OSD’s are failing or reading/writing very slow but are not completely dead yet? “

➤ Can some OSD’s cause a performance decrease for a complete cluster?

See “How does the cluster behave if some OSD’s are failing or reading/writing very slow but are not completely dead yet? “

➤ Are Ceph snapshots stable and performant for long term or persistent use? Do snapshot operations operate on metadata (e.g. NetApp WAFL or ReFS) or are they more like LVM or ESXi where deltas must be copied back?

Cephfs snapshots are in a stable state where we deploy them in many customers' production clusters. CephFS snapshots are copy-on-write and are created by making a directory inside a hidden .snap directory. The snapshot will instantly create a read-only copy of any arbitrary subtree of your file system you choose and will cover all data in the file system at that level and under that directory. For example, if you want to make a snapshot of the entire file system, you would make your .snap directory from within the root of the file system, and so on.

Snapshots are asynchronous and when a snapshot is created, the buffered data is flushed out lazily. This makes the snapshot very fast.

➤ How do you handle tiering with nvme, ssd, hd, etc. Can you control rules for how this data is placed on a folder, file type, or time-based?

There is no automatic tiering in place with Ceph. There are a few ways you can make use of different tiers of storage. The first, and most basic is to build different pools based on each tier of storage and have them each have separate uses. The next would be to use faster media such as SSD or NVMe as journals (DB/WAL) to improve latency and performance of your slower HDD based OSDs. The next would be to use the faster tier media completely as a cache. Here at 45Drives we take advantage of something called OpenCAS Linux. The OpenCAS framework was paid for and developed by Intel, who then open sourced it. From that, OpenCAS Linux was built. OpenCAS allows you to take faster media such as SSD/NVMe and pair it with HDD storage, and it built a new virtual block device that is built of some SSD and the HDD, and will act as read/write caching or specifically as only read or write if you’d like.

➤ Does each node need to be the exact same build\config as the others?

They don’t have to be completely the same, however, you will want to keep them as close as you possibly can. The lowest-performing node in the cluster will drag performance down for the entire cluster.

➤ ZFS excels in snapshots and snapshot-based replication. Do you support ZFS on RBD?

Yes ! Cern does this too !

➤ Do you have a recommended design/architecture for large scale backups; for example, a 1PB pool? Any recommended software for backups at this size?

It all depends on what you are trying to back up. Whether it is a file system, RBD’s, or object storage.

For example, if it is a file system, we would recommend using our purpose-built daemon we call Ceph Georep. It allows you to georeplicate your file system to a remote location asynchronously continuously over time.

If it is block storage (RBD’s) I would recommend looking at either RBD mirroring, where you have a second cluster that your RBD’s get asynchronously replicated to.

If it is object storage, you may want to look at Ceph multi-site. It allows you to have multiple Ceph clusters that replicate all of their object data to stay completely in sync if you want to configure it in that way.

➤ What version of Ceph is 45Drives deploying for new installs?

Nautilus

➤ Does a Ceph cluster need backups? Is it resilient enough to not need backups? Versioning?

Ceph is about as resilient and rock-solid as storage can be. However, nothing is resilient enough to not need back-ups. If your data is critical, and you cannot risk losing it, you will always want to adhere to the 3-2-1 rule. This dictates that you should have 3 copies of your data, 2 of which can remain on-site (Ceph handles this part) and 1 off-site. This saves you in the event your building gets hit by an asteroid :)