10 essential Ceph commands for managing any cluster, at any scale

If you follow best practices for deployment and maintenance, Ceph becomes a much easier beast to tame and operate. Here’s a look at some of the most fundamental and useful Ceph commands we use on a day to day basis to manage our own internal Ceph clusters, and those of our customers.

1. status

First and foremost is ceph -s, or ceph status, which is typically the first command you’ll want to run on any Ceph cluster. The output consolidates many other command outputs into one single pane of glass that provides an instant view into cluster health, size, usage, activity, and any immediate issues that may be occuring.

HEALTH_OK is the one to look for; it’s an immediate sign that you can sleep at night, as opposed to HEALTH_WARN or HEALTH_ERR, which could indicate drive or node failure or worse.

Other key things to look for are how many OSDs you have in vs out, how many other services you have running, such as rgw or cephfs, and how they’re doing.

$ ceph -s
  cluster:
    id:     7c9d43ce-c945-449a-8a66-5f1407c7e47f
    health: HEALTH_OK
  services:
    mon: 1 daemons, quorum danny-mon (age 2h)
    mgr: danny-mon(active, since 2h)
    osd: 36 osds: 36 up (since 2h), 36 in (since 2h)
    rgw: 1 daemon active (danny-mgr)

  task status:

  data:
    pools:   6 pools, 2208 pgs
    objects: 187 objects, 1.2 KiB
    usage:   2.3 TiB used, 327 TiB / 330 TiB avail
    pgs:     2208 active+clean

2. osd tree

Next up is ceph osd tree, which provides a list of every OSD and also the class, weight, status, which node it’s in, and any reweight or priority. In the case of an OSD failure this is the first place you’ll want to look, as if you need to look at OSD logs or local node failure, this will send you in the right direction. OSDs are typically weighted against each other based on size, so a 1TB OSD will have twice the weight of a 500GB OSD, in order to ensure that the cluster is filling up the OSDs at an equal rate.

If there’s an issue with a particular OSD in your tree, or you are running a very large cluster and want to quickly check a single OSD’s details without grep-ing or scrolling through a wall of text first, you can also use osd find. This command will enable you to identify an OSD’s IP address, rack location and more with a single command.

$ ceph osd tree
ID CLASS WEIGHT    TYPE NAME        STATUS REWEIGHT PRI-AFF
-1       329.69476 root default
-3       109.89825     host danny-1
  0   hdd   9.15819         osd.0        up  1.00000 1.00000
  1   hdd   9.15819         osd.1        up  1.00000 1.00000
  2   hdd   9.15819         osd.2        up  1.00000 1.00000
  3   hdd   9.15819         osd.3        up  1.00000 1.00000
  4   hdd   9.15819         osd.4        up  1.00000 1.00000
  5   hdd   9.15819         osd.5        up  1.00000 1.00000
  6   hdd   9.15819         osd.6        up  1.00000 1.00000
-7       109.89825     host danny-2
12   hdd   9.15819         osd.12       up  1.00000 1.00000
13   hdd   9.15819         osd.13       up  1.00000 1.00000
14   hdd   9.15819         osd.14       up  1.00000 1.00000
15   hdd   9.15819         osd.15       up  1.00000 1.00000
16   hdd   9.15819         osd.16       up  1.00000 1.00000
17   hdd   9.15819         osd.17       up  1.00000 1.00000
-5       109.89825     host danny-3
24   hdd   9.15819         osd.24       up  1.00000 1.00000
25   hdd   9.15819         osd.25       up  1.00000 1.00000
26   hdd   9.15819         osd.26       up  1.00000 1.00000
27   hdd   9.15819         osd.27       up  1.00000 1.00000
28   hdd   9.15819         osd.28       up  1.00000 1.00000

$ ceph osd find 37
{
    "osd": 37,
    "ip": "172.16.4.68:6804/636",
    "crush_location": {
        "datacenter": "pa2.ssdr",
        "host": "lxc-ceph-main-front-osd-03.ssdr",
        "physical-host": "store-front-03.ssdr",
        "rack": "pa2-104.ssdr",
        "root": "ssdr"
    }
}

3. df

Similar to the *nix df command, that tells us how much space is free on most unix and linux systems, Ceph has its own df command, ceph df, which provides an overview and breakdown of the amount of storage we have in our cluster, how much is used vs how much is available, and how that breaks down across our pools and storage classes.

Filling a cluster to the brim is a very bad idea with Ceph – you should add more storage well before you get to the 90% mark, and ensure that you add it in a sensible way to allow for redistribution. This is particularly important if your cluster has lots of client activity on a regular basis.

$ ceph df
RAW STORAGE:
    CLASS     SIZE        AVAIL       USED        RAW USED     %RAW USED
    hdd       330 TiB     327 TiB     2.3 TiB      2.3 TiB          0.69
    TOTAL     330 TiB     327 TiB     2.3 TiB      2.3 TiB          0.69

POOLS:
    POOL                          ID     PGS     STORED    OBJECTS   USED        %USED   MAX AVAIL
    .rgw.root                      1       32    1.2 KiB         4   768 KiB       0     104 TiB
    default.rgw.control            2       32        0 B         8       0 B       0     104 TiB
    default.rgw.meta               3       32        0 B         0       0 B       0     104 TiB
    default.rgw.log                4       32        0 B       175       0 B       0     104 TiB
    default.rgw.buckets.index      5       32        0 B         0       0 B       0     104 TiB
    default.rgw.buckets.data       6     2048        0 B         0       0 B       0     104 TiB

4. osd pool ls detail

This is a useful one for getting a quick view of pools, but with a lot more information about their particular configuration. Ideally we need to know if a pool is erasure coded or triple-replicated, what crush rule we have in place, what the min_size is, how many placement groups are in a pool, and what application we’re using this particular pool for.

$ ceph osd pool ls detail
pool 1 '.rgw.root' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 64 flags hashpspool stripe_width 0 application rgw
pool 2 'default.rgw.control' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 68 flags hashpspool stripe_width 0 application rgw
pool 3 'default.rgw.meta' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 73 flags hashpspool stripe_width 0 application rgw
pool 4 'default.rgw.log' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 71 flags hashpspool stripe_width 0 application rgw
pool 5 'default.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode warn last_change 76 flags hashpspool stripe_width 0 application rgw
pool 6 'default.rgw.buckets.data' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 2048 pgp_num 2048 autoscale_mode warn last_change 83 lfor 0/0/81 flags hashpspool stripe_width 0 application rgw

5. osd crush rule dump

At the heart of any Ceph cluster are the CRUSH rules. CRUSH is Ceph’s placement algorithm, and the rules help us define how we want to place data across the cluster – be it drives, nodes, racks, datacentres. For example if we need to mandate that we need at least one copy of data at each one of our sites for our image store, we’d assign a CRUSH rule to our image store pool that mandated that behaviour, regardless of how many nodes we may have on each side.

crush rule dump is a good way to quickly get a list of our crush rules and how we’ve defined them in the cluster. If we want to then make changes, we have a whole host of crush commands we can use to make modifications, or we can download and decompile the crush map to manually edit it, recompile it and push it back up to our cluster.

$ ceph osd crush rule dump
[
    {
        "rule_id": 0,
        "rule_name": "replicated_rule",
        "ruleset": 0,
        "type": 1,
        "min_size": 1,
        "max_size": 10,
        "steps": [
            {
                "op": "take",
                "item": -1,
                "item_name": "default"
            },
            {
                "op": "chooseleaf_firstn",
                "num": 0,
                "type": "host"
            },
            {
                "op": "emit"
            }
        ]
    }
]

6. versions

With a distributed cluster running in production, upgrading everything at once and praying for the best is clearly not the best approach. For this reason, each cluster-wide daemon in Ceph has its own version and can be upgraded independently. This means that we can upgrade daemons on a gradual basis and bring our cluster up to date with little or no disruption to service.

As long as we keep our versions somewhat close to one another, daemons with differing versions will work alongside each other perfectly happily. This does mean that we could potentially have hundreds of different daemons and respective versions to manage during an upgrade process. Enter ceph versions – a very easy way to get a look at how many instances of a daemon running a specific version are running.

$ ceph versions
{
    "mon": {
        "ceph version 14.2.15-2-g7407245e7b (7407245e7b329ac9d475f61e2cbf9f8c616505d6) nautilus (stable)": 1
    },
    "mgr": {
        "ceph version 14.2.15-2-g7407245e7b (7407245e7b329ac9d475f61e2cbf9f8c616505d6) nautilus (stable)": 1
    },
    "osd": {
        "ceph version 14.2.15-2-g7407245e7b (7407245e7b329ac9d475f61e2cbf9f8c616505d6) nautilus (stable)": 36
    },
    "mds": {},
    "rgw": {
        "ceph version 14.2.15-2-g7407245e7b (7407245e7b329ac9d475f61e2cbf9f8c616505d6) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.15-2-g7407245e7b (7407245e7b329ac9d475f61e2cbf9f8c616505d6) nautilus (stable)": 39
    }
}

7. auth print-key

If we have lots of different clients using our cluster, we’ll need to get our keys off the cluster so they can authenticate. ceph auth print-key is a pretty handy way of quickly viewing any key, rather than fishing through configuration files. Another useful and related command is ceph auth list, which will show us a full list of all the authentication keys across the cluster for both clients and daemons, and what their respective capabilities are.

$ ceph auth print-key client.admin
AQDgrLhg3qY1ChAAzzZPHCw2tYz/o+2RkpaSIg==d

8. crash ls

Daemon crashed? There could be all sorts of reasons why this may have happened, but ceph crash ls is the first place we want to look. We’ll get an idea of what’s crashed and where, so we’ll be able to diagnose further. Often these will be minor warnings or easy to address errors, but crashes can also indicate more serious problems. Related useful commands are ceph crash info <id>, which gives more info on the crash ID in question, and ceph crash archive-all, which will archive all of our crashes if they’re warnings we’re not worried about, or issues that we’ve already dealt with.

$ ceph crash ls
1 daemons have recently crashed
osd.9 crashed on host danny-1 at 2021-03-06 07:28:12.665310Z

9. osd flags

There are a number of OSD flags that are incredibly useful. For a full list, see OSDMAP_FLAGS, but the most common ones are:

pauserd, pausewr – Read and Write requests will no longer be answered.
noout – Ceph won’t consider OSDs as out of the cluster in case the daemon fails for some reason.
nobackfill, norecover, norebalance – Recovery and rebalancing is disabled

We can see how to set these flags below with the ceph osd set command, and also how this impacts our health messaging. Another useful and related command is the ability to take out multiple OSDs with a simple bash expansion.

$ ceph osd out {7..11}
marked out osd.7\. marked out osd.8\. marked out osd.9\. marked out osd.10\. marked out osd.11.
$ ceph osd set noout
noout is set
$ ceph osd set nobackfill
nobackfill is set
$ ceph osd set norecover
norecover is set
$ ceph osd set norebalance
norebalance is set
$ ceph osd set nodown
nodown is set
$ ceph osd set pause
pauserd,pausewr is set
$ ceph health detail
HEALTH_WARN pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set
OSDMAP_FLAGS pauserd,pausewr,nodown,noout,nobackfill,norebalance,norecover flag(s) set

1. pg dump

All data is placed into Ceph, which provides an abstraction layer – a bit like data buckets (not S3 buckets) – for our storage, and allows the cluster to easily decide how to distribute data and best react to failures. It’s often useful to get a granular look at how our placement groups are mapped across our OSDs, or the other way around. We can do both with pg dump, and while many of the placement group commands can be very verbose and difficult to read, ceph pg dump osds does a good job of distilling this into a single pane.

$ ceph pg dump osds
dumped osds
OSD_STAT USED    AVAIL   USED_RAW TOTAL   HB_PEERS                                                                          PG_SUM PRIMARY_PG_SUM
31        70 GiB 9.1 TiB   71 GiB 9.2 TiB                     [0,1,2,3,4,5,6,8,9,12,13,14,15,16,17,18,19,20,21,22,23,30,32]    175             72
13        70 GiB 9.1 TiB   71 GiB 9.2 TiB             [0,1,2,3,4,5,6,7,8,9,10,11,12,14,24,25,26,27,28,29,30,31,32,33,34,35]    185             66
25        77 GiB 9.1 TiB   78 GiB 9.2 TiB                         [0,1,2,3,4,5,6,12,13,14,15,16,17,18,19,20,21,22,23,24,26]    180             64
32        83 GiB 9.1 TiB   84 GiB 9.2 TiB                       [0,1,2,3,4,5,6,7,12,13,14,15,16,17,18,19,20,21,22,23,31,33]    181             73
23       102 GiB 9.1 TiB  103 GiB 9.2 TiB                [0,1,2,3,4,5,6,7,8,9,10,11,22,24,25,26,27,28,29,30,31,32,33,34,35]    191             69
18        77 GiB 9.1 TiB   78 GiB 9.2 TiB             [0,1,2,3,4,5,6,7,8,9,10,11,17,19,24,25,26,27,28,29,30,31,32,33,34,35]    188             67
11        64 GiB 9.1 TiB   65 GiB 9.2 TiB                                                   [10,12,21,28,29,31,32,33,34,35]      0              0
8         90 GiB 9.1 TiB   91 GiB 9.2 TiB                                                       [1,2,7,9,14,15,21,27,30,33]      2              0
14        70 GiB 9.1 TiB   71 GiB 9.2 TiB             [0,1,2,3,4,5,6,7,8,9,10,11,13,15,24,25,26,27,28,29,30,31,32,33,34,35]    177             64
33        77 GiB 9.1 TiB   78 GiB 9.2 TiB                         [0,1,2,3,4,5,6,12,13,14,15,16,17,18,19,20,21,22,23,32,34]    187             80
3         89 GiB 9.1 TiB   90 GiB 9.2 TiB   [2,4,8,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]    303             74
30        77 GiB 9.1 TiB   78 GiB 9.2 TiB                       [0,1,2,3,4,5,6,9,12,13,14,15,16,17,18,19,20,21,22,23,29,31]    179             76
15        71 GiB 9.1 TiB   72 GiB 9.2 TiB               [0,1,2,3,4,5,6,7,8,10,11,14,16,24,25,26,27,28,29,30,31,32,33,34,35]    178             72
7         70 GiB 9.1 TiB   71 GiB 9.2 TiB                                                     [6,8,15,17,30,31,32,33,34,35]      0              0
28        90 GiB 9.1 TiB   91 GiB 9.2 TiB                     [0,1,2,3,4,5,6,7,9,12,13,14,15,16,17,18,19,20,21,22,23,27,29]    188             73
16        77 GiB 9.1 TiB   78 GiB 9.2 TiB             [0,1,2,3,4,5,6,7,8,9,10,11,15,17,24,25,26,27,28,29,30,31,32,33,34,35]    183             66
1         77 GiB 9.1 TiB   78 GiB 9.2 TiB [0,2,8,9,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]    324             70
26        77 GiB 9.1 TiB   78 GiB 9.2 TiB                         [0,1,2,3,4,5,6,12,13,14,15,16,17,18,19,20,21,22,23,25,27]    186             61
22        89 GiB 9.1 TiB   90 GiB 9.2 TiB                [0,1,2,3,4,5,6,7,8,9,11,21,23,24,25,26,27,28,29,30,31,32,33,34,35]    178             80
0        103 GiB 9.1 TiB  104 GiB 9.2 TiB       [1,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]    308             83
5         70 GiB 9.1 TiB   71 GiB 9.2 TiB     [4,6,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]    312             69
21        77 GiB 9.1 TiB   78 GiB 9.2 TiB             [0,1,2,3,4,5,6,7,8,9,10,11,20,22,24,25,26,27,28,29,30,31,32,33,34,35]    187             63
4         96 GiB 9.1 TiB   97 GiB 9.2 TiB  [3,5,10,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]    305             77
34        96 GiB 9.1 TiB   97 GiB 9.2 TiB                     [0,1,2,3,4,5,6,8,9,12,13,14,15,16,17,18,19,20,21,22,23,33,35]    189             73
17        96 GiB 9.1 TiB   97 GiB 9.2 TiB             [0,1,2,3,4,5,6,7,8,9,10,11,16,18,24,25,26,27,28,29,30,31,32,33,34,35]    185             72
24        77 GiB 9.1 TiB   78 GiB 9.2 TiB                         [0,1,2,3,4,5,6,10,12,13,14,15,16,17,18,19,20,21,22,23,25]    186             73
10        76 GiB 9.1 TiB   77 GiB 9.2 TiB                                                     [4,9,11,15,17,18,25,29,34,35]      1              0
27        89 GiB 9.1 TiB   90 GiB 9.2 TiB                      [0,1,2,3,4,5,6,10,12,13,14,15,16,17,18,19,20,21,22,23,26,28]    185             75
2         77 GiB 9.1 TiB   78 GiB 9.2 TiB   [1,3,8,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]    310             62
19        77 GiB 9.1 TiB   78 GiB 9.2 TiB             [0,1,2,3,4,5,6,7,8,9,10,11,18,20,24,25,26,27,28,29,30,31,32,33,34,35]    184             77
20        77 GiB 9.1 TiB   78 GiB 9.2 TiB             [0,1,2,3,4,5,6,7,8,9,10,11,19,21,24,25,26,27,28,29,30,31,32,33,34,35]    183             69
35        96 GiB 9.1 TiB   97 GiB 9.2 TiB                            [0,1,2,3,4,5,6,12,13,14,15,16,17,18,19,20,21,22,23,34]    187             78
9         77 GiB 9.1 TiB   78 GiB 9.2 TiB                                                     [1,8,10,12,13,16,21,23,32,35]      1              0
6         83 GiB 9.1 TiB   84 GiB 9.2 TiB     [5,7,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35]    323             58
12        89 GiB 9.1 TiB   90 GiB 9.2 TiB                  [0,1,2,3,4,5,6,8,9,10,11,13,24,25,26,27,28,29,30,31,32,33,34,35]    189             78
29        64 GiB 9.1 TiB   65 GiB 9.2 TiB                       [0,1,2,3,4,5,6,9,12,13,14,15,16,17,18,19,20,21,22,23,28,30]    185             74
sum      2.8 TiB 327 TiB  2.9 TiB 330 TiB

With these essential commands, you’re well-equipped to handle daily Ceph cluster management.

Just as kids learn how to add, subtract, divide and multiply on paper before being given the convenience of a calculator, it’s important for any Ceph administrator to understand these critical Ceph commands. But once they’re under your belt, then why not make cluster management even simpler and/or delegate simple management tasks to those less well versed in the team, with our robust private cloud platform, HyperCloud?

Danny Abukalam is SoftIron’s Product Engineering Lead with both a commercial and technical background, particularly passionate about open source software.