Description of problem ====================== When I reboot all machines of the RHSC 2.0 managed cluster (including RHSC machine itself), OSD's of the cluster get reassigned randomly between two intependent cluster hierarchies in CRUSH cluster map. (yes, there are *two independent cluster hierarchies* in CRUSH map - but this is not an issues discussed in this BZ - see BZ 1354586 for details instead). Version-Release =============== On RHSC 2.0 server: rhscon-ceph-0.0.27-1.el7scon.x86_64 rhscon-core-0.0.28-1.el7scon.x86_64 rhscon-core-selinux-0.0.28-1.el7scon.noarch rhscon-ui-0.0.42-1.el7scon.noarch ceph-ansible-1.0.5-23.el7scon.noarch ceph-installer-1.0.12-3.el7scon.noarch On Ceph Storage nodes: rhscon-agent-0.0.13-1.el7scon.noarch ceph-osd-10.2.2-5.el7cp.x86_64 How reproducible ================ 100 % That said, a severity of the problem (how many OSDs are relocated in cluster map) is random, sometimes only single OSD is shifted, while in another case, all OSD are affected - which makes a drastic difference. Steps to Reproduce ================== 1. Install RHSC 2.0 following the documentation. 2. Accept few nodes for the ceph cluster. 3. Create new ceph cluster named 'alpha'. 4. Create rbd (along with new backing pool) in the cluster. 5. Check CRUSH cluster map 6. Reboot all machines 7. Check CRUSH cluster map again Actual results ============== After the initial setup (step #5), cluster map looks ok like this: ~~~ # ceph -c /etc/ceph/alpha.conf osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -10 0.03998 root general -6 0.00999 host mbukatov-usm1-node2.os1.phx2.redhat.com-general 1 0.00999 osd.1 up 1.00000 1.00000 -7 0.00999 host mbukatov-usm1-node3.os1.phx2.redhat.com-general 2 0.00999 osd.2 up 1.00000 1.00000 -8 0.00999 host mbukatov-usm1-node1.os1.phx2.redhat.com-general 0 0.00999 osd.0 up 1.00000 1.00000 -9 0.00999 host mbukatov-usm1-node4.os1.phx2.redhat.com-general 3 0.00999 osd.3 up 1.00000 1.00000 -1 0 root default -2 0 host mbukatov-usm1-node1 -3 0 host mbukatov-usm1-node2 -4 0 host mbukatov-usm1-node3 -5 0 host mbukatov-usm1-node4 ~~~ But after reboot, I see a different cluster map: ~~~ # ceph -c /etc/ceph/alpha.conf osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -10 0.00999 root general -6 0 host mbukatov-usm1-node2.os1.phx2.redhat.com-general -7 0.00999 host mbukatov-usm1-node3.os1.phx2.redhat.com-general 2 0.00999 osd.2 up 1.00000 1.00000 -8 0 host mbukatov-usm1-node1.os1.phx2.redhat.com-general -9 0 host mbukatov-usm1-node4.os1.phx2.redhat.com-general -1 0.02998 root default -2 0.00999 host mbukatov-usm1-node1 0 0.00999 osd.0 up 1.00000 1.00000 -3 0.00999 host mbukatov-usm1-node2 1 0.00999 osd.1 up 1.00000 1.00000 -4 0 host mbukatov-usm1-node3 -5 0.00999 host mbukatov-usm1-node4 3 0.00999 osd.3 up 1.00000 1.00000 ~~~ Note that each OSD is *randomly* assigned to either correct machine in 1st or 2nd hierarchy. Expected results ================ All OSD stay in it's original cluster hierarchy so that output of `ceph osd tree` is the same before and after the reboot. Additional info =============== Why is this a problem? Let's see the CRUSH placement rules used in a pool of the cluster: ~~~ # ceph -c /etc/ceph/alpha.conf osd pool get rbd_pool crush_ruleset crush_ruleset: 1 # ceph -c /etc/ceph/alpha.conf osd crush rule dump general { "rule_id": 1, "rule_name": "general", "ruleset": 1, "type": 1, "min_size": 1, "max_size": 10, "steps": [ { "op": "take", "item": -10, "item_name": "general" }, { "op": "chooseleaf_firstn", "num": 0, "type": "host" }, { "op": "emit" } ] } ~~~ The "take" placement rule uses root bucket (bucket with id "-10") of the 1st cluster hierarchy. This means that OSD's which were relocated into the other hierarchy (with id "-1" in this case) would not be reachable (no data would be written there, no PG would be located there ...). While Ceph design can handle one missing OSD without disrupting the operation of the whole cluster, the problem is that since the redistribution of OSDs into cluster hierarchies seems to be random, we could end up with reallocation of all OSD's which effectivelly makes the cluster unusable. See for example such case which happened on my other cluster (with the same RHSC 2.0 and Ceph 2.0 builds): ~~~ # ceph -c /etc/ceph/alpha.conf osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -10 0 root general -6 0 host dhcp-126-84.lab.eng.brq.redhat.com-general -7 0 host dhcp-126-83.lab.eng.brq.redhat.com-general -8 0 host dhcp-126-85.lab.eng.brq.redhat.com-general -9 0 host dhcp-126-82.lab.eng.brq.redhat.com-general -1 0.03998 root default -2 0.00999 host dhcp-126-82 0 0.00999 osd.0 up 1.00000 1.00000 -3 0.00999 host dhcp-126-83 1 0.00999 osd.1 up 1.00000 1.00000 -4 0.00999 host dhcp-126-84 2 0.00999 osd.2 up 1.00000 1.00000 -5 0.00999 host dhcp-126-85 3 0.00999 osd.3 up 1.00000 1.00000 # ceph -c /etc/ceph/alpha.conf health HEALTH_ERR 200 pgs are stuck inactive for more than 300 seconds; 200 pgs stale; 200 pgs stuck stale; 184 pgs stuck unclean; recovery 33/54 objects misplaced (61.111%); pool 'full_pool' is full [root@dhcp-126-79 ~]# ceph -c /etc/ceph/alpha.conf status cluster c402a6ae-960c-4ccf-9543-47f731038a33 health HEALTH_ERR 200 pgs are stuck inactive for more than 300 seconds 200 pgs stale 200 pgs stuck stale 184 pgs stuck unclean recovery 33/54 objects misplaced (61.111%) pool 'full_pool' is full monmap e3: 3 mons at {dhcp-126-79=10.34.126.79:6789/0,dhcp-126-80=10.34.126.80:6789/0,dhcp-126-81=10.34.126.81:6789/0} election epoch 16, quorum 0,1,2 dhcp-126-79,dhcp-126-80,dhcp-126-81 osdmap e82: 4 osds: 4 up, 4 in; 184 remapped pgs flags sortbitwise pgmap v1182: 384 pgs, 3 pools, 572 bytes data, 18 objects 149 MB used, 40766 MB / 40915 MB avail 33/54 objects misplaced (61.111%) 200 stale+active+clean 184 active+remapped ~~~ As one can see in the `ceph status` output, this cluster is stuck and can't recover from this on it's own. For this reason, I consider this BZ a high severity. For details about CRUSH cluster map, see: http://docs.ceph.com/docs/master/rados/operations/crush-map/
Created attachment 1178488 [details] crushmap dump before reboot
Created attachment 1178489 [details] crushmap dump after reboot (3 OSDs relocated)
Note: crushmap dumps were created via: ~~~ ceph -c /etc/ceph/alpha.conf osd getcrushmap -o ceph-crushmap.compiled crushtool -d ceph-crushmap.compiled -o ceph-crushmap ~~~
*** This bug has been marked as a duplicate of bug 1349513 ***