Bug 1354603 - CRUSH cluster hierarchy gets corrupted so that entire cluster could be stuck in unusable state
Summary: CRUSH cluster hierarchy gets corrupted so that entire cluster could be stuck ...
Keywords:
Status: CLOSED DUPLICATE of bug 1349513
Alias: None
Product: Red Hat Storage Console
Classification: Red Hat Storage
Component: Ceph
Version: 2
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 2
Assignee: Shubhendu Tripathi
QA Contact: sds-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-07-11 16:39 UTC by Martin Bukatovic
Modified: 2016-07-12 04:34 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-07-12 04:34:44 UTC
Embargoed:


Attachments (Terms of Use)
crushmap dump before reboot (2.44 KB, text/plain)
2016-07-11 17:12 UTC, Martin Bukatovic
no flags Details
crushmap dump after reboot (3 OSDs relocated) (2.44 KB, text/plain)
2016-07-11 17:13 UTC, Martin Bukatovic
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1349513 0 unspecified CLOSED Pool PGs stuck in creating state in ceph 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1353897 0 high CLOSED storage profiles not working 2021-02-22 00:41:40 UTC
Red Hat Bugzilla 1354586 0 unspecified CLOSED CRUSH cluster map contains 2 independent cluster hierarchies 2021-02-22 00:41:40 UTC

Internal Links: 1349513 1353897 1354586

Description Martin Bukatovic 2016-07-11 16:39:15 UTC
Description of problem
======================

When I reboot all machines of the RHSC 2.0 managed cluster (including RHSC
machine itself), OSD's of the cluster get reassigned randomly between two
intependent cluster hierarchies in CRUSH cluster map.

(yes, there are *two independent cluster hierarchies* in CRUSH map - but this
is not an issues discussed in this BZ - see BZ 1354586 for details instead).

Version-Release
===============

On RHSC 2.0 server:

rhscon-ceph-0.0.27-1.el7scon.x86_64
rhscon-core-0.0.28-1.el7scon.x86_64
rhscon-core-selinux-0.0.28-1.el7scon.noarch
rhscon-ui-0.0.42-1.el7scon.noarch
ceph-ansible-1.0.5-23.el7scon.noarch
ceph-installer-1.0.12-3.el7scon.noarch

On Ceph Storage nodes:

rhscon-agent-0.0.13-1.el7scon.noarch
ceph-osd-10.2.2-5.el7cp.x86_64

How reproducible
================

100 %

That said, a severity of the problem (how many OSDs are relocated in
cluster map) is random, sometimes only single OSD is shifted, while
in another case, all OSD are affected - which makes a drastic difference.

Steps to Reproduce
==================

1. Install RHSC 2.0 following the documentation.
2. Accept few nodes for the ceph cluster.
3. Create new ceph cluster named 'alpha'.
4. Create rbd (along with new backing pool) in the cluster.
5. Check CRUSH cluster map
6. Reboot all machines
7. Check CRUSH cluster map again

Actual results
==============

After the initial setup (step #5), cluster map looks ok like this:

~~~
# ceph -c /etc/ceph/alpha.conf osd tree
ID  WEIGHT  TYPE NAME                                                UP/DOWN REWEIGHT PRIMARY-AFFINITY
-10 0.03998 root general
 -6 0.00999     host mbukatov-usm1-node2.os1.phx2.redhat.com-general
  1 0.00999         osd.1                                                 up  1.00000          1.00000
 -7 0.00999     host mbukatov-usm1-node3.os1.phx2.redhat.com-general
  2 0.00999         osd.2                                                 up  1.00000          1.00000
 -8 0.00999     host mbukatov-usm1-node1.os1.phx2.redhat.com-general
  0 0.00999         osd.0                                                 up  1.00000          1.00000
 -9 0.00999     host mbukatov-usm1-node4.os1.phx2.redhat.com-general
  3 0.00999         osd.3                                                 up  1.00000          1.00000
 -1       0 root default
 -2       0     host mbukatov-usm1-node1
 -3       0     host mbukatov-usm1-node2
 -4       0     host mbukatov-usm1-node3
 -5       0     host mbukatov-usm1-node4
~~~

But after reboot, I see a different cluster map:

~~~
# ceph -c /etc/ceph/alpha.conf osd tree
ID  WEIGHT  TYPE NAME                                                UP/DOWN REWEIGHT PRIMARY-AFFINITY
-10 0.00999 root general
 -6       0     host mbukatov-usm1-node2.os1.phx2.redhat.com-general
 -7 0.00999     host mbukatov-usm1-node3.os1.phx2.redhat.com-general
  2 0.00999         osd.2                                                 up  1.00000          1.00000
 -8       0     host mbukatov-usm1-node1.os1.phx2.redhat.com-general
 -9       0     host mbukatov-usm1-node4.os1.phx2.redhat.com-general
 -1 0.02998 root default
 -2 0.00999     host mbukatov-usm1-node1
  0 0.00999         osd.0                                                 up  1.00000          1.00000
 -3 0.00999     host mbukatov-usm1-node2
  1 0.00999         osd.1                                                 up  1.00000          1.00000
 -4       0     host mbukatov-usm1-node3
 -5 0.00999     host mbukatov-usm1-node4
  3 0.00999         osd.3                                                 up  1.00000          1.00000
~~~

Note that each OSD is *randomly* assigned to either correct machine in 1st or
2nd hierarchy.

Expected results
================

All OSD stay in it's original cluster hierarchy so that output of `ceph osd
tree` is the same before and after the reboot.

Additional info
===============

Why is this a problem? Let's see the CRUSH placement rules used in a pool
of the cluster:

~~~
# ceph -c /etc/ceph/alpha.conf osd pool get rbd_pool crush_ruleset
crush_ruleset: 1
# ceph -c /etc/ceph/alpha.conf osd crush rule dump general
{
    "rule_id": 1,
    "rule_name": "general",
    "ruleset": 1,
    "type": 1,
    "min_size": 1,
    "max_size": 10,
    "steps": [
        {
            "op": "take",
            "item": -10,
            "item_name": "general"
        },
        {
            "op": "chooseleaf_firstn",
            "num": 0,
            "type": "host"
        },
        {
            "op": "emit"
        }
    ]
}
~~~

The "take" placement rule uses root bucket (bucket with id "-10") of the 1st
cluster hierarchy. This means that OSD's which were relocated into the other
hierarchy (with id "-1" in this case) would not be reachable (no data would
be written there, no PG would be located there ...).

While Ceph design can handle one missing OSD without disrupting the operation
of the whole cluster, the problem is that since the redistribution of OSDs into
cluster hierarchies seems to be random, we could end up with reallocation of
all OSD's which effectivelly makes the cluster unusable.  

See for example such case which happened on my other cluster (with the same
RHSC 2.0 and Ceph 2.0 builds):

~~~
# ceph -c /etc/ceph/alpha.conf osd tree
ID  WEIGHT  TYPE NAME                                           UP/DOWN REWEIGHT PRIMARY-AFFINITY
-10       0 root general
 -6       0     host dhcp-126-84.lab.eng.brq.redhat.com-general
 -7       0     host dhcp-126-83.lab.eng.brq.redhat.com-general
 -8       0     host dhcp-126-85.lab.eng.brq.redhat.com-general
 -9       0     host dhcp-126-82.lab.eng.brq.redhat.com-general
 -1 0.03998 root default
 -2 0.00999     host dhcp-126-82
  0 0.00999         osd.0                                            up  1.00000          1.00000
 -3 0.00999     host dhcp-126-83
  1 0.00999         osd.1                                            up  1.00000          1.00000
 -4 0.00999     host dhcp-126-84
  2 0.00999         osd.2                                            up  1.00000          1.00000
 -5 0.00999     host dhcp-126-85
  3 0.00999         osd.3                                            up  1.00000          1.00000
# ceph -c /etc/ceph/alpha.conf health
HEALTH_ERR 200 pgs are stuck inactive for more than 300 seconds; 200 pgs stale; 200 pgs stuck stale; 184 pgs stuck unclean; recovery 33/54 objects misplaced (61.111%); pool 'full_pool' is full
[root@dhcp-126-79 ~]# ceph -c /etc/ceph/alpha.conf status
    cluster c402a6ae-960c-4ccf-9543-47f731038a33
     health HEALTH_ERR
            200 pgs are stuck inactive for more than 300 seconds
            200 pgs stale
            200 pgs stuck stale
            184 pgs stuck unclean
            recovery 33/54 objects misplaced (61.111%)
            pool 'full_pool' is full
     monmap e3: 3 mons at {dhcp-126-79=10.34.126.79:6789/0,dhcp-126-80=10.34.126.80:6789/0,dhcp-126-81=10.34.126.81:6789/0}
            election epoch 16, quorum 0,1,2 dhcp-126-79,dhcp-126-80,dhcp-126-81
     osdmap e82: 4 osds: 4 up, 4 in; 184 remapped pgs
            flags sortbitwise
      pgmap v1182: 384 pgs, 3 pools, 572 bytes data, 18 objects
            149 MB used, 40766 MB / 40915 MB avail
            33/54 objects misplaced (61.111%)
                 200 stale+active+clean
                 184 active+remapped
~~~

As one can see in the `ceph status` output, this cluster is stuck and can't
recover from this on it's own. For this reason, I consider this BZ a high
severity.

For details about CRUSH cluster map, see:

http://docs.ceph.com/docs/master/rados/operations/crush-map/

Comment 1 Martin Bukatovic 2016-07-11 17:12:26 UTC
Created attachment 1178488 [details]
crushmap dump before reboot

Comment 2 Martin Bukatovic 2016-07-11 17:13:07 UTC
Created attachment 1178489 [details]
crushmap dump after reboot (3 OSDs relocated)

Comment 3 Martin Bukatovic 2016-07-11 17:15:15 UTC
Note: crushmap dumps were created via:

~~~
ceph -c /etc/ceph/alpha.conf osd getcrushmap -o ceph-crushmap.compiled
crushtool -d ceph-crushmap.compiled -o ceph-crushmap
~~~

Comment 4 Nishanth Thomas 2016-07-12 04:34:44 UTC

*** This bug has been marked as a duplicate of bug 1349513 ***


Note You need to log in before you can comment on or make changes to this bug.