Bug 1311986 - OSD never get marked out after making it down with one host bucket in room crush bucket
OSD never get marked out after making it down with one host bucket in room cr...
Status: CLOSED NOTABUG
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RADOS (Show other bugs)
1.3.0
x86_64 Linux
medium Severity medium
: rc
: 1.3.3
Assigned To: Samuel Just
ceph-qe-bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-25 07:59 EST by Vikhyat Umrao
Modified: 2017-07-30 11:08 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-26 00:45:57 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
example test crushmap (1.74 KB, text/plain)
2016-02-25 08:00 EST, Vikhyat Umrao
no flags Details

  None (edit)
Description Vikhyat Umrao 2016-02-25 07:59:04 EST
Description of problem:
OSD never get  out with one host bucket in any other crush bucket 

Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3
Also upstream firefly 0.80.10 

How reproducible:
Always 

Steps to Reproduce:
1. Create a three OSD cluster , all osds are in different node 
2. change default crush map to something like below given :

ID WEIGHT  TYPE NAME                  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.34999 root default                                                 
-5 0.89999     room test                                                
-2 0.45000         host ceh-node-5                                   
 0 0.45000             osd.0               up  1.00000          1.00000 
-3 0.45000         host ceh-node-6                                   
 1 0.45000             osd.1               up  1.00000          1.00000 
-6 0.45000     room test1                                               
-4 0.45000         host ceh-node-7                                   
 2 0.45000             osd.2             down  1.00000          1.00000 

When we modified default crush map of three osds with three nodes by adding two room buckets *test* and *test1* as given above and failure domain here is * host* .

root default {
        id -1           # do not change unnecessarily
        # weight 1.350
        alg straw2
        hash 0  # rjenkins1
        item test weight 0.900
        item test1 weight 0.450
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

3.  If I stop any of the osd from *test* bucket osd get marked out after 300 seconds which is default time for marking it out and same if I stop any of the osd from *test1* which has only one *host* bucket osd never get marked out after 300 seconds.

Actual results:

osd never get marked out 

Expected results:

osd should get marked out
Comment 1 Vikhyat Umrao 2016-02-25 08:00 EST
Created attachment 1130532 [details]
example test crushmap
Comment 5 Vikhyat Umrao 2016-02-25 08:48:33 EST
I am working on reproducing the issue with debug_mon = 10 and debug_ms =1 for both scenario so we can check what is happening. Will update the Bugzilla.
Comment 6 Vikhyat Umrao 2016-02-25 09:19:41 EST
Ahh looks like I got it one clue in logs:

mon.ceh-node-5@0(leader).osd e66 tick entire containing rack subtree for osd.2 is down; resetting timer

^^ if we check above given logs and the code :

File : src/mon/OSDMonitor.cc

 // is this an entire large subtree down?
        if (g_conf->mon_osd_down_out_subtree_limit.length()) {
          int type = osdmap.crush->get_type_id(g_conf->mon_osd_down_out_subtree_limit);
          if (type > 0) {
            if (osdmap.containing_subtree_is_down(g_ceph_context, o, type, &down_cache)) {
              dout(10) << "tick entire containing " << g_conf->mon_osd_down_out_subtree_limit
                       << " subtree for osd." << o << " is down; resetting timer" << dendl;
              // reset timer, too.
              down_pending_out[o] = now;
              continue;
            }
          }
        }

File : src/common/config_opts.h

OPTION(mon_osd_down_out_subtree_limit, OPT_STR, "rack")   // smallest crush unit/type that we will not automatically mark out

If we check above given code it is clear that *smallest crush unit/type that we will not automatically mark out*  is *rack* and it our test it is *room*.

and if we check the hierarchy in crushmap :

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

room is at type 7 very after type 3 rack. 

I hope it is allowed till *type 2 chassis* to mark out automatically if entire subtree is down.  as from *type 3 rack* it is not allowed automatically mark out.
Comment 7 Vikhyat Umrao 2016-02-25 09:21:02 EST
> 
> I hope it is allowed till *type 2 chassis* to mark out automatically if
> entire subtree is down.  as from *type 3 rack* it is not allowed
> automatically mark out.

I will test with *type 2 chassis* and see how it behaves.
Comment 8 Vikhyat Umrao 2016-02-25 10:48:03 EST
(In reply to Vikhyat Umrao from comment #7)
> > 
> I will test with *type 2 chassis* and see how it behaves.

I have tested it with *chassis* bucket and it works as expected given in comment#6

# ceph osd tree
ID WEIGHT  TYPE NAME                  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.34999 root default                                                 
-5 0.89999     chassis test                                             
-2 0.45000         host ceh-node-5                                   
 0 0.45000             osd.0               up  1.00000          1.00000 
-3 0.45000         host ceh-node-6                                   
 1 0.45000             osd.1               up  1.00000          1.00000 
-6 0.45000     chassis test1                                            
-4 0.45000         host ceh-node-7                                   
 2 0.45000             osd.2               up  1.00000          1.00000 


# date; /etc/init.d/ceph stop osd; date
Thu Feb 25 20:41:54 IST 2016
=== osd.2 === 
Stopping Ceph osd.2 on ceh-node-7...kill 117686...kill 117686...done
Thu Feb 25 20:41:56 IST 2016

^^ OSD was stopped at  Thu Feb 25 20:41:56 IST 2016.


2016-02-25 20:46:56.198641 7f7d95f81700  0 log_channel(cluster) log [INF] : osd.2 out (down for 301.929220)
2016-02-25 20:46:56.240175 7f7d96f83700  1 mon.ceh-node-5@0(leader).osd e82 e82: 3 osds: 2 up, 2 in

^^ and after five minutes OSD was marked out .


# ceph osd tree
ID WEIGHT  TYPE NAME                  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.34999 root default                                                 
-5 0.89999     chassis test                                             
-2 0.45000         host ceh-node-5                                   
 0 0.45000             osd.0               up  1.00000          1.00000 
-3 0.45000         host ceh-node-6                                   
 1 0.45000             osd.1               up  1.00000          1.00000 
-6 0.45000     chassis test1                                            
-4 0.45000         host ceh-node-7                                   
 2 0.45000             osd.2             down        0          1.00000  <=================
Comment 11 Vikhyat Umrao 2016-02-26 00:43:43 EST
Some more testing for option : *mon_osd_down_out_subtree_limit* 

[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config show | grep mon_osd_down_out_subtree_limit
    "mon_osd_down_out_subtree_limit": "rack",

Here also we can check default is *rack* as given in the code.

[root@ceh-node-5 ~]# ceph tell mon.ceh-node-5 injectargs --mon_osd_down_out_subtree_limit="datacenter"
Error ENOSYS: injectargs:You cannot change mon_osd_down_out_subtree_limit using injectargs.


[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config set mon_osd_down_out_subtree_limit "datacenter"
{
    "error": "error setting 'mon_osd_down_out_subtree_limit' to 'datacenter': (38) Function not implemented"
}


^^ Now this option can not be change via "ceph tell" or  "daemon" , you need to add it in ceph.conf and restart the mon process.

[root@ceh-node-5 ~]# vi /etc/ceph/ceph.conf

mon_osd_down_out_subtree_limit  = "datacenter" 

If you want it should be allowed in *room* bucket.

[root@ceh-node-5 ~]# /etc/init.d/ceph restart mon 
=== mon.ceh-node-5 === 
=== mon.ceh-node-5 === 
Stopping Ceph mon.ceh-node-5 on ceh-node-5...kill 133081...done
=== mon.ceh-node-5 === 
Starting Ceph mon.ceh-node-5 on ceh-node-5...
Running as unit run-4815.service.
Starting ceph-create-keys on ceh-node-5...

[root@ceh-node-5 ~]# ceph -s
    cluster 87f816c5-48e4-44ca-8794-abe79293b37f
     health HEALTH_WARN
            clock skew detected on mon.ceh-node-6, mon.ceh-node-7
     monmap e3: 3 mons at {ceh-node-5=192.168.12.27:6789/0,ceh-node-6=192.168.12.28:6789/0,ceh-node-7=192.168.12.29:6789/0}
            election epoch 50, quorum 0,1,2 ceh-node-5,ceh-node-6,ceh-node-7
     osdmap e86: 3 osds: 3 up, 3 in
      pgmap v221: 64 pgs, 1 pools, 0 bytes data, 0 objects
            106 MB used, 1396 GB / 1396 GB avail
                  64 active+clean

[root@ceh-node-5 ~]# ceph mon dump
dumped monmap epoch 3
epoch 3
fsid 87f816c5-48e4-44ca-8794-abe79293b37f
last_changed 2016-02-25 13:18:14.636456
created 0.000000
0: 192.168.12.27:6789/0 mon.ceh-node-5
1: 192.168.12.28:6789/0 mon.ceh-node-6
2: 192.168.12.29:6789/0 mon.ceh-node-7

[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config show | grep mon_osd_down_out_subtree_limit
    "mon_osd_down_out_subtree_limit": "datacenter",



[root@ceh-node-7 ~]# date; /etc/init.d/ceph stop osd; date
Fri Feb 26 10:33:21 IST 2016
=== osd.2 === 
Stopping Ceph osd.2 on ceh-node-7...kill 119334...kill 119334...done
Fri Feb 26 10:33:23 IST 2016
[root@ceh-node-7 ~]# 


^^ Now bring down the osd in this subtree :

# tail -f /var/log/ceph/ceph-mon.ceh-node-5.log
2016-02-26 10:33:22.189315 7f5cefc18700  1 mon.ceh-node-5@0(leader).osd e87 e87: 3 osds: 2 up, 3 in
2016-02-26 10:33:22.222367 7f5cefc18700  0 log_channel(cluster) log [INF] : osdmap e87: 3 osds: 2 up, 3 in
2016-02-26 10:33:22.239862 7f5cefc18700  0 log_channel(cluster) log [INF] : pgmap v222: 64 pgs: 18 stale+active+clean, 46 active+clean; 0 bytes data, 106 MB used, 1396 GB / 1396 GB avail
2016-02-26 10:33:23.230361 7f5cefc18700  1 mon.ceh-node-5@0(leader).osd e88 e88: 3 osds: 2 up, 3 in
2016-02-26 10:33:23.271759 7f5cefc18700  0 log_channel(cluster) log [INF] : osdmap e88: 3 osds: 2 up, 3 in

^^ check the log osd is down now but not out. 

Wait for 5 minutes (300 seconds) 


2016-02-26 10:38:25.100057 7f5cee442700  0 log_channel(cluster) log [INF] : osd.2 out (down for 302.894443)
2016-02-26 10:38:25.142557 7f5cefc18700  1 mon.ceh-node-5@0(leader).osd e89 e89: 3 osds: 2 up, 2 in
2016-02-26 10:38:25.167256 7f5cefc18700  0 log_channel(cluster) log [INF] : osdmap e89: 3 osds: 2 up, 2 in

^^ OSD is declared out.

ID WEIGHT  TYPE NAME                  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.34999 root default                                                 
-5 0.89999     room test                                                
-2 0.45000         host ceh-node-5                                   
 0 0.45000             osd.0               up  1.00000          1.00000 
-3 0.45000         host ceh-node-6                                   
 1 0.45000             osd.1               up  1.00000          1.00000 
-6 0.45000     room test1                                               
-4 0.45000         host ceh-node-7                                   
 2 0.45000             osd.2             down  0                1.00000
Comment 12 Vikhyat Umrao 2016-02-26 00:45:57 EST
- With all above given analysis it is NOTABUG. 
- It is just configuration issue.

- Either you have to change your crushmap to use till *chassis* as *mon_osd_down_out_subtree_limit* is *rack*.

- Or you need to modify the *mon_osd_down_out_subtree_limit*  as per your crushmap as given in comment#11.

- Closing with NOTABUG.

Note You need to log in before you can comment on or make changes to this bug.