Bug 1311986 - OSD never get marked out after making it down with one host bucket in room crush bucket
Summary: OSD never get marked out after making it down with one host bucket in room cr...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 1.3.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 1.3.3
Assignee: Samuel Just
QA Contact: ceph-qe-bugs
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-25 12:59 UTC by Vikhyat Umrao
Modified: 2019-10-10 11:20 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-26 05:45:57 UTC
Embargoed:


Attachments (Terms of Use)
example test crushmap (1.74 KB, text/plain)
2016-02-25 13:00 UTC, Vikhyat Umrao
no flags Details

Description Vikhyat Umrao 2016-02-25 12:59:04 UTC
Description of problem:
OSD never get  out with one host bucket in any other crush bucket 

Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.3
Also upstream firefly 0.80.10 

How reproducible:
Always 

Steps to Reproduce:
1. Create a three OSD cluster , all osds are in different node 
2. change default crush map to something like below given :

ID WEIGHT  TYPE NAME                  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.34999 root default                                                 
-5 0.89999     room test                                                
-2 0.45000         host ceh-node-5                                   
 0 0.45000             osd.0               up  1.00000          1.00000 
-3 0.45000         host ceh-node-6                                   
 1 0.45000             osd.1               up  1.00000          1.00000 
-6 0.45000     room test1                                               
-4 0.45000         host ceh-node-7                                   
 2 0.45000             osd.2             down  1.00000          1.00000 

When we modified default crush map of three osds with three nodes by adding two room buckets *test* and *test1* as given above and failure domain here is * host* .

root default {
        id -1           # do not change unnecessarily
        # weight 1.350
        alg straw2
        hash 0  # rjenkins1
        item test weight 0.900
        item test1 weight 0.450
}

# rules
rule replicated_ruleset {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default
        step chooseleaf firstn 0 type host
        step emit
}

3.  If I stop any of the osd from *test* bucket osd get marked out after 300 seconds which is default time for marking it out and same if I stop any of the osd from *test1* which has only one *host* bucket osd never get marked out after 300 seconds.

Actual results:

osd never get marked out 

Expected results:

osd should get marked out

Comment 1 Vikhyat Umrao 2016-02-25 13:00:18 UTC
Created attachment 1130532 [details]
example test crushmap

Comment 5 Vikhyat Umrao 2016-02-25 13:48:33 UTC
I am working on reproducing the issue with debug_mon = 10 and debug_ms =1 for both scenario so we can check what is happening. Will update the Bugzilla.

Comment 6 Vikhyat Umrao 2016-02-25 14:19:41 UTC
Ahh looks like I got it one clue in logs:

mon.ceh-node-5@0(leader).osd e66 tick entire containing rack subtree for osd.2 is down; resetting timer

^^ if we check above given logs and the code :

File : src/mon/OSDMonitor.cc

 // is this an entire large subtree down?
        if (g_conf->mon_osd_down_out_subtree_limit.length()) {
          int type = osdmap.crush->get_type_id(g_conf->mon_osd_down_out_subtree_limit);
          if (type > 0) {
            if (osdmap.containing_subtree_is_down(g_ceph_context, o, type, &down_cache)) {
              dout(10) << "tick entire containing " << g_conf->mon_osd_down_out_subtree_limit
                       << " subtree for osd." << o << " is down; resetting timer" << dendl;
              // reset timer, too.
              down_pending_out[o] = now;
              continue;
            }
          }
        }

File : src/common/config_opts.h

OPTION(mon_osd_down_out_subtree_limit, OPT_STR, "rack")   // smallest crush unit/type that we will not automatically mark out

If we check above given code it is clear that *smallest crush unit/type that we will not automatically mark out*  is *rack* and it our test it is *room*.

and if we check the hierarchy in crushmap :

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

room is at type 7 very after type 3 rack. 

I hope it is allowed till *type 2 chassis* to mark out automatically if entire subtree is down.  as from *type 3 rack* it is not allowed automatically mark out.

Comment 7 Vikhyat Umrao 2016-02-25 14:21:02 UTC
> 
> I hope it is allowed till *type 2 chassis* to mark out automatically if
> entire subtree is down.  as from *type 3 rack* it is not allowed
> automatically mark out.

I will test with *type 2 chassis* and see how it behaves.

Comment 8 Vikhyat Umrao 2016-02-25 15:48:03 UTC
(In reply to Vikhyat Umrao from comment #7)
> > 
> I will test with *type 2 chassis* and see how it behaves.

I have tested it with *chassis* bucket and it works as expected given in comment#6

# ceph osd tree
ID WEIGHT  TYPE NAME                  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.34999 root default                                                 
-5 0.89999     chassis test                                             
-2 0.45000         host ceh-node-5                                   
 0 0.45000             osd.0               up  1.00000          1.00000 
-3 0.45000         host ceh-node-6                                   
 1 0.45000             osd.1               up  1.00000          1.00000 
-6 0.45000     chassis test1                                            
-4 0.45000         host ceh-node-7                                   
 2 0.45000             osd.2               up  1.00000          1.00000 


# date; /etc/init.d/ceph stop osd; date
Thu Feb 25 20:41:54 IST 2016
=== osd.2 === 
Stopping Ceph osd.2 on ceh-node-7...kill 117686...kill 117686...done
Thu Feb 25 20:41:56 IST 2016

^^ OSD was stopped at  Thu Feb 25 20:41:56 IST 2016.


2016-02-25 20:46:56.198641 7f7d95f81700  0 log_channel(cluster) log [INF] : osd.2 out (down for 301.929220)
2016-02-25 20:46:56.240175 7f7d96f83700  1 mon.ceh-node-5@0(leader).osd e82 e82: 3 osds: 2 up, 2 in

^^ and after five minutes OSD was marked out .


# ceph osd tree
ID WEIGHT  TYPE NAME                  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.34999 root default                                                 
-5 0.89999     chassis test                                             
-2 0.45000         host ceh-node-5                                   
 0 0.45000             osd.0               up  1.00000          1.00000 
-3 0.45000         host ceh-node-6                                   
 1 0.45000             osd.1               up  1.00000          1.00000 
-6 0.45000     chassis test1                                            
-4 0.45000         host ceh-node-7                                   
 2 0.45000             osd.2             down        0          1.00000  <=================

Comment 11 Vikhyat Umrao 2016-02-26 05:43:43 UTC
Some more testing for option : *mon_osd_down_out_subtree_limit* 

[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config show | grep mon_osd_down_out_subtree_limit
    "mon_osd_down_out_subtree_limit": "rack",

Here also we can check default is *rack* as given in the code.

[root@ceh-node-5 ~]# ceph tell mon.ceh-node-5 injectargs --mon_osd_down_out_subtree_limit="datacenter"
Error ENOSYS: injectargs:You cannot change mon_osd_down_out_subtree_limit using injectargs.


[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config set mon_osd_down_out_subtree_limit "datacenter"
{
    "error": "error setting 'mon_osd_down_out_subtree_limit' to 'datacenter': (38) Function not implemented"
}


^^ Now this option can not be change via "ceph tell" or  "daemon" , you need to add it in ceph.conf and restart the mon process.

[root@ceh-node-5 ~]# vi /etc/ceph/ceph.conf

mon_osd_down_out_subtree_limit  = "datacenter" 

If you want it should be allowed in *room* bucket.

[root@ceh-node-5 ~]# /etc/init.d/ceph restart mon 
=== mon.ceh-node-5 === 
=== mon.ceh-node-5 === 
Stopping Ceph mon.ceh-node-5 on ceh-node-5...kill 133081...done
=== mon.ceh-node-5 === 
Starting Ceph mon.ceh-node-5 on ceh-node-5...
Running as unit run-4815.service.
Starting ceph-create-keys on ceh-node-5...

[root@ceh-node-5 ~]# ceph -s
    cluster 87f816c5-48e4-44ca-8794-abe79293b37f
     health HEALTH_WARN
            clock skew detected on mon.ceh-node-6, mon.ceh-node-7
     monmap e3: 3 mons at {ceh-node-5=192.168.12.27:6789/0,ceh-node-6=192.168.12.28:6789/0,ceh-node-7=192.168.12.29:6789/0}
            election epoch 50, quorum 0,1,2 ceh-node-5,ceh-node-6,ceh-node-7
     osdmap e86: 3 osds: 3 up, 3 in
      pgmap v221: 64 pgs, 1 pools, 0 bytes data, 0 objects
            106 MB used, 1396 GB / 1396 GB avail
                  64 active+clean

[root@ceh-node-5 ~]# ceph mon dump
dumped monmap epoch 3
epoch 3
fsid 87f816c5-48e4-44ca-8794-abe79293b37f
last_changed 2016-02-25 13:18:14.636456
created 0.000000
0: 192.168.12.27:6789/0 mon.ceh-node-5
1: 192.168.12.28:6789/0 mon.ceh-node-6
2: 192.168.12.29:6789/0 mon.ceh-node-7

[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config show | grep mon_osd_down_out_subtree_limit
    "mon_osd_down_out_subtree_limit": "datacenter",



[root@ceh-node-7 ~]# date; /etc/init.d/ceph stop osd; date
Fri Feb 26 10:33:21 IST 2016
=== osd.2 === 
Stopping Ceph osd.2 on ceh-node-7...kill 119334...kill 119334...done
Fri Feb 26 10:33:23 IST 2016
[root@ceh-node-7 ~]# 


^^ Now bring down the osd in this subtree :

# tail -f /var/log/ceph/ceph-mon.ceh-node-5.log
2016-02-26 10:33:22.189315 7f5cefc18700  1 mon.ceh-node-5@0(leader).osd e87 e87: 3 osds: 2 up, 3 in
2016-02-26 10:33:22.222367 7f5cefc18700  0 log_channel(cluster) log [INF] : osdmap e87: 3 osds: 2 up, 3 in
2016-02-26 10:33:22.239862 7f5cefc18700  0 log_channel(cluster) log [INF] : pgmap v222: 64 pgs: 18 stale+active+clean, 46 active+clean; 0 bytes data, 106 MB used, 1396 GB / 1396 GB avail
2016-02-26 10:33:23.230361 7f5cefc18700  1 mon.ceh-node-5@0(leader).osd e88 e88: 3 osds: 2 up, 3 in
2016-02-26 10:33:23.271759 7f5cefc18700  0 log_channel(cluster) log [INF] : osdmap e88: 3 osds: 2 up, 3 in

^^ check the log osd is down now but not out. 

Wait for 5 minutes (300 seconds) 


2016-02-26 10:38:25.100057 7f5cee442700  0 log_channel(cluster) log [INF] : osd.2 out (down for 302.894443)
2016-02-26 10:38:25.142557 7f5cefc18700  1 mon.ceh-node-5@0(leader).osd e89 e89: 3 osds: 2 up, 2 in
2016-02-26 10:38:25.167256 7f5cefc18700  0 log_channel(cluster) log [INF] : osdmap e89: 3 osds: 2 up, 2 in

^^ OSD is declared out.

ID WEIGHT  TYPE NAME                  UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 1.34999 root default                                                 
-5 0.89999     room test                                                
-2 0.45000         host ceh-node-5                                   
 0 0.45000             osd.0               up  1.00000          1.00000 
-3 0.45000         host ceh-node-6                                   
 1 0.45000             osd.1               up  1.00000          1.00000 
-6 0.45000     room test1                                               
-4 0.45000         host ceh-node-7                                   
 2 0.45000             osd.2             down  0                1.00000

Comment 12 Vikhyat Umrao 2016-02-26 05:45:57 UTC
- With all above given analysis it is NOTABUG. 
- It is just configuration issue.

- Either you have to change your crushmap to use till *chassis* as *mon_osd_down_out_subtree_limit* is *rack*.

- Or you need to modify the *mon_osd_down_out_subtree_limit*  as per your crushmap as given in comment#11.

- Closing with NOTABUG.


Note You need to log in before you can comment on or make changes to this bug.