Description of problem: OSD never get out with one host bucket in any other crush bucket Version-Release number of selected component (if applicable): Red Hat Ceph Storage 1.3 Also upstream firefly 0.80.10 How reproducible: Always Steps to Reproduce: 1. Create a three OSD cluster , all osds are in different node 2. change default crush map to something like below given : ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.34999 root default -5 0.89999 room test -2 0.45000 host ceh-node-5 0 0.45000 osd.0 up 1.00000 1.00000 -3 0.45000 host ceh-node-6 1 0.45000 osd.1 up 1.00000 1.00000 -6 0.45000 room test1 -4 0.45000 host ceh-node-7 2 0.45000 osd.2 down 1.00000 1.00000 When we modified default crush map of three osds with three nodes by adding two room buckets *test* and *test1* as given above and failure domain here is * host* . root default { id -1 # do not change unnecessarily # weight 1.350 alg straw2 hash 0 # rjenkins1 item test weight 0.900 item test1 weight 0.450 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } 3. If I stop any of the osd from *test* bucket osd get marked out after 300 seconds which is default time for marking it out and same if I stop any of the osd from *test1* which has only one *host* bucket osd never get marked out after 300 seconds. Actual results: osd never get marked out Expected results: osd should get marked out
Created attachment 1130532 [details] example test crushmap
I am working on reproducing the issue with debug_mon = 10 and debug_ms =1 for both scenario so we can check what is happening. Will update the Bugzilla.
Ahh looks like I got it one clue in logs: mon.ceh-node-5@0(leader).osd e66 tick entire containing rack subtree for osd.2 is down; resetting timer ^^ if we check above given logs and the code : File : src/mon/OSDMonitor.cc // is this an entire large subtree down? if (g_conf->mon_osd_down_out_subtree_limit.length()) { int type = osdmap.crush->get_type_id(g_conf->mon_osd_down_out_subtree_limit); if (type > 0) { if (osdmap.containing_subtree_is_down(g_ceph_context, o, type, &down_cache)) { dout(10) << "tick entire containing " << g_conf->mon_osd_down_out_subtree_limit << " subtree for osd." << o << " is down; resetting timer" << dendl; // reset timer, too. down_pending_out[o] = now; continue; } } } File : src/common/config_opts.h OPTION(mon_osd_down_out_subtree_limit, OPT_STR, "rack") // smallest crush unit/type that we will not automatically mark out If we check above given code it is clear that *smallest crush unit/type that we will not automatically mark out* is *rack* and it our test it is *room*. and if we check the hierarchy in crushmap : # types type 0 osd type 1 host type 2 chassis type 3 rack type 4 row type 5 pdu type 6 pod type 7 room type 8 datacenter type 9 region type 10 root room is at type 7 very after type 3 rack. I hope it is allowed till *type 2 chassis* to mark out automatically if entire subtree is down. as from *type 3 rack* it is not allowed automatically mark out.
> > I hope it is allowed till *type 2 chassis* to mark out automatically if > entire subtree is down. as from *type 3 rack* it is not allowed > automatically mark out. I will test with *type 2 chassis* and see how it behaves.
(In reply to Vikhyat Umrao from comment #7) > > > I will test with *type 2 chassis* and see how it behaves. I have tested it with *chassis* bucket and it works as expected given in comment#6 # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.34999 root default -5 0.89999 chassis test -2 0.45000 host ceh-node-5 0 0.45000 osd.0 up 1.00000 1.00000 -3 0.45000 host ceh-node-6 1 0.45000 osd.1 up 1.00000 1.00000 -6 0.45000 chassis test1 -4 0.45000 host ceh-node-7 2 0.45000 osd.2 up 1.00000 1.00000 # date; /etc/init.d/ceph stop osd; date Thu Feb 25 20:41:54 IST 2016 === osd.2 === Stopping Ceph osd.2 on ceh-node-7...kill 117686...kill 117686...done Thu Feb 25 20:41:56 IST 2016 ^^ OSD was stopped at Thu Feb 25 20:41:56 IST 2016. 2016-02-25 20:46:56.198641 7f7d95f81700 0 log_channel(cluster) log [INF] : osd.2 out (down for 301.929220) 2016-02-25 20:46:56.240175 7f7d96f83700 1 mon.ceh-node-5@0(leader).osd e82 e82: 3 osds: 2 up, 2 in ^^ and after five minutes OSD was marked out . # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.34999 root default -5 0.89999 chassis test -2 0.45000 host ceh-node-5 0 0.45000 osd.0 up 1.00000 1.00000 -3 0.45000 host ceh-node-6 1 0.45000 osd.1 up 1.00000 1.00000 -6 0.45000 chassis test1 -4 0.45000 host ceh-node-7 2 0.45000 osd.2 down 0 1.00000 <=================
Some more testing for option : *mon_osd_down_out_subtree_limit* [root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config show | grep mon_osd_down_out_subtree_limit "mon_osd_down_out_subtree_limit": "rack", Here also we can check default is *rack* as given in the code. [root@ceh-node-5 ~]# ceph tell mon.ceh-node-5 injectargs --mon_osd_down_out_subtree_limit="datacenter" Error ENOSYS: injectargs:You cannot change mon_osd_down_out_subtree_limit using injectargs. [root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config set mon_osd_down_out_subtree_limit "datacenter" { "error": "error setting 'mon_osd_down_out_subtree_limit' to 'datacenter': (38) Function not implemented" } ^^ Now this option can not be change via "ceph tell" or "daemon" , you need to add it in ceph.conf and restart the mon process. [root@ceh-node-5 ~]# vi /etc/ceph/ceph.conf mon_osd_down_out_subtree_limit = "datacenter" If you want it should be allowed in *room* bucket. [root@ceh-node-5 ~]# /etc/init.d/ceph restart mon === mon.ceh-node-5 === === mon.ceh-node-5 === Stopping Ceph mon.ceh-node-5 on ceh-node-5...kill 133081...done === mon.ceh-node-5 === Starting Ceph mon.ceh-node-5 on ceh-node-5... Running as unit run-4815.service. Starting ceph-create-keys on ceh-node-5... [root@ceh-node-5 ~]# ceph -s cluster 87f816c5-48e4-44ca-8794-abe79293b37f health HEALTH_WARN clock skew detected on mon.ceh-node-6, mon.ceh-node-7 monmap e3: 3 mons at {ceh-node-5=192.168.12.27:6789/0,ceh-node-6=192.168.12.28:6789/0,ceh-node-7=192.168.12.29:6789/0} election epoch 50, quorum 0,1,2 ceh-node-5,ceh-node-6,ceh-node-7 osdmap e86: 3 osds: 3 up, 3 in pgmap v221: 64 pgs, 1 pools, 0 bytes data, 0 objects 106 MB used, 1396 GB / 1396 GB avail 64 active+clean [root@ceh-node-5 ~]# ceph mon dump dumped monmap epoch 3 epoch 3 fsid 87f816c5-48e4-44ca-8794-abe79293b37f last_changed 2016-02-25 13:18:14.636456 created 0.000000 0: 192.168.12.27:6789/0 mon.ceh-node-5 1: 192.168.12.28:6789/0 mon.ceh-node-6 2: 192.168.12.29:6789/0 mon.ceh-node-7 [root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config show | grep mon_osd_down_out_subtree_limit "mon_osd_down_out_subtree_limit": "datacenter", [root@ceh-node-7 ~]# date; /etc/init.d/ceph stop osd; date Fri Feb 26 10:33:21 IST 2016 === osd.2 === Stopping Ceph osd.2 on ceh-node-7...kill 119334...kill 119334...done Fri Feb 26 10:33:23 IST 2016 [root@ceh-node-7 ~]# ^^ Now bring down the osd in this subtree : # tail -f /var/log/ceph/ceph-mon.ceh-node-5.log 2016-02-26 10:33:22.189315 7f5cefc18700 1 mon.ceh-node-5@0(leader).osd e87 e87: 3 osds: 2 up, 3 in 2016-02-26 10:33:22.222367 7f5cefc18700 0 log_channel(cluster) log [INF] : osdmap e87: 3 osds: 2 up, 3 in 2016-02-26 10:33:22.239862 7f5cefc18700 0 log_channel(cluster) log [INF] : pgmap v222: 64 pgs: 18 stale+active+clean, 46 active+clean; 0 bytes data, 106 MB used, 1396 GB / 1396 GB avail 2016-02-26 10:33:23.230361 7f5cefc18700 1 mon.ceh-node-5@0(leader).osd e88 e88: 3 osds: 2 up, 3 in 2016-02-26 10:33:23.271759 7f5cefc18700 0 log_channel(cluster) log [INF] : osdmap e88: 3 osds: 2 up, 3 in ^^ check the log osd is down now but not out. Wait for 5 minutes (300 seconds) 2016-02-26 10:38:25.100057 7f5cee442700 0 log_channel(cluster) log [INF] : osd.2 out (down for 302.894443) 2016-02-26 10:38:25.142557 7f5cefc18700 1 mon.ceh-node-5@0(leader).osd e89 e89: 3 osds: 2 up, 2 in 2016-02-26 10:38:25.167256 7f5cefc18700 0 log_channel(cluster) log [INF] : osdmap e89: 3 osds: 2 up, 2 in ^^ OSD is declared out. ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.34999 root default -5 0.89999 room test -2 0.45000 host ceh-node-5 0 0.45000 osd.0 up 1.00000 1.00000 -3 0.45000 host ceh-node-6 1 0.45000 osd.1 up 1.00000 1.00000 -6 0.45000 room test1 -4 0.45000 host ceh-node-7 2 0.45000 osd.2 down 0 1.00000
- With all above given analysis it is NOTABUG. - It is just configuration issue. - Either you have to change your crushmap to use till *chassis* as *mon_osd_down_out_subtree_limit* is *rack*. - Or you need to modify the *mon_osd_down_out_subtree_limit* as per your crushmap as given in comment#11. - Closing with NOTABUG.