| Summary: | OSD never get marked out after making it down with one host bucket in room crush bucket | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | Vikhyat Umrao <vumrao> | ||||
| Component: | RADOS | Assignee: | Samuel Just <sjust> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 1.3.0 | CC: | ceph-eng-bugs, dzafman, kchai | ||||
| Target Milestone: | rc | ||||||
| Target Release: | 1.3.3 | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2016-02-26 05:45:57 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
Created attachment 1130532 [details]
example test crushmap
I am working on reproducing the issue with debug_mon = 10 and debug_ms =1 for both scenario so we can check what is happening. Will update the Bugzilla. Ahh looks like I got it one clue in logs:
mon.ceh-node-5@0(leader).osd e66 tick entire containing rack subtree for osd.2 is down; resetting timer
^^ if we check above given logs and the code :
File : src/mon/OSDMonitor.cc
// is this an entire large subtree down?
if (g_conf->mon_osd_down_out_subtree_limit.length()) {
int type = osdmap.crush->get_type_id(g_conf->mon_osd_down_out_subtree_limit);
if (type > 0) {
if (osdmap.containing_subtree_is_down(g_ceph_context, o, type, &down_cache)) {
dout(10) << "tick entire containing " << g_conf->mon_osd_down_out_subtree_limit
<< " subtree for osd." << o << " is down; resetting timer" << dendl;
// reset timer, too.
down_pending_out[o] = now;
continue;
}
}
}
File : src/common/config_opts.h
OPTION(mon_osd_down_out_subtree_limit, OPT_STR, "rack") // smallest crush unit/type that we will not automatically mark out
If we check above given code it is clear that *smallest crush unit/type that we will not automatically mark out* is *rack* and it our test it is *room*.
and if we check the hierarchy in crushmap :
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
room is at type 7 very after type 3 rack.
I hope it is allowed till *type 2 chassis* to mark out automatically if entire subtree is down. as from *type 3 rack* it is not allowed automatically mark out.
>
> I hope it is allowed till *type 2 chassis* to mark out automatically if
> entire subtree is down. as from *type 3 rack* it is not allowed
> automatically mark out.
I will test with *type 2 chassis* and see how it behaves.
(In reply to Vikhyat Umrao from comment #7) > > > I will test with *type 2 chassis* and see how it behaves. I have tested it with *chassis* bucket and it works as expected given in comment#6 # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.34999 root default -5 0.89999 chassis test -2 0.45000 host ceh-node-5 0 0.45000 osd.0 up 1.00000 1.00000 -3 0.45000 host ceh-node-6 1 0.45000 osd.1 up 1.00000 1.00000 -6 0.45000 chassis test1 -4 0.45000 host ceh-node-7 2 0.45000 osd.2 up 1.00000 1.00000 # date; /etc/init.d/ceph stop osd; date Thu Feb 25 20:41:54 IST 2016 === osd.2 === Stopping Ceph osd.2 on ceh-node-7...kill 117686...kill 117686...done Thu Feb 25 20:41:56 IST 2016 ^^ OSD was stopped at Thu Feb 25 20:41:56 IST 2016. 2016-02-25 20:46:56.198641 7f7d95f81700 0 log_channel(cluster) log [INF] : osd.2 out (down for 301.929220) 2016-02-25 20:46:56.240175 7f7d96f83700 1 mon.ceh-node-5@0(leader).osd e82 e82: 3 osds: 2 up, 2 in ^^ and after five minutes OSD was marked out . # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.34999 root default -5 0.89999 chassis test -2 0.45000 host ceh-node-5 0 0.45000 osd.0 up 1.00000 1.00000 -3 0.45000 host ceh-node-6 1 0.45000 osd.1 up 1.00000 1.00000 -6 0.45000 chassis test1 -4 0.45000 host ceh-node-7 2 0.45000 osd.2 down 0 1.00000 <================= Some more testing for option : *mon_osd_down_out_subtree_limit*
[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config show | grep mon_osd_down_out_subtree_limit
"mon_osd_down_out_subtree_limit": "rack",
Here also we can check default is *rack* as given in the code.
[root@ceh-node-5 ~]# ceph tell mon.ceh-node-5 injectargs --mon_osd_down_out_subtree_limit="datacenter"
Error ENOSYS: injectargs:You cannot change mon_osd_down_out_subtree_limit using injectargs.
[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config set mon_osd_down_out_subtree_limit "datacenter"
{
"error": "error setting 'mon_osd_down_out_subtree_limit' to 'datacenter': (38) Function not implemented"
}
^^ Now this option can not be change via "ceph tell" or "daemon" , you need to add it in ceph.conf and restart the mon process.
[root@ceh-node-5 ~]# vi /etc/ceph/ceph.conf
mon_osd_down_out_subtree_limit = "datacenter"
If you want it should be allowed in *room* bucket.
[root@ceh-node-5 ~]# /etc/init.d/ceph restart mon
=== mon.ceh-node-5 ===
=== mon.ceh-node-5 ===
Stopping Ceph mon.ceh-node-5 on ceh-node-5...kill 133081...done
=== mon.ceh-node-5 ===
Starting Ceph mon.ceh-node-5 on ceh-node-5...
Running as unit run-4815.service.
Starting ceph-create-keys on ceh-node-5...
[root@ceh-node-5 ~]# ceph -s
cluster 87f816c5-48e4-44ca-8794-abe79293b37f
health HEALTH_WARN
clock skew detected on mon.ceh-node-6, mon.ceh-node-7
monmap e3: 3 mons at {ceh-node-5=192.168.12.27:6789/0,ceh-node-6=192.168.12.28:6789/0,ceh-node-7=192.168.12.29:6789/0}
election epoch 50, quorum 0,1,2 ceh-node-5,ceh-node-6,ceh-node-7
osdmap e86: 3 osds: 3 up, 3 in
pgmap v221: 64 pgs, 1 pools, 0 bytes data, 0 objects
106 MB used, 1396 GB / 1396 GB avail
64 active+clean
[root@ceh-node-5 ~]# ceph mon dump
dumped monmap epoch 3
epoch 3
fsid 87f816c5-48e4-44ca-8794-abe79293b37f
last_changed 2016-02-25 13:18:14.636456
created 0.000000
0: 192.168.12.27:6789/0 mon.ceh-node-5
1: 192.168.12.28:6789/0 mon.ceh-node-6
2: 192.168.12.29:6789/0 mon.ceh-node-7
[root@ceh-node-5 ~]# ceph --admin-daemon /var/run/ceph/ceph-mon.ceh-node-5.asok config show | grep mon_osd_down_out_subtree_limit
"mon_osd_down_out_subtree_limit": "datacenter",
[root@ceh-node-7 ~]# date; /etc/init.d/ceph stop osd; date
Fri Feb 26 10:33:21 IST 2016
=== osd.2 ===
Stopping Ceph osd.2 on ceh-node-7...kill 119334...kill 119334...done
Fri Feb 26 10:33:23 IST 2016
[root@ceh-node-7 ~]#
^^ Now bring down the osd in this subtree :
# tail -f /var/log/ceph/ceph-mon.ceh-node-5.log
2016-02-26 10:33:22.189315 7f5cefc18700 1 mon.ceh-node-5@0(leader).osd e87 e87: 3 osds: 2 up, 3 in
2016-02-26 10:33:22.222367 7f5cefc18700 0 log_channel(cluster) log [INF] : osdmap e87: 3 osds: 2 up, 3 in
2016-02-26 10:33:22.239862 7f5cefc18700 0 log_channel(cluster) log [INF] : pgmap v222: 64 pgs: 18 stale+active+clean, 46 active+clean; 0 bytes data, 106 MB used, 1396 GB / 1396 GB avail
2016-02-26 10:33:23.230361 7f5cefc18700 1 mon.ceh-node-5@0(leader).osd e88 e88: 3 osds: 2 up, 3 in
2016-02-26 10:33:23.271759 7f5cefc18700 0 log_channel(cluster) log [INF] : osdmap e88: 3 osds: 2 up, 3 in
^^ check the log osd is down now but not out.
Wait for 5 minutes (300 seconds)
2016-02-26 10:38:25.100057 7f5cee442700 0 log_channel(cluster) log [INF] : osd.2 out (down for 302.894443)
2016-02-26 10:38:25.142557 7f5cefc18700 1 mon.ceh-node-5@0(leader).osd e89 e89: 3 osds: 2 up, 2 in
2016-02-26 10:38:25.167256 7f5cefc18700 0 log_channel(cluster) log [INF] : osdmap e89: 3 osds: 2 up, 2 in
^^ OSD is declared out.
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 1.34999 root default
-5 0.89999 room test
-2 0.45000 host ceh-node-5
0 0.45000 osd.0 up 1.00000 1.00000
-3 0.45000 host ceh-node-6
1 0.45000 osd.1 up 1.00000 1.00000
-6 0.45000 room test1
-4 0.45000 host ceh-node-7
2 0.45000 osd.2 down 0 1.00000
- With all above given analysis it is NOTABUG. - It is just configuration issue. - Either you have to change your crushmap to use till *chassis* as *mon_osd_down_out_subtree_limit* is *rack*. - Or you need to modify the *mon_osd_down_out_subtree_limit* as per your crushmap as given in comment#11. - Closing with NOTABUG. |
Description of problem: OSD never get out with one host bucket in any other crush bucket Version-Release number of selected component (if applicable): Red Hat Ceph Storage 1.3 Also upstream firefly 0.80.10 How reproducible: Always Steps to Reproduce: 1. Create a three OSD cluster , all osds are in different node 2. change default crush map to something like below given : ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 1.34999 root default -5 0.89999 room test -2 0.45000 host ceh-node-5 0 0.45000 osd.0 up 1.00000 1.00000 -3 0.45000 host ceh-node-6 1 0.45000 osd.1 up 1.00000 1.00000 -6 0.45000 room test1 -4 0.45000 host ceh-node-7 2 0.45000 osd.2 down 1.00000 1.00000 When we modified default crush map of three osds with three nodes by adding two room buckets *test* and *test1* as given above and failure domain here is * host* . root default { id -1 # do not change unnecessarily # weight 1.350 alg straw2 hash 0 # rjenkins1 item test weight 0.900 item test1 weight 0.450 } # rules rule replicated_ruleset { ruleset 0 type replicated min_size 1 max_size 10 step take default step chooseleaf firstn 0 type host step emit } 3. If I stop any of the osd from *test* bucket osd get marked out after 300 seconds which is default time for marking it out and same if I stop any of the osd from *test1* which has only one *host* bucket osd never get marked out after 300 seconds. Actual results: osd never get marked out Expected results: osd should get marked out