Bug 1222509

Summary: 1.3.0: mon fails to come up. dies as soon as started.
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Harish NV Rao <hnallurv>
Component: RADOSAssignee: Kefu Chai <kchai>
Status: CLOSED ERRATA QA Contact: ceph-qe-bugs <ceph-qe-bugs>
Severity: urgent Docs Contact: John Wilkins <jowilkin>
Priority: unspecified    
Version: 1.3.0CC: ceph-eng-bugs, dzafman, flucifre, gmeno, hnallurv, kchai, kdreyer, rgowdege, sjust, vakulkar
Target Milestone: rc   
Target Release: 1.3.1   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-0.94.1-15.el7cp Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-07-16 22:20:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1230323    
Attachments:
Description Flags
mon log
none
ceph mon log
none
client file containing GET and POST of crush map for which mon fails.
none
ceph mon log when a customized client script was used to PUT a modified crushmap none

Description Harish NV Rao 2015-05-18 11:53:05 UTC
Created attachment 1026671 [details]
mon log

Description of problem:
-----------------------
After encountering the issue mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1222505, the mon process on the first cluster keeps failing to start. It dies as soon as started.

bt:
   -1> 2015-05-18 05:48:37.500613 7fe040ff1700  5 mon.Mon@0(leader).paxos(paxos active c 78062..78623) is_readable = 1 - now=2015-05-18 05:48:37.500613 lease_expire=0.000000 has v0 lc 78623
     0> 2015-05-18 05:48:37.502668 7fe040ff1700 -1 *** Caught signal (Aborted) **
 in thread 7fe040ff1700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: /usr/bin/ceph-mon() [0x9017e2]
 2: (()+0xf130) [0x7fe04791f130]
 3: (gsignal()+0x37) [0x7fe0463395d7]
 4: (abort()+0x148) [0x7fe04633acc8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fe046c3d9b5]
 6: (()+0x5e926) [0x7fe046c3b926]
 7: (()+0x5e953) [0x7fe046c3b953]
 8: (()+0x5eb73) [0x7fe046c3bb73]
 9: (std::__throw_logic_error(char const*)+0x77) [0x7fe046c90717]
 10: (char* std::string::_S_construct<char const*>(char const*, char const*, std::allocator<char> const&, std::forward_iterator_tag)+0xa1) [0x7fe046c9c561]
 11: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(char const*, std::allocator<char> const&)+0x38) [0x7fe046c9c918]
 12: (CrushTreeDumper::dump_item_fields(CrushWrapper const*, CrushTreeDumper::Item const&, ceph::Formatter*)+0xb9) [0x6175c9]
 13: (OSDMap::print_tree(std::ostream*, ceph::Formatter*) const+0x10d9) [0x787ab9]
 14: (OSDMonitor::preprocess_command(MMonCommand*)+0xe55) [0x60a0b5]
 15: (OSDMonitor::preprocess_query(PaxosServiceMessage*)+0x20b) [0x60f75b]
 16: (PaxosService::dispatch(PaxosServiceMessage*)+0x833) [0x5cacd3]
 17: (Monitor::handle_command(MMonCommand*)+0x147c) [0x591a4c]
 18: (Monitor::dispatch(MonSession*, Message*, bool)+0xf9) [0x594c89]
 19: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x595936]
 20: (Monitor::ms_dispatch(Message*)+0x23) [0x5b5403]
 21: (DispatchQueue::entry()+0x64a) [0x8a1d9a]
 22: (DispatchQueue::DispatchThread::entry()+0xd) [0x79bd9d]
 23: (()+0x7df5) [0x7fe047917df5]
 24: (clone()+0x6d) [0x7fe0463fa1ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Not sure whether purging on one cluster led to this.....

Version-Release number of selected component (if applicable):0.94.1


How reproducible:


Steps to Reproduce:


Actual results:
mon process keeps failing to come up as and when started. 

Expected results:
successful start of mon

Additional info:
----------------
calamari admin: 10.8.128.6 (octo lab machines)
ssh username:cephuser, passwd:junk123
# Admin node:
10.8.128.6	 Admin
#cluster 1:
10.8.128.33	Mon
10.8.128.40	osd0
10.8.128.86	osd1
10.8.128.29	osd2
#cluster2:
10.8.128.76	c2Mon
10.8.128.89	c2osd0
10.8.128.90	c2osd1
10.8.128.91	 c2osd2

Comment 2 Ken Dreyer (Red Hat) 2015-05-18 15:37:09 UTC
Kefu, would you mind assisting with this bug (or re-assigning as appropriate)?

Comment 3 Kefu Chai 2015-05-18 16:16:35 UTC
Ken, will look at it tomorrow.

Comment 4 Harish NV Rao 2015-05-18 16:21:46 UTC
Kefu, if time permits, can you please take a look at this bug today? My testing is blocked because of this bug. Please help by providing a workaround for this.

Comment 5 Kefu Chai 2015-05-19 01:55:32 UTC
Sorry for blocking your testing, Harish. i was about to call it a night yesterday. will investigate this issue today.

Comment 6 Kefu Chai 2015-05-19 14:13:37 UTC
Harish, do we have the debug symbol for ceph-mon-0.94.1-8.el7cp.x86_64 ? normally one is able to install it using

 debuginfo-install ceph-mon-0.94.1-8.el7cp.x86_64

but seems i cannot install it on `Mon`

# debuginfo-install ceph-mon-0.94.1-8.el7cp.x86_64
Loaded plugins: auto-update-debuginfo, priorities, product-id
enabling rhel-7-server-optional-debug-rpms
enabling rhel-7-server-debug-rpms
enabling rhel-7-server-extras-debug-rpms
Could not find debuginfo for main pkg: 1:ceph-mon-0.94.1-8.el7cp.x86_64
Package boost-debuginfo-1.53.0-23.el7.x86_64 already installed and latest version
Package boost-debuginfo-1.53.0-23.el7.x86_64 already installed and latest version
Package glibc-debuginfo-2.17-78.el7.x86_64 already installed and latest version
Package gcc-debuginfo-4.8.3-9.el7.x86_64 already installed and latest version
Could not find debuginfo pkg for dependency package leveldb-1.12.0-5.el7cp.x86_64
Package nspr-debuginfo-4.10.6-3.el7.x86_64 already installed and latest version
Package nss-debuginfo-3.16.2.3-5.el7.x86_64 already installed and latest version
Package gcc-debuginfo-4.8.3-9.el7.x86_64 already installed and latest version
Could not find debuginfo pkg for dependency package gperftools-libs-2.1-1.el7.x86_64
Package util-linux-debuginfo-2.23.2-21.el7.x86_64 already installed and latest version
No debuginfo packages available to install

Comment 7 Kefu Chai 2015-05-19 14:16:09 UTC
note for myself, to create a coredump, run

 /usr/bin/ceph-mon -i Mon --pid-file /var/run/ceph/mon.Mon.pid -c /etc/ceph/ceph.conf --cluster ceph -f

Comment 8 Ken Dreyer (Red Hat) 2015-05-19 15:30:54 UTC
Hi Kefu and Harish, Puddle's debuginfo repos are not defined by default on these hosts, but you can define them by running eg:

yum-config-manager --add-repo
http://puddle.ceph.redhat.com/puddles/RHCeph/1.3-RHEL-7/2015-05-05.3/Server-RH7-CEPH-MON-1.3/x86_64/debuginfo/

This will create a rather long-titled .repo file in /etc/yum.repos.d/.

At that point, "yum install ceph-debuginfo" will work.

Comment 9 Harish NV Rao 2015-05-19 15:44:24 UTC
Thanks Ken!! I was planning to ask your help on the same!

Comment 10 Kefu Chai 2015-05-19 17:04:18 UTC
Harish, it's likely you fed the monitor with (bad) crushmap,

/var/log/ceph$ zgrep setcrushmap *
ceph.audit.log-20150519.gz:2015-05-18 05:14:12.613798 mon.0 10.8.128.33:6789/0 224816 : audit [INF] from='client.? 10.8.128.33:0/1023292' entity='client.admin' cmd=[{"prefix": "osd setcrushmap"}]: dispatch
ceph.audit.log-20150519.gz:2015-05-18 05:14:12.613798 mon.0 10.8.128.33:6789/0 224816 : audit [INF] from='client.? 10.8.128.33:0/1023292' entity='client.admin' cmd=[{"prefix": "osd setcrushmap"}]: dispatch

can you recall it? if yes, did you do it on purpose or by accident?

Comment 11 Kefu Chai 2015-05-19 17:14:11 UTC
@Ken, thanks a lot! I am able to reproduce this issue in my local env now.

@Harish, it's very likely due to a bad crushmap. so i'd like to know 

# if we fed monitor with a crush map or not
# if yes, where the crush map came from. probably in addition to make monitor more tolerant to this sort of crush map, we will need to fix the tool which generate the crush map, or improve our document.

Comment 12 Harish NV Rao 2015-05-19 18:42:56 UTC
Kefu,

>> # if we fed monitor with a crush map or not
I was testing calamari REST APIs yesterday. I tried "GET" of crush_map. But i dont remember "PUT" of the crush_map being done. I am not sure doing "PUT" of the same crush_map got from GET without changing any contents will  corrupt the running crushmap.

>># if yes, where the crush map came from. probably in addition to make monitor more tolerant to this sort of crush map, we will need to fix the tool which generate the crush map, or improve our document.

Yes, this has to be done.

Comment 13 Harish NV Rao 2015-05-19 18:43:42 UTC
Kefu, can you please tell me how to make mon to come up permanently?

Comment 14 Harish NV Rao 2015-05-19 19:48:40 UTC
kefu, can you please provide the workaround for this problem so i can start testing on this setup?

Comment 15 Christina Meno 2015-05-19 23:21:06 UTC
(In reply to Harish NV Rao from comment #12)
> Kefu,
> 
> >> # if we fed monitor with a crush map or not
> I was testing calamari REST APIs yesterday. I tried "GET" of crush_map. But
> i dont remember "PUT" of the crush_map being done. I am not sure doing "PUT"
> of the same crush_map got from GET without changing any contents will 
> corrupt the running crushmap.

I do not understand this at all. Are you saying that you didn't preform a PUT? I don't have context for what the issue is here.

Would you please provide the CRUSH-map in question?

Comment 16 Kefu Chai 2015-05-20 04:25:50 UTC
> kefu, can you please provide the workaround for this problem so i can start testing on this setup?

Harish, i recreated the storage for mon.Mon with following steps:

mv /var/lib/ceph/mon/ceph-Mon{,.old} # backup the old monitor storage, keyring and etc.
ceph-monstore-tool /var/lib/ceph/mon/ceph-Mon.old get monmap -- --out /tmp/monmap.1 # extract the monmap from the old monitor storage, so we can import it back into monitor storage later on
/usr/bin/ceph-mon -i Mon --mkfs --monmap /tmp/monmap.1 --keyring /var/lib/ceph/mon/ceph-Mon.old/keyring  --conf /etc/ceph/ceph.conf # re-create the monitor storage
ceph-authtool /var/lib/ceph/mon/ceph-Mon/keyring --import-keyring /etc/ceph/ceph.client.admin.keyring # add the client.admin to monitor's keyring
# and manually add permission settings for client.admin to /var/lib/ceph/mon/ceph-Mon/keyring, so it looks like
=====>8======
[mon.]
        key = AQCZa0pVAAAAABAAE0MUdAPjG0M3Z5umrHjn3A==
        caps mon = "allow *"
[client.admin]
        key = AQAAbkpV2KM3AxAALqb07IIUBg8zldPwVfS5Og==
        caps mon = "allow *"
=====8<======
mv /var/lib/ceph/mon/backup.ceph-Mon/sysvinit{,.backup} # so we don't have two mon.Mons when init.d/ceph tries to start/stop/status the mon processes.
/etc/init.d/ceph start # start the configured instances on this machine, i.e. mon.Mon
ceph osd tree # check the crush map, well no osd is found, let's add them back
ceph-monstore-tool /var/lib/ceph/mon/backup.ceph-Mon get osdmap -- --version 81 --out /tmp/osdmap.81 # export the most recent good crushmap
osdmaptool --export-crush  /tmp/crushmap.81 /tmp/osdmap.81
ceph osd setcrushmap -i /tmp/crushmap.81 # and feed to monitor
ceph osd tree # et voilà !

you can revert what i did by overwriting /var/lib/ceph/mon/ceph-Mon with /var/lib/ceph/mon/ceph-Mon.old.

Comment 17 Kefu Chai 2015-05-20 14:52:20 UTC
> I was testing calamari REST APIs yesterday. I tried "GET" of crush_map. But i dont remember "PUT" of the crush_map being done. I am not sure doing "PUT" of the same crush_map got from GET without changing any contents will  corrupt the running crush map.

i am echoing Greg's query. Harish, if you can reproduce this issue. and attached the crush map you "GET". that would be very helpful to find out where the problematic crush map came from.

Comment 18 Harish NV Rao 2015-05-20 16:44:59 UTC
I will try to reproduce this issue and provide the required logs.

Comment 19 Ken Dreyer (Red Hat) 2015-05-21 22:30:05 UTC
When we have a clearer reproduction case we can move forward on this.

Comment 20 Harish NV Rao 2015-05-22 10:52:39 UTC
sure. I will be starting API testing next week and will try to reproduce the issue.

Comment 21 Kefu Chai 2015-05-22 16:41:22 UTC
thanks Harish, and in the meanwhile, we are working on enabling monitor to detect 1) empty crush map, 2) crush map not covering all OSDs in osdmap.

Comment 22 Harish NV Rao 2015-05-25 08:55:40 UTC
Created attachment 1029386 [details]
ceph mon log

Comment 23 Harish NV Rao 2015-05-25 08:57:40 UTC
Kefu, Issue was reproduced when a crushmap that had just "#begin    #end" was POSTed via DJango REST Framework. Above attachment contains the mon log.

Comment 24 Harish NV Rao 2015-05-25 09:35:20 UTC
Created attachment 1029400 [details]
client file containing GET and POST of crush map for which mon fails.

Comment 25 Harish NV Rao 2015-05-25 09:52:23 UTC
Hi Kefu,

Mon crashed when I used a client script to do PUT operation too. This client script is basically a script in the Calamari test suite: https://github.com/ceph/calamari/blob/master/tests/http_client.py. I modified parts of this script for doing GET and POST also[see above attachment]. 

I am not sure whether that is the right way. I found this modified py file as a quick fix for doing GET and POST operation via script which may not be right. Please check the file and confirm if it's ok to use it.

Are there any other Ceph REST API client that i can use? 

Regards,
Harish

Comment 26 Harish NV Rao 2015-05-25 09:54:25 UTC
Created attachment 1029402 [details]
ceph mon log when a customized client script was used to PUT a modified crushmap

Comment 27 Kefu Chai 2015-05-25 16:46:01 UTC
the attached python looks suspicious:


response = c.post('cluster/cbc3750d-763e-42f2-a929-b7043734f4f0/crush_map', '# begin crush map. \ntunable choose_local_tries 0\n....')

where c is an instance of requests.Session, so the script is sending a POST request to 'cluster/cbc3750d-763e-42f2-a929-b7043734f4f0/crush_map?# begin crush map ....'. in other words, the crush map is posted as the query string, (not sure if it is escaped, though). see http://docs.python-requests.org/en/latest/api/?highlight=session#requests.Session.request .

> I am not sure whether that is the right way. I found this modified py file as a quick fix for doing GET and POST operation via script which may not be right. Please check the file and confirm if it's ok to use it.

> Are there any other Ceph REST API client that i can use? 
@greg, is it how this REST API is supposed to be used? could you help confirm this? thanks.



unlike the empty crush map we extracted from osdmap#85 in http://tracker.ceph.com/issues/11680, this crush map for the POST looks decent:

# begin crush map.
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 68
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host osd0 {
	id -2		# do not change unnecessarily
	# weight 1.820
	alg straw
	hash 0	# rjenkins1
	item osd.0 weight 0.910
	item osd.2 weight 0.910
}
host osd1 {
	id -3		# do not change unnecessarily
	# weight 1.820
	alg straw
	hash 0	# rjenkins1
	item osd.1 weight 0.910
	item osd.3 weight 0.910
}
root default {
	id -1		# do not change unnecessarily
	# weight 3.640
	alg straw
	hash 0	# rjenkins1
	item osd0 weight 1.820
	item osd1 weight 1.820
}

# rules
rule replicated_ruleset {
	ruleset 0
	type replicated
	min_size 1
	max_size 10
	step take default
	step chooseleaf firstn 0 type host
	step emit
}

# end crush map



and the attached log of https://bugzilla.redhat.com/attachment.cgi?id=1029402 shows something different:

2015-05-25 05:33:57.741829 7ffe076b2700  0 log_channel(audit) log [DBG] : from='client.? 10.12.27.14:0/1004185' entity='client.admin' cmd=[{"prefix": "osd tree", "epoch": 15, "format": "json"}]: dispatch
2015-05-25 05:33:57.748845 7ffe076b2700 -1 osd/OSDMap.h: In function 'unsigned int OSDMap::get_weight(int) const' thread 7ffe076b2700 time 2015-05-25 05:33:57.742429
osd/OSDMap.h: 374: FAILED assert(o < max_osd)

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7b3a95]
 2: /usr/bin/ceph-mon() [0x773223]
 3: (OSDMap::print_tree(std::ostream*, ceph::Formatter*) const+0x192a) [0x78835a]
 4: (OSDMonitor::preprocess_command(MMonCommand*)+0xe55) [0x60a105]
 5: (OSDMonitor::preprocess_query(PaxosServiceMessage*)+0x20b) [0x60f7ab]
 6: (PaxosService::dispatch(PaxosServiceMessage*)+0x833) [0x5cad23]
 7: (Monitor::handle_command(MMonCommand*)+0x147c) [0x591a9c]
 8: (Monitor::dispatch(MonSession*, Message*, bool)+0xf9) [0x594cd9]
 9: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x595986]
 10: (Monitor::ms_dispatch(Message*)+0x23) [0x5b5453]
 11: (DispatchQueue::entry()+0x64a) [0x8a240a]
 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x79bded]
 13: (()+0x7df5) [0x7ffe0da10df5]
 14: (clone()+0x6d) [0x7ffe0c4f31ad]

which means the printed OSD did not exist in the osdmap at that moment. Harish, may i know what test you are working on?

Comment 28 Harish NV Rao 2015-05-26 07:34:20 UTC
>which means the printed OSD did not exist in the osdmap at that moment. Harish, may i know what test you are working on?

I am working on a test where i input wrong/incorrect/invalid crushmap using POST via a client script.

Here is what the script doing:

1. GET the recent crushmap

2. POST the map with following changes:
    a) the value of tunable choose_total_tries changed to 68
    b) the bucket for osd2 removed
    c) bucket osd0 is added with:
	item osd.0 weight 0.910
	item osd.2 weight 0.910
    d) bucket osd1 is added with:
	item osd.1 weight 0.910
	item osd.3 weight 0.910
    Note: osd.3 does not exist actually in the system. osd.2 is actually on the osd2 host. 
    e) under #devices, added : device 3 osd.3
    According to me above modification should be resulting in an incorrect crushmap which system should detect and reject.

3. Once POSTed via script, the mon crashed.


Please note that I use browser based DJango REST framework v2.3.12 also to test the GET and POST.  

There are some tests which need GET and POST to be done via script and hence used attached script.

Comment 29 Harish NV Rao 2015-05-26 07:39:48 UTC
Kefu, please note that script when used with valid crush map for POSTing, does not kill mon. The POST operation completes successfully - subsequent GETs show the modified values.

Comment 30 Kefu Chai 2015-05-26 13:51:58 UTC
Harish, thanks for explaining your tests in such a detail =)

> Note: osd.3 does not exist actually in the system.

so this is the case where an osd exists in crush but not in osdmap. we will be able to detect this.

>  osd.2 is actually on the osd2 host. 

this need some cross check between osdmap and crush before accepting the crush map. IIRC, crush will blindly follow the rules to select a list of OSD for a given op. so neither the client nor the server side will notice this problem, aside from the selected OSDs might not be the expected ones.

and i am not sure if we need to reject such a crush map.

> Kefu, please note that script when used with valid crush map for POSTing, does not kill mon. The POST operation completes successfully - subsequent GETs show the modified values.

yeah, i see. the "ceph osd tree" kills mon if a bad crush map is sent for it before.

Comment 31 Kefu Chai 2015-05-26 15:53:41 UTC
>  osd.2 is actually on the osd2 host. 

seems our init script can fix this automatically by calling ceph-osd-prestart.sh or  ceph-crush-location .

but this is done when the OSD daemon is started, after that i guess the monitor is on its own.

Comment 32 Christina Meno 2015-05-26 19:15:08 UTC
(In reply to Kefu Chai from comment #27)
> the attached python looks suspicious:
> 
> 
> response = c.post('cluster/cbc3750d-763e-42f2-a929-b7043734f4f0/crush_map',
> '# begin crush map. \ntunable choose_local_tries 0\n....')
> 
> where c is an instance of requests.Session, so the script is sending a POST
> request to 'cluster/cbc3750d-763e-42f2-a929-b7043734f4f0/crush_map?# begin
> crush map ....'. in other words, the crush map is posted as the query
> string, (not sure if it is escaped, though). see
> http://docs.python-requests.org/en/latest/api/?highlight=session#requests.
> Session.request .
> 
> > I am not sure whether that is the right way. I found this modified py file as a quick fix for doing GET and POST operation via script which may not be right. Please check the file and confirm if it's ok to use it.
> 
> > Are there any other Ceph REST API client that i can use? 
> @greg, is it how this REST API is supposed to be used? could you help
> confirm this? thanks.
> 
> 
> 
> unlike the empty crush map we extracted from osdmap#85 in
> http://tracker.ceph.com/issues/11680, this crush map for the POST looks
> decent:
> 
> # begin crush map.
> tunable choose_local_tries 0
> tunable choose_local_fallback_tries 0
> tunable choose_total_tries 68
> tunable chooseleaf_descend_once 1
> tunable straw_calc_version 1
> 
> # devices
> device 0 osd.0
> device 1 osd.1
> device 2 osd.2
> device 3 osd.3
> 
> # types
> type 0 osd
> type 1 host
> type 2 chassis
> type 3 rack
> type 4 row
> type 5 pdu
> type 6 pod
> type 7 room
> type 8 datacenter
> type 9 region
> type 10 root
> 
> # buckets
> host osd0 {
> 	id -2		# do not change unnecessarily
> 	# weight 1.820
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.0 weight 0.910
> 	item osd.2 weight 0.910
> }
> host osd1 {
> 	id -3		# do not change unnecessarily
> 	# weight 1.820
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd.1 weight 0.910
> 	item osd.3 weight 0.910
> }
> root default {
> 	id -1		# do not change unnecessarily
> 	# weight 3.640
> 	alg straw
> 	hash 0	# rjenkins1
> 	item osd0 weight 1.820
> 	item osd1 weight 1.820
> }
> 
> # rules
> rule replicated_ruleset {
> 	ruleset 0
> 	type replicated
> 	min_size 1
> 	max_size 10
> 	step take default
> 	step chooseleaf firstn 0 type host
> 	step emit
> }
> 
> # end crush map
> 
> 
> 
> and the attached log of
> https://bugzilla.redhat.com/attachment.cgi?id=1029402 shows something
> different:
> 
> 2015-05-25 05:33:57.741829 7ffe076b2700  0 log_channel(audit) log [DBG] :
> from='client.? 10.12.27.14:0/1004185' entity='client.admin' cmd=[{"prefix":
> "osd tree", "epoch": 15, "format": "json"}]: dispatch
> 2015-05-25 05:33:57.748845 7ffe076b2700 -1 osd/OSDMap.h: In function
> 'unsigned int OSDMap::get_weight(int) const' thread 7ffe076b2700 time
> 2015-05-25 05:33:57.742429
> osd/OSDMap.h: 374: FAILED assert(o < max_osd)
> 
>  ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x7b3a95]
>  2: /usr/bin/ceph-mon() [0x773223]
>  3: (OSDMap::print_tree(std::ostream*, ceph::Formatter*) const+0x192a)
> [0x78835a]
>  4: (OSDMonitor::preprocess_command(MMonCommand*)+0xe55) [0x60a105]
>  5: (OSDMonitor::preprocess_query(PaxosServiceMessage*)+0x20b) [0x60f7ab]
>  6: (PaxosService::dispatch(PaxosServiceMessage*)+0x833) [0x5cad23]
>  7: (Monitor::handle_command(MMonCommand*)+0x147c) [0x591a9c]
>  8: (Monitor::dispatch(MonSession*, Message*, bool)+0xf9) [0x594cd9]
>  9: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x595986]
>  10: (Monitor::ms_dispatch(Message*)+0x23) [0x5b5453]
>  11: (DispatchQueue::entry()+0x64a) [0x8a240a]
>  12: (DispatchQueue::DispatchThread::entry()+0xd) [0x79bded]
>  13: (()+0x7df5) [0x7ffe0da10df5]
>  14: (clone()+0x6d) [0x7ffe0c4f31ad]
> 
> which means the printed OSD did not exist in the osdmap at that moment.
> Harish, may i know what test you are working on?

Kefu
the CRUSH map is not sent as a query parameter see an example session:
(venv)[root@vpm180 ubuntu]# CALAMARI_CONF=/etc/calamari/calamari.conf DJANGO_SETTINGS_MODULE=calamari_web.settings django-admin.py runserver 0.0.0.0:8000
Validating models...

0 errors found
May 26, 2015 - 14:08:33
Django version 1.5.1, using settings 'calamari_web.settings'
Development server is running at http://0.0.0.0:8000/
Quit the server with CONTROL-C.
> /opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/v2.py(106)replace()
-> return Response(self.client.update(fsid, CRUSH_MAP, None, request.DATA))
(Pdb) request.DATA
'# begin crush map\r\ntunable choose_local_tries 0\r\ntunable choose_local_fallback_tries 0\r\ntunable choose_total_tries 50\r\ntunable chooseleaf_descend_once 1\r\ntunable straw_calc_version 1\r\n\r\n# devices\r\ndevice 0 osd.0\r\ndevice 1 osd.1\r\n\r\n# types\r\ntype 0 osd\r\ntype 1 host\r\ntype 2 chassis\r\ntype 3 rack\r\ntype 4 row\r\ntype 5 pdu\r\ntype 6 pod\r\ntype 7 room\r\ntype 8 datacenter\r\ntype 9 region\r\ntype 10 root\r\n\r\n# buckets\r\nhost vpm050 {\r\n\tid -2\t\t# do not change unnecessarily\r\n\t# weight 0.190\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n\titem osd.0 weight 0.190\r\n}\r\nroot default {\r\n\tid -1\t\t# do not change unnecessarily\r\n\t# weight 0.190\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n\titem vpm050 weight 0.190\r\n}\r\nhost vpm041 {\r\n\tid -3\t\t# do not change unnecessarily\r\n\t# weight 0.190\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n\titem osd.1 weight 0.190\r\n}\r\nrack rack_contains_vpm041 {\r\n\tid -4\t\t# do not change unnecessarily\r\n\t# weight 0.190\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n\titem vpm041 weight 0.190\r\n}\r\nhost vpm041-SSD {\r\n\tid -5\t\t# do not change unnecessarily\r\n\t# weight 0.000\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n}\r\n\r\n# rules\r\nrule replicated_ruleset {\r\n\truleset 0\r\n\ttype replicated\r\n\tmin_size 1\r\n\tmax_size 10\r\n\tstep take default\r\n\tstep chooseleaf firstn 0 type host\r\n\tstep emit\r\n}\r\n\r\n# end crush map'
(Pdb) request
<rest_framework.request.Request object at 0x7f92b124a950>
(Pdb) request.QUERY_PARAMS
<QueryDict: {}>
(Pdb)

Comment 33 Kefu Chai 2015-05-27 12:06:25 UTC
@Greg, thanks a lot!

@Harish, seems you might want to update your python script? to be specific, use something like:

 response = c.post('cluster/<uuid>/crush_map', data='#begin ....')

instead of 

 response = c.post('cluster/<uuid>/crush_map', '#begin ....')


the later basically POSTs an empty body to the API server. from ceph monitor's point of view, it will be injected with an empty crush map.

i don't think that's what you intended at the first place, but it could also serve as a negative test case anyway.

Comment 34 Ken Dreyer (Red Hat) 2015-05-28 20:01:35 UTC
Hi Kefu,

QE's evaluation of this bug is that it's easy to reproduce (just POST an empty crushmap to Calamari), so there is likelihood of customers accidentally triggering this bug in the field.

QE is also concerned that once the crushmap is in place, it is not trivial to get the monitor to start up again. (There is no documented way to recover, other than what you wrote in this BZ.)

Can you help me understand a couple things?

1) Will the patches in https://github.com/ceph/ceph/pull/4726 prevent a user from injecting an empty crushmap? I realize the PR is still undergoing review, so do you recommend that we take the patches in that PR downstream for the 1.3.0 release? Or would you like to wait for more review upstream? I'm trying to understand what sort of timeframe we're looking at for this, and how confident you feel about those going in.

2) If we don't take any patches in Ceph to fix this for 1.3.0, can you confirm that the steps in comment #16 are the ones that we should document for customers who experience this issue?

Comment 35 Kefu Chai 2015-05-29 14:52:48 UTC
Ken, sorry for keep you waiting.

> 1) Will the patches in https://github.com/ceph/ceph/pull/4726 prevent a user from injecting an empty crush map?

true. i also added a test to cover this case.


> so do you recommend that we take the patches in that PR downstream for the 1.3.0 release?

i plan to backport this fix to hammer after it is merged into master. if 1.3.0 can not pick up the next hammer release. i would recommend to do so

> Or would you like to wait for more review upstream?

i am scheduling a rados qa run, hopefully we will get the result in one days or two. i will ask sam or loic who is more experienced with teuthology testbed to see how long we should wait in general for a qa run, and get back to you. ordinary, i will just wait for sam/sage to pick it up in his run. although the cycle qa run is pretty long,  it will give us more confidence that the change won't cause regression. but personally, i am confident about this change, but better safe than sorry. anyway, i will try to get my own qa run finished sooner so we can merge it earlier.

> If we don't take any patches in Ceph to fix this for 1.3.0, can you confirm that the steps in comment #16 are the ones that we should document for customers who experience this issue?

the steps in comment#16 are only for the reference of our QE. in which, joao helped to identify the latest good version # of crush map. but i am not sure that our customer is able to find out the version # without good knowledge of monitor and such.

<quote>
ceph-monstore-tool /var/lib/ceph/mon/backup.ceph-Mon get osdmap -- --version 81 --out /tmp/osdmap.81 # export the most recent good crushmap
</quote>

so user can not blindly repeat it in hope to bring a monitor back online. hence it is not recommended as a practice for end user.

Comment 36 Harish NV Rao 2015-05-29 16:03:22 UTC
Kefu, would it be possible to document the procedure to bring the system back into working/healthy condition which can be used by end user too? 

If this bug is not going to be fixed in 1.3.0, then such procedure is needed if customer encounters the crash.

Comment 37 Ken Dreyer (Red Hat) 2015-05-29 22:38:21 UTC
(In reply to Harish NV Rao from comment #36)
> If this bug is not going to be fixed in 1.3.0, then such procedure is needed
> if customer encounters the crash.

Agreed with Harish - let's pursue getting this fix properly tested upstream so we can ship it after 1.3.0, and in the meantime let's try to document as much as we can about this problem.

John W., how do we get the ball rolling on setting up a new KB article for this?

Comment 38 Greg Farnum 2015-05-29 23:12:09 UTC
(Discussed this in irc and was asked to provide a summary.)
Based on Joao's comment on the downstream ticket it looks like the system is susceptible to this via the CLI as well. That said, my take is that there are two distinct issues:
1) We are unfriendly to the user by allowing them to inject a crushmap that will not map any of their data.
2) We crash if you try and print an empty crush map.

Those should both get fixed upstream; if we were in a huge hurry the quick fix is probably to resolve (2). But I think this is repairable by simply injecting a new valid crushmap (if you already have one available you shouldn't need to rip it out of the monstore nor replace the monitors, I don't think?) and so shouldn't block release...
(You also need to stop any clients which are invoking the print request on the monitors; not doing so is probably why the cluster looks to die on restart as well.)

Comment 39 Vasu Kulkarni 2015-05-30 00:51:12 UTC
One thing we should recommend is to backup the original crushmap

#ceph osd getcrushmap -o orig_compiled_crushmap

and in case an empty compiled crushmap is loaded, user can restore the backup, I couldn't hit the crash using cli(but could see the map was messed up) and this step needs further testing 

#touch emptyfile
#crushtool -c emptyfile -o empty_compiled_crushmap
#ceph osd setcrushmap -i empty_compiled_crushmap 
set crush map

# ceph osd tree
ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 
 0      0 osd.0          up  1.00000          1.00000 
 1      0 osd.1          up  1.00000          1.00000 
 2      0 osd.2          up  1.00000          1.00000 
 3      0 osd.3          up  1.00000          1.00000 
 4      0 osd.4          up  1.00000          1.00000 
 5      0 osd.5          up  1.00000          1.00000 

#Restore Original

# ceph osd setcrushmap -i orig_compiled_crushmap 
set crush map

# ceph osd tree
ID WEIGHT  TYPE NAME              UP/DOWN REWEIGHT PRIMARY-AFFINITY 
-1 6.00000 root default                                             
-3 6.00000     rack localrack                                       
-2 6.00000         host localhost                                   
 0 1.00000             osd.0           up  1.00000          1.00000 
 1 1.00000             osd.1           up  1.00000          1.00000 
 2 1.00000             osd.2           up  1.00000          1.00000 
 3 1.00000             osd.3           up  1.00000          1.00000 
 4 1.00000             osd.4           up  1.00000          1.00000 
 5 1.00000             osd.5           up  1.00000          1.00000

Comment 40 Kefu Chai 2015-05-30 02:11:13 UTC
> I couldn't hit the crash using cli(but could see the map was messed up) and this step needs further testing 

> ceph osd tree

ceph osd tree --format json

will bring down the monitor. "--format json" prints more info of buckets and items in the map, which causes the crash.

Comment 43 Federico Lucifredi 2015-06-03 01:56:05 UTC
Per Monday manager's review meeting, not a 1.3.0 release blocker. Pushing to 1.3.1 or async update as priority is determined.

Comment 44 Federico Lucifredi 2015-06-03 16:53:49 UTC
The 1.3.0 release notes should point to the risk of propagating an invalid (or empty) crushmap via Calamari. Users should exercise caution to validate crush map before distributing it to the cluster.

fixing the "empty crushmap" path will be an urgent errata.

Comment 45 Ken Dreyer (Red Hat) 2015-06-03 20:01:19 UTC
(In reply to Kefu Chai from comment #35)
> the steps in comment#16 are only for the reference of our QE. in which, joao
> helped to identify the latest good version # of crush map. but i am not sure
> that our customer is able to find out the version # without good knowledge
> of monitor and such.
> 
> <quote>
> ceph-monstore-tool /var/lib/ceph/mon/backup.ceph-Mon get osdmap -- --version
> 81 --out /tmp/osdmap.81 # export the most recent good crushmap
> </quote>

Yeah, this is tricky. It's not clear to me how Joao chose that 81 number :)

Can you please provide the steps for a user to recover from this situation, and the docs team (John Wilkins, Monti Lawrence) can get this published into a document?

Comment 46 Kefu Chai 2015-06-04 14:51:07 UTC
Ken, there is an upstream ticket filed for this actually: http://tracker.ceph.com/issues/11815 . i will play with ceph-monstore-tool and talk with Joao and Greg, and then update you with what I have.

Comment 47 Harish NV Rao 2015-06-10 09:15:28 UTC
Kefu, any update on the procedure to recover the system after mon failure?

Comment 50 Kefu Chai 2015-06-12 01:21:25 UTC
sorry, Ken, i missed Loïc's comment. and now it's merged. i am preparing the backports.

Comment 51 Kefu Chai 2015-06-12 08:56:16 UTC
Ken, the fix is merged in master, and it is being backported to hammer[0], for the date we can see it in hammer, it will likely be july. and a more precise date depends on the state of the backports[1] for this release[1].


----
[0] http://tracker.ceph.com/issues/11975
[1] http://tracker.ceph.com/projects/ceph/issues?query_id=78

Comment 52 Ken Dreyer (Red Hat) 2015-06-12 15:49:51 UTC
Ok cool, thank you!

Please create a wip- branch based on the tip of rhcs-0.94.1-ubuntu in GitHub, and that will allow us to have patches that we can cleanly cherry-pick in the packaging downstream to fix this bug.

Comment 53 Kefu Chai 2015-06-15 04:48:32 UTC
pushed to wip-11680-rhcs-v0.94.1-ubuntu .

Comment 54 Kefu Chai 2015-06-15 11:24:04 UTC
Joao, Greg and Loïc had a fruitful discussion today. so we believe a CLI tool would help to bring the monitor back online if a bad crush map is injected into monitor without being identified by crushtool. see http://tracker.ceph.com/issues/11815

Comment 55 Kefu Chai 2015-06-24 14:39:47 UTC
enhanced the ceph-monstore-tool, and pulled together a script to help with this issue, pending on review at https://github.com/ceph/ceph/pull/5052 .

Comment 62 rakesh-gm 2015-07-14 05:52:05 UTC
Hello , 

The bug is verified  by performing the following test cases. 

1. Put an empty crush map and check if Mon process is crashing. 
   result with the errata fix: Mon process does not crash and rejects to set this empty crush map

2. Add an non-existent device in the crush map. 
   result with errata fix: Mon process does not crash and rejects to set this crush map 

the tests have been performed using command line i.e crushtool and using web API(django framework) as well. 

So the bug will be moved to verified state. 

However during this process I have come across another bug where the invalid crush map is being set. i.e if the osd's are interchanged under the hosts in the crush map and this crush map is getting accepted, but there is no crash. details will be given in the new ticket.

Comment 64 errata-xmlrpc 2015-07-16 22:20:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:1240