Bug 1222509
| Summary: | 1.3.0: mon fails to come up. dies as soon as started. | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Harish NV Rao <hnallurv> |
| Component: | RADOS | Assignee: | Kefu Chai <kchai> |
| Status: | CLOSED ERRATA | QA Contact: | ceph-qe-bugs <ceph-qe-bugs> |
| Severity: | urgent | Docs Contact: | John Wilkins <jowilkin> |
| Priority: | unspecified | ||
| Version: | 1.3.0 | CC: | ceph-eng-bugs, dzafman, flucifre, gmeno, hnallurv, kchai, kdreyer, rgowdege, sjust, vakulkar |
| Target Milestone: | rc | ||
| Target Release: | 1.3.1 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | ceph-0.94.1-15.el7cp | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2015-07-16 22:20:08 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1230323 | ||
| Attachments: | |||
|
Description
Harish NV Rao
2015-05-18 11:53:05 UTC
Kefu, would you mind assisting with this bug (or re-assigning as appropriate)? Ken, will look at it tomorrow. Kefu, if time permits, can you please take a look at this bug today? My testing is blocked because of this bug. Please help by providing a workaround for this. Sorry for blocking your testing, Harish. i was about to call it a night yesterday. will investigate this issue today. Harish, do we have the debug symbol for ceph-mon-0.94.1-8.el7cp.x86_64 ? normally one is able to install it using debuginfo-install ceph-mon-0.94.1-8.el7cp.x86_64 but seems i cannot install it on `Mon` # debuginfo-install ceph-mon-0.94.1-8.el7cp.x86_64 Loaded plugins: auto-update-debuginfo, priorities, product-id enabling rhel-7-server-optional-debug-rpms enabling rhel-7-server-debug-rpms enabling rhel-7-server-extras-debug-rpms Could not find debuginfo for main pkg: 1:ceph-mon-0.94.1-8.el7cp.x86_64 Package boost-debuginfo-1.53.0-23.el7.x86_64 already installed and latest version Package boost-debuginfo-1.53.0-23.el7.x86_64 already installed and latest version Package glibc-debuginfo-2.17-78.el7.x86_64 already installed and latest version Package gcc-debuginfo-4.8.3-9.el7.x86_64 already installed and latest version Could not find debuginfo pkg for dependency package leveldb-1.12.0-5.el7cp.x86_64 Package nspr-debuginfo-4.10.6-3.el7.x86_64 already installed and latest version Package nss-debuginfo-3.16.2.3-5.el7.x86_64 already installed and latest version Package gcc-debuginfo-4.8.3-9.el7.x86_64 already installed and latest version Could not find debuginfo pkg for dependency package gperftools-libs-2.1-1.el7.x86_64 Package util-linux-debuginfo-2.23.2-21.el7.x86_64 already installed and latest version No debuginfo packages available to install note for myself, to create a coredump, run /usr/bin/ceph-mon -i Mon --pid-file /var/run/ceph/mon.Mon.pid -c /etc/ceph/ceph.conf --cluster ceph -f Hi Kefu and Harish, Puddle's debuginfo repos are not defined by default on these hosts, but you can define them by running eg: yum-config-manager --add-repo http://puddle.ceph.redhat.com/puddles/RHCeph/1.3-RHEL-7/2015-05-05.3/Server-RH7-CEPH-MON-1.3/x86_64/debuginfo/ This will create a rather long-titled .repo file in /etc/yum.repos.d/. At that point, "yum install ceph-debuginfo" will work. Thanks Ken!! I was planning to ask your help on the same! Harish, it's likely you fed the monitor with (bad) crushmap,
/var/log/ceph$ zgrep setcrushmap *
ceph.audit.log-20150519.gz:2015-05-18 05:14:12.613798 mon.0 10.8.128.33:6789/0 224816 : audit [INF] from='client.? 10.8.128.33:0/1023292' entity='client.admin' cmd=[{"prefix": "osd setcrushmap"}]: dispatch
ceph.audit.log-20150519.gz:2015-05-18 05:14:12.613798 mon.0 10.8.128.33:6789/0 224816 : audit [INF] from='client.? 10.8.128.33:0/1023292' entity='client.admin' cmd=[{"prefix": "osd setcrushmap"}]: dispatch
can you recall it? if yes, did you do it on purpose or by accident?
@Ken, thanks a lot! I am able to reproduce this issue in my local env now. @Harish, it's very likely due to a bad crushmap. so i'd like to know # if we fed monitor with a crush map or not # if yes, where the crush map came from. probably in addition to make monitor more tolerant to this sort of crush map, we will need to fix the tool which generate the crush map, or improve our document. Kefu, >> # if we fed monitor with a crush map or not I was testing calamari REST APIs yesterday. I tried "GET" of crush_map. But i dont remember "PUT" of the crush_map being done. I am not sure doing "PUT" of the same crush_map got from GET without changing any contents will corrupt the running crushmap. >># if yes, where the crush map came from. probably in addition to make monitor more tolerant to this sort of crush map, we will need to fix the tool which generate the crush map, or improve our document. Yes, this has to be done. Kefu, can you please tell me how to make mon to come up permanently? kefu, can you please provide the workaround for this problem so i can start testing on this setup? (In reply to Harish NV Rao from comment #12) > Kefu, > > >> # if we fed monitor with a crush map or not > I was testing calamari REST APIs yesterday. I tried "GET" of crush_map. But > i dont remember "PUT" of the crush_map being done. I am not sure doing "PUT" > of the same crush_map got from GET without changing any contents will > corrupt the running crushmap. I do not understand this at all. Are you saying that you didn't preform a PUT? I don't have context for what the issue is here. Would you please provide the CRUSH-map in question? > kefu, can you please provide the workaround for this problem so i can start testing on this setup?
Harish, i recreated the storage for mon.Mon with following steps:
mv /var/lib/ceph/mon/ceph-Mon{,.old} # backup the old monitor storage, keyring and etc.
ceph-monstore-tool /var/lib/ceph/mon/ceph-Mon.old get monmap -- --out /tmp/monmap.1 # extract the monmap from the old monitor storage, so we can import it back into monitor storage later on
/usr/bin/ceph-mon -i Mon --mkfs --monmap /tmp/monmap.1 --keyring /var/lib/ceph/mon/ceph-Mon.old/keyring --conf /etc/ceph/ceph.conf # re-create the monitor storage
ceph-authtool /var/lib/ceph/mon/ceph-Mon/keyring --import-keyring /etc/ceph/ceph.client.admin.keyring # add the client.admin to monitor's keyring
# and manually add permission settings for client.admin to /var/lib/ceph/mon/ceph-Mon/keyring, so it looks like
=====>8======
[mon.]
key = AQCZa0pVAAAAABAAE0MUdAPjG0M3Z5umrHjn3A==
caps mon = "allow *"
[client.admin]
key = AQAAbkpV2KM3AxAALqb07IIUBg8zldPwVfS5Og==
caps mon = "allow *"
=====8<======
mv /var/lib/ceph/mon/backup.ceph-Mon/sysvinit{,.backup} # so we don't have two mon.Mons when init.d/ceph tries to start/stop/status the mon processes.
/etc/init.d/ceph start # start the configured instances on this machine, i.e. mon.Mon
ceph osd tree # check the crush map, well no osd is found, let's add them back
ceph-monstore-tool /var/lib/ceph/mon/backup.ceph-Mon get osdmap -- --version 81 --out /tmp/osdmap.81 # export the most recent good crushmap
osdmaptool --export-crush /tmp/crushmap.81 /tmp/osdmap.81
ceph osd setcrushmap -i /tmp/crushmap.81 # and feed to monitor
ceph osd tree # et voilà !
you can revert what i did by overwriting /var/lib/ceph/mon/ceph-Mon with /var/lib/ceph/mon/ceph-Mon.old.
> I was testing calamari REST APIs yesterday. I tried "GET" of crush_map. But i dont remember "PUT" of the crush_map being done. I am not sure doing "PUT" of the same crush_map got from GET without changing any contents will corrupt the running crush map.
i am echoing Greg's query. Harish, if you can reproduce this issue. and attached the crush map you "GET". that would be very helpful to find out where the problematic crush map came from.
I will try to reproduce this issue and provide the required logs. When we have a clearer reproduction case we can move forward on this. sure. I will be starting API testing next week and will try to reproduce the issue. thanks Harish, and in the meanwhile, we are working on enabling monitor to detect 1) empty crush map, 2) crush map not covering all OSDs in osdmap. Created attachment 1029386 [details]
ceph mon log
Kefu, Issue was reproduced when a crushmap that had just "#begin #end" was POSTed via DJango REST Framework. Above attachment contains the mon log. Created attachment 1029400 [details]
client file containing GET and POST of crush map for which mon fails.
Hi Kefu, Mon crashed when I used a client script to do PUT operation too. This client script is basically a script in the Calamari test suite: https://github.com/ceph/calamari/blob/master/tests/http_client.py. I modified parts of this script for doing GET and POST also[see above attachment]. I am not sure whether that is the right way. I found this modified py file as a quick fix for doing GET and POST operation via script which may not be right. Please check the file and confirm if it's ok to use it. Are there any other Ceph REST API client that i can use? Regards, Harish Created attachment 1029402 [details]
ceph mon log when a customized client script was used to PUT a modified crushmap
the attached python looks suspicious:
response = c.post('cluster/cbc3750d-763e-42f2-a929-b7043734f4f0/crush_map', '# begin crush map. \ntunable choose_local_tries 0\n....')
where c is an instance of requests.Session, so the script is sending a POST request to 'cluster/cbc3750d-763e-42f2-a929-b7043734f4f0/crush_map?# begin crush map ....'. in other words, the crush map is posted as the query string, (not sure if it is escaped, though). see http://docs.python-requests.org/en/latest/api/?highlight=session#requests.Session.request .
> I am not sure whether that is the right way. I found this modified py file as a quick fix for doing GET and POST operation via script which may not be right. Please check the file and confirm if it's ok to use it.
> Are there any other Ceph REST API client that i can use?
@greg, is it how this REST API is supposed to be used? could you help confirm this? thanks.
unlike the empty crush map we extracted from osdmap#85 in http://tracker.ceph.com/issues/11680, this crush map for the POST looks decent:
# begin crush map.
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 68
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root
# buckets
host osd0 {
id -2 # do not change unnecessarily
# weight 1.820
alg straw
hash 0 # rjenkins1
item osd.0 weight 0.910
item osd.2 weight 0.910
}
host osd1 {
id -3 # do not change unnecessarily
# weight 1.820
alg straw
hash 0 # rjenkins1
item osd.1 weight 0.910
item osd.3 weight 0.910
}
root default {
id -1 # do not change unnecessarily
# weight 3.640
alg straw
hash 0 # rjenkins1
item osd0 weight 1.820
item osd1 weight 1.820
}
# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
and the attached log of https://bugzilla.redhat.com/attachment.cgi?id=1029402 shows something different:
2015-05-25 05:33:57.741829 7ffe076b2700 0 log_channel(audit) log [DBG] : from='client.? 10.12.27.14:0/1004185' entity='client.admin' cmd=[{"prefix": "osd tree", "epoch": 15, "format": "json"}]: dispatch
2015-05-25 05:33:57.748845 7ffe076b2700 -1 osd/OSDMap.h: In function 'unsigned int OSDMap::get_weight(int) const' thread 7ffe076b2700 time 2015-05-25 05:33:57.742429
osd/OSDMap.h: 374: FAILED assert(o < max_osd)
ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) [0x7b3a95]
2: /usr/bin/ceph-mon() [0x773223]
3: (OSDMap::print_tree(std::ostream*, ceph::Formatter*) const+0x192a) [0x78835a]
4: (OSDMonitor::preprocess_command(MMonCommand*)+0xe55) [0x60a105]
5: (OSDMonitor::preprocess_query(PaxosServiceMessage*)+0x20b) [0x60f7ab]
6: (PaxosService::dispatch(PaxosServiceMessage*)+0x833) [0x5cad23]
7: (Monitor::handle_command(MMonCommand*)+0x147c) [0x591a9c]
8: (Monitor::dispatch(MonSession*, Message*, bool)+0xf9) [0x594cd9]
9: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x595986]
10: (Monitor::ms_dispatch(Message*)+0x23) [0x5b5453]
11: (DispatchQueue::entry()+0x64a) [0x8a240a]
12: (DispatchQueue::DispatchThread::entry()+0xd) [0x79bded]
13: (()+0x7df5) [0x7ffe0da10df5]
14: (clone()+0x6d) [0x7ffe0c4f31ad]
which means the printed OSD did not exist in the osdmap at that moment. Harish, may i know what test you are working on?
>which means the printed OSD did not exist in the osdmap at that moment. Harish, may i know what test you are working on?
I am working on a test where i input wrong/incorrect/invalid crushmap using POST via a client script.
Here is what the script doing:
1. GET the recent crushmap
2. POST the map with following changes:
a) the value of tunable choose_total_tries changed to 68
b) the bucket for osd2 removed
c) bucket osd0 is added with:
item osd.0 weight 0.910
item osd.2 weight 0.910
d) bucket osd1 is added with:
item osd.1 weight 0.910
item osd.3 weight 0.910
Note: osd.3 does not exist actually in the system. osd.2 is actually on the osd2 host.
e) under #devices, added : device 3 osd.3
According to me above modification should be resulting in an incorrect crushmap which system should detect and reject.
3. Once POSTed via script, the mon crashed.
Please note that I use browser based DJango REST framework v2.3.12 also to test the GET and POST.
There are some tests which need GET and POST to be done via script and hence used attached script.
Kefu, please note that script when used with valid crush map for POSTing, does not kill mon. The POST operation completes successfully - subsequent GETs show the modified values. Harish, thanks for explaining your tests in such a detail =) > Note: osd.3 does not exist actually in the system. so this is the case where an osd exists in crush but not in osdmap. we will be able to detect this. > osd.2 is actually on the osd2 host. this need some cross check between osdmap and crush before accepting the crush map. IIRC, crush will blindly follow the rules to select a list of OSD for a given op. so neither the client nor the server side will notice this problem, aside from the selected OSDs might not be the expected ones. and i am not sure if we need to reject such a crush map. > Kefu, please note that script when used with valid crush map for POSTing, does not kill mon. The POST operation completes successfully - subsequent GETs show the modified values. yeah, i see. the "ceph osd tree" kills mon if a bad crush map is sent for it before. > osd.2 is actually on the osd2 host.
seems our init script can fix this automatically by calling ceph-osd-prestart.sh or ceph-crush-location .
but this is done when the OSD daemon is started, after that i guess the monitor is on its own.
(In reply to Kefu Chai from comment #27) > the attached python looks suspicious: > > > response = c.post('cluster/cbc3750d-763e-42f2-a929-b7043734f4f0/crush_map', > '# begin crush map. \ntunable choose_local_tries 0\n....') > > where c is an instance of requests.Session, so the script is sending a POST > request to 'cluster/cbc3750d-763e-42f2-a929-b7043734f4f0/crush_map?# begin > crush map ....'. in other words, the crush map is posted as the query > string, (not sure if it is escaped, though). see > http://docs.python-requests.org/en/latest/api/?highlight=session#requests. > Session.request . > > > I am not sure whether that is the right way. I found this modified py file as a quick fix for doing GET and POST operation via script which may not be right. Please check the file and confirm if it's ok to use it. > > > Are there any other Ceph REST API client that i can use? > @greg, is it how this REST API is supposed to be used? could you help > confirm this? thanks. > > > > unlike the empty crush map we extracted from osdmap#85 in > http://tracker.ceph.com/issues/11680, this crush map for the POST looks > decent: > > # begin crush map. > tunable choose_local_tries 0 > tunable choose_local_fallback_tries 0 > tunable choose_total_tries 68 > tunable chooseleaf_descend_once 1 > tunable straw_calc_version 1 > > # devices > device 0 osd.0 > device 1 osd.1 > device 2 osd.2 > device 3 osd.3 > > # types > type 0 osd > type 1 host > type 2 chassis > type 3 rack > type 4 row > type 5 pdu > type 6 pod > type 7 room > type 8 datacenter > type 9 region > type 10 root > > # buckets > host osd0 { > id -2 # do not change unnecessarily > # weight 1.820 > alg straw > hash 0 # rjenkins1 > item osd.0 weight 0.910 > item osd.2 weight 0.910 > } > host osd1 { > id -3 # do not change unnecessarily > # weight 1.820 > alg straw > hash 0 # rjenkins1 > item osd.1 weight 0.910 > item osd.3 weight 0.910 > } > root default { > id -1 # do not change unnecessarily > # weight 3.640 > alg straw > hash 0 # rjenkins1 > item osd0 weight 1.820 > item osd1 weight 1.820 > } > > # rules > rule replicated_ruleset { > ruleset 0 > type replicated > min_size 1 > max_size 10 > step take default > step chooseleaf firstn 0 type host > step emit > } > > # end crush map > > > > and the attached log of > https://bugzilla.redhat.com/attachment.cgi?id=1029402 shows something > different: > > 2015-05-25 05:33:57.741829 7ffe076b2700 0 log_channel(audit) log [DBG] : > from='client.? 10.12.27.14:0/1004185' entity='client.admin' cmd=[{"prefix": > "osd tree", "epoch": 15, "format": "json"}]: dispatch > 2015-05-25 05:33:57.748845 7ffe076b2700 -1 osd/OSDMap.h: In function > 'unsigned int OSDMap::get_weight(int) const' thread 7ffe076b2700 time > 2015-05-25 05:33:57.742429 > osd/OSDMap.h: 374: FAILED assert(o < max_osd) > > ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff) > 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char > const*)+0x85) [0x7b3a95] > 2: /usr/bin/ceph-mon() [0x773223] > 3: (OSDMap::print_tree(std::ostream*, ceph::Formatter*) const+0x192a) > [0x78835a] > 4: (OSDMonitor::preprocess_command(MMonCommand*)+0xe55) [0x60a105] > 5: (OSDMonitor::preprocess_query(PaxosServiceMessage*)+0x20b) [0x60f7ab] > 6: (PaxosService::dispatch(PaxosServiceMessage*)+0x833) [0x5cad23] > 7: (Monitor::handle_command(MMonCommand*)+0x147c) [0x591a9c] > 8: (Monitor::dispatch(MonSession*, Message*, bool)+0xf9) [0x594cd9] > 9: (Monitor::_ms_dispatch(Message*)+0x1a6) [0x595986] > 10: (Monitor::ms_dispatch(Message*)+0x23) [0x5b5453] > 11: (DispatchQueue::entry()+0x64a) [0x8a240a] > 12: (DispatchQueue::DispatchThread::entry()+0xd) [0x79bded] > 13: (()+0x7df5) [0x7ffe0da10df5] > 14: (clone()+0x6d) [0x7ffe0c4f31ad] > > which means the printed OSD did not exist in the osdmap at that moment. > Harish, may i know what test you are working on? Kefu the CRUSH map is not sent as a query parameter see an example session: (venv)[root@vpm180 ubuntu]# CALAMARI_CONF=/etc/calamari/calamari.conf DJANGO_SETTINGS_MODULE=calamari_web.settings django-admin.py runserver 0.0.0.0:8000 Validating models... 0 errors found May 26, 2015 - 14:08:33 Django version 1.5.1, using settings 'calamari_web.settings' Development server is running at http://0.0.0.0:8000/ Quit the server with CONTROL-C. > /opt/calamari/venv/lib/python2.7/site-packages/calamari_rest_api-0.1-py2.7.egg/calamari_rest/views/v2.py(106)replace() -> return Response(self.client.update(fsid, CRUSH_MAP, None, request.DATA)) (Pdb) request.DATA '# begin crush map\r\ntunable choose_local_tries 0\r\ntunable choose_local_fallback_tries 0\r\ntunable choose_total_tries 50\r\ntunable chooseleaf_descend_once 1\r\ntunable straw_calc_version 1\r\n\r\n# devices\r\ndevice 0 osd.0\r\ndevice 1 osd.1\r\n\r\n# types\r\ntype 0 osd\r\ntype 1 host\r\ntype 2 chassis\r\ntype 3 rack\r\ntype 4 row\r\ntype 5 pdu\r\ntype 6 pod\r\ntype 7 room\r\ntype 8 datacenter\r\ntype 9 region\r\ntype 10 root\r\n\r\n# buckets\r\nhost vpm050 {\r\n\tid -2\t\t# do not change unnecessarily\r\n\t# weight 0.190\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n\titem osd.0 weight 0.190\r\n}\r\nroot default {\r\n\tid -1\t\t# do not change unnecessarily\r\n\t# weight 0.190\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n\titem vpm050 weight 0.190\r\n}\r\nhost vpm041 {\r\n\tid -3\t\t# do not change unnecessarily\r\n\t# weight 0.190\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n\titem osd.1 weight 0.190\r\n}\r\nrack rack_contains_vpm041 {\r\n\tid -4\t\t# do not change unnecessarily\r\n\t# weight 0.190\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n\titem vpm041 weight 0.190\r\n}\r\nhost vpm041-SSD {\r\n\tid -5\t\t# do not change unnecessarily\r\n\t# weight 0.000\r\n\talg straw\r\n\thash 0\t# rjenkins1\r\n}\r\n\r\n# rules\r\nrule replicated_ruleset {\r\n\truleset 0\r\n\ttype replicated\r\n\tmin_size 1\r\n\tmax_size 10\r\n\tstep take default\r\n\tstep chooseleaf firstn 0 type host\r\n\tstep emit\r\n}\r\n\r\n# end crush map' (Pdb) request <rest_framework.request.Request object at 0x7f92b124a950> (Pdb) request.QUERY_PARAMS <QueryDict: {}> (Pdb) @Greg, thanks a lot!
@Harish, seems you might want to update your python script? to be specific, use something like:
response = c.post('cluster/<uuid>/crush_map', data='#begin ....')
instead of
response = c.post('cluster/<uuid>/crush_map', '#begin ....')
the later basically POSTs an empty body to the API server. from ceph monitor's point of view, it will be injected with an empty crush map.
i don't think that's what you intended at the first place, but it could also serve as a negative test case anyway.
Hi Kefu, QE's evaluation of this bug is that it's easy to reproduce (just POST an empty crushmap to Calamari), so there is likelihood of customers accidentally triggering this bug in the field. QE is also concerned that once the crushmap is in place, it is not trivial to get the monitor to start up again. (There is no documented way to recover, other than what you wrote in this BZ.) Can you help me understand a couple things? 1) Will the patches in https://github.com/ceph/ceph/pull/4726 prevent a user from injecting an empty crushmap? I realize the PR is still undergoing review, so do you recommend that we take the patches in that PR downstream for the 1.3.0 release? Or would you like to wait for more review upstream? I'm trying to understand what sort of timeframe we're looking at for this, and how confident you feel about those going in. 2) If we don't take any patches in Ceph to fix this for 1.3.0, can you confirm that the steps in comment #16 are the ones that we should document for customers who experience this issue? Ken, sorry for keep you waiting. > 1) Will the patches in https://github.com/ceph/ceph/pull/4726 prevent a user from injecting an empty crush map? true. i also added a test to cover this case. > so do you recommend that we take the patches in that PR downstream for the 1.3.0 release? i plan to backport this fix to hammer after it is merged into master. if 1.3.0 can not pick up the next hammer release. i would recommend to do so > Or would you like to wait for more review upstream? i am scheduling a rados qa run, hopefully we will get the result in one days or two. i will ask sam or loic who is more experienced with teuthology testbed to see how long we should wait in general for a qa run, and get back to you. ordinary, i will just wait for sam/sage to pick it up in his run. although the cycle qa run is pretty long, it will give us more confidence that the change won't cause regression. but personally, i am confident about this change, but better safe than sorry. anyway, i will try to get my own qa run finished sooner so we can merge it earlier. > If we don't take any patches in Ceph to fix this for 1.3.0, can you confirm that the steps in comment #16 are the ones that we should document for customers who experience this issue? the steps in comment#16 are only for the reference of our QE. in which, joao helped to identify the latest good version # of crush map. but i am not sure that our customer is able to find out the version # without good knowledge of monitor and such. <quote> ceph-monstore-tool /var/lib/ceph/mon/backup.ceph-Mon get osdmap -- --version 81 --out /tmp/osdmap.81 # export the most recent good crushmap </quote> so user can not blindly repeat it in hope to bring a monitor back online. hence it is not recommended as a practice for end user. Kefu, would it be possible to document the procedure to bring the system back into working/healthy condition which can be used by end user too? If this bug is not going to be fixed in 1.3.0, then such procedure is needed if customer encounters the crash. (In reply to Harish NV Rao from comment #36) > If this bug is not going to be fixed in 1.3.0, then such procedure is needed > if customer encounters the crash. Agreed with Harish - let's pursue getting this fix properly tested upstream so we can ship it after 1.3.0, and in the meantime let's try to document as much as we can about this problem. John W., how do we get the ball rolling on setting up a new KB article for this? (Discussed this in irc and was asked to provide a summary.) Based on Joao's comment on the downstream ticket it looks like the system is susceptible to this via the CLI as well. That said, my take is that there are two distinct issues: 1) We are unfriendly to the user by allowing them to inject a crushmap that will not map any of their data. 2) We crash if you try and print an empty crush map. Those should both get fixed upstream; if we were in a huge hurry the quick fix is probably to resolve (2). But I think this is repairable by simply injecting a new valid crushmap (if you already have one available you shouldn't need to rip it out of the monstore nor replace the monitors, I don't think?) and so shouldn't block release... (You also need to stop any clients which are invoking the print request on the monitors; not doing so is probably why the cluster looks to die on restart as well.) One thing we should recommend is to backup the original crushmap #ceph osd getcrushmap -o orig_compiled_crushmap and in case an empty compiled crushmap is loaded, user can restore the backup, I couldn't hit the crash using cli(but could see the map was messed up) and this step needs further testing #touch emptyfile #crushtool -c emptyfile -o empty_compiled_crushmap #ceph osd setcrushmap -i empty_compiled_crushmap set crush map # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY 0 0 osd.0 up 1.00000 1.00000 1 0 osd.1 up 1.00000 1.00000 2 0 osd.2 up 1.00000 1.00000 3 0 osd.3 up 1.00000 1.00000 4 0 osd.4 up 1.00000 1.00000 5 0 osd.5 up 1.00000 1.00000 #Restore Original # ceph osd setcrushmap -i orig_compiled_crushmap set crush map # ceph osd tree ID WEIGHT TYPE NAME UP/DOWN REWEIGHT PRIMARY-AFFINITY -1 6.00000 root default -3 6.00000 rack localrack -2 6.00000 host localhost 0 1.00000 osd.0 up 1.00000 1.00000 1 1.00000 osd.1 up 1.00000 1.00000 2 1.00000 osd.2 up 1.00000 1.00000 3 1.00000 osd.3 up 1.00000 1.00000 4 1.00000 osd.4 up 1.00000 1.00000 5 1.00000 osd.5 up 1.00000 1.00000 > I couldn't hit the crash using cli(but could see the map was messed up) and this step needs further testing > ceph osd tree ceph osd tree --format json will bring down the monitor. "--format json" prints more info of buckets and items in the map, which causes the crash. Per Monday manager's review meeting, not a 1.3.0 release blocker. Pushing to 1.3.1 or async update as priority is determined. The 1.3.0 release notes should point to the risk of propagating an invalid (or empty) crushmap via Calamari. Users should exercise caution to validate crush map before distributing it to the cluster. fixing the "empty crushmap" path will be an urgent errata. (In reply to Kefu Chai from comment #35) > the steps in comment#16 are only for the reference of our QE. in which, joao > helped to identify the latest good version # of crush map. but i am not sure > that our customer is able to find out the version # without good knowledge > of monitor and such. > > <quote> > ceph-monstore-tool /var/lib/ceph/mon/backup.ceph-Mon get osdmap -- --version > 81 --out /tmp/osdmap.81 # export the most recent good crushmap > </quote> Yeah, this is tricky. It's not clear to me how Joao chose that 81 number :) Can you please provide the steps for a user to recover from this situation, and the docs team (John Wilkins, Monti Lawrence) can get this published into a document? Ken, there is an upstream ticket filed for this actually: http://tracker.ceph.com/issues/11815 . i will play with ceph-monstore-tool and talk with Joao and Greg, and then update you with what I have. Kefu, any update on the procedure to recover the system after mon failure? sorry, Ken, i missed Loïc's comment. and now it's merged. i am preparing the backports. Ken, the fix is merged in master, and it is being backported to hammer[0], for the date we can see it in hammer, it will likely be july. and a more precise date depends on the state of the backports[1] for this release[1]. ---- [0] http://tracker.ceph.com/issues/11975 [1] http://tracker.ceph.com/projects/ceph/issues?query_id=78 Ok cool, thank you! Please create a wip- branch based on the tip of rhcs-0.94.1-ubuntu in GitHub, and that will allow us to have patches that we can cleanly cherry-pick in the packaging downstream to fix this bug. pushed to wip-11680-rhcs-v0.94.1-ubuntu . Joao, Greg and Loïc had a fruitful discussion today. so we believe a CLI tool would help to bring the monitor back online if a bad crush map is injected into monitor without being identified by crushtool. see http://tracker.ceph.com/issues/11815 enhanced the ceph-monstore-tool, and pulled together a script to help with this issue, pending on review at https://github.com/ceph/ceph/pull/5052 . Hello , The bug is verified by performing the following test cases. 1. Put an empty crush map and check if Mon process is crashing. result with the errata fix: Mon process does not crash and rejects to set this empty crush map 2. Add an non-existent device in the crush map. result with errata fix: Mon process does not crash and rejects to set this crush map the tests have been performed using command line i.e crushtool and using web API(django framework) as well. So the bug will be moved to verified state. However during this process I have come across another bug where the invalid crush map is being set. i.e if the osd's are interchanged under the hosts in the crush map and this crush map is getting accepted, but there is no crash. details will be given in the new ticket. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:1240 |