| Summary: | [RADOS]:- Stale cluster interference causes mon start failures | ||
|---|---|---|---|
| Product: | Red Hat Ceph Storage | Reporter: | shylesh <shmohan> |
| Component: | RADOS | Assignee: | Brad Hubbard <bhubbard> |
| Status: | CLOSED ERRATA | QA Contact: | shylesh <shmohan> |
| Severity: | high | Docs Contact: | Bara Ancincova <bancinco> |
| Priority: | high | ||
| Version: | 1.3.2 | CC: | bhubbard, ceph-eng-bugs, dzafman, flucifre, hnallurv, kchai, kdreyer, rperiyas, sjust |
| Target Milestone: | rc | Keywords: | Patch |
| Target Release: | 1.3.3 | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | RHEL: ceph-0.94.7-5.el7cp Ubuntu: ceph_0.94.7-3redhat1trusty | Doc Type: | Bug Fix |
| Doc Text: |
.The monitor nodes exit gracefully after authenticating with an incorrect keyring
When a new cluster included monitor nodes that were previously a part of an another cluster, the monitor nodes terminated with a segmentation fault when attempting to authenticate with an incorrect keyring. With this update, the monitor nodes exit gracefully instead of crashing in the described scenario.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-09-29 12:56:48 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Bug Depends On: | |||
| Bug Blocks: | 1348597, 1372735 | ||
|
Description
shylesh
2016-02-27 16:02:47 UTC
*** Bug 1309556 has been marked as a duplicate of this bug. *** Still investigating this and will update all relevant Bugzillas and trackers but here are my initial findings. I have tried with ceph-0.94.3-6.0.bz1312587.el7cp.x86_64 which is 0.94.3-6 with just this patch, https://github.com/ceph/ceph/commit/7ffe6ff4e26db538335b8069ce4fd14fe4cdd76c It does *not* fix the problem although we get the following additional logs which show the patch is working I guess. 2016-03-01 20:27:56.278634 7f9e16ef4700 0 mon.boxenY@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.6=0 cs=0 l=0 c=0 2016-03-01 20:27:56.278644 7f9e16ef4700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x5177000 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x4fbe3c0).accept: got bad authorizer Interestingly the peer address appears truncated in the ms_verify_authorizer line. My next step is to test with latest Hammer release 0.94.5-9. Issue looks to be the same as http://tracker.ceph.com/issues/13826 and http://tracker.ceph.com/issues/9744 (see stack in duplicate http://tracker.ceph.com/issues/13527) Reproducer: On boxenX $ ceph-deploy new boxenX boxenY $ ceph-deploy mon create-initial Single OSD set up on boxenZ but may not be needed, will test this. Leave boxenX mon process running. On boxenY # service ceph stop # mkdir ceph-deploy # cd $_ # ceph-deploy purge boxenY # ceph-deploy purge-data boxenY # ceph-deploy new boxenY # ceph-deploy mon create boxenY # ceph -s <--- This should fail Check /var/log/ceph/ceph-mon*.log on boxenY 016-03-01 18:50:23.748791 7f8c2e911700 0 mon.boxenY@0(leader) e1 handle_command mon_command({"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]} v 0) v1 2016-03-01 18:50:23.748823 7f8c2e911700 0 log_channel(audit) log [INF] : from='client.? 10.XX.XX.43:0/1031277' entity='client.admin' cmd=[{"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]}]: dispatch 2016-03-01 18:50:23.764426 7f8c30114700 0 log_channel(audit) log [INF] : from='client.? 10.XX.XX.43:0/1031277' entity='client.admin' cmd='[{"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]}]': finished 2016-03-01 18:50:24.577738 7f8c35a93700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-01 18:50:24.577744 7f8c35a93700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x3f8f000 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x3d46d60).accept we reset (peer sent cseq 2), sending RESETSESSION 2016-03-01 18:50:24.867140 7f8c3149f700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch 2016-03-01 18:50:24.867231 7f8c3149f700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished 2016-03-01 18:50:24.984399 7f8c3149f700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch 2016-03-01 18:50:24.984477 7f8c3149f700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished 2016-03-01 18:50:26.579586 7f8c35a93700 -1 *** Caught signal (Segmentation fault) ** in thread 7f8c35a93700 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-mon() [0x902ca2] 2: (()+0xf130) [0x7f8c34c83130] 3: (PK11_DoesMechanism()+0x5f) [0x7f8c35122aef] 4: (()+0x34dca) [0x7f8c35102dca] 5: (()+0x4f36a) [0x7f8c3511d36a] 6: (()+0x4f3bb) [0x7f8c3511d3bb] 7: (PK11_CreateContextBySymKey()+0x22) [0x7f8c35101b82] 8: /usr/bin/ceph-mon() [0x6b6f9a] 9: (CryptoAESKeyHandler::decrypt(ceph::buffer::list const&, ceph::buffer::list&, std::string*) const+0x2a) [0x6b7dba] 10: (void decode_decrypt_enc_bl<CephXServiceTicketInfo>(CephContext*, CephXServiceTicketInfo&, CryptoKey, ceph::buffer::list&, std::string&)+0xcd) [0x6ae9ad] 11: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, ceph::buffer::list&)+0x3bb) [0x6a913b] 12: (Monitor::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x450) [0x56f220] 13: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x7f) [0x7964ff] 14: (Pipe::accept()+0x1db1) [0x8d1721] 15: (Pipe::reader()+0x19ff) [0x8d6cef] 16: (Pipe::Reader::entry()+0xd) [0x8d996d] 17: (()+0x7df5) [0x7f8c34c7bdf5] 18: (clone()+0x6d) [0x7f8c3375e1ad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ยท I then tried to start ceph-mon in the foreground and observed the following. # /usr/bin/ceph-mon --debug_ms=5 --debug_mon=20 -i boxenY -c /etc/ceph/ceph.conf --cluster ceph -d ... 2016-03-01 19:44:28.100855 7f71df1a9700 1 -- 10.XX.XX.43:6789/0 >> :/0 pipe(0x47a0000 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x451e680).accept sd=19 10.XX.XX.41:35566/0 2016-03-01 19:44:28.101069 7f71df1a9700 10 mon.boxenY@0(leader) e1 ms_verify_authorizer 10.XX.XX.41:6789/0 mon protocol 2 2016-03-01 19:44:28.101135 7f71df1a9700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-01 19:44:28.101265 7f71df1a9700 2 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).reader couldn't read tag, (0) Success 2016-03-01 19:44:28.101286 7f71df1a9700 2 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).fault (0) Success 2016-03-01 19:44:28.101303 7f71df1a9700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).fault with nothing to send, going to standby 2016-03-01 19:44:30.101032 7f71defa7700 1 -- 10.XX.XX.43:6789/0 >> :/0 pipe(0x47a5000 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x451e940).accept sd=20 10.XX.XX.41:35567/0 *** Caught signal (Segmentation fault) ** in thread 7f71defa7700 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-mon() [0x902ca2] 2: (()+0xf130) [0x7f71e7d20130] 3: (tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)+0x37) [0x7f71e8707e67] 4: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)+0x1a) [0x7f71e870812a] 5: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x6d) [0x7f71e87081dd] 6: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x65) [0x7f71e870b235] 7: (tc_new()+0xe8) [0x7f71e871b718] 8: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7f71e709bc29] 9: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7f71e709c86b] 10: (std::string::reserve(unsigned long)+0x44) [0x7f71e709c914] 11: (std::string::append(unsigned long, char)+0x46) [0x7f71e709ccf6] 12: (PrebufferedStreambuf::overflow(int)+0x30) [0x794e10] 13: (std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long)+0x36) [0x7f71e707dd76] 14: (std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)+0x215) [0x7f71e7074ac5] 15: (Monitor::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x1bf) [0x56ef8f] 16: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x7f) [0x7964ff] 17: (Pipe::accept()+0x1db1) [0x8d1721] 18: (Pipe::reader()+0x19ff) [0x8d6cef] 19: (Pipe::Reader::entry()+0xd) [0x8d996d] 20: (()+0x7df5) [0x7f71e7d18df5] 21: (clone()+0x6d) [0x7f71e67fb1ad] 2016-03-01 19:44:31.674135 7f71e19ae700 11 mon.boxenY@0(leader) e1 tick 2016-03-01 19:44:31.674146 7f71e19ae700 10 mon.boxenY@0(leader).pg v2 check_down_pgs I initially thought the different stack traces were due to differences in debug output level but now I can't reproduce the original stack so I guess not. Infernalis is still affected by this but, with master we get a much better result. 2016-03-02 13:50:20.352650 7fdd7d71a700 0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up 2016-03-02 13:50:20.368691 7fdd7d71a700 0 log_channel(cluster) log [INF] : osdmap e1: 0 osds: 0 up, 0 in 2016-03-02 13:50:20.384977 7fdd7d71a700 0 log_channel(cluster) log [INF] : pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2016-03-02 13:50:20.404568 7fdd7d71a700 1 mon.boxenX@0(leader).log v2 unable to write to '/var/log/ceph/ceph.log' for channel 'cluster': (2) No such file or directory 2016-03-02 13:50:20.637123 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:20.637129 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:20.637136 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df56000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dfc0300).accept: got bad authorizer 2016-03-02 13:50:21.429447 7fdd7d71a700 1 mon.boxenX@0(leader).log v3 unable to write to '/var/log/ceph/ceph.log' for channel 'cluster': (2) No such file or directory 2016-03-02 13:50:22.636882 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:22.636886 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:22.636892 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df8a000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c400).accept: got bad authorizer 2016-03-02 13:50:24.637030 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:24.637035 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:24.637042 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df8b400 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7d180).accept: got bad authorizer 2016-03-02 13:50:26.637132 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:26.637139 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:26.637145 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5e000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7d180).accept: got bad authorizer 2016-03-02 13:50:28.637403 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:28.637408 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:28.637413 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5f400 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c400).accept: got bad authorizer 2016-03-02 13:50:30.637549 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:30.637553 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:30.637559 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df60800 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c580).accept: got bad authorizer 2016-03-02 13:50:32.637675 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:32.637680 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:32.637690 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5e000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c580).accept: got bad authorizer So this is fixed in master, just need to work out what the fix was and potentially backport it. Simpler reproducer. $ ceph-deploy new boxenX boxenY $ ceph-deploy mon create-initial Move to boxenY $ sudo service ceph stop $ sudo rm -rf --one-file-system /var/lib/ceph/* $ sudo rm -rf --one-file-system /etc/ceph/* $ sudo mkdir /var/lib/ceph $ sudo mkdir /etc/ceph/ $ ceph-deploy new boxenY $ ceph-deploy mon create-initial This issue is fixed by the following commit. https://github.com/mathslinux/ceph/commit/e9e05333ac7c64758bf14d80f6179e001c0fdbfd I'd suggest pulling in all of https://github.com/ceph/ceph/pull/6698 thanks to Brad, - the backport tracker: http://tracker.ceph.com/issues/14961 - the backport pr: https://github.com/ceph/ceph/pull/7922 medium priority for 1.3.3. testing. As per steps in comment10(given below) verified the issue. mon service is coming up, Moving the bug to Verified state. Steps followed: $ ceph-deploy new boxenX boxenY $ ceph-deploy mon create-initial Move to boxenY $ sudo service ceph stop $ sudo rm -rf --one-file-system /var/lib/ceph/* $ sudo rm -rf --one-file-system /etc/ceph/* $ sudo mkdir /var/lib/ceph $ sudo mkdir /etc/ceph/ $ ceph-deploy new boxenY $ ceph-deploy mon create-initial Cluster status: [root@magna111 ubuntu]# ceph -s cluster 9395c204-7cd8-4749-8d00-5a7a9fc7c515 health HEALTH_ERR 64 pgs stuck inactive 64 pgs stuck unclean no osds monmap e1: 1 mons at {magna111=10.8.128.111:6789/0} election epoch 2, quorum 0 magna111 osdmap e1: 0 osds: 0 up, 0 in pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 64 creating [root@magna111 ubuntu]# ps aux | grep ceph root 28513 0.0 0.0 115244 1460 ? Ss 14:43 0:00 /bin/bash -c ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128MB /usr/bin/ceph-mon -i magna111 --pid-file /var/run/ceph/mon.magna111.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 28519 0.4 0.0 201828 15968 ? Sl 14:43 0:00 /usr/bin/ceph-mon -i magna111 --pid-file /var/run/ceph/mon.magna111.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 29191 0.0 0.0 112648 972 pts/0 S+ 14:43 0:00 grep --color=auto ceph [root@magna111 ubuntu]# service ceph status === mon.magna111 === mon.magna111: running {"version":"0.94.9-2.el7cp"} Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-1972.html |