Description of problem: A stale cluster which was already existing interfers with the newly built cluster because couple of nodes in the newly built cluster were part of the stale cluster earlier, this leads to failures in starting monitor from the new cluster. Version-Release number of selected component (if applicable): Stale cluser details ===================== package ---------- OS:- ubuntu@magna092:/var/log/ceph$ uname -a Linux magna092 3.19.0-49-generic #55~14.04.1-Ubuntu SMP Fri Jan 22 11:24:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux ubuntu@magna092:/var/log/ceph$ dpkg -l | grep ceph ii ceph 0.94.5-5redhat1trusty amd64 distributed storage and file system ii ceph-common 0.94.5-5redhat1trusty amd64 common utilities to mount and interact with a ceph storage cluster ii libapache2-mod-fastcgi 2.4.7~0910052141-ceph1 amd64 Apache 2 FastCGI module for long-running CGI scripts ii libcephfs1 0.94.5-5redhat1trusty amd64 Ceph distributed file system client library ii python-cephfs 0.94.5-5redhat1trusty amd64 Python libraries for the Ceph libcephfs library /etc/ceph/ceph.conf ------------------- ubuntu@magna092:/var/log/ceph$ cat /etc/ceph/ceph.conf [global] osd crush location hook = /usr/bin/calamari-crush-location fsid = abe5e41d-a747-401b-8612-e2e906fb00e2 mon_initial_members = magna013, magna023, magna092 mon_host = 10.8.128.13,10.8.128.23,10.8.128.92 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true [client.rgw.magna056] rgw_frontends = "civetweb port=7480" New cluster information ========================= Package and OS ---------------- [ubuntu@magna003 ceph-config]$ cat /etc/redhat-release Red Hat Enterprise Linux Server release 7.2 (Maipo) [ubuntu@magna013 diamond]$ rpm -qa| grep ceph mod_fastcgi-2.4.7-1.ceph.el6.x86_64 ceph-0.94.5-9.el7cp.x86_64 ceph-common-0.94.5-9.el7cp.x86_64 iozone-3.424-2_ceph.el6.x86_64 ceph-mon-0.94.5-9.el7cp.x86_64 [ubuntu@magna013 diamond]$ cat /etc/ceph/ceph.conf [global] osd crush location hook = /usr/bin/calamari-crush-location fsid = b15107eb-c70f-40f2-8330-ffbf038f0dc3 mon_initial_members = magna013 mon_host = 10.8.128.13 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true debug mon=20/5 How reproducible: Not sure Steps to Reproduce: 1. I created a cluster with 1 mon(10.8.128.13) and 3 osds (magna023, magna030,magna033) 2. These nodes were earlier part of some other cluster . Above said nodes were reinstalled with RHEL 6.6 with ceph 1.2.3 and I was unaware of the existence of the earlier cluster details. 3. I upgraded these nodes from RHEL 6.6 ceph 1.2.3 to RHEL 7.2 ceph 1.3.2 4. After upgrading the ceph to 1.3.2 I tried to restart the mon (10.8.128.13) Actual results: Starting the mon failed. Expected results: Mon should start. Additional info: Mon logs from 10.8.128.13 ------------------------- 2016-02-26 03:47:02.911521 7f4b3b0b5700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final ro und failed: -8190 2016-02-26 03:47:02.911534 7f4b3b0b5700 0 -- 10.8.128.13:6789/0 >> 10.8.128.92:6789/0 pipe(0x4037680 sd=31 :6789 s=0 pgs= 0 cs=0 l=0 c=0x147c6940).accept we reset (peer sent cseq 1), sending RESETSESSION 2016-02-26 03:47:03.870754 7f4b3e830700 0 log [INF] : pgmap v52300: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used , 7903 GB / 8334 GB avail; 23211 B/s wr, 22 op/s 2016-02-26 03:47:04.911646 7f4b3b0b5700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final ro und failed: -8190 2016-02-26 03:47:04.911657 7f4b3b0b5700 0 -- 10.8.128.13:6789/0 >> 10.8.128.92:6789/0 pipe(0x45fb980 sd=31 :6789 s=0 pgs= 0 cs=0 l=0 c=0x147c7b20).accept we reset (peer sent cseq 1), sending RESETSESSION 2016-02-26 03:47:04.992398 7f4b3e830700 0 log [INF] : pgmap v52301: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used , 7903 GB / 8334 GB avail; 15165 B/s wr, 14 op/s 2016-02-26 03:47:06.457398 7f4b3e830700 0 log [INF] : pgmap v52302: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used , 7903 GB / 8334 GB avail; 20708 B/s wr, 20 op/s 2016-02-26 03:47:06.911951 7f4b3b0b5700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final ro und failed: -8190 2016-02-26 03:47:06.911965 7f4b3b0b5700 0 -- 10.8.128.13:6789/0 >> 10.8.128.92:6789/0 pipe(0x8a83680 sd=31 :6789 s=0 pgs= 0 cs=0 l=0 c=0x147c5fa0).accept we reset (peer sent cseq 1), sending RESETSESSION 2016-02-26 03:47:07.512412 7f4b3e830700 0 log [INF] : pgmap v52303: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used , 7903 GB / 8334 GB avail; 27919 B/s wr, 27 op/s 2016-02-26 03:47:08.651468 7f4b3e830700 0 log [INF] : pgmap v52304: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used , 7903 GB / 8334 GB avail; 22533 B/s wr, 22 op/s 2016-02-26 03:47:08.911921 7f4b3b0b5700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final ro und failed: -8190 2016-02-26 03:47:08.911929 7f4b3b0b5700 0 -- 10.8.128.13:6789/0 >> 10.8.128.92:6789/0 pipe(0xe8d9900 sd=31 :6789 s=0 pgs= 0 cs=0 l=0 c=0x147c4420).accept we reset (peer sent cseq 1), sending RESETSESSION 10.8.128.92 is the host which is part of the stale cluster which is still trying to connect to 10.8.128.13( current MON) because of which new mon start is failinf. Logs from 10.8.128.92 (mon from stale cluster) ======================================= 2016-02-26 15:49:33.176223 7f134d72b700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='m on_status' args=[]: dispatch 2016-02-26 15:49:33.176395 7f134d72b700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mo n_status args=[]: finished 2016-02-26 15:49:34.846914 7f134885f700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry ption 2016-02-26 15:49:34.846927 7f134885f700 0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=20 :57919 s=1 pgs =15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply 2016-02-26 15:49:36.846963 7f134885f700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry ption 2016-02-26 15:49:36.846975 7f134885f700 0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=20 :57920 s=1 pgs =15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply 2016-02-26 15:49:38.846902 7f134885f700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry ption 2016-02-26 15:49:38.846914 7f134885f700 0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=20 :57923 s=1 pgs =15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply 2016-02-26 15:49:40.847042 7f134885f700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry ption 2016-02-26 15:49:40.847056 7f134885f700 0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=9 :57924 s=1 pgs= 15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply 2016-02-26 15:49:42.847194 7f134885f700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry ption 2016-02-26 15:49:42.847210 7f134885f700 0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=9 :57926 s=1 pgs= 15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply 2016-02-26 15:49:44.847477 7f134885f700 0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry ption 2016-02-26 15:49:44.847491 7f134885f700 0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=9 :57928 s=1 pgs= 15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply 2016-02-26 15:49:45.552597 7f134a964700 0 mon.magna092@2(probing).data_health(16) update_stats avail 94% total 916 GB, us ed 3173 MB, avail 867 GB After the killing the mon from stale cluster (10.8.128.92) the new mon (10.8.128.13) from the new cluster started successfully.
*** Bug 1309556 has been marked as a duplicate of this bug. ***
Still investigating this and will update all relevant Bugzillas and trackers but here are my initial findings. I have tried with ceph-0.94.3-6.0.bz1312587.el7cp.x86_64 which is 0.94.3-6 with just this patch, https://github.com/ceph/ceph/commit/7ffe6ff4e26db538335b8069ce4fd14fe4cdd76c It does *not* fix the problem although we get the following additional logs which show the patch is working I guess. 2016-03-01 20:27:56.278634 7f9e16ef4700 0 mon.boxenY@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.6=0 cs=0 l=0 c=0 2016-03-01 20:27:56.278644 7f9e16ef4700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x5177000 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x4fbe3c0).accept: got bad authorizer Interestingly the peer address appears truncated in the ms_verify_authorizer line. My next step is to test with latest Hammer release 0.94.5-9. Issue looks to be the same as http://tracker.ceph.com/issues/13826 and http://tracker.ceph.com/issues/9744 (see stack in duplicate http://tracker.ceph.com/issues/13527) Reproducer: On boxenX $ ceph-deploy new boxenX boxenY $ ceph-deploy mon create-initial Single OSD set up on boxenZ but may not be needed, will test this. Leave boxenX mon process running. On boxenY # service ceph stop # mkdir ceph-deploy # cd $_ # ceph-deploy purge boxenY # ceph-deploy purge-data boxenY # ceph-deploy new boxenY # ceph-deploy mon create boxenY # ceph -s <--- This should fail Check /var/log/ceph/ceph-mon*.log on boxenY 016-03-01 18:50:23.748791 7f8c2e911700 0 mon.boxenY@0(leader) e1 handle_command mon_command({"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]} v 0) v1 2016-03-01 18:50:23.748823 7f8c2e911700 0 log_channel(audit) log [INF] : from='client.? 10.XX.XX.43:0/1031277' entity='client.admin' cmd=[{"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]}]: dispatch 2016-03-01 18:50:23.764426 7f8c30114700 0 log_channel(audit) log [INF] : from='client.? 10.XX.XX.43:0/1031277' entity='client.admin' cmd='[{"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]}]': finished 2016-03-01 18:50:24.577738 7f8c35a93700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-01 18:50:24.577744 7f8c35a93700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x3f8f000 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x3d46d60).accept we reset (peer sent cseq 2), sending RESETSESSION 2016-03-01 18:50:24.867140 7f8c3149f700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch 2016-03-01 18:50:24.867231 7f8c3149f700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished 2016-03-01 18:50:24.984399 7f8c3149f700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch 2016-03-01 18:50:24.984477 7f8c3149f700 0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished 2016-03-01 18:50:26.579586 7f8c35a93700 -1 *** Caught signal (Segmentation fault) ** in thread 7f8c35a93700 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-mon() [0x902ca2] 2: (()+0xf130) [0x7f8c34c83130] 3: (PK11_DoesMechanism()+0x5f) [0x7f8c35122aef] 4: (()+0x34dca) [0x7f8c35102dca] 5: (()+0x4f36a) [0x7f8c3511d36a] 6: (()+0x4f3bb) [0x7f8c3511d3bb] 7: (PK11_CreateContextBySymKey()+0x22) [0x7f8c35101b82] 8: /usr/bin/ceph-mon() [0x6b6f9a] 9: (CryptoAESKeyHandler::decrypt(ceph::buffer::list const&, ceph::buffer::list&, std::string*) const+0x2a) [0x6b7dba] 10: (void decode_decrypt_enc_bl<CephXServiceTicketInfo>(CephContext*, CephXServiceTicketInfo&, CryptoKey, ceph::buffer::list&, std::string&)+0xcd) [0x6ae9ad] 11: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, ceph::buffer::list&)+0x3bb) [0x6a913b] 12: (Monitor::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x450) [0x56f220] 13: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x7f) [0x7964ff] 14: (Pipe::accept()+0x1db1) [0x8d1721] 15: (Pipe::reader()+0x19ff) [0x8d6cef] 16: (Pipe::Reader::entry()+0xd) [0x8d996d] 17: (()+0x7df5) [0x7f8c34c7bdf5] 18: (clone()+0x6d) [0x7f8c3375e1ad] NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this. ยท I then tried to start ceph-mon in the foreground and observed the following. # /usr/bin/ceph-mon --debug_ms=5 --debug_mon=20 -i boxenY -c /etc/ceph/ceph.conf --cluster ceph -d ... 2016-03-01 19:44:28.100855 7f71df1a9700 1 -- 10.XX.XX.43:6789/0 >> :/0 pipe(0x47a0000 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x451e680).accept sd=19 10.XX.XX.41:35566/0 2016-03-01 19:44:28.101069 7f71df1a9700 10 mon.boxenY@0(leader) e1 ms_verify_authorizer 10.XX.XX.41:6789/0 mon protocol 2 2016-03-01 19:44:28.101135 7f71df1a9700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-01 19:44:28.101265 7f71df1a9700 2 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).reader couldn't read tag, (0) Success 2016-03-01 19:44:28.101286 7f71df1a9700 2 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).fault (0) Success 2016-03-01 19:44:28.101303 7f71df1a9700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).fault with nothing to send, going to standby 2016-03-01 19:44:30.101032 7f71defa7700 1 -- 10.XX.XX.43:6789/0 >> :/0 pipe(0x47a5000 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x451e940).accept sd=20 10.XX.XX.41:35567/0 *** Caught signal (Segmentation fault) ** in thread 7f71defa7700 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b) 1: /usr/bin/ceph-mon() [0x902ca2] 2: (()+0xf130) [0x7f71e7d20130] 3: (tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)+0x37) [0x7f71e8707e67] 4: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)+0x1a) [0x7f71e870812a] 5: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x6d) [0x7f71e87081dd] 6: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x65) [0x7f71e870b235] 7: (tc_new()+0xe8) [0x7f71e871b718] 8: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7f71e709bc29] 9: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7f71e709c86b] 10: (std::string::reserve(unsigned long)+0x44) [0x7f71e709c914] 11: (std::string::append(unsigned long, char)+0x46) [0x7f71e709ccf6] 12: (PrebufferedStreambuf::overflow(int)+0x30) [0x794e10] 13: (std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long)+0x36) [0x7f71e707dd76] 14: (std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)+0x215) [0x7f71e7074ac5] 15: (Monitor::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x1bf) [0x56ef8f] 16: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x7f) [0x7964ff] 17: (Pipe::accept()+0x1db1) [0x8d1721] 18: (Pipe::reader()+0x19ff) [0x8d6cef] 19: (Pipe::Reader::entry()+0xd) [0x8d996d] 20: (()+0x7df5) [0x7f71e7d18df5] 21: (clone()+0x6d) [0x7f71e67fb1ad] 2016-03-01 19:44:31.674135 7f71e19ae700 11 mon.boxenY@0(leader) e1 tick 2016-03-01 19:44:31.674146 7f71e19ae700 10 mon.boxenY@0(leader).pg v2 check_down_pgs I initially thought the different stack traces were due to differences in debug output level but now I can't reproduce the original stack so I guess not.
Infernalis is still affected by this but, with master we get a much better result. 2016-03-02 13:50:20.352650 7fdd7d71a700 0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up 2016-03-02 13:50:20.368691 7fdd7d71a700 0 log_channel(cluster) log [INF] : osdmap e1: 0 osds: 0 up, 0 in 2016-03-02 13:50:20.384977 7fdd7d71a700 0 log_channel(cluster) log [INF] : pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail 2016-03-02 13:50:20.404568 7fdd7d71a700 1 mon.boxenX@0(leader).log v2 unable to write to '/var/log/ceph/ceph.log' for channel 'cluster': (2) No such file or directory 2016-03-02 13:50:20.637123 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:20.637129 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:20.637136 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df56000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dfc0300).accept: got bad authorizer 2016-03-02 13:50:21.429447 7fdd7d71a700 1 mon.boxenX@0(leader).log v3 unable to write to '/var/log/ceph/ceph.log' for channel 'cluster': (2) No such file or directory 2016-03-02 13:50:22.636882 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:22.636886 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:22.636892 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df8a000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c400).accept: got bad authorizer 2016-03-02 13:50:24.637030 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:24.637035 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:24.637042 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df8b400 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7d180).accept: got bad authorizer 2016-03-02 13:50:26.637132 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:26.637139 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:26.637145 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5e000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7d180).accept: got bad authorizer 2016-03-02 13:50:28.637403 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:28.637408 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:28.637413 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5f400 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c400).accept: got bad authorizer 2016-03-02 13:50:30.637549 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:30.637553 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:30.637559 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df60800 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c580).accept: got bad authorizer 2016-03-02 13:50:32.637675 7fdd836ae700 0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190 2016-03-02 13:50:32.637680 7fdd836ae700 0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0 2016-03-02 13:50:32.637690 7fdd836ae700 0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5e000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c580).accept: got bad authorizer So this is fixed in master, just need to work out what the fix was and potentially backport it.
Simpler reproducer. $ ceph-deploy new boxenX boxenY $ ceph-deploy mon create-initial Move to boxenY $ sudo service ceph stop $ sudo rm -rf --one-file-system /var/lib/ceph/* $ sudo rm -rf --one-file-system /etc/ceph/* $ sudo mkdir /var/lib/ceph $ sudo mkdir /etc/ceph/ $ ceph-deploy new boxenY $ ceph-deploy mon create-initial
This issue is fixed by the following commit. https://github.com/mathslinux/ceph/commit/e9e05333ac7c64758bf14d80f6179e001c0fdbfd
I'd suggest pulling in all of https://github.com/ceph/ceph/pull/6698
thanks to Brad, - the backport tracker: http://tracker.ceph.com/issues/14961 - the backport pr: https://github.com/ceph/ceph/pull/7922
medium priority for 1.3.3. testing.
As per steps in comment10(given below) verified the issue. mon service is coming up, Moving the bug to Verified state. Steps followed: $ ceph-deploy new boxenX boxenY $ ceph-deploy mon create-initial Move to boxenY $ sudo service ceph stop $ sudo rm -rf --one-file-system /var/lib/ceph/* $ sudo rm -rf --one-file-system /etc/ceph/* $ sudo mkdir /var/lib/ceph $ sudo mkdir /etc/ceph/ $ ceph-deploy new boxenY $ ceph-deploy mon create-initial Cluster status: [root@magna111 ubuntu]# ceph -s cluster 9395c204-7cd8-4749-8d00-5a7a9fc7c515 health HEALTH_ERR 64 pgs stuck inactive 64 pgs stuck unclean no osds monmap e1: 1 mons at {magna111=10.8.128.111:6789/0} election epoch 2, quorum 0 magna111 osdmap e1: 0 osds: 0 up, 0 in pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects 0 kB used, 0 kB / 0 kB avail 64 creating [root@magna111 ubuntu]# ps aux | grep ceph root 28513 0.0 0.0 115244 1460 ? Ss 14:43 0:00 /bin/bash -c ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128MB /usr/bin/ceph-mon -i magna111 --pid-file /var/run/ceph/mon.magna111.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 28519 0.4 0.0 201828 15968 ? Sl 14:43 0:00 /usr/bin/ceph-mon -i magna111 --pid-file /var/run/ceph/mon.magna111.pid -c /etc/ceph/ceph.conf --cluster ceph -f root 29191 0.0 0.0 112648 972 pts/0 S+ 14:43 0:00 grep --color=auto ceph [root@magna111 ubuntu]# service ceph status === mon.magna111 === mon.magna111: running {"version":"0.94.9-2.el7cp"}
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2016-1972.html