1312587 – [RADOS]:- Stale cluster interference causes mon start failures

Bug 1312587 - [RADOS]:- Stale cluster interference causes mon start failures

Summary: [RADOS]:- Stale cluster interference causes mon start failures

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	RADOS
Sub Component:
Version:	1.3.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	1.3.3
Assignee:	Brad Hubbard
QA Contact:	shylesh
Docs Contact:	Bara Ancincova
URL:
Whiteboard:
Duplicates (1):	1309556 (view as bug list)
Depends On:
Blocks:	1348597 1372735
TreeView+	depends on / blocked

Reported:	2016-02-27 16:02 UTC by shylesh
Modified:	2022-07-09 09:25 UTC (History)
CC List:	9 users (show)
Fixed In Version:	RHEL: ceph-0.94.7-5.el7cp Ubuntu: ceph_0.94.7-3redhat1trusty
Doc Type:	Bug Fix
Doc Text:	.The monitor nodes exit gracefully after authenticating with an incorrect keyring When a new cluster included monitor nodes that were previously a part of an another cluster, the monitor nodes terminated with a segmentation fault when attempting to authenticate with an incorrect keyring. With this update, the monitor nodes exit gracefully instead of crashing in the described scenario.
Clone Of:
Environment:
Last Closed:	2016-09-29 12:56:48 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Ceph Project Bug Tracker	13527	None	None	None	2016-03-01 07:03:31 UTC
Ceph Project Bug Tracker	13826	None	None	None	2016-02-29 23:40:21 UTC
Ceph Project Bug Tracker	14961	None	None	None	2016-03-07 14:51:12 UTC
Red Hat Issue Tracker	RHCEPH-4729	None	None	None	2022-07-09 09:25:43 UTC
Red Hat Product Errata	RHSA-2016:1972	normal	SHIPPED_LIVE	Moderate: Red Hat Ceph Storage 1.3.3 security, bug fix, and enhancement update	2016-09-29 16:51:21 UTC

Description shylesh 2016-02-27 16:02:47 UTC

Description of problem:
A stale cluster which was already existing interfers with the newly built cluster because couple of nodes in the newly built cluster were part of the stale cluster earlier, this  leads to failures in starting monitor from the new cluster.

Version-Release number of selected component (if applicable):

Stale cluser details
=====================
package
----------
OS:- ubuntu@magna092:/var/log/ceph$ uname -a
Linux magna092 3.19.0-49-generic #55~14.04.1-Ubuntu SMP Fri Jan 22 11:24:31 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux




ubuntu@magna092:/var/log/ceph$ dpkg -l | grep ceph
ii  ceph                                 0.94.5-5redhat1trusty                 amd64        distributed storage and file system
ii  ceph-common                          0.94.5-5redhat1trusty                 amd64        common utilities to mount and interact with a ceph storage cluster
ii  libapache2-mod-fastcgi               2.4.7~0910052141-ceph1                amd64        Apache 2 FastCGI module for long-running CGI scripts
ii  libcephfs1                           0.94.5-5redhat1trusty                 amd64        Ceph distributed file system client library
ii  python-cephfs                        0.94.5-5redhat1trusty                 amd64        Python libraries for the Ceph libcephfs library



/etc/ceph/ceph.conf
-------------------
ubuntu@magna092:/var/log/ceph$ cat /etc/ceph/ceph.conf
[global]
osd crush location hook = /usr/bin/calamari-crush-location
fsid = abe5e41d-a747-401b-8612-e2e906fb00e2
mon_initial_members = magna013, magna023, magna092
mon_host = 10.8.128.13,10.8.128.23,10.8.128.92
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
[client.rgw.magna056]
rgw_frontends = "civetweb port=7480"




New cluster information
=========================
Package and OS
----------------
[ubuntu@magna003 ceph-config]$ cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.2 (Maipo)

[ubuntu@magna013 diamond]$ rpm -qa| grep ceph
mod_fastcgi-2.4.7-1.ceph.el6.x86_64
ceph-0.94.5-9.el7cp.x86_64
ceph-common-0.94.5-9.el7cp.x86_64
iozone-3.424-2_ceph.el6.x86_64
ceph-mon-0.94.5-9.el7cp.x86_64

[ubuntu@magna013 diamond]$ cat /etc/ceph/ceph.conf
[global]
osd crush location hook = /usr/bin/calamari-crush-location
fsid = b15107eb-c70f-40f2-8330-ffbf038f0dc3
mon_initial_members = magna013
mon_host = 10.8.128.13
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true

debug mon=20/5 


How reproducible:
Not sure

Steps to Reproduce:
1. I created a cluster with 1 mon(10.8.128.13) and 3 osds (magna023, magna030,magna033)
2. These nodes were earlier part of some other cluster . Above said nodes were reinstalled with RHEL 6.6 with ceph 1.2.3 and I was unaware of the existence of the earlier cluster details.
3. I upgraded these nodes from RHEL 6.6 ceph 1.2.3 to RHEL 7.2 ceph 1.3.2
4. After upgrading the ceph to 1.3.2 I tried to restart the mon (10.8.128.13)

Actual results:
Starting the mon failed.

Expected results:
Mon should start.

Additional info:
Mon logs from 10.8.128.13
-------------------------
2016-02-26 03:47:02.911521 7f4b3b0b5700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final ro
und failed: -8190
2016-02-26 03:47:02.911534 7f4b3b0b5700  0 -- 10.8.128.13:6789/0 >> 10.8.128.92:6789/0 pipe(0x4037680 sd=31 :6789 s=0 pgs=
0 cs=0 l=0 c=0x147c6940).accept we reset (peer sent cseq 1), sending RESETSESSION
2016-02-26 03:47:03.870754 7f4b3e830700  0 log [INF] : pgmap v52300: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used
, 7903 GB / 8334 GB avail; 23211 B/s wr, 22 op/s
2016-02-26 03:47:04.911646 7f4b3b0b5700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final ro
und failed: -8190
2016-02-26 03:47:04.911657 7f4b3b0b5700  0 -- 10.8.128.13:6789/0 >> 10.8.128.92:6789/0 pipe(0x45fb980 sd=31 :6789 s=0 pgs=
0 cs=0 l=0 c=0x147c7b20).accept we reset (peer sent cseq 1), sending RESETSESSION
2016-02-26 03:47:04.992398 7f4b3e830700  0 log [INF] : pgmap v52301: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used
, 7903 GB / 8334 GB avail; 15165 B/s wr, 14 op/s
2016-02-26 03:47:06.457398 7f4b3e830700  0 log [INF] : pgmap v52302: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used
, 7903 GB / 8334 GB avail; 20708 B/s wr, 20 op/s
2016-02-26 03:47:06.911951 7f4b3b0b5700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final ro
und failed: -8190
2016-02-26 03:47:06.911965 7f4b3b0b5700  0 -- 10.8.128.13:6789/0 >> 10.8.128.92:6789/0 pipe(0x8a83680 sd=31 :6789 s=0 pgs=
0 cs=0 l=0 c=0x147c5fa0).accept we reset (peer sent cseq 1), sending RESETSESSION
2016-02-26 03:47:07.512412 7f4b3e830700  0 log [INF] : pgmap v52303: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used
, 7903 GB / 8334 GB avail; 27919 B/s wr, 27 op/s
2016-02-26 03:47:08.651468 7f4b3e830700  0 log [INF] : pgmap v52304: 192 pgs: 192 active+clean; 84378 MB data, 431 GB used
, 7903 GB / 8334 GB avail; 22533 B/s wr, 22 op/s
2016-02-26 03:47:08.911921 7f4b3b0b5700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final ro
und failed: -8190
2016-02-26 03:47:08.911929 7f4b3b0b5700  0 -- 10.8.128.13:6789/0 >> 10.8.128.92:6789/0 pipe(0xe8d9900 sd=31 :6789 s=0 pgs=
0 cs=0 l=0 c=0x147c4420).accept we reset (peer sent cseq 1), sending RESETSESSION


10.8.128.92 is the host which is part of the stale cluster which is still trying to connect to 10.8.128.13( current MON) because of which new mon start is failinf.


Logs from 10.8.128.92 (mon from stale cluster)
=======================================
2016-02-26 15:49:33.176223 7f134d72b700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='m
on_status' args=[]: dispatch
2016-02-26 15:49:33.176395 7f134d72b700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mo
n_status args=[]: finished
2016-02-26 15:49:34.846914 7f134885f700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry
ption
2016-02-26 15:49:34.846927 7f134885f700  0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=20 :57919 s=1 pgs
=15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply
2016-02-26 15:49:36.846963 7f134885f700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry
ption
2016-02-26 15:49:36.846975 7f134885f700  0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=20 :57920 s=1 pgs
=15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply
2016-02-26 15:49:38.846902 7f134885f700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry
ption
2016-02-26 15:49:38.846914 7f134885f700  0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=20 :57923 s=1 pgs
=15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply
2016-02-26 15:49:40.847042 7f134885f700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry
ption
2016-02-26 15:49:40.847056 7f134885f700  0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=9 :57924 s=1 pgs=
15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply
2016-02-26 15:49:42.847194 7f134885f700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry
ption
2016-02-26 15:49:42.847210 7f134885f700  0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=9 :57926 s=1 pgs=
15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply
2016-02-26 15:49:44.847477 7f134885f700  0 cephx: verify_reply couldn't decrypt with error: error decoding block for decry
ption
2016-02-26 15:49:44.847491 7f134885f700  0 -- 10.8.128.92:6789/0 >> 10.8.128.13:6789/0 pipe(0x505d000 sd=9 :57928 s=1 pgs=
15 cs=1 l=0 c=0x49443c0).failed verifying authorize reply
2016-02-26 15:49:45.552597 7f134a964700  0 mon.magna092@2(probing).data_health(16) update_stats avail 94% total 916 GB, us
ed 3173 MB, avail 867 GB




After the killing the mon from stale cluster (10.8.128.92) the new mon (10.8.128.13) from the new cluster started successfully.

Comment 5 Samuel Just 2016-02-29 23:40:55 UTC

*** Bug 1309556 has been marked as a duplicate of this bug. ***

Comment 7 Brad Hubbard 2016-03-01 07:03:31 UTC

Still investigating this and will update all relevant Bugzillas and trackers but here are my initial findings. I have tried with ceph-0.94.3-6.0.bz1312587.el7cp.x86_64 which is 0.94.3-6 with just this patch, https://github.com/ceph/ceph/commit/7ffe6ff4e26db538335b8069ce4fd14fe4cdd76c It does *not* fix the problem although we get the following additional logs which show the patch is working I guess. 

2016-03-01 20:27:56.278634 7f9e16ef4700  0 mon.boxenY@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.6=0 cs=0 l=0 c=0
2016-03-01 20:27:56.278644 7f9e16ef4700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x5177000 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x4fbe3c0).accept: got bad authorizer

Interestingly the peer address appears truncated in the ms_verify_authorizer line. 

My next step is to test with latest Hammer release 0.94.5-9.

Issue looks to be the same as http://tracker.ceph.com/issues/13826 and http://tracker.ceph.com/issues/9744 (see stack in duplicate http://tracker.ceph.com/issues/13527)

Reproducer:
    
On boxenX

$ ceph-deploy new  boxenX boxenY
$ ceph-deploy mon create-initial

Single OSD set up on boxenZ but may not be needed, will test this.

Leave boxenX mon process running.

On boxenY

# service ceph stop
# mkdir ceph-deploy
# cd $_
# ceph-deploy purge boxenY
# ceph-deploy purge-data boxenY
# ceph-deploy new boxenY
# ceph-deploy mon create boxenY
# ceph -s   <--- This should fail

Check /var/log/ceph/ceph-mon*.log on boxenY

016-03-01 18:50:23.748791 7f8c2e911700  0 mon.boxenY@0(leader) e1 handle_command mon_command({"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]} v 0) v1
2016-03-01 18:50:23.748823 7f8c2e911700  0 log_channel(audit) log [INF] : from='client.? 10.XX.XX.43:0/1031277' entity='client.admin' cmd=[{"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]}]: dispatch
2016-03-01 18:50:23.764426 7f8c30114700  0 log_channel(audit) log [INF] : from='client.? 10.XX.XX.43:0/1031277' entity='client.admin' cmd='[{"prefix": "auth get-or-create", "entity": "client.bootstrap-mds", "caps": ["mon", "allow profile bootstrap-mds"]}]': finished
2016-03-01 18:50:24.577738 7f8c35a93700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-01 18:50:24.577744 7f8c35a93700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x3f8f000 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x3d46d60).accept we reset (peer sent cseq 2), sending RESETSESSION
2016-03-01 18:50:24.867140 7f8c3149f700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2016-03-01 18:50:24.867231 7f8c3149f700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2016-03-01 18:50:24.984399 7f8c3149f700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd='mon_status' args=[]: dispatch
2016-03-01 18:50:24.984477 7f8c3149f700  0 log_channel(audit) log [DBG] : from='admin socket' entity='admin socket' cmd=mon_status args=[]: finished
2016-03-01 18:50:26.579586 7f8c35a93700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f8c35a93700

 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
 1: /usr/bin/ceph-mon() [0x902ca2]
 2: (()+0xf130) [0x7f8c34c83130]
 3: (PK11_DoesMechanism()+0x5f) [0x7f8c35122aef]
 4: (()+0x34dca) [0x7f8c35102dca]
 5: (()+0x4f36a) [0x7f8c3511d36a]
 6: (()+0x4f3bb) [0x7f8c3511d3bb]
 7: (PK11_CreateContextBySymKey()+0x22) [0x7f8c35101b82]
 8: /usr/bin/ceph-mon() [0x6b6f9a]
 9: (CryptoAESKeyHandler::decrypt(ceph::buffer::list const&, ceph::buffer::list&, std::string*) const+0x2a) [0x6b7dba]
 10: (void decode_decrypt_enc_bl<CephXServiceTicketInfo>(CephContext*, CephXServiceTicketInfo&, CryptoKey, ceph::buffer::list&, std::string&)+0xcd) [0x6ae9ad]
 11: (cephx_verify_authorizer(CephContext*, KeyStore*, ceph::buffer::list::iterator&, CephXServiceTicketInfo&, ceph::buffer::list&)+0x3bb) [0x6a913b]
 12: (Monitor::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x450) [0x56f220]
 13: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x7f) [0x7964ff]
 14: (Pipe::accept()+0x1db1) [0x8d1721]
 15: (Pipe::reader()+0x19ff) [0x8d6cef]
 16: (Pipe::Reader::entry()+0xd) [0x8d996d]
 17: (()+0x7df5) [0x7f8c34c7bdf5]
 18: (clone()+0x6d) [0x7f8c3375e1ad]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
·
I then tried to start ceph-mon in the foreground and observed the following.

# /usr/bin/ceph-mon --debug_ms=5 --debug_mon=20 -i boxenY -c /etc/ceph/ceph.conf --cluster ceph -d
...
2016-03-01 19:44:28.100855 7f71df1a9700  1 -- 10.XX.XX.43:6789/0 >> :/0 pipe(0x47a0000 sd=19 :6789 s=0 pgs=0 cs=0 l=0 c=0x451e680).accept sd=19 10.XX.XX.41:35566/0
2016-03-01 19:44:28.101069 7f71df1a9700 10 mon.boxenY@0(leader) e1 ms_verify_authorizer 10.XX.XX.41:6789/0 mon protocol 2
2016-03-01 19:44:28.101135 7f71df1a9700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-01 19:44:28.101265 7f71df1a9700  2 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).reader couldn't read tag, (0) Success
2016-03-01 19:44:28.101286 7f71df1a9700  2 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).fault (0) Success
2016-03-01 19:44:28.101303 7f71df1a9700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x47a0000 sd=19 :6789 s=2 pgs=1228 cs=1 l=0 c=0x451e680).fault with nothing to send, going to standby
2016-03-01 19:44:30.101032 7f71defa7700  1 -- 10.XX.XX.43:6789/0 >> :/0 pipe(0x47a5000 sd=20 :6789 s=0 pgs=0 cs=0 l=0 c=0x451e940).accept sd=20 10.XX.XX.41:35567/0
*** Caught signal (Segmentation fault) **
 in thread 7f71defa7700
 ceph version 0.94.3 (95cefea9fd9ab740263bf8bb4796fd864d9afe2b)
 1: /usr/bin/ceph-mon() [0x902ca2]
 2: (()+0xf130) [0x7f71e7d20130]
 3: (tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)+0x37) [0x7f71e8707e67]
 4: (tcmalloc::CentralFreeList::FetchFromOneSpansSafe(int, void**, void**)+0x1a) [0x7f71e870812a]
 5: (tcmalloc::CentralFreeList::RemoveRange(void**, void**, int)+0x6d) [0x7f71e87081dd]
 6: (tcmalloc::ThreadCache::FetchFromCentralCache(unsigned long, unsigned long)+0x65) [0x7f71e870b235]
 7: (tc_new()+0xe8) [0x7f71e871b718]
 8: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7f71e709bc29]
 9: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7f71e709c86b]
 10: (std::string::reserve(unsigned long)+0x44) [0x7f71e709c914]
 11: (std::string::append(unsigned long, char)+0x46) [0x7f71e709ccf6]
 12: (PrebufferedStreambuf::overflow(int)+0x30) [0x794e10]
 13: (std::basic_streambuf<char, std::char_traits<char> >::xsputn(char const*, long)+0x36) [0x7f71e707dd76]
 14: (std::basic_ostream<char, std::char_traits<char> >& std::__ostream_insert<char, std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*, long)+0x215) [0x7f71e7074ac5]
 15: (Monitor::ms_verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x1bf) [0x56ef8f]
 16: (SimpleMessenger::verify_authorizer(Connection*, int, int, ceph::buffer::list&, ceph::buffer::list&, bool&, CryptoKey&)+0x7f) [0x7964ff]
 17: (Pipe::accept()+0x1db1) [0x8d1721]
 18: (Pipe::reader()+0x19ff) [0x8d6cef]
 19: (Pipe::Reader::entry()+0xd) [0x8d996d]
 20: (()+0x7df5) [0x7f71e7d18df5]
 21: (clone()+0x6d) [0x7f71e67fb1ad]
2016-03-01 19:44:31.674135 7f71e19ae700 11 mon.boxenY@0(leader) e1 tick
2016-03-01 19:44:31.674146 7f71e19ae700 10 mon.boxenY@0(leader).pg v2 check_down_pgs

I initially thought the different stack traces were due to differences in debug output level but now I can't reproduce the original stack so I guess not.

Comment 9 Brad Hubbard 2016-03-02 00:17:56 UTC

Infernalis is still affected by this but, with master we get a much better result.

2016-03-02 13:50:20.352650 7fdd7d71a700  0 log_channel(cluster) log [INF] : mdsmap e1: 0/0/0 up
2016-03-02 13:50:20.368691 7fdd7d71a700  0 log_channel(cluster) log [INF] : osdmap e1: 0 osds: 0 up, 0 in
2016-03-02 13:50:20.384977 7fdd7d71a700  0 log_channel(cluster) log [INF] : pgmap v2: 64 pgs: 64 creating; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2016-03-02 13:50:20.404568 7fdd7d71a700  1 mon.boxenX@0(leader).log v2 unable to write to '/var/log/ceph/ceph.log' for channel 'cluster': (2) No such file or directory
2016-03-02 13:50:20.637123 7fdd836ae700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-02 13:50:20.637129 7fdd836ae700  0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0
2016-03-02 13:50:20.637136 7fdd836ae700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df56000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dfc0300).accept: got bad authorizer
2016-03-02 13:50:21.429447 7fdd7d71a700  1 mon.boxenX@0(leader).log v3 unable to write to '/var/log/ceph/ceph.log' for channel 'cluster': (2) No such file or directory
2016-03-02 13:50:22.636882 7fdd836ae700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-02 13:50:22.636886 7fdd836ae700  0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0
2016-03-02 13:50:22.636892 7fdd836ae700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df8a000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c400).accept: got bad authorizer
2016-03-02 13:50:24.637030 7fdd836ae700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-02 13:50:24.637035 7fdd836ae700  0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0
2016-03-02 13:50:24.637042 7fdd836ae700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df8b400 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7d180).accept: got bad authorizer
2016-03-02 13:50:26.637132 7fdd836ae700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-02 13:50:26.637139 7fdd836ae700  0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0
2016-03-02 13:50:26.637145 7fdd836ae700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5e000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7d180).accept: got bad authorizer
2016-03-02 13:50:28.637403 7fdd836ae700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-02 13:50:28.637408 7fdd836ae700  0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0
2016-03-02 13:50:28.637413 7fdd836ae700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5f400 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c400).accept: got bad authorizer
2016-03-02 13:50:30.637549 7fdd836ae700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-02 13:50:30.637553 7fdd836ae700  0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0
2016-03-02 13:50:30.637559 7fdd836ae700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df60800 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c580).accept: got bad authorizer
2016-03-02 13:50:32.637675 7fdd836ae700  0 cephx: verify_authorizer could not decrypt ticket info: error: NSS AES final round failed: -8190
2016-03-02 13:50:32.637680 7fdd836ae700  0 mon.boxenX@0(leader) e1 ms_verify_authorizer bad authorizer from mon 10.XX.XX.41:6789/0
2016-03-02 13:50:32.637690 7fdd836ae700  0 -- 10.XX.XX.43:6789/0 >> 10.XX.XX.41:6789/0 pipe(0x7fdd8df5e000 sd=18 :6789 s=0 pgs=0 cs=0 l=0 c=0x7fdd8dd7c580).accept: got bad authorizer

So this is fixed in master, just need to work out what the fix was and potentially backport it.

Comment 10 Brad Hubbard 2016-03-02 02:55:14 UTC

Simpler reproducer.

$ ceph-deploy new boxenX boxenY
$ ceph-deploy mon create-initial

Move to boxenY

$ sudo service ceph stop
$ sudo rm -rf --one-file-system /var/lib/ceph/*
$ sudo rm -rf --one-file-system /etc/ceph/*
$ sudo mkdir /var/lib/ceph
$ sudo mkdir /etc/ceph/
$ ceph-deploy new boxenY
$ ceph-deploy mon create-initial

Comment 12 Brad Hubbard 2016-03-03 03:12:28 UTC

This issue is fixed by the following commit.

https://github.com/mathslinux/ceph/commit/e9e05333ac7c64758bf14d80f6179e001c0fdbfd

Comment 13 Brad Hubbard 2016-03-03 03:45:57 UTC

I'd suggest pulling in all of https://github.com/ceph/ceph/pull/6698

Comment 15 Kefu Chai 2016-03-07 14:51:12 UTC

thanks to Brad, 

- the backport tracker: http://tracker.ceph.com/issues/14961
- the backport pr: https://github.com/ceph/ceph/pull/7922

Comment 18 Federico Lucifredi 2016-08-17 21:57:43 UTC

medium priority for 1.3.3. testing.

Comment 21 Ramakrishnan Periyasamy 2016-09-15 14:56:03 UTC

As per steps in comment10(given below) verified the issue. mon service is coming up, Moving the bug to Verified state.

Steps followed:

$ ceph-deploy new boxenX boxenY
$ ceph-deploy mon create-initial

Move to boxenY

$ sudo service ceph stop
$ sudo rm -rf --one-file-system /var/lib/ceph/*
$ sudo rm -rf --one-file-system /etc/ceph/*
$ sudo mkdir /var/lib/ceph
$ sudo mkdir /etc/ceph/
$ ceph-deploy new boxenY
$ ceph-deploy mon create-initial


Cluster status:
[root@magna111 ubuntu]# ceph -s
    cluster 9395c204-7cd8-4749-8d00-5a7a9fc7c515
     health HEALTH_ERR
            64 pgs stuck inactive
            64 pgs stuck unclean
            no osds
     monmap e1: 1 mons at {magna111=10.8.128.111:6789/0}
            election epoch 2, quorum 0 magna111
     osdmap e1: 0 osds: 0 up, 0 in
      pgmap v2: 64 pgs, 1 pools, 0 bytes data, 0 objects
            0 kB used, 0 kB / 0 kB avail
                  64 creating
[root@magna111 ubuntu]# ps aux | grep ceph
root     28513  0.0  0.0 115244  1460 ?        Ss   14:43   0:00 /bin/bash -c ulimit -n 32768; TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=128MB /usr/bin/ceph-mon -i magna111 --pid-file /var/run/ceph/mon.magna111.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root     28519  0.4  0.0 201828 15968 ?        Sl   14:43   0:00 /usr/bin/ceph-mon -i magna111 --pid-file /var/run/ceph/mon.magna111.pid -c /etc/ceph/ceph.conf --cluster ceph -f
root     29191  0.0  0.0 112648   972 pts/0    S+   14:43   0:00 grep --color=auto ceph
[root@magna111 ubuntu]# service ceph status
=== mon.magna111 === 
mon.magna111: running {"version":"0.94.9-2.el7cp"}

Comment 27 errata-xmlrpc 2016-09-29 12:56:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-1972.html

Note You need to log in before you can comment on or make changes to this bug.