Bug 1209859

Summary: [GSS] "Cluster Updates Are Stale. The Cluster isn't updating Calamari. Please contact Administrator".
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Vikhyat Umrao <vumrao>
Component: CalamariAssignee: Christina Meno <gmeno>
Calamari sub component: Back-end QA Contact: Warren <wusui>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: dmick, flucifre, icolle, kdreyer, tmuthami, vumrao
Version: 1.2.3   
Target Milestone: rc   
Target Release: 1.2.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: calamari-server-1.2.3-13.el6cp, calamari-server-1.2.3-13.el7cp Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1214399 (view as bug list) Environment:
Last Closed: 2015-04-16 14:36:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1214399    

Description Vikhyat Umrao 2015-04-08 11:12:36 UTC
Description of problem:

Our one of the customer is getting , "Cluster Updates Are Stale. The Cluster isn't updating Calamari. Please contact Administrator". after updating to Red Hat Ceph Storage 1.2.3.

Version-Release number of selected component (if applicable):
Red Hat Ceph Storage 1.2.3
calamari-server-1.2.3-11.el7cp.x86_64
calamari-clients-1.2.3-3.el7cp.x86_64

ceph-0.80.8-5.el7cp.x86_64                                
ceph-common-0.80.8-5.el7cp.x86_64                           
ceph-mon-0.80.8-5.el7cp.x86_64                           
ceph-osd-0.80.8-5.el7cp.x86_64      

How reproducible:
For customer always

Comment 6 Christina Meno 2015-04-09 14:50:43 UTC
Vikhyat,

This is likely a failure in the agent reporting in due to upgrade.

for further diagnosis I would like to see the results of
sudo salt-key -L

and sudo salt '*' ceph.get_heartbeats

as issued from the shell where calamari is running.

this will establish what nodes should be reporting and then verify that they are reporting

If there are no results fron the get_heartbeats command:
ssh into any of the nodes listed in the salt-key -L "Accepted Keys:" list

and run:
sudo tail -f /var/log/salt/minion

if you see a message like:
2015-04-09 07:07:14,565 [salt.crypt       ][CRITICAL] The Salt Master server's public key did not authenticate!
The master may need to be updated if it is a version of Salt lower than 2014.1.11, or
If you are confident that you are connecting to a valid Salt Master, then remove the master public key and restart the Salt Minion.
The master public key can be found at:
/etc/salt/pki/minion/minion_master.pub

It means that the salt-master has rotated it's keys and that we need to remove the stale ones

the process would be for each node reporting to calamari:
sudo rm /etc/salt/pki/minion/minion_master.pub; sudo service salt-minion restart

Comment 12 Christina Meno 2015-04-10 15:44:56 UTC
ok I have identified the fix.

I failed to back-port an upstream fix to harden the socket matching code in the the 1.2.3 release. 

Fix is upstream here:
https://github.com/ceph/calamari/pull/268

Comment 15 Christina Meno 2015-04-10 21:08:11 UTC
after updating the package calamari-server
sudo salt '*' saltutil.sync_modules
must be run from the calamari node

Comment 16 Tamil 2015-04-11 20:17:14 UTC
tested on rhel 7.1 and it looks good.

Comment 22 Christina Meno 2015-04-13 17:09:27 UTC
Vikhyat,

Please know that we discovered this issue while testing the fix.

https://bugzilla.redhat.com/show_bug.cgi?id=1211347

It has the potential to cause the fix to appear to not work.

If having applied the fix for 1209859 customer continues to see the error check date time on the client machine.

Comment 23 Tamil 2015-04-13 22:56:55 UTC
Warren tested the fix on rhel 6.6 as well.

Comment 24 Vikhyat Umrao 2015-04-14 06:27:07 UTC
(In reply to Gregory Meno from comment #22)
> Vikhyat,
> 
> Please know that we discovered this issue while testing the fix.
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1211347
> 
> It has the potential to cause the fix to appear to not work.
> 
> If having applied the fix for 1209859 customer continues to see the error
> check date time on the client machine.

Thanks Greg !

Comment 29 errata-xmlrpc 2015-04-16 14:36:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2015:0842