Description of problem: Our one of the customer is getting , "Cluster Updates Are Stale. The Cluster isn't updating Calamari. Please contact Administrator". after updating to Red Hat Ceph Storage 1.2.3. Version-Release number of selected component (if applicable): Red Hat Ceph Storage 1.2.3 calamari-server-1.2.3-11.el7cp.x86_64 calamari-clients-1.2.3-3.el7cp.x86_64 ceph-0.80.8-5.el7cp.x86_64 ceph-common-0.80.8-5.el7cp.x86_64 ceph-mon-0.80.8-5.el7cp.x86_64 ceph-osd-0.80.8-5.el7cp.x86_64 How reproducible: For customer always
Vikhyat, This is likely a failure in the agent reporting in due to upgrade. for further diagnosis I would like to see the results of sudo salt-key -L and sudo salt '*' ceph.get_heartbeats as issued from the shell where calamari is running. this will establish what nodes should be reporting and then verify that they are reporting If there are no results fron the get_heartbeats command: ssh into any of the nodes listed in the salt-key -L "Accepted Keys:" list and run: sudo tail -f /var/log/salt/minion if you see a message like: 2015-04-09 07:07:14,565 [salt.crypt ][CRITICAL] The Salt Master server's public key did not authenticate! The master may need to be updated if it is a version of Salt lower than 2014.1.11, or If you are confident that you are connecting to a valid Salt Master, then remove the master public key and restart the Salt Minion. The master public key can be found at: /etc/salt/pki/minion/minion_master.pub It means that the salt-master has rotated it's keys and that we need to remove the stale ones the process would be for each node reporting to calamari: sudo rm /etc/salt/pki/minion/minion_master.pub; sudo service salt-minion restart
ok I have identified the fix. I failed to back-port an upstream fix to harden the socket matching code in the the 1.2.3 release. Fix is upstream here: https://github.com/ceph/calamari/pull/268
after updating the package calamari-server sudo salt '*' saltutil.sync_modules must be run from the calamari node
tested on rhel 7.1 and it looks good.
Vikhyat, Please know that we discovered this issue while testing the fix. https://bugzilla.redhat.com/show_bug.cgi?id=1211347 It has the potential to cause the fix to appear to not work. If having applied the fix for 1209859 customer continues to see the error check date time on the client machine.
Warren tested the fix on rhel 6.6 as well.
(In reply to Gregory Meno from comment #22) > Vikhyat, > > Please know that we discovered this issue while testing the fix. > > https://bugzilla.redhat.com/show_bug.cgi?id=1211347 > > It has the potential to cause the fix to appear to not work. > > If having applied the fix for 1209859 customer continues to see the error > check date time on the client machine. Thanks Greg !
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2015:0842