Description of problem: Diamond service is not started in one of the OSD node in 1.3.3 RC Build. On all the other ceph nodes it was running. After running "salt '*' state.highstate" command from admin node, diamond service was running in all nodes and in Calamari UI graphs were generated. Version-Release number of selected component (if applicable): diamond-3.4.67-4.el7cp.noarch How reproducible: 1/1 Steps to Reproduce: 1. Install Ceph and Calamari 2. Did ceph-deploy calamari connect for all nodes in cluster from admin node 3. Logged into Calamari UI and accepted the keys 4. Diamond service was not started in one of the OSD node out of total 6 nodes(3 MON and 3 OSD) but diamond package got installed. 5. Graphs are not generated for that particular node in calamari UI Actual results: Diamond service not started on one of the osd nodes Expected results: Diamond service should be started in all cluster nodes Additional info: Tried following work around: Run "salt '*' state.highstate" command from admin node. After this, diamond service was running in all nodes and in Calamari UI, graphs were generated.
Andrew would you please take a look at this setup today?
(In reply to Gregory Meno from comment #5) > Andrew would you please take a look at this setup today? Yeah, I can take a look.
I was able to get diamond started on all nodes and reporting to the web UI by doing the following steps: On the Admin/Calamari node: - unauthorized all nodes with `sudo salt-key -D -y` On all the OSD and MON nodes: - `sudo yum remove diamond salt-minion` - `sudo rm -rf /etc/salt` - `sudo rm /var/lock/subsys/diamond` On the Admin/Calamari node: - Did ceph-deploy calamari connect for all nodes in cluster - Logged into the Web UI and accepted the keys I've left the cluster in this working state for you to inspect and verify. I'm not exactly sure why diamond did not start on one of the OSDs before but I suspect it was because that node might not have been properly cleaned up from previous tests.
I have installed ceph and calamari in new setup(all servers are re-imaged), still observing the same problem. After accepting keys in calamari UI diamond service is not starting in some nodes. How Reproducible: 2/2 This time diamond service not started in 3 servers(1 MON and 2 OSD machines). Andrew have taken a look on this setup.
I took a look at the nodes where diamond did not start and found this in the salt logs. Sep 29 15:14:21 magna107 salt-minion[30739]: [WARNING ] The minion function caused an exception Sep 29 15:14:21 magna107 salt-minion[30739]: Traceback (most recent call last): Sep 29 15:14:21 magna107 salt-minion[30739]: File "/usr/lib/python2.7/site-packages/salt/minion.py", line 796, in _thread_return Sep 29 15:14:21 magna107 salt-minion[30739]: return_data = func(*args, **kwargs) Sep 29 15:14:21 magna107 salt-minion[30739]: File "/usr/lib/python2.7/site-packages/salt/modules/state.py", line 275, in highstate Sep 29 15:14:21 magna107 salt-minion[30739]: force=kwargs.get('force', False) Sep 29 15:14:21 magna107 salt-minion[30739]: File "/usr/lib/python2.7/site-packages/salt/state.py", line 2497, in call_highstate Sep 29 15:14:21 magna107 salt-minion[30739]: self.load_dynamic(matches) Sep 29 15:14:21 magna107 salt-minion[30739]: File "/usr/lib/python2.7/site-packages/salt/state.py", line 2081, in load_dynamic Sep 29 15:14:21 magna107 salt-minion[30739]: refresh=False) Sep 29 15:14:21 magna107 salt-minion[30739]: File "/usr/lib/python2.7/site-packages/salt/modules/saltutil.py", line 343, in sync_all Sep 29 15:14:21 magna107 salt-minion[30739]: ret['modules'] = sync_modules(saltenv, False) Sep 29 15:14:21 magna107 salt-minion[30739]: File "/usr/lib/python2.7/site-packages/salt/modules/saltutil.py", line 228, in sync_modules Sep 29 15:14:21 magna107 salt-minion[30739]: ret = _sync('modules', saltenv) Sep 29 15:14:21 magna107 salt-minion[30739]: File "/usr/lib/python2.7/site-packages/salt/modules/saltutil.py", line 82, in _sync Sep 29 15:14:21 magna107 salt-minion[30739]: os.makedirs(mod_dir) Sep 29 15:14:21 magna107 salt-minion[30739]: File "/usr/lib64/python2.7/os.py", line 157, in makedirs Sep 29 15:14:21 magna107 salt-minion[30739]: mkdir(name, mode) Sep 29 15:14:21 magna107 salt-minion[30739]: OSError: [Errno 17] File exists: '/var/cache/salt/minion/extmods/modules' This tracebook looks identical to one found in this upstream issue: http://tracker.ceph.com/issues/8780#note-1 This seems like a salt issue to me. I'd recommend adding documentation to note that running "salt '*' state.highstate" from the admin node as a workaround when this situation occurs. Also just restarting the salt-minion on the affected nodes seems to fix the issue.
QE ran "salt '*' state.highstate" from admin node and it resolved the issue This BZ needs to be release noted for 1.3.3 with the workaround mentioned above.
Doc text looks good
The doc text looks good to me as well.
Looks like it got added to the 1.3.3 release notes