Description of problem: Noticed Multiple entries for same gateway host in NVME GW map [ceph: root@ceph-sunilkumar-01-cl763k-node1-installer /]# ceph nvme-gw show rbd '' { "epoch": 47, "pool": "rbd", "group": "", "num gws": 2, "Anagrp list": "[ 1 2 ]" } { "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp", "anagrp-id": 1, "last-gw_map-epoch-valid": 1, "Availability": "UNAVAILABLE", "ana states": " 1: STANDBY , 2: STANDBY ," } { "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.sntxmb", "anagrp-id": 2, "last-gw_map-epoch-valid": 1, "Availability": "AVAILABLE", "ana states": " 1: ACTIVE , 2: ACTIVE ," } Version-Release number of selected component (if applicable): NVMe: 1.2.5-2 NVMe-CLI: 1.2.5-2 Ceph: 18.2.1-159 How reproducible: Steps to Reproduce: 1. Deploy Ceph cluster 2. Configure NVMe service and host namespaces. 3. Delete Services and add back the service again on same node resulted in new entry in the NVMe GW map. Actual results: Expected results: multiple entries for same host should be avoided. Additional info:
This seems to be a case of error handling of nvmeof deployment. We can see in ceph adm logs that the deployment failed few times because it failed to pull the containers from the registry. If this is right, I think that we can lower the severity and set to fix in next major release. This is what I see in ceph adm logs: 1. On installer node (10.0.209.130, in cephadm shell, I ran "ceph log last 200 cephadm" and I see this in the log: *********************************************************************************************************************************************** 2024-05-02T09:02:03.084988+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 402 : cephadm [INF] Upgrade: Finalizing container_image settings 2024-05-02T09:02:03.174969+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 403 : cephadm [INF] Upgrade: Complete! 2024-05-06T15:49:18.171118+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 185376 : cephadm [INF] Redeploy service nvmeof.rbd 2024-05-06T15:49:18.976417+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 185377 : cephadm [INF] Deploying daemon nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp on ceph-sunilkumar-01-cl763k-node6 2024-05-06T15:49:30.592650+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 185385 : cephadm [ERR] cephadm exited with an error code: 1, stderr: Redeploy daemon nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp ... Creating ceph-nvmeof config... Write file: /var/lib/ceph/1551a2a8-084b-11ef-bfc1-fa163ef4350d/nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp/ceph-nvmeof.conf Non-zero exit code 1 from systemctl start ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp systemctl: stderr Job for ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service failed because the control process exited with error code. systemctl: stderr See "systemctl status ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" and "journalctl -xeu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" for details. Traceback (most recent call last): File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 11148, in <module> File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 11136, in main File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6881, in command_deploy_from File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6899, in _common_deploy File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6982, in _dispatch_deploy File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 3960, in deploy_daemon File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 4203, in deploy_daemon_units File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 2220, in call_throws RuntimeError: Failed command: systemctl start ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp: Job for ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service failed because the control process exited with error code. See "systemctl status ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" and "journalctl -xeu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" for details. Traceback (most recent call last): File "/usr/share/ceph/mgr/cephadm/serve.py", line 1105, in _check_daemons self.mgr._daemon_action(daemon_spec, action=action) File "/usr/share/ceph/mgr/cephadm/module.py", line 2343, in _daemon_action return self.wait_async( File "/usr/share/ceph/mgr/cephadm/module.py", line 704, in wait_async return self.event_loop.get_result(coro, timeout) File "/usr/share/ceph/mgr/cephadm/ssh.py", line 64, in get_result return future.result(timeout) File "/lib64/python3.9/concurrent/futures/_base.py", line 446, in result return self.__get_result() File "/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result raise self._exception File "/usr/share/ceph/mgr/cephadm/serve.py", line 1339, in _create_daemon out, err, code = await self._run_cephadm( File "/usr/share/ceph/mgr/cephadm/serve.py", line 1627, in _run_cephadm raise OrchestratorError( orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr: Redeploy daemon nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp ... Creating ceph-nvmeof config... Write file: /var/lib/ceph/1551a2a8-084b-11ef-bfc1-fa163ef4350d/nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp/ceph-nvmeof.conf Non-zero exit code 1 from systemctl start ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp systemctl: stderr Job for ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service failed because the control process exited with error code. systemctl: stderr See "systemctl status ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" and "journalctl -xeu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" for details. Traceback (most recent call last): File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main return _run_code(code, main_globals, None, File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code exec(code, run_globals) File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 11148, in <module> File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 11136, in main File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6881, in command_deploy_from File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6899, in _common_deploy File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6982, in _dispatch_deploy File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 3960, in deploy_daemon File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 4203, in deploy_daemon_units File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 2220, in call_throws RuntimeError: Failed command: systemctl start ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp: Job for ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service failed because the control process exited with error code. See "systemctl status ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" and "journalctl -xeu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" for details. *********************************************************************************************************************************************** As indicated in the log above, I go to node 6, and run the command as shown in the cephadm log: "journalctl -eu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" , and I can see this: *********************************************************************************************************************************************** [root@ceph-sunilkumar-01-cl763k-node6 ~]# journalctl -eu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service May 06 11:56:57 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:56:57 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at> May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d... May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 bash[1858468]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2... May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 bash[1858468]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil> May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=> May 06 11:57:08 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:57:08 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at> May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d... May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 bash[1858559]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2... May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 bash[1858559]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil> May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=> May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at> May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Start request repeated too quickly. May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:37 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d... May 06 11:57:37 ceph-sunilkumar-01-cl763k-node6 bash[1858969]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2... May 06 11:57:37 ceph-sunilkumar-01-cl763k-node6 bash[1858969]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil> May 06 11:57:37 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=> May 06 11:57:38 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:57:38 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at> May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d... May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 bash[1859083]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2... May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 bash[1859083]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil> May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=> May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:58 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at> May 06 11:57:58 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:57:58 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d... May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 bash[1859198]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2... May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 bash[1859198]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil> May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=> May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at> May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d... May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 bash[1859303]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2... May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 bash[1859303]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil> May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=> May 06 11:58:10 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:58:10 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at> May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d... May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 bash[1859417]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2... May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 bash[1859417]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil> May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=> May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'. May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. May 06 11:58:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d. lines 939-1000/1000 (END) ************************************************************************************************************************************************************ Note in the log above - "Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>" So it looks like the issue here was that there were some failed trials to deploy the nvmeof gw, but the image was not fully available, so it failed. Conclusion so far I think we should defer this issue, while doing another test on a clean cluster and verify that images can be fully downloaded without issues on the first try. Also, I discussed with Adam King, and he say that it is recommended to "set mgr/cephadm/use_repo_digest to false when setting up your clusters, because he saw some issues with local repos in the past having issues handling image digests properly.
Thanks Aviv. I agree that there were issues in pulling image, But strictly not allowing another entry for the same Gateway node with different Id in the same Gateway group NVMeGW map is really good in my perspective. Because this could occur anytime when a registry is failing or in outage condition or even if it could not download/pull the image in low latency(Timeout issue) networked Ceph Clusters(especially in private clouds).
Considering namespace allocation with invalid entry as mentioned in the below BZ, re-setting target release to 7.1. https://bugzilla.redhat.com/show_bug.cgi?id=2279862
Fixed by https://gitlab.cee.redhat.com/ceph/ceph/-/commit/7837cc865312b562228519e3efdf658a3cde4193
Could not re-produce the issue and every time new NVMe GW came up, I didn't notice duplicate entry for the same node in the NVMe MON Map. Tried scenarios to make daemon fail at build, - Hosted pvt registry and redeplloyed nvmeof.service w/o having access to pvt registry at Gateway nodes. - Removed and added back the service. Verified in the Ceph 18.2.1-185 registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:1.2.9-1 registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:1.2.9-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2024:3925