2279352 – Multiple entries found for same host in GWMon Map

Bug 2279352 - Multiple entries found for same host in GWMon Map

Summary: Multiple entries found for same host in GWMon Map

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	NVMeOF
Sub Component:
Version:	7.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.1
Assignee:	Aviv Caro
QA Contact:	Sunil Kumar Nagaraju
Docs Contact:	ceph-doc-bot
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2024-05-06 16:41 UTC by Sunil Kumar Nagaraju
Modified:	2024-06-13 14:32 UTC (History)
CC List:	6 users (show)
Fixed In Version:	ceph-18.2.1-176.el9cp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2024-06-13 14:32:38 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	RHCEPH-8944	0	None	None	None	2024-05-06 16:41:56 UTC
Red Hat Product Errata	RHSA-2024:3925	0	None	None	None	2024-06-13 14:32:40 UTC

Description Sunil Kumar Nagaraju 2024-05-06 16:41:22 UTC

Description of problem:

Noticed Multiple entries for same gateway host in NVME GW map 

[ceph: root@ceph-sunilkumar-01-cl763k-node1-installer /]# ceph nvme-gw show rbd ''
{
    "epoch": 47,
    "pool": "rbd",
    "group": "",
    "num gws": 2,
    "Anagrp list": "[ 1 2 ]"
}
{
    "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp",
    "anagrp-id": 1,
    "last-gw_map-epoch-valid": 1,
    "Availability": "UNAVAILABLE",
    "ana states": " 1: STANDBY , 2: STANDBY ,"
}
{
    "gw-id": "client.nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.sntxmb",
    "anagrp-id": 2,
    "last-gw_map-epoch-valid": 1,
    "Availability": "AVAILABLE",
    "ana states": " 1: ACTIVE , 2: ACTIVE ,"
}

Version-Release number of selected component (if applicable):
NVMe: 1.2.5-2
NVMe-CLI: 1.2.5-2
Ceph: 18.2.1-159

How reproducible: 


Steps to Reproduce:
1. Deploy Ceph cluster
2. Configure NVMe service and host namespaces.
3. Delete Services and add back the service again on same node resulted in new entry in the NVMe GW map.

Actual results:


Expected results:
multiple entries for same host should be avoided.

Additional info:

Comment 1 Aviv Caro 2024-05-07 05:47:03 UTC

This seems to be a case of error handling of nvmeof deployment. We can see in ceph adm logs that the deployment failed few times because it failed to pull the containers from the registry. If this is right, I think that we can lower the severity and set to fix in next major release. This is what I see in ceph adm logs: 

1. On installer node (10.0.209.130, in cephadm shell, I ran "ceph log last 200 cephadm" and I see this in the log: 

***********************************************************************************************************************************************
2024-05-02T09:02:03.084988+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 402 : cephadm [INF] Upgrade: Finalizing container_image settings
2024-05-02T09:02:03.174969+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 403 : cephadm [INF] Upgrade: Complete!
2024-05-06T15:49:18.171118+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 185376 : cephadm [INF] Redeploy service nvmeof.rbd
2024-05-06T15:49:18.976417+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 185377 : cephadm [INF] Deploying daemon nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp on ceph-sunilkumar-01-cl763k-node6
2024-05-06T15:49:30.592650+0000 mgr.ceph-sunilkumar-01-cl763k-node1-installer.zahonj (mgr.15735) 185385 : cephadm [ERR] cephadm exited with an error code: 1, stderr: Redeploy daemon nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp ...
Creating ceph-nvmeof config...
Write file: /var/lib/ceph/1551a2a8-084b-11ef-bfc1-fa163ef4350d/nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp/ceph-nvmeof.conf
Non-zero exit code 1 from systemctl start ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp
systemctl: stderr Job for ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" and "journalctl -xeu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" for details.
Traceback (most recent call last):
  File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 11148, in <module>
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 11136, in main
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6881, in command_deploy_from
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6899, in _common_deploy
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6982, in _dispatch_deploy
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 3960, in deploy_daemon
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 4203, in deploy_daemon_units
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 2220, in call_throws
RuntimeError: Failed command: systemctl start ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp: Job for ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service failed because the control process exited with error code.
See "systemctl status ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" and "journalctl -xeu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" for details.
Traceback (most recent call last):
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1105, in _check_daemons
    self.mgr._daemon_action(daemon_spec, action=action)
  File "/usr/share/ceph/mgr/cephadm/module.py", line 2343, in _daemon_action
    return self.wait_async(
  File "/usr/share/ceph/mgr/cephadm/module.py", line 704, in wait_async
    return self.event_loop.get_result(coro, timeout)
  File "/usr/share/ceph/mgr/cephadm/ssh.py", line 64, in get_result
    return future.result(timeout)
  File "/lib64/python3.9/concurrent/futures/_base.py", line 446, in result
    return self.__get_result()
  File "/lib64/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1339, in _create_daemon
    out, err, code = await self._run_cephadm(
  File "/usr/share/ceph/mgr/cephadm/serve.py", line 1627, in _run_cephadm
    raise OrchestratorError(
orchestrator._interface.OrchestratorError: cephadm exited with an error code: 1, stderr: Redeploy daemon nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp ...
Creating ceph-nvmeof config...
Write file: /var/lib/ceph/1551a2a8-084b-11ef-bfc1-fa163ef4350d/nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp/ceph-nvmeof.conf
Non-zero exit code 1 from systemctl start ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp
systemctl: stderr Job for ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service failed because the control process exited with error code.
systemctl: stderr See "systemctl status ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" and "journalctl -xeu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" for details.
Traceback (most recent call last):
  File "/usr/lib64/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/lib64/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 11148, in <module>
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 11136, in main
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6881, in command_deploy_from
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6899, in _common_deploy
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 6982, in _dispatch_deploy
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 3960, in deploy_daemon
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 4203, in deploy_daemon_units
  File "/tmp/tmpf82_u822.cephadm.build/__main__.py", line 2220, in call_throws
RuntimeError: Failed command: systemctl start ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp: Job for ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service failed because the control process exited with error code.
See "systemctl status ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" and "journalctl -xeu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" for details.
***********************************************************************************************************************************************

As indicated in the log above, I go to node 6, and run the command as shown in the cephadm log: "journalctl -eu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service" , and I can see this: 

***********************************************************************************************************************************************
[root@ceph-sunilkumar-01-cl763k-node6 ~]# journalctl -eu ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service
May 06 11:56:57 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:56:57 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at>
May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d...
May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 bash[1858468]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2...
May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 bash[1858468]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>
May 06 11:57:07 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=>
May 06 11:57:08 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:57:08 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at>
May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d...
May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 bash[1858559]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2...
May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 bash[1858559]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>
May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=>
May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:57:18 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at>
May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Start request repeated too quickly.
May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:57:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:37 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d...
May 06 11:57:37 ceph-sunilkumar-01-cl763k-node6 bash[1858969]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2...
May 06 11:57:37 ceph-sunilkumar-01-cl763k-node6 bash[1858969]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>
May 06 11:57:37 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=>
May 06 11:57:38 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:57:38 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at>
May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d...
May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 bash[1859083]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2...
May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 bash[1859083]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>
May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=>
May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:57:48 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:58 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at>
May 06 11:57:58 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:57:58 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d...
May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 bash[1859198]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2...
May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 bash[1859198]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>
May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=>
May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:57:59 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at>
May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d...
May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 bash[1859303]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2...
May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 bash[1859303]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>
May 06 11:58:09 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=>
May 06 11:58:10 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:58:10 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Scheduled restart job, restart counter is at>
May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Starting Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d...
May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 bash[1859417]: Trying to pull ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2...
May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 bash[1859417]: Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>
May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Control process exited, code=exited, status=>
May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: ceph-1551a2a8-084b-11ef-bfc1-fa163ef4350d.ceph-sunilkumar-01-cl763k-node6.qhkdcp.service: Failed with result 'exit-code'.
May 06 11:58:20 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Failed to start Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
May 06 11:58:29 ceph-sunilkumar-01-cl763k-node6 systemd[1]: Stopped Ceph nvmeof.rbd.ceph-sunilkumar-01-cl763k-node6.qhkdcp for 1551a2a8-084b-11ef-bfc1-fa163ef4350d.
lines 939-1000/1000 (END)
************************************************************************************************************************************************************

Note in the log above - "Error: initializing source docker://ceph-sunilkumar-01-cl763k-node7:5000/ibm-ceph/nvmeof-rhel9:1.2.5-2: reading manifest 1.2.5-2 in ceph-sunil>" 


So it looks like the issue here was that there were some failed trials to deploy the nvmeof gw, but the image was not fully available, so it failed. 


Conclusion so far
I think we should defer this issue, while doing another test on a clean cluster and verify that images can be fully downloaded without issues on the first try. 
Also, I discussed with Adam King, and he say that it is recommended to "set mgr/cephadm/use_repo_digest to false when setting up your clusters, because he saw some issues with local repos in the past having issues handling image digests properly.

Comment 2 Sunil Kumar Nagaraju 2024-05-07 08:00:05 UTC

Thanks Aviv.

I agree that there were issues in pulling image,
 But strictly not allowing another entry for the same Gateway node with different Id in the same Gateway group NVMeGW map is really good in my perspective.

Because this could occur anytime when a registry is failing or in outage condition or even if it could not download/pull the image in low latency(Timeout issue) networked Ceph Clusters(especially in private clouds).

Comment 3 Sunil Kumar Nagaraju 2024-05-09 10:45:37 UTC

Considering namespace allocation with invalid entry as mentioned in the below BZ, re-setting target release to 7.1.


https://bugzilla.redhat.com/show_bug.cgi?id=2279862

Comment 4 Aviv Caro 2024-05-16 09:02:44 UTC

Fixed by https://gitlab.cee.redhat.com/ceph/ceph/-/commit/7837cc865312b562228519e3efdf658a3cde4193

Comment 8 Sunil Kumar Nagaraju 2024-05-20 16:09:39 UTC

Could not re-produce the issue and every time new NVMe GW came up, I didn't notice duplicate entry for the same node in the NVMe MON Map.

Tried scenarios to make daemon fail at build,
- Hosted pvt registry and redeplloyed nvmeof.service w/o having access to pvt registry at Gateway nodes.
- Removed and added back the service.

Verified in the Ceph 18.2.1-185

registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof:1.2.9-1
registry-proxy.engineering.redhat.com/rh-osbs/ceph-nvmeof-cli:1.2.9-1

Comment 9 errata-xmlrpc 2024-06-13 14:32:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925

Note You need to log in before you can comment on or make changes to this bug.