Bug 2278778

Summary:	[7.1 Upgrade] : Upgrade of MGR in staggered approach also started upgrading NVMeoF service
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Sunil Kumar Nagaraju <sunnagar>
Component:	Cephadm	Assignee:	Adam King <adking>
Status:	CLOSED ERRATA	QA Contact:	Sunil Kumar Nagaraju <sunnagar>
Severity:	high	Docs Contact:	Akash Raj <akraj>
Priority:	unspecified
Version:	7.1	CC:	adking, akraj, cephqe-warriors, tserlin, vereddy
Target Milestone:	---
Target Release:	7.1
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	ceph-18.2.1-168.el9cp	Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:
Clones:	2310839 (view as bug list)		Environment:
Last Closed:	2024-06-13 14:32:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2267614, 2298578, 2298579, 2310839

Description Sunil Kumar Nagaraju 2024-05-03 02:45:48 UTC

Created attachment 2030955 [details]
gw-crash log

Description of problem:

I was upgrading 4 Node GW in staggered approach. and I noticed 2 issues, but couldn't understand the crash.

Issue-1: Initially proceeded with MGR upgrade only with staggered upgrade method, But Noticed that after MGR got updated, nvmeof daemon started upgrading. Is it expected?
The same without staggered upgrade approach, The upgrade procedure follows the order
Enforced upgrade order is: mgr -> mon -> crash -> osd -> mds -> rgw -> rbd-mirror -> cephfs-mirror -> ceph-exporter -> iscsi -> nfs -> nvmeof

Issue-2: Noticing crashes in GW daemons.
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    1/ 5 mgrc
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    1/ 5 dpdk
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    1/ 5 eventtrace
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    1/ 5 prioritycache
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 test
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 cephfs_mirror
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 cephsqlite
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_onode
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_odata
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_omap
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_tm
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_t
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_cleaner
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_epm
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_lba
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_fixedkv_tree
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_cache
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_journal
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_device
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 seastore_backref
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 alienstore
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    1/ 5 mclock
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    0/ 5 cyanstore
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    1/ 5 ceph_exporter
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:    1/ 5 memstore
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   -2/-2 (syslog threshold)
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   99/99 (stderr threshold)
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: --- pthread ID / name mapping for recent threads ---
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   7fc34bb13640 / ms_dispatch
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   7fc34db17640 / msgr-worker-2
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   7fc34e318640 / msgr-worker-1
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   7fc34eb19640 / msgr-worker-0
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   7fc34fb795c0 / ceph-nvmeof-mon
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   max_recent       500
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   max_new         1000
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   log_file /var/lib/ceph/crash/2024-05-02T15:58:31.864166Z_3148cde5-2a00-4b9e-a9a7-f0bb9341803d/log
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: --- end dump of recent events ---
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: 2024-05-02T15:58:31.863+0000 7fc34fb795c0  0 nvmeofgw int NVMeofGwMonitorClient::init() Complete.
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 systemd-coredump[134690]: Process 134655 (ceph-nvmeof-mon) of user 0 dumped core.
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: [02-May-2024 15:58:31] ERROR server.py:42: GatewayServer: SIGCHLD received signum=17
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: [02-May-2024 15:58:31] ERROR server.py:108: GatewayServer exception occurred:
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: Traceback (most recent call last):
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   File "/remote-source/ceph-nvmeof/app/control/__main__.py", line 43, in <module>
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:     gateway.serve()
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 165, in serve
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:     self._start_monitor_client()
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 223, in _start_monitor_client
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:     self._wait_for_group_id()
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 145, in _wait_for_group_id
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:     self.monitor_event.wait()
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   File "/usr/lib64/python3.9/threading.py", line 581, in wait
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:     signaled = self._cond.wait(timeout)
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   File "/usr/lib64/python3.9/threading.py", line 312, in wait
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:     waiter.acquire()
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:   File "/remote-source/ceph-nvmeof/app/control/server.py", line 54, in sigchld_handler
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]:     raise SystemExit(f"Gateway subprocess terminated {pid=} {exit_code=}")
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: SystemExit: Gateway subprocess terminated pid=18 exit_code=-6
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: [02-May-2024 15:58:31] INFO server.py:392: Aborting (client.nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node6.vtwjfa) pid 18...
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-vtwjfa[134635]: [02-May-2024 15:58:31] INFO server.py:129: Exiting the gateway process.
May 02 11:58:31 ceph-sunilkumar-00-bjcvqj-node6 ceph-a77239d0-0874-11ef-9ae9-fa163ea1f4b1-nvmeof-rbd-ceph-sunilkumar-00-bjcvqj-node6-


Version-Release number of selected component (if applicable):
Upgrading,
  from IBM Ceph-18.2.1-149 to Ceph-18.2.1-159. 
       NVMe 1.2.4-1 to 1.2.5-2



How reproducible:


Steps to Reproduce:
1. Deploy cluster in 18.2.1-149, configure 4 NVMe GWs and ran IO.
2. Start Upgrade with only MGR using staggered upgrade approach.
3. Noticed that NVMeoF services are also started to upgrade after completion of MGR upgrade.
4. Later I stopped the upgrade process and started again which resulted in All GWs crash.


Additional info:

[ceph: root@ceph-sunilkumar-00-bjcvqj-node1-installer /]# ceph -s
  cluster:
    id:     a77239d0-0874-11ef-9ae9-fa163ea1f4b1
    health: HEALTH_WARN
            4 failed cephadm daemon(s)

  services:
    mon: 3 daemons, quorum ceph-sunilkumar-00-bjcvqj-node1-installer,ceph-sunilkumar-00-bjcvqj-node2,ceph-sunilkumar-00-bjcvqj-node3 (age 10h)
    mgr: ceph-sunilkumar-00-bjcvqj-node1-installer.xmrfdw(active, since 10h), standbys: ceph-sunilkumar-00-bjcvqj-node2.yqxjwf
    osd: 15 osds: 15 up (since 10h), 15 in (since 15h)

  data:
    pools:   2 pools, 129 pgs
    objects: 7.84k objects, 30 GiB
    usage:   93 GiB used, 282 GiB / 375 GiB avail
    pgs:     129 active+clean

[ceph: root@ceph-sunilkumar-00-bjcvqj-node1-installer /]# ceph versions
{
    "mon": {
        "ceph version 18.2.1-159.el9cp (5290a81d189b81ab463c73601c44f77a99f4107e) reef (stable)": 3
    },
    "mgr": {
        "ceph version 18.2.1-159.el9cp (5290a81d189b81ab463c73601c44f77a99f4107e) reef (stable)": 2
    },
    "osd": {
        "ceph version 18.2.1-159.el9cp (5290a81d189b81ab463c73601c44f77a99f4107e) reef (stable)": 15
    },
    "overall": {
        "ceph version 18.2.1-159.el9cp (5290a81d189b81ab463c73601c44f77a99f4107e) reef (stable)": 20
    }
}

[ceph: root@ceph-sunilkumar-00-bjcvqj-node1-installer /]# ceph orch ls
NAME                       PORTS             RUNNING  REFRESHED  AGE  PLACEMENT
alertmanager               ?:9093,9094           1/1  4m ago     15h  count:1
ceph-exporter                                    9/9  4m ago     15h  *
crash                                            9/9  4m ago     15h  *
grafana                    ?:3000                1/1  4m ago     15h  count:1
mgr                                              2/2  4m ago     15h  label:mgr
mon                                              3/3  4m ago     15h  label:mon
node-exporter              ?:9100                9/9  4m ago     15h  *
node-proxy                                       0/0  -          15h  *
nvmeof.rbd                 ?:4420,5500,8009      0/4  4m ago     12h  ceph-sunilkumar-00-bjcvqj-node6;ceph-sunilkumar-00-bjcvqj-node7;ceph-sunilkumar-00-bjcvqj-node8;ceph-sunilkumar-00-bjcvqj-node9
osd.all-available-devices                         15  4m ago     15h  *
prometheus                 ?:9095                1/1  4m ago     15h  count:1


[ceph: root@ceph-sunilkumar-00-bjcvqj-node1-installer /]# ceph orch ps | grep nvme
nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node6.vtwjfa        ceph-sunilkumar-00-bjcvqj-node6            *:5500,4420,8009  error             5m ago  12h        -        -  <unknown>         <unknown>     <unknown>
nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node7.vglfty        ceph-sunilkumar-00-bjcvqj-node7            *:5500,4420,8009  error             5m ago  12h        -        -  <unknown>         <unknown>     <unknown>
nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node8.hnmfps        ceph-sunilkumar-00-bjcvqj-node8            *:5500,4420,8009  error             5m ago  12h        -        -  <unknown>         <unknown>     <unknown>
nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node9.guwzdw        ceph-sunilkumar-00-bjcvqj-node9            *:5500,4420,8009  error             5m ago  12h        -        -  <unknown>         <unknown>     <unknown>
[ceph: root@ceph-sunilkumar-00-bjcvqj-node1-installer /]# ceph health detail
HEALTH_WARN 4 failed cephadm daemon(s)
[WRN] CEPHADM_FAILED_DAEMON: 4 failed cephadm daemon(s)
    daemon nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node6.vtwjfa on ceph-sunilkumar-00-bjcvqj-node6 is in error state
    daemon nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node7.vglfty on ceph-sunilkumar-00-bjcvqj-node7 is in error state
    daemon nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node8.hnmfps on ceph-sunilkumar-00-bjcvqj-node8 is in error state
    daemon nvmeof.rbd.ceph-sunilkumar-00-bjcvqj-node9.guwzdw on ceph-sunilkumar-00-bjcvqj-node9 is in error state
[ceph: root@ceph-sunilkumar-00-bjcvqj-node1-installer /]# ceph crash info

Comment 7 Sunil Kumar Nagaraju 2024-05-09 10:29:10 UTC

Hello Everyone,

Waiting for IBM Ceph 18.2.1-168 to verify the issue, with 18.2.1-167 still mgr also upgrades NVMeoF service daemons as expected.

- Thanks

Comment 8 Sunil Kumar Nagaraju 2024-05-09 13:52:20 UTC

Verified the BZ with RH 18.2.1-170.

Now Upgrading the MGR in staggered approach, doesn't upgrades NVMeoF. And It needs to follow order of MGR --> MON --> CRASH --> OSD.

Once all ceph based daemons are upgraded, then from the last upgrade process, NVMeoF get upgraded.

Order: mgr -> mon -> crash -> osd -> mds -> rgw -> rbd-mirror -> cephfs-mirror -> ceph-exporter -> iscsi -> nfs -> nvmeof

Here in my case after calling upgrade of ceph-exporter in staggered approach upgraded ceph-exporter at first, then started upgrading NVMeof daemons.


Hence Marking this BZ as verified.

Comment 11 errata-xmlrpc 2024-06-13 14:32:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925