Description of problem: ========= 1 of the mgr daemon got crashed while in process of performing HOST OS upgrade from RHEL 8.8 to RHEL 9.2 having CEPH 7.0. All the nodes got upgraded successfully. Post reboot ceph -s shows 1 mgr module crashed. # ceph crash info 2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0 { "backtrace": [ " File \"/usr/share/ceph/mgr/cephadm/module.py\", line 698, in serve\n serve.serve()", " File \"/usr/share/ceph/mgr/cephadm/serve.py\", line 93, in serve\n if self._apply_all_services():", " File \"/usr/share/ceph/mgr/cephadm/serve.py\", line 573, in _apply_all_services\n self.mgr.tuned_profile_utils._write_all_tuned_profiles()", " File \"/usr/share/ceph/mgr/cephadm/tuned_profiles.py\", line 46, in _write_all_tuned_profiles\n self._remove_stray_tuned_profiles(host, profiles)", " File \"/usr/share/ceph/mgr/cephadm/tuned_profiles.py\", line 73, in _remove_stray_tuned_profiles\n found_files = self.mgr.ssh.check_execute_command(host, cmd, log_command=self.mgr.log_refresh_metadata).split('\\n')", " File \"/usr/share/ceph/mgr/cephadm/ssh.py\", line 228, in check_execute_command\n return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin, addr, log_command))", " File \"/usr/share/ceph/mgr/cephadm/module.py\", line 710, in wait_async\n return self.event_loop.get_result(coro, timeout)", " File \"/usr/share/ceph/mgr/cephadm/ssh.py\", line 63, in get_result\n return future.result(timeout)", " File \"/lib64/python3.9/concurrent/futures/_base.py\", line 448, in result\n raise TimeoutError()", "concurrent.futures._base.TimeoutError" ], "ceph_version": "18.1.2-1.el9cp", "crash_id": "2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0", "entity_name": "mgr.e24-h19-740xd.nciiap", "mgr_module": "cephadm", "mgr_module_caller": "PyModuleRunner::serve", "mgr_python_exception": "TimeoutError", "os_id": "rhel", "os_name": "Red Hat Enterprise Linux", "os_version": "9.2 (Plow)", "os_version_id": "9.2", "process_name": "ceph-mgr", "stack_sig": "e11ab31c9d0ed5cece13917ebab1f612e4ca89481fe319aa0f59d4af9065ab1e", "timestamp": "2023-08-19T04:59:53.987466Z", "utsname_hostname": "e24-h19-740xd.alias.bos.scalelab.redhat.com", "utsname_machine": "x86_64", "utsname_release": "4.18.0-477.21.1.el8_8.x86_64", "utsname_sysname": "Linux", "utsname_version": "#1 SMP Thu Jul 20 08:38:27 EDT 2023" } Version-Release number of selected component (if applicable): ========== # ceph --version ceph version 18.1.2-1.el9cp (fda3f82ca4e81cdea0ca82061f107dbc4e84fb00) reef (rc) # rpm -qa | grep ceph cephadm-18.1.2-1.el9cp.noarch How reproducible: ======== 1/1 Steps to Reproduce: =========== 1. Perform Host OS upgrade with ceph 7.0 from RHEL 8.8 to RHEL 9.2 Actual results: ======= Upgrade got completed successfully. Post upgrade 1 of mgr module crash was observed Expected results: ====== No crash should be reported. Additional info: ============ # ceph -s cluster: id: 5742cad8-37be-11ee-87b6-246e96c2b086 health: HEALTH_WARN noout flag(s) set 1 mgr modules have recently crashed services: mon: 5 daemons, quorum e24-h19-740xd,e24-h27-740xd,e24-h21-740xd,e24-h23-740xd,e24-h25-740xd (age 25m) mgr: e24-h19-740xd.nciiap(active, since 25m), standbys: e24-h27-740xd.qxkzyp mds: 1/1 daemons up, 1 standby osd: 41 osds: 41 up (since 25m), 41 in (since 8d) flags noout data: volumes: 1/1 healthy pools: 4 pools, 81 pgs objects: 205 objects, 163 MiB usage: 11 GiB used, 72 TiB / 72 TiB avail pgs: 81 active+clean io: client: 341 B/s rd, 0 op/s rd, 0 op/s wr ----- [ceph: root@e24-h19-740xd /]# ceph crash ls ID ENTITY NEW 2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0 mgr.e24-h19-740xd.nciiap * -------- [ceph: root@e24-h19-740xd /]# ceph crash stat 1 crashes recorded -------- # ceph health detail HEALTH_WARN noout flag(s) set; 1 mgr modules have recently crashed [WRN] OSDMAP_FLAGS: noout flag(s) set [WRN] RECENT_MGR_MODULE_CRASH: 1 mgr modules have recently crashed mgr module cephadm crashed in daemon mgr.e24-h19-740xd.nciiap on host e24-h19-740xd.alias.bos.scalelab.redhat.com at 2023-08-19T04:59:53.987466Z ----- # ceph log last cephadm 2023-08-19T19:54:51.329975+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 36 : cephadm [INF] Creating key for client.nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg-rgw 2023-08-19T19:54:51.338148+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 37 : cephadm [WRN] Bind address in nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg's ganesha conf is defaulting to empty 2023-08-19T19:54:51.340215+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 38 : cephadm [INF] Deploying daemon nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg on e24-h27-740xd 2023-08-19T19:54:53.958261+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 40 : cephadm [INF] Creating key for client.nfs.mycephnfs.3.1.e24-h25-740xd.afegrf 2023-08-19T19:54:53.964859+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 41 : cephadm [INF] Ensuring nfs.mycephnfs.3 is in the ganesha grace table 2023-08-19T19:54:54.236015+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 42 : cephadm [INF] Rados config object exists: conf-nfs.mycephnfs 2023-08-19T19:54:54.236192+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 43 : cephadm [INF] Creating key for client.nfs.mycephnfs.3.1.e24-h25-740xd.afegrf-rgw 2023-08-19T19:54:54.244302+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 44 : cephadm [WRN] Bind address in nfs.mycephnfs.3.1.e24-h25-740xd.afegrf's ganesha conf is defaulting to empty 2023-08-19T19:54:54.246669+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 45 : cephadm [INF] Deploying daemon nfs.mycephnfs.3.1.e24-h25-740xd.afegrf on e24-h25-740xd 2023-08-19T19:55:09.427816+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 54 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h21-740xd.qsnpxj (dependencies changed)... 2023-08-19T19:55:09.472633+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 55 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h21-740xd.qsnpxj on e24-h21-740xd 2023-08-19T19:55:11.675656+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 57 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h23-740xd.ujyyxd (dependencies changed)... 2023-08-19T19:55:11.678865+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 58 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h23-740xd.ujyyxd on e24-h23-740xd 2023-08-19T19:55:13.577480+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 60 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h25-740xd.lvqthq (dependencies changed)... 2023-08-19T19:55:13.580526+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 61 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h25-740xd.lvqthq on e24-h25-740xd 2023-08-19T19:55:16.384571+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 63 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h27-740xd.fcesea (dependencies changed)... 2023-08-19T19:55:16.387773+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 64 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h27-740xd.fcesea on e24-h27-740xd 2023-08-19T20:05:47.138004+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 383 : cephadm [INF] Adjusting osd_memory_target on e24-h27-740xd to 17927M 2023-08-19T20:05:47.145235+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 384 : cephadm [INF] Adjusting osd_memory_target on e24-h23-740xd to 14399M 2023-08-19T20:05:47.691571+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 386 : cephadm [INF] Adjusting osd_memory_target on e24-h25-740xd to 16198M
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:7780