2232877 – One of the ceph mgr module crashed while performing host OS upgrade

Bug 2232877 - One of the ceph mgr module crashed while performing host OS upgrade

Summary: One of the ceph mgr module crashed while performing host OS upgrade

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Ceph Storage
Classification:	Red Hat Storage
Component:	Cephadm
Sub Component:
Version:	7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	7.0
Assignee:	Adam King
QA Contact:	Mohit Bisht
Docs Contact:	Rivka Pollack
URL:
Whiteboard:
Depends On:
Blocks:	2237662
TreeView+	depends on / blocked

Reported:	2023-08-19 20:32 UTC by Manisha Saini
Modified:	2023-12-13 15:22 UTC (History)
CC List:	5 users (show)
Fixed In Version:	ceph-18.2.0-1
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-12-13 15:22:07 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	ceph ceph pull 51963	None	open	mgr/cephadm: fixups for asyncio based timeout	2023-08-30 17:49:25 UTC
Red Hat Issue Tracker	RHCEPH-7237	None	None	None	2023-08-19 20:33:07 UTC
Red Hat Product Errata	RHBA-2023:7780	None	None	None	2023-12-13 15:22:11 UTC

Description Manisha Saini 2023-08-19 20:32:12 UTC

Description of problem:
=========

1 of the mgr daemon got crashed while in process of performing HOST OS upgrade from RHEL 8.8 to RHEL 9.2 having CEPH 7.0.  All the nodes got upgraded successfully. Post reboot ceph -s shows 1 mgr module crashed.

# ceph crash info 2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/cephadm/module.py\", line 698, in serve\n    serve.serve()",
        "  File \"/usr/share/ceph/mgr/cephadm/serve.py\", line 93, in serve\n    if self._apply_all_services():",
        "  File \"/usr/share/ceph/mgr/cephadm/serve.py\", line 573, in _apply_all_services\n    self.mgr.tuned_profile_utils._write_all_tuned_profiles()",
        "  File \"/usr/share/ceph/mgr/cephadm/tuned_profiles.py\", line 46, in _write_all_tuned_profiles\n    self._remove_stray_tuned_profiles(host, profiles)",
        "  File \"/usr/share/ceph/mgr/cephadm/tuned_profiles.py\", line 73, in _remove_stray_tuned_profiles\n    found_files = self.mgr.ssh.check_execute_command(host, cmd, log_command=self.mgr.log_refresh_metadata).split('\\n')",
        "  File \"/usr/share/ceph/mgr/cephadm/ssh.py\", line 228, in check_execute_command\n    return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin, addr, log_command))",
        "  File \"/usr/share/ceph/mgr/cephadm/module.py\", line 710, in wait_async\n    return self.event_loop.get_result(coro, timeout)",
        "  File \"/usr/share/ceph/mgr/cephadm/ssh.py\", line 63, in get_result\n    return future.result(timeout)",
        "  File \"/lib64/python3.9/concurrent/futures/_base.py\", line 448, in result\n    raise TimeoutError()",
        "concurrent.futures._base.TimeoutError"
    ],
    "ceph_version": "18.1.2-1.el9cp",
    "crash_id": "2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0",
    "entity_name": "mgr.e24-h19-740xd.nciiap",
    "mgr_module": "cephadm",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "TimeoutError",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-mgr",
    "stack_sig": "e11ab31c9d0ed5cece13917ebab1f612e4ca89481fe319aa0f59d4af9065ab1e",
    "timestamp": "2023-08-19T04:59:53.987466Z",
    "utsname_hostname": "e24-h19-740xd.alias.bos.scalelab.redhat.com",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-477.21.1.el8_8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu Jul 20 08:38:27 EDT 2023"
}



Version-Release number of selected component (if applicable):
==========
# ceph --version
ceph version 18.1.2-1.el9cp (fda3f82ca4e81cdea0ca82061f107dbc4e84fb00) reef (rc)

# rpm -qa | grep ceph
cephadm-18.1.2-1.el9cp.noarch

How reproducible:
========
1/1


Steps to Reproduce:
===========
1. Perform Host OS upgrade with ceph 7.0 from RHEL 8.8 to RHEL 9.2


Actual results:
=======
Upgrade got completed successfully. Post upgrade 1 of mgr module crash was observed


Expected results:
======
No crash should be reported.


Additional info:
============

# ceph -s
  cluster:
    id:     5742cad8-37be-11ee-87b6-246e96c2b086
    health: HEALTH_WARN
            noout flag(s) set
            1 mgr modules have recently crashed
 
  services:
    mon: 5 daemons, quorum e24-h19-740xd,e24-h27-740xd,e24-h21-740xd,e24-h23-740xd,e24-h25-740xd (age 25m)
    mgr: e24-h19-740xd.nciiap(active, since 25m), standbys: e24-h27-740xd.qxkzyp
    mds: 1/1 daemons up, 1 standby
    osd: 41 osds: 41 up (since 25m), 41 in (since 8d)
         flags noout
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 81 pgs
    objects: 205 objects, 163 MiB
    usage:   11 GiB used, 72 TiB / 72 TiB avail
    pgs:     81 active+clean
 
  io:
    client:   341 B/s rd, 0 op/s rd, 0 op/s wr



-----
[ceph: root@e24-h19-740xd /]# ceph crash ls
ID                                                                ENTITY                    NEW  
2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0  mgr.e24-h19-740xd.nciiap   *   

--------
[ceph: root@e24-h19-740xd /]# ceph crash stat
1 crashes recorded

--------

# ceph health detail
HEALTH_WARN noout flag(s) set; 1 mgr modules have recently crashed
[WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] RECENT_MGR_MODULE_CRASH: 1 mgr modules have recently crashed
    mgr module cephadm crashed in daemon mgr.e24-h19-740xd.nciiap on host e24-h19-740xd.alias.bos.scalelab.redhat.com at 2023-08-19T04:59:53.987466Z

-----


# ceph log last cephadm
2023-08-19T19:54:51.329975+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 36 : cephadm [INF] Creating key for client.nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg-rgw
2023-08-19T19:54:51.338148+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 37 : cephadm [WRN] Bind address in nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg's ganesha conf is defaulting to empty
2023-08-19T19:54:51.340215+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 38 : cephadm [INF] Deploying daemon nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg on e24-h27-740xd
2023-08-19T19:54:53.958261+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 40 : cephadm [INF] Creating key for client.nfs.mycephnfs.3.1.e24-h25-740xd.afegrf
2023-08-19T19:54:53.964859+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 41 : cephadm [INF] Ensuring nfs.mycephnfs.3 is in the ganesha grace table
2023-08-19T19:54:54.236015+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 42 : cephadm [INF] Rados config object exists: conf-nfs.mycephnfs
2023-08-19T19:54:54.236192+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 43 : cephadm [INF] Creating key for client.nfs.mycephnfs.3.1.e24-h25-740xd.afegrf-rgw
2023-08-19T19:54:54.244302+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 44 : cephadm [WRN] Bind address in nfs.mycephnfs.3.1.e24-h25-740xd.afegrf's ganesha conf is defaulting to empty
2023-08-19T19:54:54.246669+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 45 : cephadm [INF] Deploying daemon nfs.mycephnfs.3.1.e24-h25-740xd.afegrf on e24-h25-740xd
2023-08-19T19:55:09.427816+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 54 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h21-740xd.qsnpxj (dependencies changed)...
2023-08-19T19:55:09.472633+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 55 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h21-740xd.qsnpxj on e24-h21-740xd
2023-08-19T19:55:11.675656+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 57 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h23-740xd.ujyyxd (dependencies changed)...
2023-08-19T19:55:11.678865+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 58 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h23-740xd.ujyyxd on e24-h23-740xd
2023-08-19T19:55:13.577480+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 60 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h25-740xd.lvqthq (dependencies changed)...
2023-08-19T19:55:13.580526+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 61 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h25-740xd.lvqthq on e24-h25-740xd
2023-08-19T19:55:16.384571+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 63 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h27-740xd.fcesea (dependencies changed)...
2023-08-19T19:55:16.387773+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 64 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h27-740xd.fcesea on e24-h27-740xd
2023-08-19T20:05:47.138004+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 383 : cephadm [INF] Adjusting osd_memory_target on e24-h27-740xd to 17927M
2023-08-19T20:05:47.145235+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 384 : cephadm [INF] Adjusting osd_memory_target on e24-h23-740xd to 14399M
2023-08-19T20:05:47.691571+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 386 : cephadm [INF] Adjusting osd_memory_target on e24-h25-740xd to 16198M

Comment 7 errata-xmlrpc 2023-12-13 15:22:07 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780

Note You need to log in before you can comment on or make changes to this bug.