Bug 2232877

Summary: One of the ceph mgr module crashed while performing host OS upgrade
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Manisha Saini <msaini>
Component: CephadmAssignee: Adam King <adking>
Status: CLOSED ERRATA QA Contact: Mohit Bisht <mobisht>
Severity: high Docs Contact: Rivka Pollack <rpollack>
Priority: unspecified    
Version: 7.0CC: adking, akraj, cephqe-warriors, kdreyer, rpollack
Target Milestone: ---   
Target Release: 7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ceph-18.2.0-1 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-12-13 15:22:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2237662    

Description Manisha Saini 2023-08-19 20:32:12 UTC
Description of problem:
=========

1 of the mgr daemon got crashed while in process of performing HOST OS upgrade from RHEL 8.8 to RHEL 9.2 having CEPH 7.0.  All the nodes got upgraded successfully. Post reboot ceph -s shows 1 mgr module crashed.

# ceph crash info 2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0
{
    "backtrace": [
        "  File \"/usr/share/ceph/mgr/cephadm/module.py\", line 698, in serve\n    serve.serve()",
        "  File \"/usr/share/ceph/mgr/cephadm/serve.py\", line 93, in serve\n    if self._apply_all_services():",
        "  File \"/usr/share/ceph/mgr/cephadm/serve.py\", line 573, in _apply_all_services\n    self.mgr.tuned_profile_utils._write_all_tuned_profiles()",
        "  File \"/usr/share/ceph/mgr/cephadm/tuned_profiles.py\", line 46, in _write_all_tuned_profiles\n    self._remove_stray_tuned_profiles(host, profiles)",
        "  File \"/usr/share/ceph/mgr/cephadm/tuned_profiles.py\", line 73, in _remove_stray_tuned_profiles\n    found_files = self.mgr.ssh.check_execute_command(host, cmd, log_command=self.mgr.log_refresh_metadata).split('\\n')",
        "  File \"/usr/share/ceph/mgr/cephadm/ssh.py\", line 228, in check_execute_command\n    return self.mgr.wait_async(self._check_execute_command(host, cmd, stdin, addr, log_command))",
        "  File \"/usr/share/ceph/mgr/cephadm/module.py\", line 710, in wait_async\n    return self.event_loop.get_result(coro, timeout)",
        "  File \"/usr/share/ceph/mgr/cephadm/ssh.py\", line 63, in get_result\n    return future.result(timeout)",
        "  File \"/lib64/python3.9/concurrent/futures/_base.py\", line 448, in result\n    raise TimeoutError()",
        "concurrent.futures._base.TimeoutError"
    ],
    "ceph_version": "18.1.2-1.el9cp",
    "crash_id": "2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0",
    "entity_name": "mgr.e24-h19-740xd.nciiap",
    "mgr_module": "cephadm",
    "mgr_module_caller": "PyModuleRunner::serve",
    "mgr_python_exception": "TimeoutError",
    "os_id": "rhel",
    "os_name": "Red Hat Enterprise Linux",
    "os_version": "9.2 (Plow)",
    "os_version_id": "9.2",
    "process_name": "ceph-mgr",
    "stack_sig": "e11ab31c9d0ed5cece13917ebab1f612e4ca89481fe319aa0f59d4af9065ab1e",
    "timestamp": "2023-08-19T04:59:53.987466Z",
    "utsname_hostname": "e24-h19-740xd.alias.bos.scalelab.redhat.com",
    "utsname_machine": "x86_64",
    "utsname_release": "4.18.0-477.21.1.el8_8.x86_64",
    "utsname_sysname": "Linux",
    "utsname_version": "#1 SMP Thu Jul 20 08:38:27 EDT 2023"
}



Version-Release number of selected component (if applicable):
==========
# ceph --version
ceph version 18.1.2-1.el9cp (fda3f82ca4e81cdea0ca82061f107dbc4e84fb00) reef (rc)

# rpm -qa | grep ceph
cephadm-18.1.2-1.el9cp.noarch

How reproducible:
========
1/1


Steps to Reproduce:
===========
1. Perform Host OS upgrade with ceph 7.0 from RHEL 8.8 to RHEL 9.2


Actual results:
=======
Upgrade got completed successfully. Post upgrade 1 of mgr module crash was observed


Expected results:
======
No crash should be reported.


Additional info:
============

# ceph -s
  cluster:
    id:     5742cad8-37be-11ee-87b6-246e96c2b086
    health: HEALTH_WARN
            noout flag(s) set
            1 mgr modules have recently crashed
 
  services:
    mon: 5 daemons, quorum e24-h19-740xd,e24-h27-740xd,e24-h21-740xd,e24-h23-740xd,e24-h25-740xd (age 25m)
    mgr: e24-h19-740xd.nciiap(active, since 25m), standbys: e24-h27-740xd.qxkzyp
    mds: 1/1 daemons up, 1 standby
    osd: 41 osds: 41 up (since 25m), 41 in (since 8d)
         flags noout
 
  data:
    volumes: 1/1 healthy
    pools:   4 pools, 81 pgs
    objects: 205 objects, 163 MiB
    usage:   11 GiB used, 72 TiB / 72 TiB avail
    pgs:     81 active+clean
 
  io:
    client:   341 B/s rd, 0 op/s rd, 0 op/s wr



-----
[ceph: root@e24-h19-740xd /]# ceph crash ls
ID                                                                ENTITY                    NEW  
2023-08-19T04:59:53.987466Z_dda445ee-f57e-4030-932f-4e6e26956fe0  mgr.e24-h19-740xd.nciiap   *   

--------
[ceph: root@e24-h19-740xd /]# ceph crash stat
1 crashes recorded

--------

# ceph health detail
HEALTH_WARN noout flag(s) set; 1 mgr modules have recently crashed
[WRN] OSDMAP_FLAGS: noout flag(s) set
[WRN] RECENT_MGR_MODULE_CRASH: 1 mgr modules have recently crashed
    mgr module cephadm crashed in daemon mgr.e24-h19-740xd.nciiap on host e24-h19-740xd.alias.bos.scalelab.redhat.com at 2023-08-19T04:59:53.987466Z

-----


# ceph log last cephadm
2023-08-19T19:54:51.329975+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 36 : cephadm [INF] Creating key for client.nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg-rgw
2023-08-19T19:54:51.338148+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 37 : cephadm [WRN] Bind address in nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg's ganesha conf is defaulting to empty
2023-08-19T19:54:51.340215+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 38 : cephadm [INF] Deploying daemon nfs.mycephnfs.2.1.e24-h27-740xd.mfelxg on e24-h27-740xd
2023-08-19T19:54:53.958261+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 40 : cephadm [INF] Creating key for client.nfs.mycephnfs.3.1.e24-h25-740xd.afegrf
2023-08-19T19:54:53.964859+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 41 : cephadm [INF] Ensuring nfs.mycephnfs.3 is in the ganesha grace table
2023-08-19T19:54:54.236015+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 42 : cephadm [INF] Rados config object exists: conf-nfs.mycephnfs
2023-08-19T19:54:54.236192+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 43 : cephadm [INF] Creating key for client.nfs.mycephnfs.3.1.e24-h25-740xd.afegrf-rgw
2023-08-19T19:54:54.244302+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 44 : cephadm [WRN] Bind address in nfs.mycephnfs.3.1.e24-h25-740xd.afegrf's ganesha conf is defaulting to empty
2023-08-19T19:54:54.246669+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 45 : cephadm [INF] Deploying daemon nfs.mycephnfs.3.1.e24-h25-740xd.afegrf on e24-h25-740xd
2023-08-19T19:55:09.427816+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 54 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h21-740xd.qsnpxj (dependencies changed)...
2023-08-19T19:55:09.472633+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 55 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h21-740xd.qsnpxj on e24-h21-740xd
2023-08-19T19:55:11.675656+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 57 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h23-740xd.ujyyxd (dependencies changed)...
2023-08-19T19:55:11.678865+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 58 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h23-740xd.ujyyxd on e24-h23-740xd
2023-08-19T19:55:13.577480+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 60 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h25-740xd.lvqthq (dependencies changed)...
2023-08-19T19:55:13.580526+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 61 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h25-740xd.lvqthq on e24-h25-740xd
2023-08-19T19:55:16.384571+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 63 : cephadm [INF] Reconfiguring haproxy.nfs.mycephnfs.e24-h27-740xd.fcesea (dependencies changed)...
2023-08-19T19:55:16.387773+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 64 : cephadm [INF] Reconfiguring daemon haproxy.nfs.mycephnfs.e24-h27-740xd.fcesea on e24-h27-740xd
2023-08-19T20:05:47.138004+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 383 : cephadm [INF] Adjusting osd_memory_target on e24-h27-740xd to 17927M
2023-08-19T20:05:47.145235+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 384 : cephadm [INF] Adjusting osd_memory_target on e24-h23-740xd to 14399M
2023-08-19T20:05:47.691571+0000 mgr.e24-h19-740xd.nciiap (mgr.134353) 386 : cephadm [INF] Adjusting osd_memory_target on e24-h25-740xd to 16198M

Comment 7 errata-xmlrpc 2023-12-13 15:22:07 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Ceph Storage 7.0 Bug Fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2023:7780