Description of problem: After upgrade from RHCS 3.x to 4.x, if the step to set 'require-osd-release' flag to Nautilus is missed, the environment may operate normally until a change is made. The symptoms experienced are not easily identified to the root cause, and are catastrophic with PGs stuck in peering, activating, or unknown states, among other issues. Version-Release number of selected component (if applicable): RHCS 4.x How reproducible: Untested beyond one experience, but expected to be 100% Steps to Reproduce: 1. Install cluster with RHCS 3.x 2. Upgrade cluster manually to RHCS 4.x - Do not run the command: ceph osd require-osd-release nautilus 3. Add a new pool, or remove an OSD Actual results: -- Nothing.. until a change is made, then: -- Chaos ensues -- * PGs stuck in peering, activating, or unknown states that never clear * Cluster 'recovery' traffic shown in ceph -s is very low compared to the expected throughput. * Excessive Ceph process memory consumption and potential 'Out of Memory' kill events * Ceph OSD Log entries showing BADAUTHORIZER * Ceph MGR Log entries showing 'could not get service secret for service' * Ceph Cluster Logs and ceph -s show high volume of 'slow request' entries of type delayed, queued for pg, or started Expected results: -- A warning in 'ceph -s' and any GUI / WebUI indicating a mismatch is present. -- The warning should occur once all OSDs are upgraded to the new release version, and continue until the 'require-osd-release' value is set properly. -- Due to the severity of the symptoms, and the 'time bomb' nature (since the cluster operates normally until a change), maybe even put the cluster into 'HEALTH_ERR' state, not just WARN. Additional info: See KCS https://access.redhat.com/solutions/6228431
This issue is verified while upgrading 4.3 [ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)] to 5.2 [ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)]. When the upgrade is in progress and the osds are upgraded to pacific, the warning message was shown up in ceph status. [root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph versions { "mon": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3 }, "osd": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 12 }, "mds": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1 }, "rgw": { "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 2 }, "rgw-nfs": { "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 1 }, "overall": { "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 3, "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 19 } } [root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph osd dump | grep require_osd_release require_osd_release nautilus [root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph status cluster: id: 85dba4a0-4e42-403f-b180-96bc8dab5f64 health: HEALTH_WARN mons are allowing insecure global_id reclaim all OSDs are running pacific or later but require_osd_release < pacific 2 pools have too few placement groups services: mon: 3 daemons, quorum ceph-relosd-q0ejl3-node3,ceph-relosd-q0ejl3-node2,ceph-relosd-q0ejl3-node1-installer (age 10m) mgr: ceph-relosd-q0ejl3-node1-installer(active, since 7m), standbys: ceph-relosd-q0ejl3-node2, ceph-relosd-q0ejl3-node3 mds: 1/1 daemons up osd: 12 osds: 12 up (since 2m), 12 in (since 3d) rgw: 4 daemons active (2 hosts, 1 zones) rgw-nfs: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 8 pools, 153 pgs objects: 251 objects, 15 KiB usage: 498 MiB used, 239 GiB / 240 GiB avail pgs: 153 active+clean io: client: 153 KiB/s rd, 0 B/s wr, 152 op/s rd, 88 op/s wr [root@ceph-relosd-q0ejl3-node1-installer cephuser]# By the time the cluster was fully upgraded to 5.2 the warning was disappeared and the require_osd_release flag was set to pacific. [root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph osd dump | grep require_osd_release require_osd_release pacific [root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph status cluster: id: 85dba4a0-4e42-403f-b180-96bc8dab5f64 health: HEALTH_WARN mons are allowing insecure global_id reclaim 1 pools have too few placement groups services: mon: 3 daemons, quorum ceph-relosd-q0ejl3-node3,ceph-relosd-q0ejl3-node2,ceph-relosd-q0ejl3-node1-installer (age 15m) mgr: ceph-relosd-q0ejl3-node1-installer(active, since 13m), standbys: ceph-relosd-q0ejl3-node2, ceph-relosd-q0ejl3-node3 mds: 1/1 daemons up osd: 12 osds: 12 up (since 7m), 12 in (since 3d) rgw: 4 daemons active (2 hosts, 1 zones) rgw-nfs: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 8 pools, 177 pgs objects: 251 objects, 15 KiB usage: 528 MiB used, 239 GiB / 240 GiB avail pgs: 177 active+clean io: client: 2.5 KiB/s rd, 2 op/s rd, 0 op/s wr [root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph versions { "mon": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3 }, "osd": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 12 }, "mds": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1 }, "rgw": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 4 }, "rgw-nfs": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1 }, "overall": { "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 24 } } [root@ceph-relosd-q0ejl3-node1-installer cephuser]#
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5997