Bug 1988773 - [RFE] Provide warning when the 'require-osd-release' flag does not match current release.
Summary: [RFE] Provide warning when the 'require-osd-release' flag does not match curr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat Storage
Component: RADOS
Version: 4.2
Hardware: All
OS: Linux
unspecified
high
Target Milestone: ---
: 5.2
Assignee: Sridhar Seshasayee
QA Contact: Tintu Mathew
Akash Raj
URL:
Whiteboard:
Depends On:
Blocks: 2033078 2126050
TreeView+ depends on / blocked
 
Reported: 2021-08-01 01:50 UTC by Michael J. Kidd
Modified: 2024-10-01 19:07 UTC (History)
15 users (show)

Fixed In Version: ceph-16.2.8-2.el8cp
Doc Type: Bug Fix
Doc Text:
.Ceph cluster issues a health warning if the `require-osd-release` flag is not set to the appropriate release after a cluster upgrade. Previously, the logic in the code that detects the `require-osd-release` flag mismatch after an upgrade was inadvertently removed during a code refactoring effort. Since the warning was not raised in the `ceph -s` output post an upgrade, any change made to the cluster without setting the flag to the appropriate release resulted in issues, such as, placement groups (PG) stuck in certain states, excessive Ceph process memory consumption, slow requests, among many other issues. With this fix, the Ceph cluster issues a health warning if the `require-osd-release` flag is not set to the appropriate release after a cluster upgrade.
Clone Of:
: 2033078 (view as bug list)
Environment:
Last Closed: 2022-08-09 17:35:53 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Ceph Project Bug Tracker 51984 0 None None None 2021-08-02 00:56:05 UTC
Github ceph ceph pull 44090 0 None open osd/OSDMap: Add health warning if 'require-osd-release' != current release 2021-11-29 23:59:02 UTC
Github ceph ceph pull 44259 0 None open pacific: osd/OSDMap: Add health warning if 'require-osd-release' != current release 2021-12-09 19:02:11 UTC
Red Hat Bugzilla 1986221 1 None None None 2024-10-01 19:04:26 UTC
Red Hat Bugzilla 1988755 1 None None None 2024-10-01 19:07:05 UTC
Red Hat Issue Tracker RHCEPH-662 0 None None None 2021-08-01 01:52:11 UTC
Red Hat Knowledge Base (Solution) 6228431 0 None None None 2021-08-01 01:51:53 UTC
Red Hat Product Errata RHSA-2022:5997 0 None None None 2022-08-09 17:36:25 UTC

Description Michael J. Kidd 2021-08-01 01:50:18 UTC
Description of problem:
After upgrade from RHCS 3.x to 4.x, if the step to set 'require-osd-release' flag to Nautilus is missed, the environment may operate normally until a change is made.  The symptoms experienced are not easily identified to the root cause, and are catastrophic with PGs stuck in peering, activating, or unknown states, among other issues.

Version-Release number of selected component (if applicable):
RHCS 4.x

How reproducible:
Untested beyond one experience, but expected to be 100%

Steps to Reproduce:
1. Install cluster with RHCS 3.x
2. Upgrade cluster manually to RHCS 4.x
  - Do not run the command: ceph osd require-osd-release nautilus
3. Add a new pool, or remove an OSD

Actual results:
-- Nothing.. until a change is made, then:
-- Chaos ensues --
 * PGs stuck in peering, activating, or unknown states that never clear
 * Cluster 'recovery' traffic shown in ceph -s is very low compared to the expected throughput.
 * Excessive Ceph process memory consumption and potential 'Out of Memory' kill events
 * Ceph OSD Log entries showing BADAUTHORIZER
 * Ceph MGR Log entries showing 'could not get service secret for service'
 * Ceph Cluster Logs and ceph -s show high volume of 'slow request' entries of type delayed, queued for pg, or started

Expected results:
-- A warning in 'ceph -s' and any GUI / WebUI indicating a mismatch is present.
-- The warning should occur once all OSDs are upgraded to the new release version, and continue until the 'require-osd-release' value is set properly.
-- Due to the severity of the symptoms, and the 'time bomb' nature (since the cluster operates normally until a change), maybe even put the cluster into 'HEALTH_ERR' state, not just WARN.

Additional info:
See KCS https://access.redhat.com/solutions/6228431

Comment 8 Tintu Mathew 2022-07-05 04:24:15 UTC
This issue is verified while upgrading 4.3 [ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)] to 5.2 [ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)].

When the upgrade is in progress and the osds are upgraded to pacific, the warning message was shown up in ceph status. 


[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph versions
{
    "mon": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3
    },
    "osd": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 12
    },
    "mds": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1
    },
    "rgw": {
        "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 2
    },
    "rgw-nfs": {
        "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 3,
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 19
    }
}
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph osd dump | grep require_osd_release
require_osd_release nautilus
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph status
  cluster:
    id:     85dba4a0-4e42-403f-b180-96bc8dab5f64
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            all OSDs are running pacific or later but require_osd_release < pacific
            2 pools have too few placement groups
 
  services:
    mon:     3 daemons, quorum ceph-relosd-q0ejl3-node3,ceph-relosd-q0ejl3-node2,ceph-relosd-q0ejl3-node1-installer (age 10m)
    mgr:     ceph-relosd-q0ejl3-node1-installer(active, since 7m), standbys: ceph-relosd-q0ejl3-node2, ceph-relosd-q0ejl3-node3
    mds:     1/1 daemons up
    osd:     12 osds: 12 up (since 2m), 12 in (since 3d)
    rgw:     4 daemons active (2 hosts, 1 zones)
    rgw-nfs: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 153 pgs
    objects: 251 objects, 15 KiB
    usage:   498 MiB used, 239 GiB / 240 GiB avail
    pgs:     153 active+clean
 
  io:
    client:   153 KiB/s rd, 0 B/s wr, 152 op/s rd, 88 op/s wr
 
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# 

By the time the cluster was fully upgraded to 5.2 the warning was disappeared and the require_osd_release flag was set to pacific.

[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph osd dump | grep require_osd_release
require_osd_release pacific
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph status
  cluster:
    id:     85dba4a0-4e42-403f-b180-96bc8dab5f64
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 pools have too few placement groups
 
  services:
    mon:     3 daemons, quorum ceph-relosd-q0ejl3-node3,ceph-relosd-q0ejl3-node2,ceph-relosd-q0ejl3-node1-installer (age 15m)
    mgr:     ceph-relosd-q0ejl3-node1-installer(active, since 13m), standbys: ceph-relosd-q0ejl3-node2, ceph-relosd-q0ejl3-node3
    mds:     1/1 daemons up
    osd:     12 osds: 12 up (since 7m), 12 in (since 3d)
    rgw:     4 daemons active (2 hosts, 1 zones)
    rgw-nfs: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 177 pgs
    objects: 251 objects, 15 KiB
    usage:   528 MiB used, 239 GiB / 240 GiB avail
    pgs:     177 active+clean
 
  io:
    client:   2.5 KiB/s rd, 2 op/s rd, 0 op/s wr
 
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph versions
{
    "mon": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3
    },
    "osd": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 12
    },
    "mds": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1
    },
    "rgw": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 4
    },
    "rgw-nfs": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 24
    }
}
[root@ceph-relosd-q0ejl3-node1-installer cephuser]#

Comment 14 errata-xmlrpc 2022-08-09 17:35:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5997


Note You need to log in before you can comment on or make changes to this bug.