Bug 1988773

Summary:	[RFE] Provide warning when the 'require-osd-release' flag does not match current release.
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	Michael J. Kidd <linuxkidd>
Component:	RADOS	Assignee:	Sridhar Seshasayee <sseshasa>
Status:	CLOSED ERRATA	QA Contact:	Tintu Mathew <tmathew>
Severity:	high	Docs Contact:	Akash Raj <akraj>
Priority:	unspecified
Version:	4.2	CC:	akraj, akupczyk, bhubbard, ceph-eng-bugs, mmuench, ngangadh, nojha, pasik, pdhiran, rzarzyns, skanta, sseshasa, tserlin, vereddy, vumrao
Target Milestone:	---	Keywords:	FutureFeature
Target Release:	5.2
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	ceph-16.2.8-2.el8cp	Doc Type:	Bug Fix
Doc Text:	.Ceph cluster issues a health warning if the `require-osd-release` flag is not set to the appropriate release after a cluster upgrade. Previously, the logic in the code that detects the `require-osd-release` flag mismatch after an upgrade was inadvertently removed during a code refactoring effort. Since the warning was not raised in the `ceph -s` output post an upgrade, any change made to the cluster without setting the flag to the appropriate release resulted in issues, such as, placement groups (PG) stuck in certain states, excessive Ceph process memory consumption, slow requests, among many other issues. With this fix, the Ceph cluster issues a health warning if the `require-osd-release` flag is not set to the appropriate release after a cluster upgrade.	Story Points:	---
Clone Of:
Clones:	2033078 (view as bug list)		Environment:
Last Closed:	2022-08-09 17:35:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2033078, 2126050

Description Michael J. Kidd 2021-08-01 01:50:18 UTC

Description of problem:
After upgrade from RHCS 3.x to 4.x, if the step to set 'require-osd-release' flag to Nautilus is missed, the environment may operate normally until a change is made. The symptoms experienced are not easily identified to the root cause, and are catastrophic with PGs stuck in peering, activating, or unknown states, among other issues.

Version-Release number of selected component (if applicable):
RHCS 4.x

How reproducible:
Untested beyond one experience, but expected to be 100%

Steps to Reproduce:
1. Install cluster with RHCS 3.x
2. Upgrade cluster manually to RHCS 4.x
- Do not run the command: ceph osd require-osd-release nautilus
3. Add a new pool, or remove an OSD

Actual results:
-- Nothing.. until a change is made, then:
-- Chaos ensues --
* PGs stuck in peering, activating, or unknown states that never clear
* Cluster 'recovery' traffic shown in ceph -s is very low compared to the expected throughput.
* Excessive Ceph process memory consumption and potential 'Out of Memory' kill events
* Ceph OSD Log entries showing BADAUTHORIZER
* Ceph MGR Log entries showing 'could not get service secret for service'
* Ceph Cluster Logs and ceph -s show high volume of 'slow request' entries of type delayed, queued for pg, or started

Expected results:
-- A warning in 'ceph -s' and any GUI / WebUI indicating a mismatch is present.
-- The warning should occur once all OSDs are upgraded to the new release version, and continue until the 'require-osd-release' value is set properly.
-- Due to the severity of the symptoms, and the 'time bomb' nature (since the cluster operates normally until a change), maybe even put the cluster into 'HEALTH_ERR' state, not just WARN.

Additional info:
See KCS https://access.redhat.com/solutions/6228431

Comment 8 Tintu Mathew 2022-07-05 04:24:15 UTC

This issue is verified while upgrading 4.3 [ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)] to 5.2 [ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)].

When the upgrade is in progress and the osds are upgraded to pacific, the warning message was shown up in ceph status. 


[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph versions
{
    "mon": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3
    },
    "osd": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 12
    },
    "mds": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1
    },
    "rgw": {
        "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 2
    },
    "rgw-nfs": {
        "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 1
    },
    "overall": {
        "ceph version 14.2.22-110.el8cp (2e0d97dbe192cca7419bbf3f8ee6b7abb42965c4) nautilus (stable)": 3,
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 19
    }
}
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph osd dump | grep require_osd_release
require_osd_release nautilus
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph status
  cluster:
    id:     85dba4a0-4e42-403f-b180-96bc8dab5f64
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            all OSDs are running pacific or later but require_osd_release < pacific
            2 pools have too few placement groups
 
  services:
    mon:     3 daemons, quorum ceph-relosd-q0ejl3-node3,ceph-relosd-q0ejl3-node2,ceph-relosd-q0ejl3-node1-installer (age 10m)
    mgr:     ceph-relosd-q0ejl3-node1-installer(active, since 7m), standbys: ceph-relosd-q0ejl3-node2, ceph-relosd-q0ejl3-node3
    mds:     1/1 daemons up
    osd:     12 osds: 12 up (since 2m), 12 in (since 3d)
    rgw:     4 daemons active (2 hosts, 1 zones)
    rgw-nfs: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 153 pgs
    objects: 251 objects, 15 KiB
    usage:   498 MiB used, 239 GiB / 240 GiB avail
    pgs:     153 active+clean
 
  io:
    client:   153 KiB/s rd, 0 B/s wr, 152 op/s rd, 88 op/s wr
 
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# 

By the time the cluster was fully upgraded to 5.2 the warning was disappeared and the require_osd_release flag was set to pacific.

[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph osd dump | grep require_osd_release
require_osd_release pacific
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph status
  cluster:
    id:     85dba4a0-4e42-403f-b180-96bc8dab5f64
    health: HEALTH_WARN
            mons are allowing insecure global_id reclaim
            1 pools have too few placement groups
 
  services:
    mon:     3 daemons, quorum ceph-relosd-q0ejl3-node3,ceph-relosd-q0ejl3-node2,ceph-relosd-q0ejl3-node1-installer (age 15m)
    mgr:     ceph-relosd-q0ejl3-node1-installer(active, since 13m), standbys: ceph-relosd-q0ejl3-node2, ceph-relosd-q0ejl3-node3
    mds:     1/1 daemons up
    osd:     12 osds: 12 up (since 7m), 12 in (since 3d)
    rgw:     4 daemons active (2 hosts, 1 zones)
    rgw-nfs: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   8 pools, 177 pgs
    objects: 251 objects, 15 KiB
    usage:   528 MiB used, 239 GiB / 240 GiB avail
    pgs:     177 active+clean
 
  io:
    client:   2.5 KiB/s rd, 2 op/s rd, 0 op/s wr
 
[root@ceph-relosd-q0ejl3-node1-installer cephuser]# ceph versions
{
    "mon": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 3
    },
    "osd": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 12
    },
    "mds": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1
    },
    "rgw": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 4
    },
    "rgw-nfs": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.8-65.el8cp (79f0367338897c8c6d9805eb8c9ad24af0dcd9c7) pacific (stable)": 24
    }
}
[root@ceph-relosd-q0ejl3-node1-installer cephuser]#

Comment 14 errata-xmlrpc 2022-08-09 17:35:53 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage Security, Bug Fix, and Enhancement Update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5997