2062853 – The Status of cephcluster and "ceph status" [via tool pod] is different

Bug 2062853 - The Status of cephcluster and "ceph status" [via tool pod] is different

Summary: The Status of cephcluster and "ceph status" [via tool pod] is different

Keywords:
Status:	CLOSED DUPLICATE of bug 2069795
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Travis Nielsen
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-03-10 17:09 UTC by Oded
Modified:	2023-12-08 04:28 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-04-04 15:19:40 UTC
Embargoed:

Attachments	(Terms of Use)
alerts full osd (135.52 KB, image/png) 2022-03-10 17:09 UTC, Oded	no flags	Details
View All

Description Oded 2022-03-10 17:09:06 UTC

Created attachment 1865217 [details]
alerts full osd

Created attachment 1865217 [details]
alerts full osd

Description of problem (please be detailed as possible and provide log
snippests):
I ran e2e system test.
The cluster reaches full state [85%] then we add capacity.
The status of cephcluster is HEALTH_ERR and the "ceph status" [via tool pod] is HEALTH_OK.
After I restarted the rook-ceph-operator the cephcluster moved to HEALTH_OK


Version of all relevant components (if applicable):
OCP Version:4.10.0-0.nightly-2022-02-26-230022
ODF Version:4.10.0-169
LSO Version:local-storage-operator.4.11.0-202202260148
Ceph version:
sh-4.4$ ceph versions
{
    "mon": {
        "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 3
    },
    "mgr": {
        "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 1
    },
    "osd": {
        "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 7
    },
    "mds": {
        "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 1
    },
    "overall": {
        "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 14
    }
}

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Run FIO job

2.Verify cluster reaches full state [85%]
sh-4.4$ ceph df
--- RAW STORAGE ---
CLASS     SIZE   AVAIL     USED  RAW USED  %RAW USED
hdd    300 GiB  42 GiB  258 GiB   258 GiB      85.90
TOTAL  300 GiB  42 GiB  258 GiB   258 GiB      85.90
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED   %USED  MAX AVAIL
ocs-storagecluster-cephblockpool                        1   32   85 GiB   22.45k  255 GiB  100.00        0 B

3.Check Alerts on UI [CephClusterNearFull, CephClusterCriticallyFull]

4.Add Capacity and delete data

5.Check "ceph df":
sh-4.4$ ceph df
--- RAW STORAGE ---
CLASS     SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd    600 GiB  497 GiB  103 GiB   103 GiB      17.10
TOTAL  600 GiB  497 GiB  103 GiB   103 GiB      17.10
 
--- POOLS ---
POOL                                                   ID  PGS   STORED  OBJECTS     USED  %USED  MAX AVAIL
ocs-storagecluster-cephblockpool                        1   32   33 GiB    8.81k   99 GiB  20.32    130 GiB

6.Wait 2 hours

7.Check storage cluster status:
$ oc get storageclusters.ocs.openshift.io
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   46h   Ready              2022-02-28T10:33:55Z   4.10.0

8.Check ceph status via tool pod:
sh-4.4$ ceph status
  cluster:
    id:     471879f4-9b79-44f9-8c60-c69490b2276f
    health: HEALTH_OK
 
9.Storage cluster stuck on Error state [attched screenshot]

10.Check cephcluster status:
oc get cephcluster
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE    PHASE   MESSAGE                        HEALTH       EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          2d1h   Ready   Cluster created successfully   HEALTH_ERR

11.Restart rook ceph operator pod
$ oc delete pod rook-ceph-operator-8dfb969cb-7kgfk
pod "rook-ceph-operator-8dfb969cb-7kgfk" deleted

12.cephcluster moved to ok state:
$ oc get cephcluster
NAME                             DATADIRHOSTPATH   MONCOUNT   AGE    PHASE   MESSAGE                        HEALTH      EXTERNAL
ocs-storagecluster-cephcluster   /var/lib/rook     3          2d2h   Ready   Cluster created successfully   HEALTH_OK   


Actual results:
The Status of cephcluster and "ceph status" [via tool pod] is different

Expected results:
The Status of cephcluster and "ceph status" [via tool pod] is same

Additional info:
OCS MG:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2062853.tar.gz

Comment 2 Travis Nielsen 2022-03-10 21:45:05 UTC

Oded It seems I can't access the must-gather. Was the must-gather collected before restarting the rook operator? Otherwise the outdated state of the operator and cephcluster CR will be lost. 

If you don't see any errors in the operator log about failing to update the cephcluster CR status, it may help to enable debug logging for the operator. In the rook-config-overrides configmap, you would set ROOK_LOG_LEVEL: DEBUG. The status updates every minute are only debug-level unless perhaps there is an error with the update.

The ceph status on the CR should only be at most a minute outdated from what is seen from the toolbox, so two hours was plenty to wait!

Comment 3 Travis Nielsen 2022-03-21 16:18:22 UTC

Not a blocker for 4.10 at least until there is a repro

Comment 4 Travis Nielsen 2022-04-04 15:19:40 UTC

Actually this was fixed last week with another 4.10 BZ, didn't realize previously a fix was needed.

*** This bug has been marked as a duplicate of bug 2069795 ***

Comment 5 Red Hat Bugzilla 2023-12-08 04:28:00 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.