Created attachment 1865217 [details] alerts full osd Created attachment 1865217 [details] alerts full osd Description of problem (please be detailed as possible and provide log snippests): I ran e2e system test. The cluster reaches full state [85%] then we add capacity. The status of cephcluster is HEALTH_ERR and the "ceph status" [via tool pod] is HEALTH_OK. After I restarted the rook-ceph-operator the cephcluster moved to HEALTH_OK Version of all relevant components (if applicable): OCP Version:4.10.0-0.nightly-2022-02-26-230022 ODF Version:4.10.0-169 LSO Version:local-storage-operator.4.11.0-202202260148 Ceph version: sh-4.4$ ceph versions { "mon": { "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 3 }, "mgr": { "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 1 }, "osd": { "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 7 }, "mds": { "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 2 }, "rgw": { "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 1 }, "overall": { "ceph version 16.2.7-71.el8cp (4c975536861fc39c429045d66a6dba5a00753b9f) pacific (stable)": 14 } } Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Run FIO job 2.Verify cluster reaches full state [85%] sh-4.4$ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 300 GiB 42 GiB 258 GiB 258 GiB 85.90 TOTAL 300 GiB 42 GiB 258 GiB 258 GiB 85.90 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 32 85 GiB 22.45k 255 GiB 100.00 0 B 3.Check Alerts on UI [CephClusterNearFull, CephClusterCriticallyFull] 4.Add Capacity and delete data 5.Check "ceph df": sh-4.4$ ceph df --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd 600 GiB 497 GiB 103 GiB 103 GiB 17.10 TOTAL 600 GiB 497 GiB 103 GiB 103 GiB 17.10 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL ocs-storagecluster-cephblockpool 1 32 33 GiB 8.81k 99 GiB 20.32 130 GiB 6.Wait 2 hours 7.Check storage cluster status: $ oc get storageclusters.ocs.openshift.io NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 46h Ready 2022-02-28T10:33:55Z 4.10.0 8.Check ceph status via tool pod: sh-4.4$ ceph status cluster: id: 471879f4-9b79-44f9-8c60-c69490b2276f health: HEALTH_OK 9.Storage cluster stuck on Error state [attched screenshot] 10.Check cephcluster status: oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL ocs-storagecluster-cephcluster /var/lib/rook 3 2d1h Ready Cluster created successfully HEALTH_ERR 11.Restart rook ceph operator pod $ oc delete pod rook-ceph-operator-8dfb969cb-7kgfk pod "rook-ceph-operator-8dfb969cb-7kgfk" deleted 12.cephcluster moved to ok state: $ oc get cephcluster NAME DATADIRHOSTPATH MONCOUNT AGE PHASE MESSAGE HEALTH EXTERNAL ocs-storagecluster-cephcluster /var/lib/rook 3 2d2h Ready Cluster created successfully HEALTH_OK Actual results: The Status of cephcluster and "ceph status" [via tool pod] is different Expected results: The Status of cephcluster and "ceph status" [via tool pod] is same Additional info: OCS MG: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2062853.tar.gz
Oded It seems I can't access the must-gather. Was the must-gather collected before restarting the rook operator? Otherwise the outdated state of the operator and cephcluster CR will be lost. If you don't see any errors in the operator log about failing to update the cephcluster CR status, it may help to enable debug logging for the operator. In the rook-config-overrides configmap, you would set ROOK_LOG_LEVEL: DEBUG. The status updates every minute are only debug-level unless perhaps there is an error with the update. The ceph status on the CR should only be at most a minute outdated from what is seen from the toolbox, so two hours was plenty to wait!
Not a blocker for 4.10 at least until there is a repro
Actually this was fixed last week with another 4.10 BZ, didn't realize previously a fix was needed. *** This bug has been marked as a duplicate of bug 2069795 ***