Bug 2277603 - Ceph health is going to Error state on ODF4.15.2 on IBM Power [NEEDINFO]
Summary: Ceph health is going to Error state on ODF4.15.2 on IBM Power
Keywords:
Status: NEW
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: ceph
Version: 4.15
Hardware: ppc64le
OS: Linux
unspecified
high
Target Milestone: ---
: ---
Assignee: Radoslaw Zarzynski
QA Contact: Elad
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-04-28 14:21 UTC by Pooja Soni
Modified: 2024-05-21 08:45 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Embargoed:
kramdoss: needinfo? (akupczyk)


Attachments (Terms of Use)

Description Pooja Soni 2024-04-28 14:21:02 UTC
Description of problem (please be detailed as possible and provide log
snippests):
On running Tier1 on ODF 4.15.2 ceph health is going in error state.
sh-5.1$ ceph health
HEALTH_ERR 1/654 objects unfound (0.153%); 7 scrub errors; Possible data damage: 1 pg recovery_unfound, 4 pgs inconsistent; Degraded data redundancy: 3/1962 objects degraded (0.153%), 1 pg degraded; 3 slow ops, oldest one blocked for 265101 sec, daemons [osd.1,osd.2] have slow ops.
sh-5.1$

Version of all relevant components (if applicable):
ODF version - 4.15.2
OCP version - 4.15.2

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?
We can skip the test case and continue with other test case execution.

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create ODF 4.15.2 and execute the tier1 test suite. 
2. During execution of test_selinux_relabel_for_existing_pvc[5] test case ceph health is going into error state.


Actual results:


Expected results:


Additional info:

Comment 4 Santosh Pillai 2024-04-29 04:18:38 UTC
Hi, 
what is Tier1?
What operations were performed on the cluster before it went to the current state?

Comment 5 Pooja Soni 2024-04-29 06:28:44 UTC
Tier1 is the test suite which is getting executed as part of zstream 4.15.2 and below test cases is part of the same test suite -
test_selinux_relabel_for_existing_pvc[5]

after running test_selinux_relabel_for_existing_pvc[5] test case we are seeing this issue. Link to test case - https://github.com/red-hat-storage/ocs-ci/blob/6de27377af27d626991b2b0b590f534a91a81400/tests/cross_functional/kcs/test_selinux_relabel_solution.py#L227

This test case is creating PVC, attach to the pod and creating multiple directories with files and applying selinux relabeling. This issue is seen in multiple cluster having ODF 4.15.2 installed.

Comment 6 Santosh Pillai 2024-04-29 09:30:46 UTC
(In reply to Pooja Soni from comment #5)
> Tier1 is the test suite which is getting executed as part of zstream 4.15.2
> and below test cases is part of the same test suite -
> test_selinux_relabel_for_existing_pvc[5]
> 
> after running test_selinux_relabel_for_existing_pvc[5] test case we are
> seeing this issue. Link to test case -
> https://github.com/red-hat-storage/ocs-ci/blob/
> 6de27377af27d626991b2b0b590f534a91a81400/tests/cross_functional/kcs/
> test_selinux_relabel_solution.py#L227
> 
> This test case is creating PVC, attach to the pod and creating multiple
> directories with files and applying selinux relabeling. This issue is seen
> in multiple cluster having ODF 4.15.2 installed.

Thanks for the details. 
Do you have a live cluster that I can take a look at?

Comment 7 Santosh Pillai 2024-04-29 11:33:40 UTC
Got a live cluster from Aaruni.  Didn't see any issues at the Rook level. But the OSD.0 pod is crashing. 

Hi Radoslaw, 
Can you take a look at the OSD pod crashing?

Thanks.

Comment 11 Pooja Soni 2024-04-29 15:21:10 UTC
we tried again on diff setup and it failed again with error:

Ceph cluster health is not OK. Health: HEALTH_ERR 1/60838 objects unfound (0.002%); Reduced data availability: 37 pgs peering; Possible data damage: 1 pg recovery_unfound; Degraded data redundancy: 41007/182514 objects degraded (22.468%), 75 pgs degraded, 109 pgs undersized; 5 daemons have recently crashed; 1 slow ops, oldest one blocked for 120 sec, daemons [osd.1,osd.2] have slow ops.

Must gather log for this setup - https://drive.google.com/file/d/1G_zg9vF8xI3c74hZBxmK4Q-Q-3VmZwtk/view?usp=sharing

Comment 14 Pooja Soni 2024-05-15 10:38:31 UTC
I got the same issue on running Tier1 on ODF 4.14.7. Ceph health went into error state after execution of test_selinux_relabel_for_existing_pvc[5] test case.


Note You need to log in before you can comment on or make changes to this bug.