Bug 2131703

Summary: Ceph is in HEALTH_WARN right after deployment with size 12
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: Filip Balák <fbalak>
Component: odf-managed-serviceAssignee: Leela Venkaiah Gangavarapu <lgangava>
Status: CLOSED NOTABUG QA Contact: Filip Balák <fbalak>
Severity: high Docs Contact:
Priority: medium    
Version: 4.10CC: aeyal, ebenahar, lgangava, nberry, ocs-bugs, odf-bz-bot, owasserm
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-02-06 10:10:54 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Filip Balák 2022-10-03 11:44:22 UTC
Description of problem:
Right after deployment of ODF Managed Service addon with size 12, the ceph is in unhealthy state:
HEALTH_WARN 1 slow ops, oldest one blocked for 9728 sec, mon.c has slow ops

Version-Release number of selected component (if applicable):
ocs-osd-deployer.v2.0.7

How reproducible:
1/1

Steps to Reproduce:
1.Deploy a service with dev addon:
rosa create service --type ocs-provider-dev --name fbalak-pr --machine-cidr 10.0.0.0/16 --size 12 --onboarding-validation-key <key> --subnet-ids <subnets> --region us-east-1
2. Check health status:
oc rsh -n openshift-storage $(oc get pods -n openshift-storage|grep tool|awk '{print$1}') ceph health

Actual results:
HEALTH_WARN 1 slow ops, oldest one blocked for 9728 sec, mon.c has slow ops

Expected results:
HEALTH_OK

Additional info:
The cluster was deployed with dev addon that contains changes to epic ODFMS-55.

Comment 2 Leela Venkaiah Gangavarapu 2022-10-10 11:25:44 UTC
hi,

- this seems to be a legit issue and need changes to resource calculations as well
- for time being, I'm assigning the bug to myself

@fbalak,

- does this effect the IO/management ops directly?

Thanks,
leela.

Comment 3 Leela Venkaiah Gangavarapu 2022-10-27 13:41:06 UTC
- Still awaiting to hear back any repercussions caused by this bug
- Orit is also looking into it and will await an update

Comment 4 Filip Balák 2022-10-27 13:54:27 UTC
No IO was tested with the cluster. This was a state right after installation without any operation.

Comment 6 Leela Venkaiah Gangavarapu 2022-11-04 07:40:44 UTC
- pls note above workaround has to applied after each upscale

Comment 7 Leela Venkaiah Gangavarapu 2022-11-08 09:33:21 UTC
- Bug is resolved, the dependent jira issue is fixed from OCM

Comment 16 Elad 2023-01-17 12:41:44 UTC
Moving to 4.12.z as the verification would be done against the ODF MS rollout that would be based on ODF 4.12

Comment 17 Elad 2023-01-17 13:18:12 UTC
Moving to VERIFIED based on regression testing.
We will clone this bug for the sake of verifying the scenario as part of ODF MS testing over ODF 4.12 or with the provider-consumer layout

Comment 19 Filip Balák 2023-02-06 10:10:54 UTC
Size 12 is not going to be supported now. --> CLOSED NOTABUG