Bug 2251670

Summary:	[provider] During regression testing RGW pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore on provider went to CLBO state
Product:	[Red Hat Storage] Red Hat OpenShift Data Foundation	Reporter:	suchita <sgatfane>
Component:	ocs-operator	Assignee:	Rohan Gupta <rohgupta>
Status:	CLOSED COMPLETED	QA Contact:	Amrita Mahapatra <ammahapa>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.14	CC:	asriram, brgardne, dosypenk, ebenahar, jthottan, kramdoss, lgangava, muagarwa, nberry, nigoyal, odf-bz-bot, paarora, resoni, rohgupta
Target Milestone:	---
Target Release:	ODF 4.14.6
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	isf-provider, fusion-hci-phase-2
Fixed In Version:	4.14.4-5	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	2256777 (view as bug list)		Environment:
Last Closed:	2024-07-22 10:55:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	2256777

Description suchita 2023-11-27 07:54:12 UTC

Description of problem (please be detailed as possible and provide log
snippests):

[provider] During regression testing RGW pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore on provider went to CLBO state

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
$ oc get csv
NAME                                    DISPLAY                       VERSION               REPLACES                                  PHASE
mcg-operator.v4.14.1-6.hci              NooBaa Operator               4.14.1-6.hci          mcg-operator.v4.14.0-160.hci              Succeeded
metallb-operator.v4.14.0-202310201027   MetalLB Operator              4.14.0-202310201027                                             Succeeded
ocs-operator.v4.14.1-6.hci              OpenShift Container Storage   4.14.1-6.hci          ocs-operator.v4.14.0-160.hci              Succeeded
odf-csi-addons-operator.v4.14.1-6.hci   CSI Addons                    4.14.1-6.hci          odf-csi-addons-operator.v4.14.0-160.hci   Succeeded
odf-operator.v4.14.1-6.hci              OpenShift Data Foundation     4.14.1-6.hci          odf-operator.v4.14.0-160.hci              Succeeded


Is there any workaround available to the best of your knowledge?
Not Yet

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
1/1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Will update after QE analysis on Tier2 Test analysis. for exact test after which cluster reach such a CLBO stat

Steps to Reproduce:
1.
2.
3.


Actual results:

$ oc get pods | grep rgw
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-d795cb7zccx6   1/2     CrashLoopBackOff   159 (50s ago)   13h

Expected results:

No pod should be in VLBP state

Additional info:

$ oc get pods
NAME                                                              READY   STATUS             RESTARTS          AGE
csi-addons-controller-manager-d68fdf849-r2wlp                     2/2     Running            2                 31h
noobaa-core-0                                                     1/1     Running            0                 13h
noobaa-db-pg-0                                                    1/1     Running            0                 7d17h
noobaa-default-backing-store-noobaa-pod-8ef2d870                  1/1     Running            24 (35h ago)      7d17h
noobaa-endpoint-586fc89d89-96dc8                                  1/1     Running            0                 35h
noobaa-endpoint-586fc89d89-wmmmm                                  1/1     Running            0                 13h
noobaa-operator-5584d8656f-pdnzr                                  2/2     Running            0                 7d19h
ocs-metrics-exporter-69f54ff5df-jptm5                             1/1     Running            4 (31h ago)       7d19h
ocs-operator-9cd5c8d9-lt5g8                                       1/1     Running            8 (2d13h ago)     7d19h
ocs-provider-server-97f954f89-9v2js                               1/1     Running            0                 7d19h
odf-console-78f7d497c4-t4l4k                                      1/1     Running            0                 7d19h
odf-operator-controller-manager-69ddd7855-j7z9m                   2/2     Running            12 (2d8h ago)     7d19h
rook-ceph-crashcollector-00-50-56-8f-03-dd-864dc98fd5-rbktj       1/1     Running            0                 13h
rook-ceph-crashcollector-00-50-56-8f-7b-17-6df9b99b47-9qzsw       1/1     Running            0                 13h
rook-ceph-crashcollector-00-50-56-8f-e7-ef-8dbf49bf9-krskk        1/1     Running            0                 30h
rook-ceph-exporter-00-50-56-8f-03-dd-69488b5b48-5rrqn             1/1     Running            63 (8h ago)       13h
rook-ceph-exporter-00-50-56-8f-7b-17-8fddfb7db-gmbhb              1/1     Running            39 (10h ago)      13h
rook-ceph-exporter-00-50-56-8f-e7-ef-5dfb4d559d-74wp9             1/1     Running            8 (7d18h ago)     7d19h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-8767d7cc42476   2/2     Running            0                 13h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-9c68d5d47ldws   2/2     Running            1 (13h ago)       29h
rook-ceph-mgr-a-8b79b7c7f-jhkhm                                   2/2     Running            0                 29h
rook-ceph-mon-a-666867df8b-fskvz                                  2/2     Running            2                 29h
rook-ceph-mon-b-76868bbf47-fg6xw                                  2/2     Running            0                 29h
rook-ceph-mon-c-859c6c7f56-kn4kh                                  2/2     Running            0                 29h
rook-ceph-operator-66fcc8ff5-z4748                                1/1     Running            2                 30h
rook-ceph-osd-0-d99889db-8kcnm                                    2/2     Running            2                 29h
rook-ceph-osd-1-c46966bc-7jwjw                                    2/2     Running            0                 29h
rook-ceph-osd-2-764dfdbf46-5rpwr                                  2/2     Running            0                 29h
rook-ceph-osd-3-86787dcbcd-j5kdw                                  2/2     Running            2                 29h
rook-ceph-osd-4-8dfb9cc69-kh6tn                                   2/2     Running            0                 29h
rook-ceph-osd-5-bdfd4cdfb-p5dtf                                   2/2     Running            0                 29h
rook-ceph-osd-prepare-2ff0ede5ab0f9013316b828503bad2da-gr6kj      0/1     Completed          0                 20d
rook-ceph-osd-prepare-3ddfe359d4e828b0fa59990173505ee8-58gkr      0/1     Completed          0                 7d17h
rook-ceph-osd-prepare-3f7379b6eb08d00c2fc37b868c01bb2f-w7gv9      0/1     Completed          0                 7d17h
rook-ceph-osd-prepare-f629042efc54775b9041526fdae74f6b-fvcw4      0/1     Completed          0                 20d
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-d795cb7zccx6   1/2     CrashLoopBackOff   158 (2m16s ago)   13h
rook-ceph-tools-669dd96f6f-8f6wf                                  1/1     Running            0                 7d19h


Must Gather Logs:http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/provider-client/provider-client_20231113T182256/logs/must_gather/

Comment 2 Rohan Gupta 2023-11-27 17:46:40 UTC

A manual workaround for this bug is to restart rook-ceph-operator pod and then restart the crashing rgw pod

Comment 3 Rohan Gupta 2023-11-29 10:35:55 UTC

This issue is occurring on fresh installed HCI cluster

Comment 4 Rohan Gupta 2023-11-29 11:06:18 UTC

RGW cannot bind the HTTPS port (443) because it is already in use causing the pod to CLBO

debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable), process radosgw, pid 701
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework: beast
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework conf key: port, val: 80
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework conf key: ssl_port, val: 443
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  1 radosgw_Main not setting numa affinity
debug 2023-11-29T10:20:11.402+0000 7f2d9521a7c0  1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0
debug 2023-11-29T10:20:11.402+0000 7f2d9521a7c0  1 D3N datacache enabled: 0
debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0  0 framework: beast
debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0  0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0  0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key
debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0  0 starting handler: beast
debug 2023-11-29T10:20:11.518+0000 7f2d9521a7c0 -1 failed to bind address 0.0.0.0:443: Address already in use
debug 2023-11-29T10:20:11.518+0000 7f2d9521a7c0 -1 ERROR: failed initializing frontend

Comment 7 Blaine Gardner 2023-11-30 16:45:28 UTC

I don't think this issue is related to the deployment strategy. RGW has been working fine with this port setup for a while. I suspect that this could have something to do with OpenShift changes, or having to do with how HCI is working.


@sgatfane and @rohgupta I'd like to ask some more questions about the environment before proceeding.

1. Is this environment in __any way__ new?
2. Has this HCI setup been tested before?
3. Is this using a new version (even minor version) of OpenShfit compared to previous tests?
4. Is this running on bare metal hardware? (some oc output I see suggests that it is)
5. Is this running in any kind of new environment?


To help triage, we may want to run this test in other environments to see how reproducible it is.

1. Does this reproduce with an older version of openshift?
2. Does this reproduce in non-HCI environments?
3. Does this reproduce in HCI environments that are virtual/cloud instead of bare metal? (or vice versa)


Also, I'd really like to be able to get access to the environment where this is failing. Can you set up a test where the cluster remains offline for me to look at live?

Comment 8 Blaine Gardner 2023-11-30 16:56:07 UTC

Also, what does this mean?

> Can this issue reproducible?
> 1/1

Does this mean the issue was only seen once? I think it is important to understand if this issue reproduces with any regularity. It sounds like a race condition, but I don't see anything that suggests a cause. If this doesn't repro, I'm not sure we have enough info to know what can/should be changed.

Comment 9 Blaine Gardner 2023-11-30 18:09:32 UTC

I understand from Rohan that this is regularly reproducible. I suspect that there could be a race condition with the startup probe.

To dig deeper into that, we need info that isn't present in our must-gathers. We should gather logs for the startup probe that are only captured in kubelet logs. Getting those logs requires setting the kubelet's log level to `--v=4` (or higher).

This doc tells how to use the openshift machineconfig to change the kubelet log level. Please change the example log level from 2 to 4. 
https://docs.openshift.com/container-platform/4.8/rest_api/editing-kubelet-log-level-verbosity.html#persistent-kubelet-log-level-configuration_editing-kubelet-log-level-verbosity

We'll also need kubelet logs from the node where the RGW is scheduled. Likely this command will do so, and gather logs from all worker nodes: `oc adm node-logs --role worker -u kubelet`

Comment 10 suchita 2023-12-01 11:30:12 UTC

(In reply to Rohan Gupta from comment #3)
> This issue is occurring on fresh installed HCI cluster

No the cluster where QE reported this issue was not a freshky deploy cluster. cluster was running from couple of weeks

Comment 11 suchita 2023-12-01 11:50:01 UTC

(In reply to Blaine Gardner from comment #7)
> I don't think this issue is related to the deployment strategy. RGW has been
> working fine with this port setup for a while. I suspect that this could
> have something to do with OpenShift changes, or having to do with how HCI is
> working.
> 
> 
> @sgatfane and @rohgupta I'd like to ask some more
> questions about the environment before proceeding.
> 
> 1. Is this environment in __any way__ new?
==> environment new means? if you want to say freshly deployed cluster then answer is No . We observed on provider cluster which was running from alsmost week long and we just run our regression TC from tier2 of MCG,RGW and Nooba tests

> 2. Has this HCI setup been tested before?
==> atleast from ODF QE side we are testing HCI Setup for first time during this release

> 3. Is this using a new version (even minor version) of OpenShfit compared to
> previous tests?
==> ODF QE testing MCG,Nooba and RGW on this HCI provider client solution from first time. In converged solutions we are doing this testing on all supported platform and issue was not reported yet.

> 4. Is this running on bare metal hardware? (some oc output I see suggests
> that it is) 
When I reported it the setup is not actual BM h/w. it is VSPEHERE H/W used like BM H/W

> 5. Is this running in any kind of new environment?
> 
> 
> To help triage, we may want to run this test in other environments to see
> how reproducible it is.
> 
> 1. Does this reproduce with an older version of openshift? 
> 2. Does this reproduce in non-HCI environments?
> 3. Does this reproduce in HCI environments that are virtual/cloud instead of
> bare metal? (or vice versa) 
===> Yes , First incident observed on vpshere BM like cluster 
QE are not able to reproduce it again because we are blocked at running out tier2 on the provider due to some blocked while running it and try to resolve our ocs-ci issue to do that. 

> 
> 
> Also, I'd really like to be able to get access to the environment where this
> is failing. Can you set up a test where the cluster remains offline for me
> to look at live?

Comment 12 Rohan Gupta 2023-12-01 16:03:25 UTC

I was able to reproduce the issue on a BM cluster.
Didn't hit the issue on first storegecluster install but then cleaned ODF and recreated and hit the issue.

Here are the kubelet logs @brgardne https://drive.google.com/file/d/1Fl3fjPFn-Tym1mfad725qevkEUsx77LR/view?usp=sharing

Comment 13 Blaine Gardner 2023-12-01 18:27:51 UTC

@sgatfane I still really need more details

I don't know what HCI means in the context you are using it. I really need a very deeply involved breakdown. It is important for us to understand how this environment is different from environments where the tests pass. That helps narrow down the possibilities for what is going wrong. If I don't have that information, I don't have any leads to help guide the search for bugs.

Please provide all of the information you can of, even if you think the details are minor. The less information I get, the longer this is going to take to debug. This will require you to spend a lot of time writing out the information, but it is critical that you take the time to do it so that we can figure out the problem in time for release. I usually spend 45 minutes to 1 1/2 hours investigating and writing out detailed responses to BZs when they are complex.

What does HCI mean?

What is new about this test? Please explain in excruciating detail. What are the test steps? How are things configured?

How is this test different from tests that pass? List all of them. What steps are new? What configurations are new?

Does this work if Noobaa isn't installed? Does this test pass with ODF 4.13? Does this test pass with an older OpenShift version?

Please don't skip the answer to "Does this test pass with ODF 4.13?" If this is a regression test, then we should be testing if the result is regressing between versions, and it will be easier to bisect the changes that happen between the versions.

Comment 14 Blaine Gardner 2023-12-01 19:37:38 UTC

Okay! I finally tracked this down!

The CephCluster has hostNetwork enabled:

  network:
    hostNetwork: true
    multiClusterService: {}


When the CephCluster has hostNetwork enabled, the CephObjectStore will too, and obviously, if the CephObjectStore is configured to use the host's network, the user should be 100% sure that the port is free on the host. Port 80 and port 443 are standard HTTP(S) ports, and so it should come as no surprise that something might be bound to those ports. In this case, it looks like haproxy is bound to the port, which is from some other OpenShift component.

For ODF, we should be **extremely** careful when deploying with hostNetwork enabled because of the risk of port conflicts like this. I would go as far as to suggest that we should avoid host networking unless we can't avoid it.

Why is host networking being enabled on the CephCluster? Is this a bug or intentional?

Comment 15 Neha Berry 2023-12-04 05:07:11 UTC

(In reply to Blaine Gardner from comment #14)
> Okay! I finally tracked this down!
> 
> The CephCluster has hostNetwork enabled:
> 
>   network:
>     hostNetwork: true
>     multiClusterService: {}
> 
> 
> When the CephCluster has hostNetwork enabled, the CephObjectStore will too,
> and obviously, if the CephObjectStore is configured to use the host's
> network, the user should be 100% sure that the port is free on the host.
> Port 80 and port 443 are standard HTTP(S) ports, and so it should come as no
> surprise that something might be bound to those ports. In this case, it
> looks like haproxy is bound to the port, which is from some other OpenShift
> component.
> 
> For ODF, we should be **extremely** careful when deploying with hostNetwork
> enabled because of the risk of port conflicts like this. I would go as far
> as to suggest that we should avoid host networking unless we can't avoid it.
> 
> Why is host networking being enabled on the CephCluster? Is this a bug or
> intentional?

Hi Blaine thanks for the analysis and the suggestion

In Provider/client and also in the old Managed services , hostNetwork has to be enabled to true for the provider .

BTW adding few points for updates here

 IIUC from what Suchita confirmed, we are not seeing this issue on all deployments and hence could be a race situation.

Also, we faced this problem after running tier2 tests on the cluster and currently due to some changes needed in our CI, we are unable to repeat the tier2 execution to check if the issue is reporoduced again

Comment 17 Rohan Gupta 2023-12-04 16:33:50 UTC

Port 443 is being utilized by HAproxy on the host so the RGW pod is not able to bind to 443.

Comment 18 krishnaram Karthick 2024-01-02 07:50:29 UTC

4.14.4 content was finalized and frozen already. moving the bug to 4.14.5.

Comment 19 Rohan Gupta 2024-01-03 07:52:07 UTC

Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323

Comment 20 Rohan Gupta 2024-01-03 07:52:31 UTC

Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323

Comment 21 Rohan Gupta 2024-01-03 07:52:39 UTC

Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323

Comment 22 suchita 2024-01-16 09:18:01 UTC

(In reply to Blaine Gardner from comment #13)
> @sgatfane I still really need more details
> 
> I don't know what HCI means in the context you are using it. I really need a
> very deeply involved breakdown. It is important for us to understand how
> this environment is different from environments where the tests pass. That
> helps narrow down the possibilities for what is going wrong. If I don't have
> that information, I don't have any leads to help guide the search for bugs. 
> 
> Please provide all of the information you can of, even if you think the
> details are minor. The less information I get, the longer this is going to
> take to debug. This will require you to spend a lot of time writing out the
> information, but it is critical that you take the time to do it so that we
> can figure out the problem in time for release. I usually spend 45 minutes
> to 1 1/2 hours investigating and writing out detailed responses to BZs when
> they are complex.
> 
> What does HCI mean? 
> 
> What is new about this test? Please explain in excruciating detail. What are
> the test steps? How are things configured?
> 
> How is this test different from tests that pass? List all of them. What
> steps are new? What configurations are new?
> 
> Does this work if Noobaa isn't installed? Does this test pass with ODF 4.13?
> Does this test pass with an older OpenShift version?
> 
> Please don't skip the answer to "Does this test pass with ODF 4.13?" If this
> is a regression test, then we should be testing if the result is regressing
> between versions, and it will be easier to bisect the changes that happen
> between the versions.

Clearing Need info , because:
1. I have anwered a few already in comment#8
2. Conclusion on this bug is now already tracked down

Comment 31 Daniel Osypenko 2024-03-18 10:20:30 UTC

during the new deployment OCP 4.14.15 + ODF odf-operator.v4.14.5-rhodf CLBO appeared again on rgw-ocs-storagecluster-cephobjectstore-a-<suffix> pod

oc get pods -n openshift-storage | grep rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-ffd87dbt59p7   2/2     Running     18 (16h ago)   17h

after one hour and 18 restarts a CLBO disappeared. 

spec.managedResources.cephObjectStores.hostNetwork was set to False initially when creating StorageCluster. After StorageClient creation spec.managedResources.cephObjectStores.hostNetwork got a True value

must-gather logs: 
OCP: https://drive.google.com/file/d/1OtOORTU1H9e-ed5k5S73aNhOtV3fq28G/view?usp=sharing 
OCS: https://drive.google.com/file/d/1wimm42cbF_iV_qImSNvy7j6EX43If7kT/view?usp=sharing