2256777 – [provider] During regression testing RGW pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore on provider went to CLBO state

Bug 2256777 - [provider] During regression testing RGW pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore on provider went to CLBO state

Summary: [provider] During regression testing RGW pod rook-ceph-rgw-ocs-storagecluster...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	ocs-operator
Sub Component:
Version:	4.15
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	ODF 4.15.0
Assignee:	Rohan Gupta
QA Contact:	Amrita Mahapatra
Docs Contact:
URL:
Whiteboard:	isf-provider, fusion-hci-phase-2
Depends On:	2251670
Blocks:
TreeView+	depends on / blocked

Reported:	2024-01-04 14:03 UTC by Rohan Gupta
Modified:	2024-03-19 15:30 UTC (History)
CC List:	13 users (show)
Fixed In Version:	4.15.0-112
Doc Type:	No Doc Update
Doc Text:
Clone Of:	2251670
Environment:
Last Closed:	2024-03-19 15:30:36 UTC
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	red-hat-storage ocs-operator pull 2371	0	None	Merged	Bug 2255499: [release-4.15] Added network configuration setting for CephObjectStore	2024-01-08 11:36:27 UTC
Red Hat Product Errata	RHSA-2024:1383	0	None	None	None	2024-03-19 15:30:39 UTC

Description Rohan Gupta 2024-01-04 14:03:37 UTC

+++ This bug was initially created as a clone of Bug #2251670 +++

Description of problem (please be detailed as possible and provide log
snippests):

[provider] During regression testing RGW pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore on provider went to CLBO state

Version of all relevant components (if applicable):


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
$ oc get csv
NAME                                    DISPLAY                       VERSION               REPLACES                                  PHASE
mcg-operator.v4.14.1-6.hci              NooBaa Operator               4.14.1-6.hci          mcg-operator.v4.14.0-160.hci              Succeeded
metallb-operator.v4.14.0-202310201027   MetalLB Operator              4.14.0-202310201027                                             Succeeded
ocs-operator.v4.14.1-6.hci              OpenShift Container Storage   4.14.1-6.hci          ocs-operator.v4.14.0-160.hci              Succeeded
odf-csi-addons-operator.v4.14.1-6.hci   CSI Addons                    4.14.1-6.hci          odf-csi-addons-operator.v4.14.0-160.hci   Succeeded
odf-operator.v4.14.1-6.hci              OpenShift Data Foundation     4.14.1-6.hci          odf-operator.v4.14.0-160.hci              Succeeded


Is there any workaround available to the best of your knowledge?
Not Yet

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
1/1

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:
Will update after QE analysis on Tier2 Test analysis. for exact test after which cluster reach such a CLBO stat

Steps to Reproduce:
1.
2.
3.


Actual results:

$ oc get pods | grep rgw
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-d795cb7zccx6   1/2     CrashLoopBackOff   159 (50s ago)   13h

Expected results:

No pod should be in VLBP state

Additional info:

$ oc get pods
NAME                                                              READY   STATUS             RESTARTS          AGE
csi-addons-controller-manager-d68fdf849-r2wlp                     2/2     Running            2                 31h
noobaa-core-0                                                     1/1     Running            0                 13h
noobaa-db-pg-0                                                    1/1     Running            0                 7d17h
noobaa-default-backing-store-noobaa-pod-8ef2d870                  1/1     Running            24 (35h ago)      7d17h
noobaa-endpoint-586fc89d89-96dc8                                  1/1     Running            0                 35h
noobaa-endpoint-586fc89d89-wmmmm                                  1/1     Running            0                 13h
noobaa-operator-5584d8656f-pdnzr                                  2/2     Running            0                 7d19h
ocs-metrics-exporter-69f54ff5df-jptm5                             1/1     Running            4 (31h ago)       7d19h
ocs-operator-9cd5c8d9-lt5g8                                       1/1     Running            8 (2d13h ago)     7d19h
ocs-provider-server-97f954f89-9v2js                               1/1     Running            0                 7d19h
odf-console-78f7d497c4-t4l4k                                      1/1     Running            0                 7d19h
odf-operator-controller-manager-69ddd7855-j7z9m                   2/2     Running            12 (2d8h ago)     7d19h
rook-ceph-crashcollector-00-50-56-8f-03-dd-864dc98fd5-rbktj       1/1     Running            0                 13h
rook-ceph-crashcollector-00-50-56-8f-7b-17-6df9b99b47-9qzsw       1/1     Running            0                 13h
rook-ceph-crashcollector-00-50-56-8f-e7-ef-8dbf49bf9-krskk        1/1     Running            0                 30h
rook-ceph-exporter-00-50-56-8f-03-dd-69488b5b48-5rrqn             1/1     Running            63 (8h ago)       13h
rook-ceph-exporter-00-50-56-8f-7b-17-8fddfb7db-gmbhb              1/1     Running            39 (10h ago)      13h
rook-ceph-exporter-00-50-56-8f-e7-ef-5dfb4d559d-74wp9             1/1     Running            8 (7d18h ago)     7d19h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-8767d7cc42476   2/2     Running            0                 13h
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-9c68d5d47ldws   2/2     Running            1 (13h ago)       29h
rook-ceph-mgr-a-8b79b7c7f-jhkhm                                   2/2     Running            0                 29h
rook-ceph-mon-a-666867df8b-fskvz                                  2/2     Running            2                 29h
rook-ceph-mon-b-76868bbf47-fg6xw                                  2/2     Running            0                 29h
rook-ceph-mon-c-859c6c7f56-kn4kh                                  2/2     Running            0                 29h
rook-ceph-operator-66fcc8ff5-z4748                                1/1     Running            2                 30h
rook-ceph-osd-0-d99889db-8kcnm                                    2/2     Running            2                 29h
rook-ceph-osd-1-c46966bc-7jwjw                                    2/2     Running            0                 29h
rook-ceph-osd-2-764dfdbf46-5rpwr                                  2/2     Running            0                 29h
rook-ceph-osd-3-86787dcbcd-j5kdw                                  2/2     Running            2                 29h
rook-ceph-osd-4-8dfb9cc69-kh6tn                                   2/2     Running            0                 29h
rook-ceph-osd-5-bdfd4cdfb-p5dtf                                   2/2     Running            0                 29h
rook-ceph-osd-prepare-2ff0ede5ab0f9013316b828503bad2da-gr6kj      0/1     Completed          0                 20d
rook-ceph-osd-prepare-3ddfe359d4e828b0fa59990173505ee8-58gkr      0/1     Completed          0                 7d17h
rook-ceph-osd-prepare-3f7379b6eb08d00c2fc37b868c01bb2f-w7gv9      0/1     Completed          0                 7d17h
rook-ceph-osd-prepare-f629042efc54775b9041526fdae74f6b-fvcw4      0/1     Completed          0                 20d
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-d795cb7zccx6   1/2     CrashLoopBackOff   158 (2m16s ago)   13h
rook-ceph-tools-669dd96f6f-8f6wf                                  1/1     Running            0                 7d19h


Must Gather Logs:http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/provider-client/provider-client_20231113T182256/logs/must_gather/

--- Additional comment from RHEL Program Management on 2023-11-27 07:54:20 UTC ---

This bug having no release flag set previously, is now set with release flag 'odf‑4.15.0' to '?', and so is being proposed to be fixed at the ODF 4.15.0 release. Note that the 3 Acks (pm_ack, devel_ack, qa_ack), if any previously set while release flag was missing, have now been reset since the Acks are to be set against a release flag.

--- Additional comment from Rohan Gupta on 2023-11-27 17:46:40 UTC ---

A manual workaround for this bug is to restart rook-ceph-operator pod and then restart the crashing rgw pod

--- Additional comment from Rohan Gupta on 2023-11-29 10:35:55 UTC ---

This issue is occurring on fresh installed HCI cluster

--- Additional comment from Rohan Gupta on 2023-11-29 11:06:18 UTC ---

RGW cannot bind the HTTPS port (443) because it is already in use causing the pod to CLBO

debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 deferred set uid:gid to 167:167 (ceph:ceph)
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable), process radosgw, pid 701
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework: beast
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework conf key: port, val: 80
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework conf key: ssl_port, val: 443
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem
debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0  1 radosgw_Main not setting numa affinity
debug 2023-11-29T10:20:11.402+0000 7f2d9521a7c0  1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0
debug 2023-11-29T10:20:11.402+0000 7f2d9521a7c0  1 D3N datacache enabled: 0
debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0  0 framework: beast
debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0  0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt
debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0  0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key
debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0  0 starting handler: beast
debug 2023-11-29T10:20:11.518+0000 7f2d9521a7c0 -1 failed to bind address 0.0.0.0:443: Address already in use
debug 2023-11-29T10:20:11.518+0000 7f2d9521a7c0 -1 ERROR: failed initializing frontend

--- Additional comment from Parth Arora on 2023-11-29 11:32:49 UTC ---

Feels like this is the similar issue we observed in exporter pod, either rgw pod should delete completely first, or should have `RecreateDeploymentStrategyType` with rollingupdate set to the pod container,
Jiffin any idea with the rgw logs, what it is trying to bind.
^CC Blaine

--- Additional comment from Jiffin on 2023-11-29 11:52:30 UTC ---

The rgw daemon is binding to 443 for HTTPS since the secure port is 443 in the ObjectStoreSpec. I don't understand why another service will access the port in the RGW pod. We can try using different ports and see if the issue preexists. Then it can be another issue

--- Additional comment from Blaine Gardner on 2023-11-30 16:45:28 UTC ---

I don't think this issue is related to the deployment strategy. RGW has been working fine with this port setup for a while. I suspect that this could have something to do with OpenShift changes, or having to do with how HCI is working.


@sgatfane and @rohgupta I'd like to ask some more questions about the environment before proceeding.

1. Is this environment in __any way__ new?
2. Has this HCI setup been tested before?
3. Is this using a new version (even minor version) of OpenShfit compared to previous tests?
4. Is this running on bare metal hardware? (some oc output I see suggests that it is)
5. Is this running in any kind of new environment?


To help triage, we may want to run this test in other environments to see how reproducible it is.

1. Does this reproduce with an older version of openshift?
2. Does this reproduce in non-HCI environments?
3. Does this reproduce in HCI environments that are virtual/cloud instead of bare metal? (or vice versa)


Also, I'd really like to be able to get access to the environment where this is failing. Can you set up a test where the cluster remains offline for me to look at live?

--- Additional comment from Blaine Gardner on 2023-11-30 16:56:07 UTC ---

Also, what does this mean?

> Can this issue reproducible?
> 1/1

Does this mean the issue was only seen once? I think it is important to understand if this issue reproduces with any regularity. It sounds like a race condition, but I don't see anything that suggests a cause. If this doesn't repro, I'm not sure we have enough info to know what can/should be changed.

--- Additional comment from Blaine Gardner on 2023-11-30 18:09:32 UTC ---

I understand from Rohan that this is regularly reproducible. I suspect that there could be a race condition with the startup probe.

To dig deeper into that, we need info that isn't present in our must-gathers. We should gather logs for the startup probe that are only captured in kubelet logs. Getting those logs requires setting the kubelet's log level to `--v=4` (or higher).

This doc tells how to use the openshift machineconfig to change the kubelet log level. Please change the example log level from 2 to 4. 
https://docs.openshift.com/container-platform/4.8/rest_api/editing-kubelet-log-level-verbosity.html#persistent-kubelet-log-level-configuration_editing-kubelet-log-level-verbosity

We'll also need kubelet logs from the node where the RGW is scheduled. Likely this command will do so, and gather logs from all worker nodes: `oc adm node-logs --role worker -u kubelet`

--- Additional comment from suchita on 2023-12-01 11:30:12 UTC ---

(In reply to Rohan Gupta from comment #3)
> This issue is occurring on fresh installed HCI cluster

No the cluster where QE reported this issue was not a freshky deploy cluster. cluster was running from couple of weeks

--- Additional comment from suchita on 2023-12-01 11:50:01 UTC ---

(In reply to Blaine Gardner from comment #7)
> I don't think this issue is related to the deployment strategy. RGW has been
> working fine with this port setup for a while. I suspect that this could
> have something to do with OpenShift changes, or having to do with how HCI is
> working.
> 
> 
> @sgatfane and @rohgupta I'd like to ask some more
> questions about the environment before proceeding.
> 
> 1. Is this environment in __any way__ new?
==> environment new means? if you want to say freshly deployed cluster then answer is No . We observed on provider cluster which was running from alsmost week long and we just run our regression TC from tier2 of MCG,RGW and Nooba tests

> 2. Has this HCI setup been tested before?
==> atleast from ODF QE side we are testing HCI Setup for first time during this release

> 3. Is this using a new version (even minor version) of OpenShfit compared to
> previous tests?
==> ODF QE testing MCG,Nooba and RGW on this HCI provider client solution from first time. In converged solutions we are doing this testing on all supported platform and issue was not reported yet.

> 4. Is this running on bare metal hardware? (some oc output I see suggests
> that it is) 
When I reported it the setup is not actual BM h/w. it is VSPEHERE H/W used like BM H/W

> 5. Is this running in any kind of new environment?
> 
> 
> To help triage, we may want to run this test in other environments to see
> how reproducible it is.
> 
> 1. Does this reproduce with an older version of openshift? 
> 2. Does this reproduce in non-HCI environments?
> 3. Does this reproduce in HCI environments that are virtual/cloud instead of
> bare metal? (or vice versa) 
===> Yes , First incident observed on vpshere BM like cluster 
QE are not able to reproduce it again because we are blocked at running out tier2 on the provider due to some blocked while running it and try to resolve our ocs-ci issue to do that. 

> 
> 
> Also, I'd really like to be able to get access to the environment where this
> is failing. Can you set up a test where the cluster remains offline for me
> to look at live?

--- Additional comment from Rohan Gupta on 2023-12-01 16:03:25 UTC ---

I was able to reproduce the issue on a BM cluster.
Didn't hit the issue on first storegecluster install but then cleaned ODF and recreated and hit the issue.

Here are the kubelet logs @brgardne https://drive.google.com/file/d/1Fl3fjPFn-Tym1mfad725qevkEUsx77LR/view?usp=sharing

--- Additional comment from Blaine Gardner on 2023-12-01 18:27:51 UTC ---

@sgatfane I still really need more details

I don't know what HCI means in the context you are using it. I really need a very deeply involved breakdown. It is important for us to understand how this environment is different from environments where the tests pass. That helps narrow down the possibilities for what is going wrong. If I don't have that information, I don't have any leads to help guide the search for bugs. 

Please provide all of the information you can of, even if you think the details are minor. The less information I get, the longer this is going to take to debug. This will require you to spend a lot of time writing out the information, but it is critical that you take the time to do it so that we can figure out the problem in time for release. I usually spend 45 minutes to 1 1/2 hours investigating and writing out detailed responses to BZs when they are complex.

What does HCI mean? 

What is new about this test? Please explain in excruciating detail. What are the test steps? How are things configured?

How is this test different from tests that pass? List all of them. What steps are new? What configurations are new?

Does this work if Noobaa isn't installed? Does this test pass with ODF 4.13? Does this test pass with an older OpenShift version?

Please don't skip the answer to "Does this test pass with ODF 4.13?" If this is a regression test, then we should be testing if the result is regressing between versions, and it will be easier to bisect the changes that happen between the versions.

--- Additional comment from Blaine Gardner on 2023-12-01 19:37:38 UTC ---

Okay! I finally tracked this down!

The CephCluster has hostNetwork enabled:

  network:
    hostNetwork: true
    multiClusterService: {}


When the CephCluster has hostNetwork enabled, the CephObjectStore will too, and obviously, if the CephObjectStore is configured to use the host's network, the user should be 100% sure that the port is free on the host. Port 80 and port 443 are standard HTTP(S) ports, and so it should come as no surprise that something might be bound to those ports. In this case, it looks like haproxy is bound to the port, which is from some other OpenShift component.

For ODF, we should be **extremely** careful when deploying with hostNetwork enabled because of the risk of port conflicts like this. I would go as far as to suggest that we should avoid host networking unless we can't avoid it.

Why is host networking being enabled on the CephCluster? Is this a bug or intentional?

--- Additional comment from Neha Berry on 2023-12-04 05:07:11 UTC ---

(In reply to Blaine Gardner from comment #14)
> Okay! I finally tracked this down!
> 
> The CephCluster has hostNetwork enabled:
> 
>   network:
>     hostNetwork: true
>     multiClusterService: {}
> 
> 
> When the CephCluster has hostNetwork enabled, the CephObjectStore will too,
> and obviously, if the CephObjectStore is configured to use the host's
> network, the user should be 100% sure that the port is free on the host.
> Port 80 and port 443 are standard HTTP(S) ports, and so it should come as no
> surprise that something might be bound to those ports. In this case, it
> looks like haproxy is bound to the port, which is from some other OpenShift
> component.
> 
> For ODF, we should be **extremely** careful when deploying with hostNetwork
> enabled because of the risk of port conflicts like this. I would go as far
> as to suggest that we should avoid host networking unless we can't avoid it.
> 
> Why is host networking being enabled on the CephCluster? Is this a bug or
> intentional?

Hi Blaine thanks for the analysis and the suggestion

In Provider/client and also in the old Managed services , hostNetwork has to be enabled to true for the provider .

BTW adding few points for updates here

 IIUC from what Suchita confirmed, we are not seeing this issue on all deployments and hence could be a race situation.

Also, we faced this problem after running tier2 tests on the cluster and currently due to some changes needed in our CI, we are unable to repeat the tier2 execution to check if the issue is reporoduced again

--- Additional comment from Neha Berry on 2023-12-04 09:23:12 UTC ---

(In reply to Neha Berry from comment #15)
> (In reply to Blaine Gardner from comment #14)
> > Okay! I finally tracked this down!
> > 
> > The CephCluster has hostNetwork enabled:
> > 
> >   network:
> >     hostNetwork: true
> >     multiClusterService: {}
> > 
> > 
> > When the CephCluster has hostNetwork enabled, the CephObjectStore will too,
> > and obviously, if the CephObjectStore is configured to use the host's
> > network, the user should be 100% sure that the port is free on the host.
> > Port 80 and port 443 are standard HTTP(S) ports, and so it should come as no
> > surprise that something might be bound to those ports. In this case, it
> > looks like haproxy is bound to the port, which is from some other OpenShift
> > component.
> > 
> > For ODF, we should be **extremely** careful when deploying with hostNetwork
> > enabled because of the risk of port conflicts like this. I would go as far
> > as to suggest that we should avoid host networking unless we can't avoid it.
> > 
> > Why is host networking being enabled on the CephCluster? Is this a bug or
> > intentional?
> 
> Hi Blaine thanks for the analysis and the suggestion
> 
> In Provider/client and also in the old Managed services , hostNetwork has to
> be enabled to true for the provider .
> 
> BTW adding few points for updates here
> 
>  IIUC from what Suchita confirmed, we are not seeing this issue on all
> deployments and hence could be a race situation.
> 
> Also, we faced this problem after running tier2 tests on the cluster and
> currently due to some changes needed in our CI, we are unable to repeat the
> tier2 execution to check if the issue is reporoduced again

BTW , Rohan was able to reproduce, so he can help better with the suggestion of scenarios where we can see the possibility of reproducing it

--- Additional comment from Rohan Gupta on 2023-12-04 16:33:50 UTC ---

Port 443 is being utilized by HAproxy on the host so the RGW pod is not able to bind to 443.

--- Additional comment from krishnaram Karthick on 2024-01-02 07:50:29 UTC ---

4.14.4 content was finalized and frozen already. moving the bug to 4.14.5.

--- Additional comment from Rohan Gupta on 2024-01-03 07:52:07 UTC ---

Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323

--- Additional comment from Rohan Gupta on 2024-01-03 07:52:31 UTC ---

Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323

--- Additional comment from Rohan Gupta on 2024-01-03 07:52:39 UTC ---

Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323

Comment 7 errata-xmlrpc 2024-03-19 15:30:36 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Red Hat OpenShift Data Foundation 4.15.0 security, enhancement, & bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:1383

Note You need to log in before you can comment on or make changes to this bug.