Bug 2251670
| Summary: | [provider] During regression testing RGW pod rook-ceph-rgw-ocs-storagecluster-cephobjectstore on provider went to CLBO state | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | suchita <sgatfane> | |
| Component: | ocs-operator | Assignee: | Rohan Gupta <rohgupta> | |
| Status: | CLOSED COMPLETED | QA Contact: | Amrita Mahapatra <ammahapa> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 4.14 | CC: | asriram, brgardne, dosypenk, ebenahar, jthottan, kramdoss, lgangava, muagarwa, nberry, nigoyal, odf-bz-bot, paarora, resoni, rohgupta | |
| Target Milestone: | --- | |||
| Target Release: | ODF 4.14.6 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | isf-provider, fusion-hci-phase-2 | |||
| Fixed In Version: | 4.14.4-5 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2256777 (view as bug list) | Environment: | ||
| Last Closed: | 2024-07-22 10:55:15 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2256777 | |||
|
Description
suchita
2023-11-27 07:54:12 UTC
A manual workaround for this bug is to restart rook-ceph-operator pod and then restart the crashing rgw pod This issue is occurring on fresh installed HCI cluster RGW cannot bind the HTTPS port (443) because it is already in use causing the pod to CLBO debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0 0 deferred set uid:gid to 167:167 (ceph:ceph) debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0 0 ceph version 17.2.6-148.el9cp (badc1d27cb07762bea48f6554ad4f92b9d3fbb6b) quincy (stable), process radosgw, pid 701 debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0 0 framework: beast debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0 0 framework conf key: port, val: 80 debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0 0 framework conf key: ssl_port, val: 443 debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0 0 framework conf key: ssl_certificate, val: /etc/ceph/private/rgw-cert.pem debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0 0 framework conf key: ssl_private_key, val: /etc/ceph/private/rgw-key.pem debug 2023-11-29T10:20:11.401+0000 7f2d9521a7c0 1 radosgw_Main not setting numa affinity debug 2023-11-29T10:20:11.402+0000 7f2d9521a7c0 1 rgw_d3n: rgw_d3n_l1_local_datacache_enabled=0 debug 2023-11-29T10:20:11.402+0000 7f2d9521a7c0 1 D3N datacache enabled: 0 debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0 0 framework: beast debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0 0 framework conf key: ssl_certificate, val: config://rgw/cert/$realm/$zone.crt debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0 0 framework conf key: ssl_private_key, val: config://rgw/cert/$realm/$zone.key debug 2023-11-29T10:20:11.516+0000 7f2d9521a7c0 0 starting handler: beast debug 2023-11-29T10:20:11.518+0000 7f2d9521a7c0 -1 failed to bind address 0.0.0.0:443: Address already in use debug 2023-11-29T10:20:11.518+0000 7f2d9521a7c0 -1 ERROR: failed initializing frontend I don't think this issue is related to the deployment strategy. RGW has been working fine with this port setup for a while. I suspect that this could have something to do with OpenShift changes, or having to do with how HCI is working. @sgatfane and @rohgupta I'd like to ask some more questions about the environment before proceeding. 1. Is this environment in __any way__ new? 2. Has this HCI setup been tested before? 3. Is this using a new version (even minor version) of OpenShfit compared to previous tests? 4. Is this running on bare metal hardware? (some oc output I see suggests that it is) 5. Is this running in any kind of new environment? To help triage, we may want to run this test in other environments to see how reproducible it is. 1. Does this reproduce with an older version of openshift? 2. Does this reproduce in non-HCI environments? 3. Does this reproduce in HCI environments that are virtual/cloud instead of bare metal? (or vice versa) Also, I'd really like to be able to get access to the environment where this is failing. Can you set up a test where the cluster remains offline for me to look at live? Also, what does this mean?
> Can this issue reproducible?
> 1/1
Does this mean the issue was only seen once? I think it is important to understand if this issue reproduces with any regularity. It sounds like a race condition, but I don't see anything that suggests a cause. If this doesn't repro, I'm not sure we have enough info to know what can/should be changed.
I understand from Rohan that this is regularly reproducible. I suspect that there could be a race condition with the startup probe. To dig deeper into that, we need info that isn't present in our must-gathers. We should gather logs for the startup probe that are only captured in kubelet logs. Getting those logs requires setting the kubelet's log level to `--v=4` (or higher). This doc tells how to use the openshift machineconfig to change the kubelet log level. Please change the example log level from 2 to 4. https://docs.openshift.com/container-platform/4.8/rest_api/editing-kubelet-log-level-verbosity.html#persistent-kubelet-log-level-configuration_editing-kubelet-log-level-verbosity We'll also need kubelet logs from the node where the RGW is scheduled. Likely this command will do so, and gather logs from all worker nodes: `oc adm node-logs --role worker -u kubelet` (In reply to Rohan Gupta from comment #3) > This issue is occurring on fresh installed HCI cluster No the cluster where QE reported this issue was not a freshky deploy cluster. cluster was running from couple of weeks (In reply to Blaine Gardner from comment #7) > I don't think this issue is related to the deployment strategy. RGW has been > working fine with this port setup for a while. I suspect that this could > have something to do with OpenShift changes, or having to do with how HCI is > working. > > > @sgatfane and @rohgupta I'd like to ask some more > questions about the environment before proceeding. > > 1. Is this environment in __any way__ new? ==> environment new means? if you want to say freshly deployed cluster then answer is No . We observed on provider cluster which was running from alsmost week long and we just run our regression TC from tier2 of MCG,RGW and Nooba tests > 2. Has this HCI setup been tested before? ==> atleast from ODF QE side we are testing HCI Setup for first time during this release > 3. Is this using a new version (even minor version) of OpenShfit compared to > previous tests? ==> ODF QE testing MCG,Nooba and RGW on this HCI provider client solution from first time. In converged solutions we are doing this testing on all supported platform and issue was not reported yet. > 4. Is this running on bare metal hardware? (some oc output I see suggests > that it is) When I reported it the setup is not actual BM h/w. it is VSPEHERE H/W used like BM H/W > 5. Is this running in any kind of new environment? > > > To help triage, we may want to run this test in other environments to see > how reproducible it is. > > 1. Does this reproduce with an older version of openshift? > 2. Does this reproduce in non-HCI environments? > 3. Does this reproduce in HCI environments that are virtual/cloud instead of > bare metal? (or vice versa) ===> Yes , First incident observed on vpshere BM like cluster QE are not able to reproduce it again because we are blocked at running out tier2 on the provider due to some blocked while running it and try to resolve our ocs-ci issue to do that. > > > Also, I'd really like to be able to get access to the environment where this > is failing. Can you set up a test where the cluster remains offline for me > to look at live? I was able to reproduce the issue on a BM cluster. Didn't hit the issue on first storegecluster install but then cleaned ODF and recreated and hit the issue. Here are the kubelet logs @brgardne https://drive.google.com/file/d/1Fl3fjPFn-Tym1mfad725qevkEUsx77LR/view?usp=sharing @sgatfane I still really need more details I don't know what HCI means in the context you are using it. I really need a very deeply involved breakdown. It is important for us to understand how this environment is different from environments where the tests pass. That helps narrow down the possibilities for what is going wrong. If I don't have that information, I don't have any leads to help guide the search for bugs. Please provide all of the information you can of, even if you think the details are minor. The less information I get, the longer this is going to take to debug. This will require you to spend a lot of time writing out the information, but it is critical that you take the time to do it so that we can figure out the problem in time for release. I usually spend 45 minutes to 1 1/2 hours investigating and writing out detailed responses to BZs when they are complex. What does HCI mean? What is new about this test? Please explain in excruciating detail. What are the test steps? How are things configured? How is this test different from tests that pass? List all of them. What steps are new? What configurations are new? Does this work if Noobaa isn't installed? Does this test pass with ODF 4.13? Does this test pass with an older OpenShift version? Please don't skip the answer to "Does this test pass with ODF 4.13?" If this is a regression test, then we should be testing if the result is regressing between versions, and it will be easier to bisect the changes that happen between the versions. Okay! I finally tracked this down!
The CephCluster has hostNetwork enabled:
network:
hostNetwork: true
multiClusterService: {}
When the CephCluster has hostNetwork enabled, the CephObjectStore will too, and obviously, if the CephObjectStore is configured to use the host's network, the user should be 100% sure that the port is free on the host. Port 80 and port 443 are standard HTTP(S) ports, and so it should come as no surprise that something might be bound to those ports. In this case, it looks like haproxy is bound to the port, which is from some other OpenShift component.
For ODF, we should be **extremely** careful when deploying with hostNetwork enabled because of the risk of port conflicts like this. I would go as far as to suggest that we should avoid host networking unless we can't avoid it.
Why is host networking being enabled on the CephCluster? Is this a bug or intentional?
(In reply to Blaine Gardner from comment #14) > Okay! I finally tracked this down! > > The CephCluster has hostNetwork enabled: > > network: > hostNetwork: true > multiClusterService: {} > > > When the CephCluster has hostNetwork enabled, the CephObjectStore will too, > and obviously, if the CephObjectStore is configured to use the host's > network, the user should be 100% sure that the port is free on the host. > Port 80 and port 443 are standard HTTP(S) ports, and so it should come as no > surprise that something might be bound to those ports. In this case, it > looks like haproxy is bound to the port, which is from some other OpenShift > component. > > For ODF, we should be **extremely** careful when deploying with hostNetwork > enabled because of the risk of port conflicts like this. I would go as far > as to suggest that we should avoid host networking unless we can't avoid it. > > Why is host networking being enabled on the CephCluster? Is this a bug or > intentional? Hi Blaine thanks for the analysis and the suggestion In Provider/client and also in the old Managed services , hostNetwork has to be enabled to true for the provider . BTW adding few points for updates here IIUC from what Suchita confirmed, we are not seeing this issue on all deployments and hence could be a race situation. Also, we faced this problem after running tier2 tests on the cluster and currently due to some changes needed in our CI, we are unable to repeat the tier2 execution to check if the issue is reporoduced again Port 443 is being utilized by HAproxy on the host so the RGW pod is not able to bind to 443. 4.14.4 content was finalized and frozen already. moving the bug to 4.14.5. Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323 Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323 Upstream PR to allow enabling/disabling host network for RGW is merged https://github.com/red-hat-storage/ocs-operator/pull/2323 (In reply to Blaine Gardner from comment #13) > @sgatfane I still really need more details > > I don't know what HCI means in the context you are using it. I really need a > very deeply involved breakdown. It is important for us to understand how > this environment is different from environments where the tests pass. That > helps narrow down the possibilities for what is going wrong. If I don't have > that information, I don't have any leads to help guide the search for bugs. > > Please provide all of the information you can of, even if you think the > details are minor. The less information I get, the longer this is going to > take to debug. This will require you to spend a lot of time writing out the > information, but it is critical that you take the time to do it so that we > can figure out the problem in time for release. I usually spend 45 minutes > to 1 1/2 hours investigating and writing out detailed responses to BZs when > they are complex. > > What does HCI mean? > > What is new about this test? Please explain in excruciating detail. What are > the test steps? How are things configured? > > How is this test different from tests that pass? List all of them. What > steps are new? What configurations are new? > > Does this work if Noobaa isn't installed? Does this test pass with ODF 4.13? > Does this test pass with an older OpenShift version? > > Please don't skip the answer to "Does this test pass with ODF 4.13?" If this > is a regression test, then we should be testing if the result is regressing > between versions, and it will be easier to bisect the changes that happen > between the versions. Clearing Need info , because: 1. I have anwered a few already in comment#8 2. Conclusion on this bug is now already tracked down during the new deployment OCP 4.14.15 + ODF odf-operator.v4.14.5-rhodf CLBO appeared again on rgw-ocs-storagecluster-cephobjectstore-a-<suffix> pod oc get pods -n openshift-storage | grep rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-ffd87dbt59p7 2/2 Running 18 (16h ago) 17h after one hour and 18 restarts a CLBO disappeared. spec.managedResources.cephObjectStores.hostNetwork was set to False initially when creating StorageCluster. After StorageClient creation spec.managedResources.cephObjectStores.hostNetwork got a True value must-gather logs: OCP: https://drive.google.com/file/d/1OtOORTU1H9e-ed5k5S73aNhOtV3fq28G/view?usp=sharing OCS: https://drive.google.com/file/d/1wimm42cbF_iV_qImSNvy7j6EX43If7kT/view?usp=sharing |