Bug 2185173
| Summary: | Multus, Network issue between the OSDs in the 'public-net' network | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Oded <oviner> |
| Component: | rook | Assignee: | Blaine Gardner <brgardne> |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.13 | CC: | ebenahar, muagarwa, ocs-bugs, odf-bz-bot, srai |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-04-11 15:13:43 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Oded
2023-04-07 10:06:08 UTC
Have talked with Oded, and he will try to reproduce the again. Cluster is no longer available, and cannot debug vSphere network communication issues with must-gathers here. Closing. Hi Blaine, The bug was reported with the logs that could be extracted. If something is missing, feel free to ask for it. In case must gather doesn't contain the necessary artifacts, we should file a bug for adding them. I am re-opening the bug, to make sure we are not stopping the RCA. RCA never stopped. The underlying issue is likely to be with using promiscuous mode on vSphere as captured in an external email thread. @ebenahar What is the appropriate component to change this to for updating the QE environment? Should this be moved to JIRA? Since this is not a rook bug, nor a bug in ODF, we are eager to get it off the product. Hi Elad, Enabling promiscuous mode is not considered a safe option and since we have a requirement to enable that option to deploy multus, I enabled promiscuous mode on DC-15( VMFS-with single server ) ONLY. Intentionally I didn't enable it on DC-CP/ECO since we have more servers and the whole team is using those DC's. Also, if we try to create more than 1 cluster on the same DC which enables promiscuous mode, we might see slow metadata IOs, osd down ( might be related to infra ) issues. In 4.12, we created an additional network i/f ""ens224". Regards, VJ On Thu, Apr 13, 2023 at 12:19 PM Elad Ben Aharon <ebenahar> wrote: Looks like there a bit of a disconnect here. The vSphere environment is administrated by my team, ODF QE. I understand that promiscuous mode has been enabled last week on the vswitches, since it is a pre-requisite for Multus to be configured and used properly. Such configuration, of promiscuous mode enabled, on the vswitches, hasn’t been used so far on our vSphere environment. @Vijay Bhaskar Reddy Avuthu please keep me honest about it. This same vSphere environment is being used for many other QE activities. We haven’t seen such flakiness and unpredictable behavior in the past so I would say that promiscuous mode enabled is the immediate suspect. We will check if we see this flakiness over other, non Multus clusters. From: Eran Tamir1 <etamir> Date: Thursday, 13 April 2023 at 9:32 To: Blaine Gardner <Blaine.Gardner>, Coady LaCroix <clacroix>, Daniel Horak <dahorak>, Elad Ben Aharon <ebenahar>, Subham Kumar Rai <Subham.Kumar.Rai>, Mudit Agarwal2 <Mudit.Agarwal2>, Bipin Kunal <Bipin.Kunal>, Christopher Nelson Blum <cblum> Subject: Re: Multus testing in vSphere/vCenter Adding @Christopher Nelson Blum who may have the right connections to help here From: Blaine Gardner <Blaine.Gardner> Sent: Wednesday, April 12, 2023 1:22 AM To: Coady LaCroix <clacroix>; Daniel Horak <dahorak>; Elad Ben Aharon <ebenahar>; Subham Kumar Rai <Subham.Kumar.Rai>; Mudit Agarwal2 <Mudit.Agarwal2>; Bipin Kunal <Bipin.Kunal>; Eran Tamir1 <etamir> Subject: Re: Multus testing in vSphere/vCenter I have a bit of an update here: Oded saw that a cluster went from being ready to not ready with some pods in CLBO after about a day. As I have been developing the Multus validation test, I am seeing that the Multus connections can be quite flaky in the vSphere environment and that the flakiness is not predictable. If simple HTTP curl traffic isn’t reliable, that suggests to me that a Ceph cluster won’t be either, validating what Oded saw. I believe what may be happening is what I noted in my first email, “[I have read that] there may be performance impacts of using promiscuous mode on vmware switches.” I need to be preparing for Kubecon. Can we find someone familiar with vmWare and vCenter to help resolve what is happening here? No one I’ve talked to seems to know who the vCenter administrator is, if there is one. There might be software or hardware options that fix the performance loss on vmWare switches when promiscuous routing is allowed. Finding someone with deep networking and/or vmWare knowledge could make this go a lot more quickly. If performance on the one network can’t be improved, I can only think of one suggestion to make to help resolve the issue: • Reject promiscuous mode (and other network security options) on the “VM_Network” network • Create a new network that will be used for attaching additional Multus interfaces to tests that include Multus ◦ “VM_Promiscuous” perhaps? ◦ This probably requires using a different/new network switch, but I’m really not sure ◦ Allow promiscuous traffic, and the other 2 security options as well on the new network • Update the openshift-install scripts to set up multiple (i.e., 2) networks on Multus tests hosts ◦ newer versions of openshift-install support multiple networks in some capacity judging by the latest API I’ve seen • Change test NADs to use the 2nd interface for Multus instead of the interface that is also used for the pod overlay network Blaine If change has to happen in automation then QE needs to open an issue in ocs-ci.
Given that this is not a product bug, it should be closed.
>> In case must gather doesn't contain the necessary artifacts, we should file a bug for adding them.
vSphere network communication related logs can't be added to must-gather.
Okay. Closing, and I'll open an issue in ocs-ci that others can add more info to. Okay. Closing, and I'll open an issue in ocs-ci that others can add more info to. |