2185173 – Multus, Network issue between the OSDs in the 'public-net' network

Bug 2185173 - Multus, Network issue between the OSDs in the 'public-net' network

Summary: Multus, Network issue between the OSDs in the 'public-net' network

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	---
Assignee:	Blaine Gardner
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-07 10:06 UTC by Oded
Modified:	2023-08-09 17:03 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-04-11 15:13:43 UTC
Embargoed:

Attachments	(Terms of Use)

Description Oded 2023-04-07 10:06:08 UTC

Description of problem (please be detailed as possible and provide log
snippests):
All OSDs down due to a network issue between OSDs in public-net network  [192.168.20.0/24]

Version of all relevant components (if applicable):
OCP Version: 4.13.0-0.nightly-2023-04-01-062001
ODF Version: odf-operator.v4.13.0-121.stable
Plaform :vsphere

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.Deploy a cluster with multus https://docs.google.com/document/d/1BRk9JqjWZM2WHXt8iVDryVYIqScEyjbTm3Jiln9Ghd4/edit
2. Wait 1 Day

3.Check OSDs status
$ oc logs rook-ceph-osd-0-6f78674c6f-qxmc8 | grep 192.168
Defaulted container "osd" out of: osd, log-collector, blkdevmapper (init), activate (init), expand-bluefs (init), chown-container-data-dir (init)
debug 2023-04-06T21:03:23.748+0000 7f6bcc0f6640 -1 osd.0 3347 heartbeat_check: no reply from 192.168.20.23:6802 osd.2 ever on either front or back, first ping sent 2023-04-06T21:03:03.195162+0000 (oldest deadline 2023-04-06T21:03:23.195162+0000)
debug 2023-04-06T21:03:44.724+0000 7f6bcc0f6640 -1 osd.0 3347 heartbeat_check: no reply from 192.168.20.23:6802 osd.2 ever on either front or back, first ping sent 2023-04-06T21:03:24.228882+0000 (oldest deadline 2023-04-06T21:03:44.228882+0000)

$ oc get pods | grep osd
rook-ceph-osd-0-6f78674c6f-qxmc8                                  2/2     Running     29 (32m ago)     3d4h
rook-ceph-osd-1-5dd88b5bcf-8dm5h                                  2/2     Running     30 (32m ago)     3d4h
rook-ceph-osd-2-689cdd7988-slckd                                  2/2     Running     26 (32m ago)     3d4h
rook-ceph-osd-prepare-26675259a7e22f7b8bd207b55ae5091b-jlrr5      0/1     Completed   0                3d4h
rook-ceph-osd-prepare-cf2c554cd5325882450e573af6dfb907-pjn7d      0/1     Completed   0                3d4h
rook-ceph-osd-prepare-e59da83b4d48de9041b5978067b2f707-6vglt      0/1     Completed   0                3d4h

sh-5.1$ ceph -s
  cluster:
    id:     4655ecc0-2de9-4e41-8990-330486320a0b
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            2 osds down
            2 hosts (2 osds) down
            2 racks (2 osds) down
            Reduced data availability: 169 pgs inactive, 169 pgs down
 
  services:
    mon: 3 daemons, quorum a,b,c (age 3d)
    mgr: a(active, since 3d)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 1 up (since 3m), 3 in (since 3d)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 571 objects, 486 MiB
    usage:   1.4 GiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     100.000% pgs not active
             169 down

$ oc describe pod rook-ceph-osd-0-6f78674c6f-qxmc8
Events:
  Type     Reason     Age                 From     Message
  ----     ------     ----                ----     -------
  Warning  Unhealthy  95m (x10 over 22h)  kubelet  Startup probe failed: ceph daemon health check failed with the following output:
> no valid command found; 10 closest matches:
> 0
> 1
> 2
> abort
> assert
> bluefs debug_inject_read_zeros
> bluefs files list
> bluefs stats
> bluestore allocator dump block
> bluestore allocator fragmentation block
> admin_socket: invalid command
  Normal  Pulled   35m (x30 over 3d4h)  kubelet  Container image "quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63" already present on machine
  Normal  Created  35m (x30 over 3d4h)  kubelet  Created container osd
  Normal  Started  35m (x30 over 3d4h)  kubelet  Started container osd



Actual results:
OSD pods restart many times

Expected results:
OSD pods on running state

Additional info:
OCP+OCS must-gather
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2185173.tar.gz

Comment 2 Subham Rai 2023-04-11 15:10:13 UTC

Have talked with Oded, and he will try to reproduce the again.

Comment 3 Blaine Gardner 2023-04-11 15:13:43 UTC

Cluster is no longer available, and cannot debug vSphere network communication issues with must-gathers here. Closing.

Comment 4 Elad 2023-04-13 08:39:31 UTC

Hi Blaine, 

The bug was reported with the logs that could be extracted. If something is missing, feel free to ask for it. In case must gather doesn't contain the necessary artifacts, we should file a bug for adding them.

I am re-opening the bug, to make sure we are not stopping the RCA.

Comment 5 Blaine Gardner 2023-04-13 16:34:48 UTC

RCA never stopped. The underlying issue is likely to be with using promiscuous mode on vSphere as captured in an external email thread. @ebenahar What is the appropriate component to change this to for updating the QE environment? Should this be moved to JIRA? Since this is not a rook bug, nor a bug in ODF, we are eager to get it off the product.





Hi Elad,

Enabling promiscuous mode is not considered a safe option and since we have a requirement to enable that option to deploy multus, I enabled promiscuous mode on DC-15( VMFS-with single server ) ONLY.
Intentionally I didn't enable it on DC-CP/ECO since we have more servers and the whole team is using those DC's.

Also, if we try to create more than 1 cluster on the same DC which enables promiscuous mode, we might see slow metadata IOs, osd down ( might be related to infra ) issues. In 4.12, we created an additional network i/f ""ens224".

Regards,
VJ






On Thu, Apr 13, 2023 at 12:19 PM Elad Ben Aharon <ebenahar> wrote:
Looks like there a bit of a disconnect here.
The vSphere environment is administrated by my team, ODF QE.
 
I understand that promiscuous mode has been enabled last week on the vswitches, since it is a pre-requisite for Multus to be configured and used properly.
Such configuration, of promiscuous mode enabled, on the vswitches, hasn’t been used so far on our vSphere environment. @Vijay Bhaskar Reddy Avuthu please keep me honest about it.
 
This same vSphere environment is being used for many other QE activities. We haven’t seen such flakiness and unpredictable behavior in the past so I would say that promiscuous mode enabled is the immediate suspect.
We will check if we see this flakiness over other, non Multus clusters.
 
 
 
From: Eran Tamir1 <etamir> Date: Thursday, 13 April 2023 at 9:32 To: Blaine Gardner <Blaine.Gardner>, Coady LaCroix <clacroix>, Daniel Horak <dahorak>, Elad Ben Aharon <ebenahar>, Subham Kumar Rai <Subham.Kumar.Rai>, Mudit Agarwal2 <Mudit.Agarwal2>, Bipin Kunal <Bipin.Kunal>, Christopher Nelson Blum <cblum> Subject: Re: Multus testing in vSphere/vCenter
Adding @Christopher Nelson Blum who may have the right connections to help here
 
 

From: Blaine Gardner <Blaine.Gardner> Sent: Wednesday, April 12, 2023 1:22 AM To: Coady LaCroix <clacroix>; Daniel Horak <dahorak>; Elad Ben Aharon <ebenahar>; Subham Kumar Rai <Subham.Kumar.Rai>; Mudit Agarwal2 <Mudit.Agarwal2>; Bipin Kunal <Bipin.Kunal>; Eran Tamir1 <etamir> Subject: Re: Multus testing in vSphere/vCenter
 
I have a bit of an update here: Oded saw that a cluster went from being ready to not ready with some pods in CLBO after about a day.  As I have been developing the Multus validation test, I am seeing that the Multus connections can be quite flaky in the vSphere environment and that the flakiness is not predictable. If simple HTTP curl traffic isn’t reliable, that suggests to me that a Ceph cluster won’t be either, validating what Oded saw. I believe what may be happening is what I noted in my first email, “[I have read that] there may be performance impacts of using promiscuous mode on vmware switches.”
 
I need to be preparing for Kubecon. Can we find someone familiar with vmWare and vCenter to help resolve what is happening here? No one I’ve talked to seems to know who the vCenter administrator is, if there is one. There might be software or hardware options that fix the performance loss on vmWare switches when promiscuous routing is allowed. Finding someone with deep networking and/or vmWare knowledge could make this go a lot more quickly.
 
If performance on the one network can’t be improved, I can only think of one suggestion to make to help resolve the issue:
	•	Reject promiscuous mode (and other network security options) on the “VM_Network” network
	•	Create a new network that will be used for attaching additional Multus interfaces to tests that include Multus
	◦	“VM_Promiscuous” perhaps?
	◦	This probably requires using a different/new network switch, but I’m really not sure
	◦	Allow promiscuous traffic, and the other 2 security options as well on the new network
	•	Update the openshift-install scripts to set up multiple (i.e., 2) networks on Multus tests hosts
	◦	newer versions of openshift-install support multiple networks in some capacity judging by the latest API I’ve seen
	•	Change test NADs to use the 2nd interface for Multus instead of the interface that is also used for the pod overlay network
 
Blaine

Comment 6 Mudit Agarwal 2023-04-13 17:12:02 UTC

If change has to happen in automation then QE needs to open an issue in ocs-ci.
Given that this is not a product bug, it should be closed.

>> In case must gather doesn't contain the necessary artifacts, we should file a bug for adding them.
vSphere network communication related logs can't be added to must-gather.

Comment 7 Blaine Gardner 2023-04-13 19:40:51 UTC

Okay. Closing, and I'll open an issue in ocs-ci that others can add more info to.

Comment 8 Blaine Gardner 2023-04-13 19:40:51 UTC

Okay. Closing, and I'll open an issue in ocs-ci that others can add more info to.

Comment 9 Blaine Gardner 2023-04-13 19:53:12 UTC

https://github.com/red-hat-storage/ocs-ci/issues/7467

Note You need to log in before you can comment on or make changes to this bug.