Description of problem (please be detailed as possible and provide log snippests): All OSDs down due to a network issue between OSDs in public-net network [192.168.20.0/24] Version of all relevant components (if applicable): OCP Version: 4.13.0-0.nightly-2023-04-01-062001 ODF Version: odf-operator.v4.13.0-121.stable Plaform :vsphere Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1.Deploy a cluster with multus https://docs.google.com/document/d/1BRk9JqjWZM2WHXt8iVDryVYIqScEyjbTm3Jiln9Ghd4/edit 2. Wait 1 Day 3.Check OSDs status $ oc logs rook-ceph-osd-0-6f78674c6f-qxmc8 | grep 192.168 Defaulted container "osd" out of: osd, log-collector, blkdevmapper (init), activate (init), expand-bluefs (init), chown-container-data-dir (init) debug 2023-04-06T21:03:23.748+0000 7f6bcc0f6640 -1 osd.0 3347 heartbeat_check: no reply from 192.168.20.23:6802 osd.2 ever on either front or back, first ping sent 2023-04-06T21:03:03.195162+0000 (oldest deadline 2023-04-06T21:03:23.195162+0000) debug 2023-04-06T21:03:44.724+0000 7f6bcc0f6640 -1 osd.0 3347 heartbeat_check: no reply from 192.168.20.23:6802 osd.2 ever on either front or back, first ping sent 2023-04-06T21:03:24.228882+0000 (oldest deadline 2023-04-06T21:03:44.228882+0000) $ oc get pods | grep osd rook-ceph-osd-0-6f78674c6f-qxmc8 2/2 Running 29 (32m ago) 3d4h rook-ceph-osd-1-5dd88b5bcf-8dm5h 2/2 Running 30 (32m ago) 3d4h rook-ceph-osd-2-689cdd7988-slckd 2/2 Running 26 (32m ago) 3d4h rook-ceph-osd-prepare-26675259a7e22f7b8bd207b55ae5091b-jlrr5 0/1 Completed 0 3d4h rook-ceph-osd-prepare-cf2c554cd5325882450e573af6dfb907-pjn7d 0/1 Completed 0 3d4h rook-ceph-osd-prepare-e59da83b4d48de9041b5978067b2f707-6vglt 0/1 Completed 0 3d4h sh-5.1$ ceph -s cluster: id: 4655ecc0-2de9-4e41-8990-330486320a0b health: HEALTH_WARN 1 MDSs report slow metadata IOs 2 osds down 2 hosts (2 osds) down 2 racks (2 osds) down Reduced data availability: 169 pgs inactive, 169 pgs down services: mon: 3 daemons, quorum a,b,c (age 3d) mgr: a(active, since 3d) mds: 1/1 daemons up, 1 hot standby osd: 3 osds: 1 up (since 3m), 3 in (since 3d) rgw: 1 daemon active (1 hosts, 1 zones) data: volumes: 1/1 healthy pools: 12 pools, 169 pgs objects: 571 objects, 486 MiB usage: 1.4 GiB used, 1.5 TiB / 1.5 TiB avail pgs: 100.000% pgs not active 169 down $ oc describe pod rook-ceph-osd-0-6f78674c6f-qxmc8 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning Unhealthy 95m (x10 over 22h) kubelet Startup probe failed: ceph daemon health check failed with the following output: > no valid command found; 10 closest matches: > 0 > 1 > 2 > abort > assert > bluefs debug_inject_read_zeros > bluefs files list > bluefs stats > bluestore allocator dump block > bluestore allocator fragmentation block > admin_socket: invalid command Normal Pulled 35m (x30 over 3d4h) kubelet Container image "quay.io/rhceph-dev/rhceph@sha256:f916da02f59b8f73ad18eb65310333d1e3cbd1a54678ff50bf27ed9618719b63" already present on machine Normal Created 35m (x30 over 3d4h) kubelet Created container osd Normal Started 35m (x30 over 3d4h) kubelet Started container osd Actual results: OSD pods restart many times Expected results: OSD pods on running state Additional info: OCP+OCS must-gather http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/bz-2185173.tar.gz
Have talked with Oded, and he will try to reproduce the again.
Cluster is no longer available, and cannot debug vSphere network communication issues with must-gathers here. Closing.
Hi Blaine, The bug was reported with the logs that could be extracted. If something is missing, feel free to ask for it. In case must gather doesn't contain the necessary artifacts, we should file a bug for adding them. I am re-opening the bug, to make sure we are not stopping the RCA.
RCA never stopped. The underlying issue is likely to be with using promiscuous mode on vSphere as captured in an external email thread. @ebenahar What is the appropriate component to change this to for updating the QE environment? Should this be moved to JIRA? Since this is not a rook bug, nor a bug in ODF, we are eager to get it off the product. Hi Elad, Enabling promiscuous mode is not considered a safe option and since we have a requirement to enable that option to deploy multus, I enabled promiscuous mode on DC-15( VMFS-with single server ) ONLY. Intentionally I didn't enable it on DC-CP/ECO since we have more servers and the whole team is using those DC's. Also, if we try to create more than 1 cluster on the same DC which enables promiscuous mode, we might see slow metadata IOs, osd down ( might be related to infra ) issues. In 4.12, we created an additional network i/f ""ens224". Regards, VJ On Thu, Apr 13, 2023 at 12:19 PM Elad Ben Aharon <ebenahar> wrote: Looks like there a bit of a disconnect here. The vSphere environment is administrated by my team, ODF QE. I understand that promiscuous mode has been enabled last week on the vswitches, since it is a pre-requisite for Multus to be configured and used properly. Such configuration, of promiscuous mode enabled, on the vswitches, hasn’t been used so far on our vSphere environment. @Vijay Bhaskar Reddy Avuthu please keep me honest about it. This same vSphere environment is being used for many other QE activities. We haven’t seen such flakiness and unpredictable behavior in the past so I would say that promiscuous mode enabled is the immediate suspect. We will check if we see this flakiness over other, non Multus clusters. From: Eran Tamir1 <etamir> Date: Thursday, 13 April 2023 at 9:32 To: Blaine Gardner <Blaine.Gardner>, Coady LaCroix <clacroix>, Daniel Horak <dahorak>, Elad Ben Aharon <ebenahar>, Subham Kumar Rai <Subham.Kumar.Rai>, Mudit Agarwal2 <Mudit.Agarwal2>, Bipin Kunal <Bipin.Kunal>, Christopher Nelson Blum <cblum> Subject: Re: Multus testing in vSphere/vCenter Adding @Christopher Nelson Blum who may have the right connections to help here From: Blaine Gardner <Blaine.Gardner> Sent: Wednesday, April 12, 2023 1:22 AM To: Coady LaCroix <clacroix>; Daniel Horak <dahorak>; Elad Ben Aharon <ebenahar>; Subham Kumar Rai <Subham.Kumar.Rai>; Mudit Agarwal2 <Mudit.Agarwal2>; Bipin Kunal <Bipin.Kunal>; Eran Tamir1 <etamir> Subject: Re: Multus testing in vSphere/vCenter I have a bit of an update here: Oded saw that a cluster went from being ready to not ready with some pods in CLBO after about a day. As I have been developing the Multus validation test, I am seeing that the Multus connections can be quite flaky in the vSphere environment and that the flakiness is not predictable. If simple HTTP curl traffic isn’t reliable, that suggests to me that a Ceph cluster won’t be either, validating what Oded saw. I believe what may be happening is what I noted in my first email, “[I have read that] there may be performance impacts of using promiscuous mode on vmware switches.” I need to be preparing for Kubecon. Can we find someone familiar with vmWare and vCenter to help resolve what is happening here? No one I’ve talked to seems to know who the vCenter administrator is, if there is one. There might be software or hardware options that fix the performance loss on vmWare switches when promiscuous routing is allowed. Finding someone with deep networking and/or vmWare knowledge could make this go a lot more quickly. If performance on the one network can’t be improved, I can only think of one suggestion to make to help resolve the issue: • Reject promiscuous mode (and other network security options) on the “VM_Network” network • Create a new network that will be used for attaching additional Multus interfaces to tests that include Multus ◦ “VM_Promiscuous” perhaps? ◦ This probably requires using a different/new network switch, but I’m really not sure ◦ Allow promiscuous traffic, and the other 2 security options as well on the new network • Update the openshift-install scripts to set up multiple (i.e., 2) networks on Multus tests hosts ◦ newer versions of openshift-install support multiple networks in some capacity judging by the latest API I’ve seen • Change test NADs to use the 2nd interface for Multus instead of the interface that is also used for the pod overlay network Blaine
If change has to happen in automation then QE needs to open an issue in ocs-ci. Given that this is not a product bug, it should be closed. >> In case must gather doesn't contain the necessary artifacts, we should file a bug for adding them. vSphere network communication related logs can't be added to must-gather.
Okay. Closing, and I'll open an issue in ocs-ci that others can add more info to.
https://github.com/red-hat-storage/ocs-ci/issues/7467