Bug 2184332
| Summary: | [multus, vsphere] ODF4.13 deployment with multus failed [storagecluster stuck on Progressing state] | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Oded <oviner> |
| Component: | rook | Assignee: | Blaine Gardner <brgardne> |
| Status: | CLOSED NOTABUG | QA Contact: | Neha Berry <nberry> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.13 | CC: | muagarwa, ocs-bugs, odf-bz-bot |
| Target Milestone: | --- | Keywords: | TestBlocker |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-04-04 17:45:34 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
****The issue was fixed by configuring the security policy for a vSphere Standard Switch [Promiscuous mode, MAC address changes, Forged transmits] to accept [moved from reject]
sh-5.1$ ceph status
cluster:
id: 4655ecc0-2de9-4e41-8990-330486320a0b
health: HEALTH_OK
services:
mon: 3 daemons, quorum a,b,c (age 24h)
mgr: a(active, since 24h)
mds: 1/1 daemons up, 1 hot standby
osd: 3 osds: 3 up (since 73m), 3 in (since 24h)
rgw: 1 daemon active (1 hosts, 1 zones)
data:
volumes: 1/1 healthy
pools: 12 pools, 169 pgs
objects: 450 objects, 137 MiB
usage: 683 MiB used, 1.5 TiB / 1.5 TiB avail
pgs: 169 active+clean
io:
client: 1.3 KiB/s rd, 1.8 KiB/s wr, 2 op/s rd, 0 op/s wr
[odedviner@fedora auth]$ oc get storagecluster
NAME AGE PHASE EXTERNAL CREATED AT VERSION
ocs-storagecluster 24h Ready 2023-04-03T16:55:08Z 4.13.0
for more info:
https://docs.google.com/document/d/1BRk9JqjWZM2WHXt8iVDryVYIqScEyjbTm3Jiln9Ghd4/edit
|
Description of problem (please be detailed as possible and provide log snippests): ODF4.13 deployment with multus failed on vsphere platform[storagecluster stuck on Progressing state] Version of all relevant components (if applicable): OCP Version: 4.13.0-0.nightly-2023-04-01-062001 ODF Version: odf-operator.v4.13.0-121.stable Plaform :vsphere Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? Is there any workaround available to the best of your knowledge? Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Can this issue reproduce from the UI? If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Install OCP cluster without ODF [4.13.0-0.nightly-2023-04-01-062001] 2. Enable PROMISC on br-ex network interface on all worker nodes $ oc debug node/compute-0 [0/1/2] sh-4.4# chroot /host sh-5.1# bash [root@compute-0 /]# ip link set promisc on br-ex 3. Verify PROMISC flag exist on br-ex network interface sh-4.4# ifconfig br-ex: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST> mtu 1500 inet 10.1.160.157 netmask 255.255.254.0 broadcast 10.1.161.255 ether 00:50:56:8f:dd:20 txqueuelen 1000 (Ethernet) RX packets 2483234 bytes 1985663520 (1.8 GiB) RX errors 0 dropped 3724 overruns 0 frame 0 TX packets 1893583 bytes 1503803337 (1.4 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 4. Create 2 Networks Attchement on default namespace: cluster-net: 192.168.20.0/24 Public-net: 192.168.30.0/24 Multus Net-Atta-Defs in default NS $ cat net-atta-defs.yaml --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: public-net namespace: default labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }' --- apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: name: cluster-net namespace: default labels: {} annotations: {} spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }' 5. Verify 2 networks exists: $ oc get network-attachment-definitions.k8s.cni.cncf.io NAME AGE cluster-net 43s public-net 43s 6. Install ODF Opertaor [odf-operator.v4.13.0-121.stable] 7. Install storage cluster via UI and enabled multus: 8. Check Ceph status: sh-5.1$ ceph -s cluster: id: 4655ecc0-2de9-4e41-8990-330486320a0b health: HEALTH_WARN 1 MDSs report slow metadata IOs 1 osds down 1 host (1 osds) down 1 rack (1 osds) down Reduced data availability: 61 pgs inactive, 20 pgs down, 41 pgs peering, 27 pgs stale 2 slow ops, oldest one blocked for 146 sec, daemons [osd.0,mon.a] have slow ops. services: mon: 3 daemons, quorum a,b,c (age 15h) mgr: a(active, since 15h) mds: 1/1 daemons up, 1 standby osd: 3 osds: 2 up (since 64s), 3 in (since 15h) data: volumes: 1/1 healthy pools: 12 pools, 61 pgs objects: 0 objects, 0 B usage: 172 MiB used, 1.5 TiB / 1.5 TiB avail pgs: 100.000% pgs not active 34 creating+peering 20 stale+creating+down 7 stale+creating+peering progress: Global Recovery Event (0s) [............................] 9. Check ceph OSD dump: sh-5.1$ ceph osd dump osd.0 up in weight 1 up_from 783 up_thru 783 down_at 782 last_clean_interval [12,782) [v2:192.168.20.21:6800/2417487063,v1:192.168.20.21:6801/2417487063] [v2:192.168.30.1:6800/2463487063,v1:192.168.30.1:6801/2463487063] exists,up 43f15076-d401-46e0-9c53-f87375e075c6 osd.1 up in weight 1 up_from 783 up_thru 783 down_at 778 last_clean_interval [14,782) [v2:192.168.20.22:6800/3362994509,v1:192.168.20.22:6801/3362994509] [v2:192.168.30.2:6804/3409994509,v1:192.168.30.2:6805/3409994509] exists,up f55fd804-edfd-407a-8d6e-99682e291694 osd.2 down in weight 1 up_from 769 up_thru 778 down_at 780 last_clean_interval [14,768) [v2:192.168.20.23:6800/1352676998,v1:192.168.20.23:6801/1352676998] [v2:192.168.30.3:6804/1397676998,v1:192.168.30.3:6805/1397676998] exists 51268948-c8d1-43ec-ac54-68615f786516 10. Check storagecluster status: $ oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 16h Progressing 2023-04-03T16:55:08Z 4.13.0 Status: Conditions: Last Heartbeat Time: 2023-04-03T16:55:09Z Last Transition Time: 2023-04-03T16:55:09Z Message: Version check successful Reason: VersionMatched Status: False Type: VersionMismatch Last Heartbeat Time: 2023-04-04T09:09:37Z Last Transition Time: 2023-04-03T16:55:09Z Message: Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-ceph-rbd] Actual results: ceph status in WARN state and storagecluster on Progressing state Expected results: ceph status is OK and storagecluster on ready state Additional info: https://docs.google.com/document/d/1BRk9JqjWZM2WHXt8iVDryVYIqScEyjbTm3Jiln9Ghd4/edit# OCP + OCS must gather: http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-2184332