2184332 – [multus, vsphere] ODF4.13 deployment with multus failed [storagecluster stuck on Progressing state]

Bug 2184332 - [multus, vsphere] ODF4.13 deployment with multus failed [storagecluster stuck on Progressing state]

Summary: [multus, vsphere] ODF4.13 deployment with multus failed [storagecluster stuck...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenShift Data Foundation
Classification:	Red Hat Storage
Component:	rook
Sub Component:
Version:	4.13
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Blaine Gardner
QA Contact:	Neha Berry
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-04-04 09:42 UTC by Oded
Modified:	2023-08-09 17:03 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-04-04 17:45:34 UTC
Embargoed:

Attachments	(Terms of Use)

Description Oded 2023-04-04 09:42:25 UTC

Description of problem (please be detailed as possible and provide log
snippests):
ODF4.13 deployment with multus failed on vsphere platform[storagecluster stuck on  Progressing  state]

Version of all relevant components (if applicable):
OCP Version: 4.13.0-0.nightly-2023-04-01-062001
ODF Version: odf-operator.v4.13.0-121.stable
Plaform :vsphere


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?


Is there any workaround available to the best of your knowledge?


Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?


Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Install OCP cluster without ODF [4.13.0-0.nightly-2023-04-01-062001]

2. Enable PROMISC on br-ex network interface on all worker nodes
$ oc debug node/compute-0  [0/1/2]
sh-4.4# chroot /host
sh-5.1# bash
[root@compute-0 /]#  ip link set promisc on br-ex

3. Verify PROMISC flag exist on br-ex network interface 
sh-4.4# ifconfig
br-ex: flags=4419<UP,BROADCAST,RUNNING,PROMISC,MULTICAST>  mtu 1500
        inet 10.1.160.157  netmask 255.255.254.0  broadcast 10.1.161.255
        ether 00:50:56:8f:dd:20  txqueuelen 1000  (Ethernet)
        RX packets 2483234  bytes 1985663520 (1.8 GiB)
        RX errors 0  dropped 3724  overruns 0  frame 0
        TX packets 1893583  bytes 1503803337 (1.4 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

4. Create 2 Networks Attchement on default namespace:

cluster-net: 192.168.20.0/24
Public-net: 192.168.30.0/24 
Multus Net-Atta-Defs in default NS

$ cat net-atta-defs.yaml
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: public-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }'
---
apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
 name: cluster-net
 namespace: default
 labels: {}
 annotations: {}
spec:
 config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "br-ex", "ipam": { "type": "whereabouts", "range": "192.168.30.0/24" } }'


5. Verify 2 networks exists:
$ oc get network-attachment-definitions.k8s.cni.cncf.io
NAME          AGE
cluster-net   43s
public-net    43s

6. Install ODF Opertaor [odf-operator.v4.13.0-121.stable]

7. Install storage cluster via UI  and enabled multus:

8. Check Ceph status:
sh-5.1$ ceph -s
  cluster:
    id:     4655ecc0-2de9-4e41-8990-330486320a0b
    health: HEALTH_WARN
            1 MDSs report slow metadata IOs
            1 osds down
            1 host (1 osds) down
            1 rack (1 osds) down
            Reduced data availability: 61 pgs inactive, 20 pgs down, 41 pgs peering, 27 pgs stale
            2 slow ops, oldest one blocked for 146 sec, daemons [osd.0,mon.a] have slow ops.
 
  services:
    mon: 3 daemons, quorum a,b,c (age 15h)
    mgr: a(active, since 15h)
    mds: 1/1 daemons up, 1 standby
    osd: 3 osds: 2 up (since 64s), 3 in (since 15h)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 61 pgs
    objects: 0 objects, 0 B
    usage:   172 MiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     100.000% pgs not active
             34 creating+peering
             20 stale+creating+down
             7  stale+creating+peering
 
  progress:
    Global Recovery Event (0s)
      [............................] 

9. Check ceph OSD dump:
sh-5.1$ ceph osd dump
osd.0 up   in  weight 1 up_from 783 up_thru 783 down_at 782 last_clean_interval [12,782) [v2:192.168.20.21:6800/2417487063,v1:192.168.20.21:6801/2417487063] [v2:192.168.30.1:6800/2463487063,v1:192.168.30.1:6801/2463487063] exists,up 43f15076-d401-46e0-9c53-f87375e075c6
osd.1 up   in  weight 1 up_from 783 up_thru 783 down_at 778 last_clean_interval [14,782) [v2:192.168.20.22:6800/3362994509,v1:192.168.20.22:6801/3362994509] [v2:192.168.30.2:6804/3409994509,v1:192.168.30.2:6805/3409994509] exists,up f55fd804-edfd-407a-8d6e-99682e291694
osd.2 down in  weight 1 up_from 769 up_thru 778 down_at 780 last_clean_interval [14,768) [v2:192.168.20.23:6800/1352676998,v1:192.168.20.23:6801/1352676998] [v2:192.168.30.3:6804/1397676998,v1:192.168.30.3:6805/1397676998] exists 51268948-c8d1-43ec-ac54-68615f786516

10. Check storagecluster status:
$ oc get storagecluster
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   16h   Progressing              2023-04-03T16:55:08Z   4.13.0

Status:
  Conditions:
    Last Heartbeat Time:   2023-04-03T16:55:09Z
    Last Transition Time:  2023-04-03T16:55:09Z
    Message:               Version check successful
    Reason:                VersionMatched
    Status:                False
    Type:                  VersionMismatch
    Last Heartbeat Time:   2023-04-04T09:09:37Z
    Last Transition Time:  2023-04-03T16:55:09Z
    Message:               Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-ceph-rbd]



Actual results:
ceph status in WARN state and storagecluster on Progressing state

Expected results:
ceph status is OK and storagecluster on ready state

Additional info:

https://docs.google.com/document/d/1BRk9JqjWZM2WHXt8iVDryVYIqScEyjbTm3Jiln9Ghd4/edit#

OCP + OCS must gather:
http://rhsqe-repo.lab.eng.blr.redhat.com/OCS/ocs-qe-bugs/BZ-2184332

Comment 3 Oded 2023-04-04 17:45:34 UTC

****The issue was fixed by configuring the security policy for a vSphere Standard Switch  [Promiscuous mode, MAC address changes, Forged transmits] to accept [moved from reject]


sh-5.1$ ceph status
  cluster:
    id:     4655ecc0-2de9-4e41-8990-330486320a0b
    health: HEALTH_OK
 
  services:
    mon: 3 daemons, quorum a,b,c (age 24h)
    mgr: a(active, since 24h)
    mds: 1/1 daemons up, 1 hot standby
    osd: 3 osds: 3 up (since 73m), 3 in (since 24h)
    rgw: 1 daemon active (1 hosts, 1 zones)
 
  data:
    volumes: 1/1 healthy
    pools:   12 pools, 169 pgs
    objects: 450 objects, 137 MiB
    usage:   683 MiB used, 1.5 TiB / 1.5 TiB avail
    pgs:     169 active+clean
 
  io:
    client:   1.3 KiB/s rd, 1.8 KiB/s wr, 2 op/s rd, 0 op/s wr

[odedviner@fedora auth]$ oc get storagecluster
NAME                 AGE   PHASE   EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   24h   Ready              2023-04-03T16:55:08Z   4.13.0

for more info:
https://docs.google.com/document/d/1BRk9JqjWZM2WHXt8iVDryVYIqScEyjbTm3Jiln9Ghd4/edit

Note You need to log in before you can comment on or make changes to this bug.