Description of problem (please be detailed as possible and provide log snippests): with multus enabled, storagecluster is not getting to ready. one storageclass is not getting created. noobaa pods are not coming up. Version of all relevant components (if applicable): [root@nara1-cicd-odf-1c53-syd05-bastion-0 ~]# oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.13.0-0.nightly-ppc64le-2023-02-17-084453 True False 2d Cluster version is 4.13.0-0.nightly-ppc64le-2023-02-17-084453 [root@nara1-cicd-odf-1c53-syd05-bastion-0 ~]# oc describe csv odf-operator.v4.13.0 -n openshift-storage | grep full Labels: full_version=4.13.0-92 also tested on ODF 4.13.0-95 build. Does this issue impact your ability to continue to work with the product (please explain in detail what is the user impact)? No Is there any workaround available to the best of your knowledge? No Rate from 1 - 5 the complexity of the scenario you performed that caused this bug (1 - very simple, 5 - very complex)? Can this issue reproducible? Yes Can this issue reproduce from the UI? Yes If this is a regression, please provide more details to justify this: Steps to Reproduce: 1. Create 4.13 OCP cluster 2. create NAD's 3. deploy ODF with multus 4. storagecluster is stuck in progessing state due to storageclass is missing as seen in the status Actual results: storagecluster is stuck in progressing state. Expected results: storagecluster should get to ready state. Additional info:
1. Created NAD with below spec. cat <<EOF | oc create -f - apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: ocs-public namespace: openshift-storage spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.0.2/24" } }' EOF 2. during ODF deployment, storagesystem is created with above spec NAD. 3. storagecluster is stuck in progressing state. [root@nara1-cicd-odf-1c53-syd05-bastion-0 ~]# oc get storagecluster -o yaml apiVersion: v1 items: - apiVersion: ocs.openshift.io/v1 kind: StorageCluster metadata: annotations: cluster.ocs.openshift.io/local-devices: "true" uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful creationTimestamp: "2023-02-27T10:29:08Z" finalizers: - storagecluster.ocs.openshift.io generation: 2 name: ocs-storagecluster namespace: openshift-storage ownerReferences: - apiVersion: odf.openshift.io/v1alpha1 kind: StorageSystem name: ocs-storagecluster-storagesystem uid: 67dd65ba-680c-4cd0-99ca-ffc29923667a resourceVersion: "1118258" uid: 04969a31-1e0f-483a-8ebb-fe8bb7a4908c spec: arbiter: {} encryption: kms: {} externalStorage: {} flexibleScaling: true managedResources: cephBlockPools: {} cephCluster: {} cephConfig: {} cephDashboard: {} cephFilesystems: {} cephNonResilientPools: {} cephObjectStoreUsers: {} cephObjectStores: {} cephToolbox: {} mirroring: {} monDataDirHostPath: /var/lib/rook network: provider: multus selectors: cluster: default/iocs-public2 public: default/iocs-public nodeTopologies: {} resources: mds: limits: cpu: "3" memory: 8Gi requests: cpu: "1" memory: 8Gi rgw: limits: cpu: "2" memory: 4Gi requests: cpu: "1" memory: 4Gi storageDeviceSets: - config: {} count: 3 dataPVCTemplate: metadata: {} spec: accessModes: - ReadWriteOnce resources: requests: storage: "1" storageClassName: localblock volumeMode: Block status: {} name: ocs-deviceset-localblock placement: {} preparePlacement: {} replica: 1 resources: limits: cpu: "2" memory: 5Gi requests: cpu: "1" memory: 5Gi status: conditions: - lastHeartbeatTime: "2023-02-28T05:12:59Z" lastTransitionTime: "2023-02-27T10:29:09Z" message: 'Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-ceph-rbd]' reason: ReconcileFailed status: "False" type: ReconcileComplete - lastHeartbeatTime: "2023-02-27T10:29:09Z" lastTransitionTime: "2023-02-27T10:29:09Z" message: Initializing StorageCluster reason: Init status: "False" type: Available - lastHeartbeatTime: "2023-02-27T10:29:09Z" lastTransitionTime: "2023-02-27T10:29:09Z" message: Initializing StorageCluster reason: Init status: "True" type: Progressing - lastHeartbeatTime: "2023-02-27T10:29:09Z" lastTransitionTime: "2023-02-27T10:29:09Z" message: Initializing StorageCluster reason: Init status: "False" type: Degraded - lastHeartbeatTime: "2023-02-27T10:29:09Z" lastTransitionTime: "2023-02-27T10:29:09Z" message: Initializing StorageCluster reason: Init status: Unknown type: Upgradeable externalStorage: grantedCapacity: "0" failureDomain: host failureDomainKey: kubernetes.io/hostname failureDomainValues: - syd05-worker-0.nara1-cicd-odf-1c53.redhat.com - syd05-worker-1.nara1-cicd-odf-1c53.redhat.com - syd05-worker-2.nara1-cicd-odf-1c53.redhat.com images: ceph: actualImage: quay.io/rhceph-dev/rhceph@sha256:c4cceafa24f984bfa8aaa8937df0c545c21f37c35cc4661db8ee4f010bddfb74 desiredImage: quay.io/rhceph-dev/rhceph@sha256:c4cceafa24f984bfa8aaa8937df0c545c21f37c35cc4661db8ee4f010bddfb74 noobaaCore: desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:5dd993448516e250cf7af449e346710f2437bc11c5ccd835cbe52c1d5a175765 noobaaDB: desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:9248c4eaa8aeedacc1c06d7e3141ca1457147eef59e329273eb78e32fcd27e79 kmsServerConnection: {} nodeTopologies: labels: kubernetes.io/hostname: - syd05-worker-0.nara1-cicd-odf-1c53.redhat.com - syd05-worker-1.nara1-cicd-odf-1c53.redhat.com - syd05-worker-2.nara1-cicd-odf-1c53.redhat.com phase: Progressing relatedObjects: - apiVersion: ceph.rook.io/v1 kind: CephCluster name: ocs-storagecluster-cephcluster namespace: openshift-storage resourceVersion: "1114281" uid: f04f39f2-4c44-4ca7-9900-3dd6f6ad87b1 version: 4.13.0 kind: List metadata: resourceVersion: "" [root@nara1-cicd-odf-1c53-syd05-bastion-0 ~]# 4. noobaa pods are not coming up. 5. rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-69f6db5lr8xr 1/2 Running 1 (15s ago) 5m21s I will add must-gather logs once uploaded. if required can setup the environment.
I think I see the issue in the storageCluster configuration, ``` network: provider: multus selectors: cluster: default/iocs-public2 public: default/iocs-public ``` instead of the above it should be ``` network: provider: multus selectors: cluster: openshift-storage/iocs-public2 public: openshift-storage/iocs-public ``` refer ``` The NetworkAttachmentDefinition should be referenced along with the namespace in which it is present like public: <namespace>/<name of NAD>. e.g., the network attachment definition are in default namespace: public: default/rook-public-nw cluster: default/rook-cluster-nw ``` this doc https://rook.github.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/?h=mu#multus will happy to look at the must gather or cluster
this was tried multiple times. I have also tried another environment with openshift-storage namespace as well. I am setting up an environment to get you more information.
below are the network config details on nodes and NAD spec details below. [core@syd05-worker-0 ~]$ ifconfig env2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 192.168.0.121 netmask 255.255.255.0 broadcast 192.168.0.255 inet6 fe80::f3f4:6078:3e5c:e1bc prefixlen 64 scopeid 0x20<link> ether fa:c2:a7:a8:9d:20 txqueuelen 1000 (Ethernet) RX packets 1552044 bytes 1724073436 (1.6 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 585362 bytes 308775824 (294.4 MiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 19 [core@syd05-master-1 ~]$ ifconfig env2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 192.168.0.52 netmask 255.255.255.0 broadcast 192.168.0.255 inet6 fe80::ee44:5233:fde8:f04b prefixlen 64 scopeid 0x20<link> ether fa:4c:b6:c2:c3:20 txqueuelen 1000 (Ethernet) RX packets 26932419 bytes 22688481785 (21.1 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 16474461 bytes 6858627911 (6.3 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 19 [core@syd05-master-0 ~]$ ifconfig env2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 192.168.0.103 netmask 255.255.255.0 broadcast 192.168.0.255 inet6 fe80::f9f8:a3a9:e1eb:1443 prefixlen 64 scopeid 0x20<link> ether fa:42:05:74:4d:20 txqueuelen 1000 (Ethernet) RX packets 17916412 bytes 16094303599 (14.9 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 10915520 bytes 6440703513 (5.9 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 19 cat <<EOF | oc create -f - apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: ocs-public namespace: openshift-storage spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.0.2/24" } }' EOF
New cluster is setup and getting same behavior. [root@nara6-cicd-odf-31bf-syd05-bastion-0 ~]# oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-cc4b499fd-8rmst 2/2 Running 0 22m csi-cephfsplugin-7bvfw 2/2 Running 0 8m1s csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-8xnb5 1/1 Running 0 8m3s csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-tbc72 1/1 Running 0 8m3s csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-tzfp4 1/1 Running 0 8m3s csi-cephfsplugin-provisioner-6cbfd7fd77-lr2br 5/5 Running 0 9m6s csi-cephfsplugin-provisioner-6cbfd7fd77-vg2dr 5/5 Running 0 9m6s csi-cephfsplugin-wjx9f 2/2 Running 0 7m51s csi-cephfsplugin-z8g2l 2/2 Running 0 7m56s csi-rbdplugin-8tnjv 3/3 Running 0 8m2s csi-rbdplugin-hhc6k 3/3 Running 0 7m51s csi-rbdplugin-holder-ocs-storagecluster-cephcluster-c597c 1/1 Running 0 8m4s csi-rbdplugin-holder-ocs-storagecluster-cephcluster-r5c22 1/1 Running 0 8m4s csi-rbdplugin-holder-ocs-storagecluster-cephcluster-vg9b4 1/1 Running 0 8m4s csi-rbdplugin-provisioner-78bc9589dc-nsf9l 6/6 Running 0 9m6s csi-rbdplugin-provisioner-78bc9589dc-w9mgv 6/6 Running 0 9m6s csi-rbdplugin-rdvg8 3/3 Running 0 7m56s noobaa-operator-6945c8f4f6-j8qh4 1/1 Running 0 22m ocs-metrics-exporter-f478dfd65-xr799 1/1 Running 0 22m ocs-operator-67f55fd69b-mqxdz 1/1 Running 0 22m odf-console-7b88df97b6-xql6b 1/1 Running 0 22m odf-operator-controller-manager-67b6f7879d-6xq8n 2/2 Running 0 22m rook-ceph-crashcollector-1e13b6e9213b2457d4ee942ea6c49722-qncpl 1/1 Running 0 6m11s rook-ceph-crashcollector-ae9b8dcbb391f72fe2d91aa2690725b8-bdkc6 1/1 Running 0 6m1s rook-ceph-crashcollector-f10bcbd876937b397cce97feb647d3a1-znn46 1/1 Running 0 7m3s rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-659f8657rss65 2/2 Running 0 6m11s rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6ffbc49f72h7k 2/2 Running 0 6m1s rook-ceph-mgr-a-7cb768f844-ts9bc 3/3 Running 0 7m9s rook-ceph-mon-a-5c494dddd4-gkvds 2/2 Running 0 8m6s rook-ceph-mon-b-b748db99d-x57xh 2/2 Running 0 7m39s rook-ceph-mon-c-86b59b88b7-shsbq 2/2 Running 0 7m25s rook-ceph-operator-5dd6df7795-jwx8f 1/1 Running 0 9m15s rook-ceph-osd-0-74c645bf45-rdtz8 2/2 Running 0 6m40s rook-ceph-osd-1-7999d97d5b-fqkqz 2/2 Running 0 6m41s rook-ceph-osd-2-5c9f768697-rft2l 2/2 Running 0 6m41s rook-ceph-osd-prepare-1e0eff230cc0f76d25b67e2c29fb037c-sxq5c 0/1 Completed 0 6m58s rook-ceph-osd-prepare-5061fe06e4a142a28ef23564c45ac3bc-h4qfl 0/1 Completed 0 6m58s rook-ceph-osd-prepare-c6a3989e48d3a64164f109074528ec4e-8sgqz 0/1 Completed 0 6m58s rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-679547498qwj 1/2 Running 1 (20s ago) 5m25s [root@nara6-cicd-odf-31bf-syd05-bastion-0 ~]# oc get storagecluster NAME AGE PHASE EXTERNAL CREATED AT VERSION ocs-storagecluster 10m Progressing 2023-03-06T14:57:10Z 4.13.0 [root@nara6-cicd-odf-31bf-syd05-bastion-0 ~]# oc describe storagecluster ocs-storagecluster Name: ocs-storagecluster Namespace: openshift-storage Labels: <none> Annotations: cluster.ocs.openshift.io/local-devices: true uninstall.ocs.openshift.io/cleanup-policy: delete uninstall.ocs.openshift.io/mode: graceful API Version: ocs.openshift.io/v1 Kind: StorageCluster Metadata: Creation Timestamp: 2023-03-06T14:57:10Z Finalizers: storagecluster.ocs.openshift.io Generation: 2 Managed Fields: API Version: ocs.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:cluster.ocs.openshift.io/local-devices: f:spec: .: f:arbiter: f:encryption: .: f:kms: f:flexibleScaling: f:monDataDirHostPath: f:network: .: f:connections: .: f:encryption: f:provider: f:selectors: .: f:cluster: f:public: f:nodeTopologies: f:resources: .: f:mds: .: f:limits: .: f:cpu: f:memory: f:requests: .: f:cpu: f:memory: f:rgw: .: f:limits: .: f:cpu: f:memory: f:requests: .: f:cpu: f:memory: Manager: Mozilla Operation: Update Time: 2023-03-06T14:57:10Z API Version: ocs.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: f:uninstall.ocs.openshift.io/cleanup-policy: f:uninstall.ocs.openshift.io/mode: f:finalizers: .: v:"storagecluster.ocs.openshift.io": f:spec: f:externalStorage: f:managedResources: .: f:cephBlockPools: f:cephCluster: f:cephConfig: f:cephDashboard: f:cephFilesystems: f:cephNonResilientPools: f:cephObjectStoreUsers: f:cephObjectStores: f:cephToolbox: f:mirroring: f:storageDeviceSets: Manager: ocs-operator Operation: Update Time: 2023-03-06T14:57:10Z API Version: ocs.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:ownerReferences: .: k:{"uid":"17d05cd0-47ca-4d7d-80db-2d393a04814d"}: Manager: manager Operation: Update Time: 2023-03-06T14:57:11Z API Version: ocs.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:conditions: f:externalStorage: .: f:grantedCapacity: f:failureDomain: f:failureDomainKey: f:failureDomainValues: f:images: .: f:ceph: .: f:actualImage: f:desiredImage: f:noobaaCore: .: f:desiredImage: f:noobaaDB: .: f:desiredImage: f:kmsServerConnection: f:nodeTopologies: .: f:labels: .: f:kubernetes.io/hostname: f:phase: f:relatedObjects: f:version: Manager: ocs-operator Operation: Update Subresource: status Time: 2023-03-06T15:07:03Z Owner References: API Version: odf.openshift.io/v1alpha1 Kind: StorageSystem Name: ocs-storagecluster-storagesystem UID: 17d05cd0-47ca-4d7d-80db-2d393a04814d Resource Version: 132426 UID: 060a8a37-8c11-427f-9ae5-1f03c1d8af48 Spec: Arbiter: Encryption: Kms: External Storage: Flexible Scaling: true Managed Resources: Ceph Block Pools: Ceph Cluster: Ceph Config: Ceph Dashboard: Ceph Filesystems: Ceph Non Resilient Pools: Ceph Object Store Users: Ceph Object Stores: Ceph Toolbox: Mirroring: Mon Data Dir Host Path: /var/lib/rook Network: Connections: Encryption: Provider: multus Selectors: Cluster: openshift-storage/ocs-private Public: openshift-storage/ocs-public Node Topologies: Resources: Mds: Limits: Cpu: 3 Memory: 8Gi Requests: Cpu: 1 Memory: 8Gi Rgw: Limits: Cpu: 2 Memory: 4Gi Requests: Cpu: 1 Memory: 4Gi Storage Device Sets: Config: Count: 3 Data PVC Template: Metadata: Spec: Access Modes: ReadWriteOnce Resources: Requests: Storage: 1 Storage Class Name: localblock Volume Mode: Block Status: Name: ocs-deviceset-localblock Placement: Prepare Placement: Replica: 1 Resources: Limits: Cpu: 2 Memory: 5Gi Requests: Cpu: 1 Memory: 5Gi Status: Conditions: Last Heartbeat Time: 2023-03-06T14:57:12Z Last Transition Time: 2023-03-06T14:57:12Z Message: Version check successful Reason: VersionMatched Status: False Type: VersionMismatch Last Heartbeat Time: 2023-03-06T15:07:03Z Last Transition Time: 2023-03-06T14:57:12Z Message: Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-ceph-rbd] Reason: ReconcileFailed Status: False Type: ReconcileComplete Last Heartbeat Time: 2023-03-06T14:57:12Z Last Transition Time: 2023-03-06T14:57:12Z Message: Initializing StorageCluster Reason: Init Status: False Type: Available Last Heartbeat Time: 2023-03-06T14:57:12Z Last Transition Time: 2023-03-06T14:57:12Z Message: Initializing StorageCluster Reason: Init Status: True Type: Progressing Last Heartbeat Time: 2023-03-06T14:57:12Z Last Transition Time: 2023-03-06T14:57:12Z Message: Initializing StorageCluster Reason: Init Status: False Type: Degraded Last Heartbeat Time: 2023-03-06T14:57:12Z Last Transition Time: 2023-03-06T14:57:12Z Message: Initializing StorageCluster Reason: Init Status: Unknown Type: Upgradeable External Storage: Granted Capacity: 0 Failure Domain: host Failure Domain Key: kubernetes.io/hostname Failure Domain Values: syd05-worker-0.nara6-cicd-odf-31bf.redhat.com syd05-worker-1.nara6-cicd-odf-31bf.redhat.com syd05-worker-2.nara6-cicd-odf-31bf.redhat.com Images: Ceph: Actual Image: quay.io/rhceph-dev/rhceph@sha256:a9bffe4a4b9115cba8f1d4192245c81d797bf54bb3e5aaed8c4499fcf78b477c Desired Image: quay.io/rhceph-dev/rhceph@sha256:a9bffe4a4b9115cba8f1d4192245c81d797bf54bb3e5aaed8c4499fcf78b477c Noobaa Core: Desired Image: quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:c1264177733a7219078bc4d1ca2acf615aafbd1648b42a9b8832210acdd16ba8 Noobaa DB: Desired Image: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:9248c4eaa8aeedacc1c06d7e3141ca1457147eef59e329273eb78e32fcd27e79 Kms Server Connection: Node Topologies: Labels: kubernetes.io/hostname: syd05-worker-0.nara6-cicd-odf-31bf.redhat.com syd05-worker-1.nara6-cicd-odf-31bf.redhat.com syd05-worker-2.nara6-cicd-odf-31bf.redhat.com Phase: Progressing Related Objects: API Version: ceph.rook.io/v1 Kind: CephCluster Name: ocs-storagecluster-cephcluster Namespace: openshift-storage Resource Version: 132422 UID: 10844f71-4e11-4169-bc0b-03058457ba47 Version: 4.13.0 Events: <none> [root@nara6-cicd-odf-31bf-syd05-bastion-0 ~]#
@subham I have shared the cluster information over chat with you. let me know if you need any other details.
I see rbd init commands hang ``` 2023-03-06 15:00:28.009617 E | ceph-block-pool-controller: failed to reconcile CephBlockPool "openshift-storage/ocs-storagecluster-cephblockpool". failed to create pool "ocs-storagecluster-cephblockpool".: failed to create pool "ocs-storagecluster-cephblockpool".: failed to initialize pool "ocs-storagecluster-cephblockpool" for RBD use. : command terminated with exit code 124 ``` pool is create ``` ceph osd lspools --conf=/var/lib/rook/openshift-storage/openshift-storage.config 1 .mgr 2 ocs-storagecluster-cephblockpool ``` seems like all rbd command hang, but able to run ceph commands
on different note, I see rgw pod in CLBO due ```Startup probe failed: RGW health check failed with error code: 7. the RGW likely cannot be reached by clients```
also, I see osd 0 is not in multus range ``` osd.0 up in weight 1 up_from 856 up_thru 856 down_at 855 last_clean_interval [10,855) [v2:192.168.0.25:6800/2499787560,v1:192.168.0.25:6801/2499787560] [v2:192.168.0.25:6802/2545787560,v1:192.168.0.25:6803/2545787560] exists,up cf8a7623-42c4-418f-b30e-064016a0ed2f osd.1 down out weight 0 up_from 839 up_thru 851 down_at 853 last_clean_interval [9,838) [v2:192.168.0.23:6800/3591904357,v1:192.168.0.23:6801/3591904357] [v2:192.168.0.23:6802/3635904357,v1:192.168.0.23:6803/3635904357] autoout,exists 16f42375-e158-4ec8-be55-b2ef98221494 osd.2 up in weight 1 up_from 856 up_thru 856 down_at 851 last_clean_interval [9,855) [v2:192.168.0.21:6800/828470086,v1:192.168.0.21:6801/828470086] [v2:192.168.0.21:6808/873470086,v1:192.168.0.21:6809/873470086] exists,up 369839bf-2768-4244-b986-42aaac4e83e0 ``` ``` kc get network-attachment-definitions.k8s.cni.cncf.io -oyaml apiVersion: v1 items: - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2023-03-06T14:54:33Z" generation: 1 name: ocs-private namespace: openshift-storage resourceVersion: "122210" uid: 948f9d88-c9a1-4c8c-bb3c-7c2898ff5752 spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.0.2/24" } }' - apiVersion: k8s.cni.cncf.io/v1 kind: NetworkAttachmentDefinition metadata: creationTimestamp: "2023-03-06T14:54:18Z" generation: 1 name: ocs-public namespace: openshift-storage resourceVersion: "122043" uid: 78769273-3255-4775-b90d-d8b68d539b00 spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.0.2/24" } }' kind: List metadata: resourceVersion: "" ``` will check with @brgardne more on this
I think the issue is that the NAD is overlapping with the host's env2 network. Both are 192.168.0.X/24 . Please choose a different CIDR for the NAD and see if it is fixed. Try something like 192.168.167.0/24 on the NAD. FYI, CIDRs (pronounced like the drink cider) are a strange thing to learn to read. They're quite simple but do require understanding binary notation to understand. 1.0.0.1/24 doesn't mean that the IPs go from 1.0.0.1 to 1.0.0.24. The /24 defines the address "prefix" and corresponds to a bitmask, out of 32 bits, starting from the left. The prefix determines the number of '1' bits in the mask. The most common bitmasks you'll see are: - /24 (mask 11111111.11111111.11111111.00000000) (24 1's) - most home router networks and virtual machine networks use this range - /16 (mask 11111111.11111111.00000000.00000000) (16 1's) - minikube sets the cluster CIDR as 10.244.0.0/16 to allow for many pods - /12 (mask 11111111.11110000.00000000.00000000) (12 1's) - minikube sets the service cluster CIDR as 10.96.0.0/12 which allows more services than pods This shows the bitmask for each prefix including the binary: https://datacadamia.com/network/mask There are a few IPv4 networks spaces that are reserved for private use, which is why you commonly see 10., 172., and 192. networks. Conceptually, these are approximately: big corporate class A, business class B, and personal class C, respectively. https://www.arin.net/reference/research/statistics/address_filters/ Each 4 bits equates to a 0xF in hex, so /12 is 0xFF.0xF0.0x00.0x00 in hex (255.240.0.0 in decimal). It'll be easiest to work with CIDRs if you understand how hex compares to binary and make use of your calculator's "programmer" mode. You can also use a CIDR calculator tool like this: https://account.arin.net/public/cidrCalculator Anything "masked" by a 1 bit is a fixed part of the address. Anything not masked (i.e., a "0") can be variable and handed out by a DHCP server to a client. This allows admins to set up different networks that don't reuse addresses. For example, when I set up NADs for multus testing in minikube, I use 192.168.20.0/24 for one NAD and 192.168.21.0/24 for another NAD so that neither network will overlap with the other. 192.168.20.0/255.255.255.0 means the ".20" section will never change. Similarly for the ".21" section. If I were to specify 192.168.20.0/24 for one NAD and 192.168.21.0/16 for the next, the ".21" section would be unmasked, and the second NAD's address range would overlap for 256 addresses in the range 192.168.20.0-.255.
Hi , Tried again with the CIDR changes as defined below. getting same issue. env2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1450 inet 192.168.0.229 netmask 255.255.255.0 broadcast 192.168.0.255 inet6 fe80::45e4:3913:845d:49ee prefixlen 64 scopeid 0x20<link> ether fa:87:54:94:2f:20 txqueuelen 1000 (Ethernet) RX packets 4548094 bytes 4730443879 (4.4 GiB) RX errors 0 dropped 0 overruns 0 frame 0 TX packets 4141830 bytes 2707254142 (2.5 GiB) TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0 device interrupt 19 cat <<EOF | oc create -f - apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: ocs-public namespace: openshift-storage spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }' EOF cat <<EOF | oc create -f - apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: ocs-private namespace: openshift-storage spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode": "bridge", "ipam": { "type": "whereabouts", "range": "192.168.21.0/24" } }' EOF
also tried with NAD type as bridge as suggested by Blaine, getting same issue. cat <<EOF | oc create -f - apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: ocs-public namespace: openshift-storage spec: config: '{ "cniVersion": "0.3.1", "type": "bridge", "isGateway": true, "vlan": 2, "ipam": { "type": "whereabouts", "range": "192.168.20.0/24" } }' EOF
tried with ipvlan as suggested by Blaine. most of the pods are not getting created due to "failed to create pod sandbox" error. [root@rdr-narayan13-lon06-bastion-0 ~]# oc get pods NAME READY STATUS RESTARTS AGE csi-addons-controller-manager-6d894c9656-fv9lh 2/2 Running 0 38m csi-cephfsplugin-2pl77 2/2 Running 0 7m18s csi-cephfsplugin-4rvdk 0/2 ContainerCreating 0 6m31s csi-cephfsplugin-6wv7q 2/2 Running 0 7m18s csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r 0/1 ContainerCreating 0 6m33s csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-wqdz9 0/1 ContainerCreating 0 6m33s csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-zktxn 0/1 ContainerCreating 0 6m33s csi-cephfsplugin-provisioner-c678999f9-bmg7z 0/5 ContainerCreating 0 7m18s csi-cephfsplugin-provisioner-c678999f9-wrrtr 0/5 ContainerCreating 0 7m18s csi-rbdplugin-67qpr 3/3 Running 0 7m18s csi-rbdplugin-cc5f2 0/3 ContainerCreating 0 6m31s csi-rbdplugin-holder-ocs-storagecluster-cephcluster-kwpqj 0/1 ContainerCreating 0 6m34s csi-rbdplugin-holder-ocs-storagecluster-cephcluster-p66m2 0/1 ContainerCreating 0 6m34s csi-rbdplugin-holder-ocs-storagecluster-cephcluster-qvbbc 0/1 ContainerCreating 0 6m34s csi-rbdplugin-hsljj 3/3 Running 0 7m18s csi-rbdplugin-provisioner-798786c5fc-c49gk 0/6 ContainerCreating 0 7m18s csi-rbdplugin-provisioner-798786c5fc-pzgwn 0/6 ContainerCreating 0 7m18s noobaa-operator-595c599b64-2gnd5 1/1 Running 0 39m ocs-metrics-exporter-564f46b697-nl45w 1/1 Running 0 39m ocs-operator-76746fb89c-bwcgt 1/1 Running 0 39m odf-console-6c8d464746-5t47b 1/1 Running 0 39m odf-operator-controller-manager-67df444888-tjsxw 2/2 Running 0 39m rook-ceph-mon-a-69f8598db-dm9db 0/2 Init:0/2 0 6m29s rook-ceph-operator-54d4c47787-qpdw2 1/1 Running 0 7m25s Normal AddedInterface 20s multus Add eth0 [10.129.2.101/23] from ovn-kubernetes Warning FailedCreatePodSandBox 18s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r_openshift-storage_c1efa8b9-4f25-4678-a035-052b83a06b85_0(58df047a98390cd5b2edf706348fcc86d9bb990848c44e821b3b2320a6c2d1ec): error adding pod openshift-storage_csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r/c1efa8b9-4f25-4678-a035-052b83a06b85:ocs-public]: error adding container to network "ocs-public": failed to create ipvlan: device or resource busy Normal AddedInterface 7s multus Add eth0 [10.129.2.101/23] from ovn-kubernetes Warning FailedCreatePodSandBox 1s kubelet Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r_openshift-storage_c1efa8b9-4f25-4678-a035-052b83a06b85_0(0c4d5cbcb97df3cead6a5438cbceb5c80ca5c0e20d9c85a430c636562f258d1b): error adding pod openshift-storage_csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r/c1efa8b9-4f25-4678-a035-052b83a06b85:ocs-public]: error adding container to network "ocs-public": failed to create ipvlan: device or resource busy
Hi Subham, created OCP cluster and shared the environment over chat.
I validate the multus configuration using a script that @brgardne created and seems like configuration is right. ``` sh test-multus.sh Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-1 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-2 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-3 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-4 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-5 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-6 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-7 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-8 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-9 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-10 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-11 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-12 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-13 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-14 created Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost") daemonset.apps/multus-validator-15 created Waiting for 15 daemonsets to have pods scheduled └─ waiting for daemonset 'multus-validator-1' to have pods scheduled └─ waiting for daemonset 'multus-validator-2' to have pods scheduled └─ waiting for daemonset 'multus-validator-3' to have pods scheduled └─ waiting for daemonset 'multus-validator-4' to have pods scheduled └─ waiting for daemonset 'multus-validator-5' to have pods scheduled └─ waiting for daemonset 'multus-validator-6' to have pods scheduled └─ waiting for daemonset 'multus-validator-7' to have pods scheduled └─ waiting for daemonset 'multus-validator-8' to have pods scheduled └─ waiting for daemonset 'multus-validator-9' to have pods scheduled └─ waiting for daemonset 'multus-validator-10' to have pods scheduled └─ waiting for daemonset 'multus-validator-11' to have pods scheduled └─ waiting for daemonset 'multus-validator-12' to have pods scheduled └─ waiting for daemonset 'multus-validator-13' to have pods scheduled └─ waiting for daemonset 'multus-validator-14' to have pods scheduled └─ waiting for daemonset 'multus-validator-15' to have pods scheduled Waiting for 45 pods from 15 daemonsets └─ 45 of 45 pods have multus networks Test successful! Cleaning up daemonsets daemonset.apps "multus-validator-2" deleted daemonset.apps "multus-validator-3" deleted daemonset.apps "multus-validator-7" deleted daemonset.apps "multus-validator-5" deleted daemonset.apps "multus-validator-14" deleted daemonset.apps "multus-validator-15" deleted daemonset.apps "multus-validator-8" deleted daemonset.apps "multus-validator-12" deleted daemonset.apps "multus-validator-4" deleted daemonset.apps "multus-validator-1" deleted daemonset.apps "multus-validator-13" deleted daemonset.apps "multus-validator-11" deleted daemonset.apps "multus-validator-6" deleted daemonset.apps "multus-validator-10" deleted daemonset.apps "multus-validator-9" deleted ............No resources found in openshift-storage namespace. ``` So, now we'll proceed with finding the actual bug.
I'm taking the rbd team's help why the rbd command hangs. I will update as soon as I have update.
Thanks @idryomov for taking a look. Multus ip ranges are `192.168.20.0/24` and `192.168.21.0/24` for public-net and cluster-net and osds are on the same range(see v1 and v2) osd.0 up in weight 1 up_from 1698 up_thru 1698 down_at 1697 last_clean_interval [9,1697) [v2:192.168.20.21:6800/2815335877,v1:192.168.20.21:6801/2815335877] [v2:192.168.21.1:6804/2892335877,v1:192.168.21.1:6805/2892335877] exists,up 20a73be9-2add-4cf8-9ca9-3540007a7d8c osd.1 up in weight 1 up_from 1698 up_thru 1698 down_at 1690 last_clean_interval [14,1697) [v2:192.168.20.22:6800/1147689911,v1:192.168.20.22:6801/1147689911] [v2:192.168.21.2:6804/1224689911,v1:192.168.21.2:6805/1224689911] exists,up 789a063f-f7d0-4ae0-9092-bf6e6760d4fd osd.2 down in weight 1 up_from 1678 up_thru 1690 down_at 1692 last_clean_interval [15,1677) [v2:192.168.20.23:6800/1330624502,v1:192.168.20.23:6801/1330624502] [v2:192.168.21.3:6800/1406624502,v1:192.168.21.3:6801/1406624502] exists 2d6eab7a-ea58-4c59-a63b-a61094443803 will need to check more on the networking
We are trying to run commands like netstat,ping, telnet to check the ip/ports available and we can communicate but those commands are not available in the images, and in the toolbox we don't have permission to install them. Blaine will look do some more testing and validation.
In the OSD logs, I am seeing that they are unable to get hearbeats from their peers. I suspected a network issue, and that is exactly what I have found using a separate test. I created a simple nginx pod that listens on the pod network, public multus net, and cluster multus net. When curl-ed from OSD.0's pod on another host, nginx responds on the pod network, but it does not respond on either multus network. There is likely something blocking traffic to/from this address. When curl-ed from OSD.2's pod, on the SAME host, nginx responds on all networks. Whomever has access to the dashboard for this cluster should first check if the virtual cluster has a security group (or whatever the IBM cloud equivalent of an AWS security group is) that would be disallowing that traffic. @ngowda If that's not the issue, then we should inspect whether there are iptables rules set up on the host that would block the addresses. I don't suspect this of being an issue, but it's possible. Failing that, I think we should seek help from someone with the environment that provides these virtual clusters. We may want to do that in parallel. OSD.0 [root@rook-ceph-osd-0-5d877b47c5-tkrg9 ceph]# curl 10.131.1.73:8080 <html> <head><title>403 Forbidden</title></head> <body> <center><h1>403 Forbidden</h1></center> <hr><center>nginx/1.22.1</center> </body> </html> [root@rook-ceph-osd-0-5d877b47c5-tkrg9 ceph]# curl 192.168.20.27:8080 curl: (7) Failed to connect to 192.168.20.27 port 8080: No route to host [root@rook-ceph-osd-0-5d877b47c5-tkrg9 ceph]# curl 192.168.21.4:8080 curl: (7) Failed to connect to 192.168.21.4 port 8080: No route to host OSD.2 [root@rook-ceph-osd-2-5d65d455c4-mlk5x ceph]# curl 10.131.1.73:8080 <html> <head><title>403 Forbidden</title></head> <body> <center><h1>403 Forbidden</h1></center> <hr><center>nginx/1.22.1</center> </body> </html> [root@rook-ceph-osd-2-5d65d455c4-mlk5x ceph]# curl 192.168.20.27:8080 <html> <head><title>403 Forbidden</title></head> <body> <center><h1>403 Forbidden</h1></center> <hr><center>nginx/1.22.1</center> </body> </html> [root@rook-ceph-osd-2-5d65d455c4-mlk5x ceph]# curl 192.168.21.4:8080 <html> <head><title>403 Forbidden</title></head> <body> <center><h1>403 Forbidden</h1></center> <hr><center>nginx/1.22.1</center> </body> </html>
@ngowda what results have we been able to get from my suggestion yesterday (copied below)? > Whomever has access to the dashboard for this cluster should first check if the virtual cluster has a security group (or whatever the IBM cloud equivalent of an AWS security group is) that would be disallowing that traffic. I don't know how this environment is set up or how to access a console for it to look into it myself.
below is the SG settings. I do not have permission to edit. Name Description Rules Attached interfaces allow_8443 1 0 allow_all Allow all ingress traffic. 4 2 allow_http Allow all ingress TCP traffic on port 80. 2 2 allow_https Allow all ingress TCP traffic on port 443. 2 2 allow_outbound Allow all egress traffic. 2 2 allow_ssh Allow all ingress TCP traffic on port 22. 2 2 allow_wsl 2 0 ocp-cli 3 0 ocp-install 5 0 rpsene 2 0 We also tried by creating test pod in different nodes and trying to ping over multus interface which fails if they are in different subnet(192.168.20.1 and 192.168.21.2) but ping works if pods in different nodes and are in same subnet(example: 192.168.20.2 and 192.168.20.3)
still, we are looking to find RCA but giving devel since it is planned to GA in 4.13
We tried with below NAD spec as well by specifying the gateway details but it is not working. cat <<EOF | oc create -f - apiVersion: "k8s.cni.cncf.io/v1" kind: NetworkAttachmentDefinition metadata: name: ocs-public namespace: openshift-storage spec: config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode": "bridge", "ipam": { "type": "host-local", "subnet": "10.1.1.0/24", "rangeStart": "10.1.1.100", "rangeEnd": "10.1.1.200", "routes": [ { "dst": "0.0.0.0/0" } ], "gateway": "10.1.1.1" } }' EOF
> below is the SG settings. I do not have permission to edit. > > Name Description Rules Attached interfaces > allow_8443 1 0 > allow_all Allow all ingress traffic. 4 2 > allow_http Allow all ingress TCP traffic on port 80. 2 2 > allow_https Allow all ingress TCP traffic on port 443. 2 2 > allow_outbound Allow all egress traffic. 2 2 > allow_ssh Allow all ingress TCP traffic on port 22. 2 2 > allow_wsl 2 0 > ocp-cli 3 0 > ocp-install 5 0 > rpsene 2 0 In the debugging I did with Narayanaswamy, our view wasn't able to show which "Attached interfaces" were connected for the 'allow_all' rule. But given that there are only 2 connected interfaces, I find it unlikely that the security group settings are allowing traffic between nodes on the multus networks. There may be other issues present, but this cloud/virt environment's admin will likely have to allow traffic between nodes on 192.168.20.0/24 and 192.168.21.0/24. @clacroix can you help with this at all?
Update: The OpenShift team only tests Multus in bare metal environments. After speaking with Eran and Elad, we will likely have to limit our supported environments for Multus in ODF to bare metal only since other environments are tested by the OpenShift team. However, for QA testing, we are going to try to get a working environment in vSphere since we don't have immediate access to on-demand bare metal environments for our own testing. For @ngowda this likely means that IBM cloud will not be a supported environment. I want to caveat this by clarifying that supported environment conversations are still ongoing, and this decision is not final.
Ok Thanks for the update. FYI we also tried on PowerVM.
I think we can close this bz since the deployment has passed. @brgardne
Need @etamir (or @muagarwa ?) to specify whether ODF's Multus feature should support IBM virtual environments in 4.13? Otherwise, I assume it is a possibility for 4.14?
Re-opening as we might decide to support Power eventually. We can target to future release in case not intended to be supported in 4.13
Moving out of 4.13
This already got passed over for 4.14. Let's also make sure this gets noted as a feature request / enhancement request.
Created a Jira epic https://issues.redhat.com/browse/RHSTOR-4619 to track this feature.