Bug 2175618

Summary: [feature] support multus in IBM cloud environments
Product: [Red Hat Storage] Red Hat OpenShift Data Foundation Reporter: narayanspg <ngowda>
Component: rookAssignee: Blaine Gardner <brgardne>
Status: CLOSED DEFERRED QA Contact: Coady LaCroix <clacroix>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.13CC: assingh, brgardne, clacroix, ebenahar, etamir, idryomov, lmcfadde, muagarwa, ocs-bugs, odf-bz-bot, owasserm, sheggodu, srai, tnielsen
Target Milestone: ---Keywords: FutureFeature, Reopened
Target Release: ---Flags: ngowda: needinfo-
Hardware: ppc64le   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-06-06 16:26:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description narayanspg 2023-03-06 06:40:36 UTC
Description of problem (please be detailed as possible and provide log
snippests):
with multus enabled, storagecluster is not getting to ready. one storageclass is not getting created. noobaa pods are not coming up.

Version of all relevant components (if applicable):
[root@nara1-cicd-odf-1c53-syd05-bastion-0 ~]# oc get clusterversion
NAME      VERSION                                      AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.13.0-0.nightly-ppc64le-2023-02-17-084453   True        False         2d      Cluster version is 4.13.0-0.nightly-ppc64le-2023-02-17-084453


[root@nara1-cicd-odf-1c53-syd05-bastion-0 ~]# oc describe csv odf-operator.v4.13.0 -n openshift-storage | grep full
Labels:       full_version=4.13.0-92

also tested on ODF 4.13.0-95 build.

Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?
Yes

If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1. Create 4.13 OCP cluster 
2. create NAD's
3. deploy ODF with multus
4. storagecluster is stuck in progessing state due to storageclass is missing as seen in the status


Actual results:
storagecluster is stuck in progressing state.

Expected results:
storagecluster should get to ready state.

Additional info:

Comment 2 narayanspg 2023-03-06 07:47:46 UTC
1. Created NAD with below spec.

cat <<EOF | oc create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ocs-public
  namespace: openshift-storage
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "macvlan",
      "master": "env2",
      "mode": "bridge",
      "ipam": {
            "type": "whereabouts",
            "range": "192.168.0.2/24"
      }
  }'
EOF


2. during ODF deployment, storagesystem is created with above spec NAD.

3. storagecluster is stuck in progressing state.

[root@nara1-cicd-odf-1c53-syd05-bastion-0 ~]# oc get storagecluster -o yaml
apiVersion: v1
items:
- apiVersion: ocs.openshift.io/v1
  kind: StorageCluster
  metadata:
    annotations:
      cluster.ocs.openshift.io/local-devices: "true"
      uninstall.ocs.openshift.io/cleanup-policy: delete
      uninstall.ocs.openshift.io/mode: graceful
    creationTimestamp: "2023-02-27T10:29:08Z"
    finalizers:
    - storagecluster.ocs.openshift.io
    generation: 2
    name: ocs-storagecluster
    namespace: openshift-storage
    ownerReferences:
    - apiVersion: odf.openshift.io/v1alpha1
      kind: StorageSystem
      name: ocs-storagecluster-storagesystem
      uid: 67dd65ba-680c-4cd0-99ca-ffc29923667a
    resourceVersion: "1118258"
    uid: 04969a31-1e0f-483a-8ebb-fe8bb7a4908c
  spec:
    arbiter: {}
    encryption:
      kms: {}
    externalStorage: {}
    flexibleScaling: true
    managedResources:
      cephBlockPools: {}
      cephCluster: {}
      cephConfig: {}
      cephDashboard: {}
      cephFilesystems: {}
      cephNonResilientPools: {}
      cephObjectStoreUsers: {}
      cephObjectStores: {}
      cephToolbox: {}
    mirroring: {}
    monDataDirHostPath: /var/lib/rook
    network:
      provider: multus
      selectors:
        cluster: default/iocs-public2
        public: default/iocs-public
    nodeTopologies: {}
    resources:
      mds:
        limits:
          cpu: "3"
          memory: 8Gi
        requests:
          cpu: "1"
          memory: 8Gi
      rgw:
        limits:
          cpu: "2"
          memory: 4Gi
        requests:
          cpu: "1"
          memory: 4Gi
    storageDeviceSets:
    - config: {}
      count: 3
      dataPVCTemplate:
        metadata: {}
        spec:
          accessModes:
          - ReadWriteOnce
          resources:
            requests:
              storage: "1"
          storageClassName: localblock
          volumeMode: Block
        status: {}
      name: ocs-deviceset-localblock
      placement: {}
      preparePlacement: {}
      replica: 1
      resources:
        limits:
          cpu: "2"
          memory: 5Gi
        requests:
          cpu: "1"
          memory: 5Gi
  status:
    conditions:
    - lastHeartbeatTime: "2023-02-28T05:12:59Z"
      lastTransitionTime: "2023-02-27T10:29:09Z"
      message: 'Error while reconciling: some StorageClasses were skipped while waiting
        for pre-requisites to be met: [ocs-storagecluster-ceph-rbd]'
      reason: ReconcileFailed
      status: "False"
      type: ReconcileComplete
    - lastHeartbeatTime: "2023-02-27T10:29:09Z"
      lastTransitionTime: "2023-02-27T10:29:09Z"
      message: Initializing StorageCluster
      reason: Init
      status: "False"
      type: Available
    - lastHeartbeatTime: "2023-02-27T10:29:09Z"
      lastTransitionTime: "2023-02-27T10:29:09Z"
      message: Initializing StorageCluster
      reason: Init
      status: "True"
      type: Progressing
    - lastHeartbeatTime: "2023-02-27T10:29:09Z"
      lastTransitionTime: "2023-02-27T10:29:09Z"
      message: Initializing StorageCluster
      reason: Init
      status: "False"
      type: Degraded
    - lastHeartbeatTime: "2023-02-27T10:29:09Z"
      lastTransitionTime: "2023-02-27T10:29:09Z"
      message: Initializing StorageCluster
      reason: Init
      status: Unknown
      type: Upgradeable
    externalStorage:
      grantedCapacity: "0"
    failureDomain: host
    failureDomainKey: kubernetes.io/hostname
    failureDomainValues:
    - syd05-worker-0.nara1-cicd-odf-1c53.redhat.com
    - syd05-worker-1.nara1-cicd-odf-1c53.redhat.com
    - syd05-worker-2.nara1-cicd-odf-1c53.redhat.com
    images:
      ceph:
        actualImage: quay.io/rhceph-dev/rhceph@sha256:c4cceafa24f984bfa8aaa8937df0c545c21f37c35cc4661db8ee4f010bddfb74
        desiredImage: quay.io/rhceph-dev/rhceph@sha256:c4cceafa24f984bfa8aaa8937df0c545c21f37c35cc4661db8ee4f010bddfb74
      noobaaCore:
        desiredImage: quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:5dd993448516e250cf7af449e346710f2437bc11c5ccd835cbe52c1d5a175765
      noobaaDB:
        desiredImage: quay.io/rhceph-dev/rhel8-postgresql-12@sha256:9248c4eaa8aeedacc1c06d7e3141ca1457147eef59e329273eb78e32fcd27e79
    kmsServerConnection: {}
    nodeTopologies:
      labels:
        kubernetes.io/hostname:
        - syd05-worker-0.nara1-cicd-odf-1c53.redhat.com
        - syd05-worker-1.nara1-cicd-odf-1c53.redhat.com
        - syd05-worker-2.nara1-cicd-odf-1c53.redhat.com
    phase: Progressing
    relatedObjects:
    - apiVersion: ceph.rook.io/v1
      kind: CephCluster
      name: ocs-storagecluster-cephcluster
      namespace: openshift-storage
      resourceVersion: "1114281"
      uid: f04f39f2-4c44-4ca7-9900-3dd6f6ad87b1
    version: 4.13.0
kind: List
metadata:
  resourceVersion: ""
[root@nara1-cicd-odf-1c53-syd05-bastion-0 ~]# 

4. noobaa pods are not coming up.
5. 
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-69f6db5lr8xr   1/2     Running                1 (15s ago)   5m21s

I will add must-gather logs once uploaded.
if required can setup the environment.

Comment 3 Subham Rai 2023-03-06 08:51:45 UTC
I think I see the issue in the storageCluster configuration, 

```
network:
      provider: multus
      selectors:
        cluster: default/iocs-public2
        public: default/iocs-public
```
instead of the above it should be

```
network:
      provider: multus
      selectors:
        cluster: openshift-storage/iocs-public2
        public: openshift-storage/iocs-public
```

refer

```
The NetworkAttachmentDefinition should be referenced along with the namespace in which it is present like public: <namespace>/<name of NAD>. e.g., the network attachment definition are in default namespace:
public: default/rook-public-nw
cluster: default/rook-cluster-nw
```

this doc https://rook.github.io/docs/rook/latest/CRDs/Cluster/ceph-cluster-crd/?h=mu#multus

will happy to look at the must gather or cluster

Comment 4 narayanspg 2023-03-06 10:00:08 UTC
this was tried multiple times. I have also tried another environment with openshift-storage namespace as well. I am setting up an environment to get you more information.

Comment 5 narayanspg 2023-03-06 14:39:04 UTC
below are the network config details on nodes and NAD spec details below.

[core@syd05-worker-0 ~]$ ifconfig
env2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 192.168.0.121  netmask 255.255.255.0  broadcast 192.168.0.255
        inet6 fe80::f3f4:6078:3e5c:e1bc  prefixlen 64  scopeid 0x20<link>
        ether fa:c2:a7:a8:9d:20  txqueuelen 1000  (Ethernet)
        RX packets 1552044  bytes 1724073436 (1.6 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 585362  bytes 308775824 (294.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 19

[core@syd05-master-1 ~]$ ifconfig
env2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 192.168.0.52  netmask 255.255.255.0  broadcast 192.168.0.255
        inet6 fe80::ee44:5233:fde8:f04b  prefixlen 64  scopeid 0x20<link>
        ether fa:4c:b6:c2:c3:20  txqueuelen 1000  (Ethernet)
        RX packets 26932419  bytes 22688481785 (21.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 16474461  bytes 6858627911 (6.3 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 19

[core@syd05-master-0 ~]$ ifconfig
env2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 192.168.0.103  netmask 255.255.255.0  broadcast 192.168.0.255
        inet6 fe80::f9f8:a3a9:e1eb:1443  prefixlen 64  scopeid 0x20<link>
        ether fa:42:05:74:4d:20  txqueuelen 1000  (Ethernet)
        RX packets 17916412  bytes 16094303599 (14.9 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 10915520  bytes 6440703513 (5.9 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 19

cat <<EOF | oc create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ocs-public
  namespace: openshift-storage
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "macvlan",
      "master": "env2",
      "mode": "bridge",
      "ipam": {
            "type": "whereabouts",
            "range": "192.168.0.2/24"
      }
  }'
EOF

Comment 6 narayanspg 2023-03-06 15:08:28 UTC
New cluster is setup and getting same behavior.

[root@nara6-cicd-odf-31bf-syd05-bastion-0 ~]# oc get pods
NAME                                                              READY   STATUS      RESTARTS      AGE
csi-addons-controller-manager-cc4b499fd-8rmst                     2/2     Running     0             22m
csi-cephfsplugin-7bvfw                                            2/2     Running     0             8m1s
csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-8xnb5      1/1     Running     0             8m3s
csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-tbc72      1/1     Running     0             8m3s
csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-tzfp4      1/1     Running     0             8m3s
csi-cephfsplugin-provisioner-6cbfd7fd77-lr2br                     5/5     Running     0             9m6s
csi-cephfsplugin-provisioner-6cbfd7fd77-vg2dr                     5/5     Running     0             9m6s
csi-cephfsplugin-wjx9f                                            2/2     Running     0             7m51s
csi-cephfsplugin-z8g2l                                            2/2     Running     0             7m56s
csi-rbdplugin-8tnjv                                               3/3     Running     0             8m2s
csi-rbdplugin-hhc6k                                               3/3     Running     0             7m51s
csi-rbdplugin-holder-ocs-storagecluster-cephcluster-c597c         1/1     Running     0             8m4s
csi-rbdplugin-holder-ocs-storagecluster-cephcluster-r5c22         1/1     Running     0             8m4s
csi-rbdplugin-holder-ocs-storagecluster-cephcluster-vg9b4         1/1     Running     0             8m4s
csi-rbdplugin-provisioner-78bc9589dc-nsf9l                        6/6     Running     0             9m6s
csi-rbdplugin-provisioner-78bc9589dc-w9mgv                        6/6     Running     0             9m6s
csi-rbdplugin-rdvg8                                               3/3     Running     0             7m56s
noobaa-operator-6945c8f4f6-j8qh4                                  1/1     Running     0             22m
ocs-metrics-exporter-f478dfd65-xr799                              1/1     Running     0             22m
ocs-operator-67f55fd69b-mqxdz                                     1/1     Running     0             22m
odf-console-7b88df97b6-xql6b                                      1/1     Running     0             22m
odf-operator-controller-manager-67b6f7879d-6xq8n                  2/2     Running     0             22m
rook-ceph-crashcollector-1e13b6e9213b2457d4ee942ea6c49722-qncpl   1/1     Running     0             6m11s
rook-ceph-crashcollector-ae9b8dcbb391f72fe2d91aa2690725b8-bdkc6   1/1     Running     0             6m1s
rook-ceph-crashcollector-f10bcbd876937b397cce97feb647d3a1-znn46   1/1     Running     0             7m3s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-a-659f8657rss65   2/2     Running     0             6m11s
rook-ceph-mds-ocs-storagecluster-cephfilesystem-b-6ffbc49f72h7k   2/2     Running     0             6m1s
rook-ceph-mgr-a-7cb768f844-ts9bc                                  3/3     Running     0             7m9s
rook-ceph-mon-a-5c494dddd4-gkvds                                  2/2     Running     0             8m6s
rook-ceph-mon-b-b748db99d-x57xh                                   2/2     Running     0             7m39s
rook-ceph-mon-c-86b59b88b7-shsbq                                  2/2     Running     0             7m25s
rook-ceph-operator-5dd6df7795-jwx8f                               1/1     Running     0             9m15s
rook-ceph-osd-0-74c645bf45-rdtz8                                  2/2     Running     0             6m40s
rook-ceph-osd-1-7999d97d5b-fqkqz                                  2/2     Running     0             6m41s
rook-ceph-osd-2-5c9f768697-rft2l                                  2/2     Running     0             6m41s
rook-ceph-osd-prepare-1e0eff230cc0f76d25b67e2c29fb037c-sxq5c      0/1     Completed   0             6m58s
rook-ceph-osd-prepare-5061fe06e4a142a28ef23564c45ac3bc-h4qfl      0/1     Completed   0             6m58s
rook-ceph-osd-prepare-c6a3989e48d3a64164f109074528ec4e-8sgqz      0/1     Completed   0             6m58s
rook-ceph-rgw-ocs-storagecluster-cephobjectstore-a-679547498qwj   1/2     Running     1 (20s ago)   5m25s

[root@nara6-cicd-odf-31bf-syd05-bastion-0 ~]# oc get storagecluster
NAME                 AGE   PHASE         EXTERNAL   CREATED AT             VERSION
ocs-storagecluster   10m   Progressing              2023-03-06T14:57:10Z   4.13.0

[root@nara6-cicd-odf-31bf-syd05-bastion-0 ~]# oc describe storagecluster ocs-storagecluster
Name:         ocs-storagecluster
Namespace:    openshift-storage
Labels:       <none>
Annotations:  cluster.ocs.openshift.io/local-devices: true
              uninstall.ocs.openshift.io/cleanup-policy: delete
              uninstall.ocs.openshift.io/mode: graceful
API Version:  ocs.openshift.io/v1
Kind:         StorageCluster
Metadata:
  Creation Timestamp:  2023-03-06T14:57:10Z
  Finalizers:
    storagecluster.ocs.openshift.io
  Generation:  2
  Managed Fields:
    API Version:  ocs.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:cluster.ocs.openshift.io/local-devices:
      f:spec:
        .:
        f:arbiter:
        f:encryption:
          .:
          f:kms:
        f:flexibleScaling:
        f:monDataDirHostPath:
        f:network:
          .:
          f:connections:
            .:
            f:encryption:
          f:provider:
          f:selectors:
            .:
            f:cluster:
            f:public:
        f:nodeTopologies:
        f:resources:
          .:
          f:mds:
            .:
            f:limits:
              .:
              f:cpu:
              f:memory:
            f:requests:
              .:
              f:cpu:
              f:memory:
          f:rgw:
            .:
            f:limits:
              .:
              f:cpu:
              f:memory:
            f:requests:
              .:
              f:cpu:
              f:memory:
    Manager:      Mozilla
    Operation:    Update
    Time:         2023-03-06T14:57:10Z
    API Version:  ocs.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          f:uninstall.ocs.openshift.io/cleanup-policy:
          f:uninstall.ocs.openshift.io/mode:
        f:finalizers:
          .:
          v:"storagecluster.ocs.openshift.io":
      f:spec:
        f:externalStorage:
        f:managedResources:
          .:
          f:cephBlockPools:
          f:cephCluster:
          f:cephConfig:
          f:cephDashboard:
          f:cephFilesystems:
          f:cephNonResilientPools:
          f:cephObjectStoreUsers:
          f:cephObjectStores:
          f:cephToolbox:
        f:mirroring:
        f:storageDeviceSets:
    Manager:      ocs-operator
    Operation:    Update
    Time:         2023-03-06T14:57:10Z
    API Version:  ocs.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:ownerReferences:
          .:
          k:{"uid":"17d05cd0-47ca-4d7d-80db-2d393a04814d"}:
    Manager:      manager
    Operation:    Update
    Time:         2023-03-06T14:57:11Z
    API Version:  ocs.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:externalStorage:
          .:
          f:grantedCapacity:
        f:failureDomain:
        f:failureDomainKey:
        f:failureDomainValues:
        f:images:
          .:
          f:ceph:
            .:
            f:actualImage:
            f:desiredImage:
          f:noobaaCore:
            .:
            f:desiredImage:
          f:noobaaDB:
            .:
            f:desiredImage:
        f:kmsServerConnection:
        f:nodeTopologies:
          .:
          f:labels:
            .:
            f:kubernetes.io/hostname:
        f:phase:
        f:relatedObjects:
        f:version:
    Manager:      ocs-operator
    Operation:    Update
    Subresource:  status
    Time:         2023-03-06T15:07:03Z
  Owner References:
    API Version:     odf.openshift.io/v1alpha1
    Kind:            StorageSystem
    Name:            ocs-storagecluster-storagesystem
    UID:             17d05cd0-47ca-4d7d-80db-2d393a04814d
  Resource Version:  132426
  UID:               060a8a37-8c11-427f-9ae5-1f03c1d8af48
Spec:
  Arbiter:
  Encryption:
    Kms:
  External Storage:
  Flexible Scaling:  true
  Managed Resources:
    Ceph Block Pools:
    Ceph Cluster:
    Ceph Config:
    Ceph Dashboard:
    Ceph Filesystems:
    Ceph Non Resilient Pools:
    Ceph Object Store Users:
    Ceph Object Stores:
    Ceph Toolbox:
  Mirroring:
  Mon Data Dir Host Path:  /var/lib/rook
  Network:
    Connections:
      Encryption:
    Provider:  multus
    Selectors:
      Cluster:  openshift-storage/ocs-private
      Public:   openshift-storage/ocs-public
  Node Topologies:
  Resources:
    Mds:
      Limits:
        Cpu:     3
        Memory:  8Gi
      Requests:
        Cpu:     1
        Memory:  8Gi
    Rgw:
      Limits:
        Cpu:     2
        Memory:  4Gi
      Requests:
        Cpu:     1
        Memory:  4Gi
  Storage Device Sets:
    Config:
    Count:  3
    Data PVC Template:
      Metadata:
      Spec:
        Access Modes:
          ReadWriteOnce
        Resources:
          Requests:
            Storage:         1
        Storage Class Name:  localblock
        Volume Mode:         Block
      Status:
    Name:  ocs-deviceset-localblock
    Placement:
    Prepare Placement:
    Replica:  1
    Resources:
      Limits:
        Cpu:     2
        Memory:  5Gi
      Requests:
        Cpu:     1
        Memory:  5Gi
Status:
  Conditions:
    Last Heartbeat Time:   2023-03-06T14:57:12Z
    Last Transition Time:  2023-03-06T14:57:12Z
    Message:               Version check successful
    Reason:                VersionMatched
    Status:                False
    Type:                  VersionMismatch
    Last Heartbeat Time:   2023-03-06T15:07:03Z
    Last Transition Time:  2023-03-06T14:57:12Z
    Message:               Error while reconciling: some StorageClasses were skipped while waiting for pre-requisites to be met: [ocs-storagecluster-ceph-rbd]
    Reason:                ReconcileFailed
    Status:                False
    Type:                  ReconcileComplete
    Last Heartbeat Time:   2023-03-06T14:57:12Z
    Last Transition Time:  2023-03-06T14:57:12Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Available
    Last Heartbeat Time:   2023-03-06T14:57:12Z
    Last Transition Time:  2023-03-06T14:57:12Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                True
    Type:                  Progressing
    Last Heartbeat Time:   2023-03-06T14:57:12Z
    Last Transition Time:  2023-03-06T14:57:12Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                False
    Type:                  Degraded
    Last Heartbeat Time:   2023-03-06T14:57:12Z
    Last Transition Time:  2023-03-06T14:57:12Z
    Message:               Initializing StorageCluster
    Reason:                Init
    Status:                Unknown
    Type:                  Upgradeable
  External Storage:
    Granted Capacity:  0
  Failure Domain:      host
  Failure Domain Key:  kubernetes.io/hostname
  Failure Domain Values:
    syd05-worker-0.nara6-cicd-odf-31bf.redhat.com
    syd05-worker-1.nara6-cicd-odf-31bf.redhat.com
    syd05-worker-2.nara6-cicd-odf-31bf.redhat.com
  Images:
    Ceph:
      Actual Image:   quay.io/rhceph-dev/rhceph@sha256:a9bffe4a4b9115cba8f1d4192245c81d797bf54bb3e5aaed8c4499fcf78b477c
      Desired Image:  quay.io/rhceph-dev/rhceph@sha256:a9bffe4a4b9115cba8f1d4192245c81d797bf54bb3e5aaed8c4499fcf78b477c
    Noobaa Core:
      Desired Image:  quay.io/rhceph-dev/odf4-mcg-core-rhel9@sha256:c1264177733a7219078bc4d1ca2acf615aafbd1648b42a9b8832210acdd16ba8
    Noobaa DB:
      Desired Image:  quay.io/rhceph-dev/rhel8-postgresql-12@sha256:9248c4eaa8aeedacc1c06d7e3141ca1457147eef59e329273eb78e32fcd27e79
  Kms Server Connection:
  Node Topologies:
    Labels:
      kubernetes.io/hostname:
        syd05-worker-0.nara6-cicd-odf-31bf.redhat.com
        syd05-worker-1.nara6-cicd-odf-31bf.redhat.com
        syd05-worker-2.nara6-cicd-odf-31bf.redhat.com
  Phase:  Progressing
  Related Objects:
    API Version:       ceph.rook.io/v1
    Kind:              CephCluster
    Name:              ocs-storagecluster-cephcluster
    Namespace:         openshift-storage
    Resource Version:  132422
    UID:               10844f71-4e11-4169-bc0b-03058457ba47
  Version:             4.13.0
Events:                <none>
[root@nara6-cicd-odf-31bf-syd05-bastion-0 ~]#

Comment 7 narayanspg 2023-03-06 15:10:08 UTC
@subham I have shared the cluster information over chat with you. let me know if you need any other details.

Comment 8 Subham Rai 2023-03-07 06:42:21 UTC
I see rbd init commands hang
```
2023-03-06 15:00:28.009617 E | ceph-block-pool-controller: failed to reconcile CephBlockPool "openshift-storage/ocs-storagecluster-cephblockpool". failed to create pool "ocs-storagecluster-cephblockpool".: failed to create pool "ocs-storagecluster-cephblockpool".: failed to initialize pool "ocs-storagecluster-cephblockpool" for RBD use. : command terminated with exit code 124
```
 
pool is create 
```
ceph osd lspools --conf=/var/lib/rook/openshift-storage/openshift-storage.config
1 .mgr
2 ocs-storagecluster-cephblockpool
```

seems like all rbd command hang, but able to run ceph commands

Comment 9 Subham Rai 2023-03-07 06:44:47 UTC
on different note, I see rgw pod in CLBO due ```Startup probe failed: RGW health check failed with error code: 7. the RGW likely cannot be reached by clients```

Comment 10 Subham Rai 2023-03-07 07:00:19 UTC
also, I see osd 0 is not in multus range
```
osd.0 up   in  weight 1 up_from 856 up_thru 856 down_at 855 last_clean_interval [10,855) [v2:192.168.0.25:6800/2499787560,v1:192.168.0.25:6801/2499787560] [v2:192.168.0.25:6802/2545787560,v1:192.168.0.25:6803/2545787560] exists,up cf8a7623-42c4-418f-b30e-064016a0ed2f
osd.1 down out weight 0 up_from 839 up_thru 851 down_at 853 last_clean_interval [9,838) [v2:192.168.0.23:6800/3591904357,v1:192.168.0.23:6801/3591904357] [v2:192.168.0.23:6802/3635904357,v1:192.168.0.23:6803/3635904357] autoout,exists 16f42375-e158-4ec8-be55-b2ef98221494
osd.2 up   in  weight 1 up_from 856 up_thru 856 down_at 851 last_clean_interval [9,855) [v2:192.168.0.21:6800/828470086,v1:192.168.0.21:6801/828470086] [v2:192.168.0.21:6808/873470086,v1:192.168.0.21:6809/873470086] exists,up 369839bf-2768-4244-b986-42aaac4e83e0
```

```
 kc get network-attachment-definitions.k8s.cni.cncf.io -oyaml
apiVersion: v1
items:
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    creationTimestamp: "2023-03-06T14:54:33Z"
    generation: 1
    name: ocs-private
    namespace: openshift-storage
    resourceVersion: "122210"
    uid: 948f9d88-c9a1-4c8c-bb3c-7c2898ff5752
  spec:
    config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode":
      "bridge", "ipam": { "type": "whereabouts", "range": "192.168.0.2/24" } }'
- apiVersion: k8s.cni.cncf.io/v1
  kind: NetworkAttachmentDefinition
  metadata:
    creationTimestamp: "2023-03-06T14:54:18Z"
    generation: 1
    name: ocs-public
    namespace: openshift-storage
    resourceVersion: "122043"
    uid: 78769273-3255-4775-b90d-d8b68d539b00
  spec:
    config: '{ "cniVersion": "0.3.1", "type": "macvlan", "master": "env2", "mode":
      "bridge", "ipam": { "type": "whereabouts", "range": "192.168.0.2/24" } }'
kind: List
metadata:
  resourceVersion: ""

```

will check with @brgardne more on this

Comment 11 Blaine Gardner 2023-03-07 18:56:29 UTC
I think the issue is that the NAD is overlapping with the host's env2 network. Both are 192.168.0.X/24 . Please choose a different CIDR for the NAD and see if it is fixed. Try something like 192.168.167.0/24 on the NAD.



FYI, CIDRs (pronounced like the drink cider) are a strange thing to learn to read. They're quite simple but do require understanding binary notation to understand.

1.0.0.1/24 doesn't mean that the IPs go from 1.0.0.1 to 1.0.0.24.

The /24 defines the address "prefix" and corresponds to a bitmask, out of 32 bits, starting from the left. The prefix determines the number of '1' bits in the mask.

The most common bitmasks you'll see are: 
 - /24 (mask 11111111.11111111.11111111.00000000) (24 1's) - most home router networks and virtual machine networks use this range
 - /16 (mask 11111111.11111111.00000000.00000000) (16 1's) - minikube sets the cluster CIDR as 10.244.0.0/16 to allow for many pods
 - /12 (mask 11111111.11110000.00000000.00000000) (12 1's) - minikube sets the service cluster CIDR as 10.96.0.0/12 which allows more services than pods

This shows the bitmask for each prefix including the binary: https://datacadamia.com/network/mask

There are a few IPv4 networks spaces that are reserved for private use, which is why you commonly see 10., 172., and 192. networks. Conceptually, these are approximately: big corporate class A, business class B, and personal class C, respectively.
https://www.arin.net/reference/research/statistics/address_filters/

Each 4 bits equates to a 0xF in hex, so /12 is 0xFF.0xF0.0x00.0x00 in hex (255.240.0.0 in decimal). It'll be easiest to work with CIDRs if you understand how hex compares to binary and make use of your calculator's "programmer" mode. You can also use a CIDR calculator tool like this: https://account.arin.net/public/cidrCalculator

Anything "masked" by a 1 bit is a fixed part of the address. Anything not masked (i.e., a "0") can be variable and handed out by a DHCP server to a client. This allows admins to set up different networks that don't reuse addresses.
For example, when I set up NADs for multus testing in minikube, I use 192.168.20.0/24 for one NAD and 192.168.21.0/24 for another NAD so that neither network will overlap with the other. 192.168.20.0/255.255.255.0 means the ".20" section will never change. Similarly for the ".21" section.

If I were to specify 192.168.20.0/24 for one NAD and 192.168.21.0/16 for the next, the ".21" section would be unmasked, and the second NAD's address range would overlap for 256 addresses in the range 192.168.20.0-.255.

Comment 12 narayanspg 2023-03-08 16:43:14 UTC
Hi ,

Tried again with the CIDR changes as defined below. getting same issue.

env2: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1450
        inet 192.168.0.229  netmask 255.255.255.0  broadcast 192.168.0.255
        inet6 fe80::45e4:3913:845d:49ee  prefixlen 64  scopeid 0x20<link>
        ether fa:87:54:94:2f:20  txqueuelen 1000  (Ethernet)
        RX packets 4548094  bytes 4730443879 (4.4 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4141830  bytes 2707254142 (2.5 GiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
        device interrupt 19


cat <<EOF | oc create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ocs-public
  namespace: openshift-storage
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "macvlan",
      "master": "env2",
      "mode": "bridge",
      "ipam": {
            "type": "whereabouts",
            "range": "192.168.20.0/24"
      }
  }'
EOF

cat <<EOF | oc create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ocs-private
  namespace: openshift-storage
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "macvlan",
      "master": "env2",
      "mode": "bridge",
      "ipam": {
            "type": "whereabouts",
            "range": "192.168.21.0/24"
      }
  }'
EOF

Comment 13 narayanspg 2023-03-08 17:50:51 UTC
also tried with NAD type as bridge as suggested by Blaine, getting same issue.

cat <<EOF | oc create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ocs-public
  namespace: openshift-storage
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "bridge",
      "isGateway": true,
      "vlan": 2,
      "ipam": {
            "type": "whereabouts",
            "range": "192.168.20.0/24"
      }
  }'
EOF

Comment 14 narayanspg 2023-03-09 07:19:24 UTC
tried with ipvlan as suggested by Blaine. most of the pods are not getting created due to "failed to create pod sandbox" error.

[root@rdr-narayan13-lon06-bastion-0 ~]# oc get pods
NAME                                                           READY   STATUS              RESTARTS   AGE
csi-addons-controller-manager-6d894c9656-fv9lh                 2/2     Running             0          38m
csi-cephfsplugin-2pl77                                         2/2     Running             0          7m18s
csi-cephfsplugin-4rvdk                                         0/2     ContainerCreating   0          6m31s
csi-cephfsplugin-6wv7q                                         2/2     Running             0          7m18s
csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r   0/1     ContainerCreating   0          6m33s
csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-wqdz9   0/1     ContainerCreating   0          6m33s
csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-zktxn   0/1     ContainerCreating   0          6m33s
csi-cephfsplugin-provisioner-c678999f9-bmg7z                   0/5     ContainerCreating   0          7m18s
csi-cephfsplugin-provisioner-c678999f9-wrrtr                   0/5     ContainerCreating   0          7m18s
csi-rbdplugin-67qpr                                            3/3     Running             0          7m18s
csi-rbdplugin-cc5f2                                            0/3     ContainerCreating   0          6m31s
csi-rbdplugin-holder-ocs-storagecluster-cephcluster-kwpqj      0/1     ContainerCreating   0          6m34s
csi-rbdplugin-holder-ocs-storagecluster-cephcluster-p66m2      0/1     ContainerCreating   0          6m34s
csi-rbdplugin-holder-ocs-storagecluster-cephcluster-qvbbc      0/1     ContainerCreating   0          6m34s
csi-rbdplugin-hsljj                                            3/3     Running             0          7m18s
csi-rbdplugin-provisioner-798786c5fc-c49gk                     0/6     ContainerCreating   0          7m18s
csi-rbdplugin-provisioner-798786c5fc-pzgwn                     0/6     ContainerCreating   0          7m18s
noobaa-operator-595c599b64-2gnd5                               1/1     Running             0          39m
ocs-metrics-exporter-564f46b697-nl45w                          1/1     Running             0          39m
ocs-operator-76746fb89c-bwcgt                                  1/1     Running             0          39m
odf-console-6c8d464746-5t47b                                   1/1     Running             0          39m
odf-operator-controller-manager-67df444888-tjsxw               2/2     Running             0          39m
rook-ceph-mon-a-69f8598db-dm9db                                0/2     Init:0/2            0          6m29s
rook-ceph-operator-54d4c47787-qpdw2                            1/1     Running             0          7m25s


  Normal   AddedInterface          20s   multus             Add eth0 [10.129.2.101/23] from ovn-kubernetes
  Warning  FailedCreatePodSandBox  18s   kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r_openshift-storage_c1efa8b9-4f25-4678-a035-052b83a06b85_0(58df047a98390cd5b2edf706348fcc86d9bb990848c44e821b3b2320a6c2d1ec): error adding pod openshift-storage_csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r/c1efa8b9-4f25-4678-a035-052b83a06b85:ocs-public]: error adding container to network "ocs-public": failed to create ipvlan: device or resource busy
  Normal   AddedInterface          7s    multus             Add eth0 [10.129.2.101/23] from ovn-kubernetes
  Warning  FailedCreatePodSandBox  1s    kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r_openshift-storage_c1efa8b9-4f25-4678-a035-052b83a06b85_0(0c4d5cbcb97df3cead6a5438cbceb5c80ca5c0e20d9c85a430c636562f258d1b): error adding pod openshift-storage_csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r to CNI network "multus-cni-network": plugin type="multus" name="multus-cni-network" failed (add): [openshift-storage/csi-cephfsplugin-holder-ocs-storagecluster-cephcluster-gpx4r/c1efa8b9-4f25-4678-a035-052b83a06b85:ocs-public]: error adding container to network "ocs-public": failed to create ipvlan: device or resource busy

Comment 16 narayanspg 2023-03-15 09:27:11 UTC
Hi Subham, created OCP cluster and shared the environment over chat.

Comment 17 Subham Rai 2023-03-15 11:36:07 UTC
I validate the multus configuration using a script that @brgardne created and seems like configuration is right.

```
sh test-multus.sh 
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-1 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-2 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-3 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-4 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-5 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-6 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-7 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-8 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-9 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-10 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-11 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-12 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-13 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-14 created
Warning: would violate PodSecurity "restricted:latest": seccompProfile (pod or container "multus-validator" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
daemonset.apps/multus-validator-15 created
Waiting for 15 daemonsets to have pods scheduled
 └─ waiting for daemonset 'multus-validator-1' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-2' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-3' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-4' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-5' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-6' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-7' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-8' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-9' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-10' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-11' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-12' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-13' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-14' to have pods scheduled
 └─ waiting for daemonset 'multus-validator-15' to have pods scheduled
Waiting for 45 pods from 15 daemonsets
 └─ 45 of 45 pods have multus networks
Test successful!
Cleaning up daemonsets
daemonset.apps "multus-validator-2" deleted
daemonset.apps "multus-validator-3" deleted
daemonset.apps "multus-validator-7" deleted
daemonset.apps "multus-validator-5" deleted
daemonset.apps "multus-validator-14" deleted
daemonset.apps "multus-validator-15" deleted
daemonset.apps "multus-validator-8" deleted
daemonset.apps "multus-validator-12" deleted
daemonset.apps "multus-validator-4" deleted
daemonset.apps "multus-validator-1" deleted
daemonset.apps "multus-validator-13" deleted
daemonset.apps "multus-validator-11" deleted
daemonset.apps "multus-validator-6" deleted
daemonset.apps "multus-validator-10" deleted
daemonset.apps "multus-validator-9" deleted
............No resources found in openshift-storage namespace.
```

So, now we'll proceed with finding the actual bug.

Comment 18 Subham Rai 2023-03-16 11:41:16 UTC
I'm taking the rbd team's help why the rbd command hangs. I will update as soon as I have update.

Comment 21 Subham Rai 2023-03-16 13:46:03 UTC
Thanks @idryomov for taking a look.

Multus ip ranges are `192.168.20.0/24` and `192.168.21.0/24` for public-net and cluster-net and osds are on the same range(see v1 and v2)

osd.0 up   in  weight 1 up_from 1698 up_thru 1698 down_at 1697 last_clean_interval [9,1697) [v2:192.168.20.21:6800/2815335877,v1:192.168.20.21:6801/2815335877] [v2:192.168.21.1:6804/2892335877,v1:192.168.21.1:6805/2892335877] exists,up 20a73be9-2add-4cf8-9ca9-3540007a7d8c
osd.1 up   in  weight 1 up_from 1698 up_thru 1698 down_at 1690 last_clean_interval [14,1697) [v2:192.168.20.22:6800/1147689911,v1:192.168.20.22:6801/1147689911] [v2:192.168.21.2:6804/1224689911,v1:192.168.21.2:6805/1224689911] exists,up 789a063f-f7d0-4ae0-9092-bf6e6760d4fd
osd.2 down in  weight 1 up_from 1678 up_thru 1690 down_at 1692 last_clean_interval [15,1677) [v2:192.168.20.23:6800/1330624502,v1:192.168.20.23:6801/1330624502] [v2:192.168.21.3:6800/1406624502,v1:192.168.21.3:6801/1406624502] exists 2d6eab7a-ea58-4c59-a63b-a61094443803

will need to check more on the networking

Comment 22 Subham Rai 2023-03-16 15:55:28 UTC
We are trying to run commands like netstat,ping, telnet to check the ip/ports available and we can communicate but those commands are not available in the images, and in the toolbox we don't have permission to install them. 

Blaine will look do some more testing and validation.

Comment 23 Blaine Gardner 2023-03-16 22:46:27 UTC
In the OSD logs, I am seeing that they are unable to get hearbeats from their peers. I suspected a network issue, and that is exactly what I have found using a separate test.

I created a simple nginx pod that listens on the pod network, public multus net, and cluster multus net. When curl-ed from OSD.0's pod on another host, nginx responds on the pod network, but it does not respond on either multus network. There is likely something blocking traffic to/from this address. When curl-ed from OSD.2's pod, on the SAME host, nginx responds on all networks. 

Whomever has access to the dashboard for this cluster should first check if the virtual cluster has a security group (or whatever the IBM cloud equivalent of an AWS security group is) that would be disallowing that traffic. @ngowda

If that's not the issue, then we should inspect whether there are iptables rules set up on the host that would block the addresses. I don't suspect this of being an issue, but it's possible.

Failing that, I think we should seek help from someone with the environment that provides these virtual clusters. We may want to do that in parallel.



OSD.0

[root@rook-ceph-osd-0-5d877b47c5-tkrg9 ceph]# curl 10.131.1.73:8080
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>
[root@rook-ceph-osd-0-5d877b47c5-tkrg9 ceph]# curl 192.168.20.27:8080
curl: (7) Failed to connect to 192.168.20.27 port 8080: No route to host
[root@rook-ceph-osd-0-5d877b47c5-tkrg9 ceph]# curl 192.168.21.4:8080
curl: (7) Failed to connect to 192.168.21.4 port 8080: No route to host


OSD.2

[root@rook-ceph-osd-2-5d65d455c4-mlk5x ceph]# curl 10.131.1.73:8080
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>
[root@rook-ceph-osd-2-5d65d455c4-mlk5x ceph]# curl 192.168.20.27:8080
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>
[root@rook-ceph-osd-2-5d65d455c4-mlk5x ceph]# curl 192.168.21.4:8080
<html>
<head><title>403 Forbidden</title></head>
<body>
<center><h1>403 Forbidden</h1></center>
<hr><center>nginx/1.22.1</center>
</body>
</html>

Comment 24 Blaine Gardner 2023-03-17 15:53:10 UTC
@ngowda what results have we been able to get from my suggestion yesterday (copied below)?

> Whomever has access to the dashboard for this cluster should first check if the virtual cluster has a security group (or whatever the IBM cloud equivalent of an AWS security group is) that would be disallowing that traffic. 

I don't know how this environment is set up or how to access a console for it to look into it myself.

Comment 25 narayanspg 2023-03-17 16:40:32 UTC
below is the SG settings. I do not have permission to edit. 
  
Name           Description                                    Rules     Attached interfaces
allow_8443		                                       1	    0
allow_all	Allow all ingress traffic.	               4	    2
allow_http	Allow all ingress TCP traffic on port 80.	2	2
allow_https	Allow all ingress TCP traffic on port 443.	2	2
allow_outbound	Allow all egress traffic.	               2	                2
allow_ssh	Allow all ingress TCP traffic on port 22.	2	2
allow_wsl		                                       2	0
ocp-cli		                                               3	0
ocp-install		                                       5	0
rpsene		                                               2	0


We also tried by creating test pod in different nodes and trying to ping over multus interface which fails if they are in different subnet(192.168.20.1 and 192.168.21.2) but ping works if pods in different nodes and are in same subnet(example: 192.168.20.2 and 192.168.20.3)

Comment 27 Subham Rai 2023-03-20 05:18:06 UTC
still, we are looking to find RCA but giving devel since it is planned to GA in 4.13

Comment 28 narayanspg 2023-03-20 09:58:52 UTC
We tried with below NAD spec as well by specifying the gateway details but it is not working.

cat <<EOF | oc create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: ocs-public
  namespace: openshift-storage
spec:
  config: '{
      "cniVersion": "0.3.1",
      "type": "macvlan",
      "master": "env2",
      "mode": "bridge",
      "ipam": {
            "type": "host-local",
            "subnet": "10.1.1.0/24", "rangeStart": "10.1.1.100", "rangeEnd": "10.1.1.200", "routes": [ { "dst": "0.0.0.0/0" } ], "gateway": "10.1.1.1" 
      }
  }'
EOF

Comment 29 Blaine Gardner 2023-03-20 17:17:25 UTC
> below is the SG settings. I do not have permission to edit. 
>   
> Name           Description                                    Rules     Attached interfaces
> allow_8443		                                       1	    0
> allow_all	Allow all ingress traffic.	               4	    2
> allow_http	Allow all ingress TCP traffic on port 80.	2	2
> allow_https	Allow all ingress TCP traffic on port 443.	2	2
> allow_outbound	Allow all egress traffic.	               2	                2
> allow_ssh	Allow all ingress TCP traffic on port 22.	2	2
> allow_wsl		                                       2	0
> ocp-cli		                                               3	0
> ocp-install		                                       5	0
> rpsene		                                               2	0

In the debugging I did with Narayanaswamy, our view wasn't able to show which "Attached interfaces" were connected for the 'allow_all' rule. But given that there are only 2 connected interfaces, I find it unlikely that the security group settings are allowing traffic between nodes on the multus networks. There may be other issues present, but this cloud/virt environment's admin will likely have to allow traffic between nodes on 192.168.20.0/24 and 192.168.21.0/24.

@clacroix can you help with this at all?

Comment 32 Blaine Gardner 2023-03-23 17:21:36 UTC
Update:

The OpenShift team only tests Multus in bare metal environments. After speaking with Eran and Elad, we will likely have to limit our supported environments for Multus in ODF to bare metal only since other environments are tested by the OpenShift team. However, for QA testing, we are going to try to get a working environment in vSphere since we don't have immediate access to on-demand bare metal environments for our own testing. 

For @ngowda this likely means that IBM cloud will not be a supported environment.

I want to caveat this by clarifying that supported environment conversations are still ongoing, and this decision is not final.

Comment 33 narayanspg 2023-03-24 06:36:40 UTC
Ok Thanks for the update. FYI we also tried on PowerVM.

Comment 39 Subham Rai 2023-04-10 18:21:25 UTC
I think we can close this bz since the deployment has passed. @brgardne

Comment 41 Blaine Gardner 2023-04-12 18:13:37 UTC
Need @etamir (or @muagarwa ?) to specify whether ODF's Multus feature should support IBM virtual environments in 4.13? Otherwise, I assume it is a possibility for 4.14?

Comment 46 Elad 2023-05-09 12:54:44 UTC
Re-opening as we might decide to support Power eventually.
We can target to future release in case not intended to be supported in 4.13

Comment 47 Travis Nielsen 2023-05-09 18:45:31 UTC
Moving out of 4.13

Comment 48 Blaine Gardner 2023-06-06 15:17:42 UTC
This already got passed over for 4.14.

Let's also make sure this gets noted as a feature request / enhancement request.

Comment 51 Mudit Agarwal 2023-06-06 16:26:49 UTC
Created a Jira epic https://issues.redhat.com/browse/RHSTOR-4619 to track this feature.