Bug 1887488

Summary: OCP 4.6: Topology Manager OpenShift E2E test fails: gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface
Product: OpenShift Container Platform Reporter: Walid A. <wabouham>
Component: NodeAssignee: Francesco Romani <fromani>
Node sub component: Topology manager QA Contact: Walid A. <wabouham>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, ddharwar, fromani, jokerman, mifiedle, rphillips, tsweeney
Version: 4.6   
Target Milestone: ---   
Target Release: 4.7.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-02-24 15:25:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1903467    

Description Walid A. 2020-10-12 15:35:59 UTC
Description of problem:

This is on an OCP 4.6 baremetal cluster of 3 master and 2 worker nodes, installed with IPI Ansible method ( see steps below) with OVNKubernetes network plugin, and openshift-network-sriov-operator successfully deployed.


The Topology Manager OpenShift E2E test "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]" fails with following error on OCp 4.6 baremetal cluster with OVNKubernetes network plugin, and openshift-network-sriov-operator successfully deployed:


Oct 10 20:20:49.195: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready
STEP: Destroying namespace "e2e-test-topology-manager-r86df" for this suite.
Oct 10 20:20:49.208: INFO: Running AfterSuite actions on all nodes
Oct 10 20:20:49.208: INFO: Running AfterSuite actions on node 1
fail [github.com/openshift/origin/test/extended/topology_manager/resourcealign.go:183]: Unexpected error:
    <*util.ExitError | 0xc002e80060>: {
        Cmd: "oc --namespace=e2e-test-topology-manager-r86df --kubeconfig=/root/.kube/config rsh -n default -c test-0 test-pqdtp ping -c 3 10.56.217.178",
        StdErr: "ping: permission denied (are you root?)\nPING 10.56.217.178 (10.56.217.178): 56 data bytes\ncommand terminated with exit code 1",
        ExitError: {
            ProcessState: {
                pid: 14145,
                status: 256,
                rusage: {
                    Utime: {Sec: 0, Usec: 160857},
                    Stime: {Sec: 0, Usec: 37275},
                    Maxrss: 154324,
                    Ixrss: 0,
                    Idrss: 0,
                    Isrss: 0,
                    Minflt: 2414,
                    Majflt: 0,
                    Nswap: 0,
                    Inblock: 0,
                    Oublock: 0,
                    Msgsnd: 0,
                    Msgrcv: 0,
                    Nsignals: 0,
                    Nvcsw: 1061,
                    Nivcsw: 0,
                },
            },
            Stderr: nil,
        },
    }
    exit status 1
occurred

Oct 10 20:19:18.786 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Pulled image/registry.redhat.io/redhat/redhat-marketplace-index:v4.6
Oct 10 20:19:18.982 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Created
Oct 10 20:19:19.008 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Started
Oct 10 20:19:30.528 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Ready
Oct 10 20:19:30.532 W ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 reason/GracefulDelete in 0s
Oct 10 20:19:30.552 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Killing
Oct 10 20:19:30.552 W ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 reason/Deleted
Oct 10 20:19:30.558 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Killing
Oct 10 20:19:33.696 I ns/default pod/test-vcts8 node/ reason/Created
Oct 10 20:19:33.696 I ns/default pod/test-pqdtp node/ reason/Created
Oct 10 20:19:33.703 I ns/default pod/test-vcts8 node/worker000 reason/Scheduled
Oct 10 20:19:33.703 I ns/default pod/test-pqdtp node/worker000 reason/Scheduled
Oct 10 20:19:35.796 I ns/default pod/test-pqdtp reason/AddedInterface Add eth0 [10.128.2.121/23]
Oct 10 20:19:35.853 I ns/default pod/test-pqdtp reason/AddedInterface Add sriov1 [10.56.217.178/24] from default/sriov-intel
Oct 10 20:19:35.862 I ns/default pod/test-vcts8 reason/AddedInterface Add eth0 [10.128.2.120/23]
Oct 10 20:19:35.935 I ns/default pod/test-vcts8 reason/AddedInterface Add sriov1 [10.56.217.179/24] from default/sriov-intel
Oct 10 20:19:36.194 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Pulling image/busybox
Oct 10 20:19:36.214 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Pulling image/busybox
Oct 10 20:19:37.564 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Pulled image/busybox
Oct 10 20:19:37.566 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Pulled image/busybox
Oct 10 20:19:37.728 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Created
Oct 10 20:19:37.734 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Created
Oct 10 20:19:37.749 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Started
Oct 10 20:19:37.754 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Started
Oct 10 20:19:38.704 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Ready
Oct 10 20:19:38.709 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Ready
Oct 10 20:19:41.158 W ns/default pod/test-pqdtp node/worker000 reason/GracefulDelete in 30s
Oct 10 20:19:41.166 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Killing
Oct 10 20:20:11.745 E ns/default pod/test-pqdtp node/worker000 container/test-0 container exited with code 137 (Error): 
Oct 10 20:20:16.763 W ns/default pod/test-pqdtp node/worker000 reason/Deleted
Oct 10 20:20:17.167 W ns/default pod/test-vcts8 node/worker000 reason/GracefulDelete in 30s
Oct 10 20:20:17.171 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Killing
Oct 10 20:20:47.780 E ns/default pod/test-vcts8 node/worker000 container/test-0 container exited with code 137 (Error): 
Oct 10 20:20:48.784 W ns/default pod/test-vcts8 node/worker000 reason/Deleted

failed: (1m32s) 2020-10-10T20:20:49 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"


Version-Release number of selected component (if applicable):

Client Version: 4.6.0-0.nightly-2020-10-03-051134
Server Version: 4.6.0-0.nightly-2020-10-03-051134
Kubernetes Version: v1.19.0+db1fc96

How reproducible:
Everytime

Steps to Reproduce:
1. Install OCP 4.6.0-0.nightly-2020-10-03-051134 nightly clusters (3 masters and two worker nodes) with Jetski IPI Ansible automation on baremetal cluster (see steps in https://github.com/redhat-performance/JetSki)

2. Configure CPU manager with static policy and Enable Topology manager ( see docs: 
https://docs.openshift.com/container-platform/4.5/scalability_and_performance/using-cpu-manager.html
https://docs.openshift.com/container-platform/4.5/scalability_and_performance/using-topology-manager.html

3.  Deploy SR-IOV network operator:
git clone https://github.com/openshift/sriov-network-operator.git
make deploy-setup

4. Label worker with Mellanox SR-IOV cards with virtualization enabled in bios:
oc label node worker000 feature.node.kubernetes.io/sriov-capable=true

5. oc create -f sriov-policy.yaml:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
 name: policy-sriov
 namespace: openshift-sriov-network-operator
spec:
 resourceName: intelsriov
 nodeSelector:
   feature.node.kubernetes.io/sriov-capable: 'true'
 priority: 51
 mtu: 1500
 numVfs: 8
 nicSelector:
   vendor: '8086'
   rootDevices:
     - '0000:86:00.1'
   pfNames:
     - ens7f1
 deviceType: netdevice

6. oc create -f sriov-ipam.yaml:
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
 name: sriov-intel
 namespace: openshift-sriov-network-operator
spec:
 ipam: |
   {
     "type": "host-local",
     "subnet": "10.56.217.0/24",
     "rangeStart": "10.56.217.171",
     "rangeEnd": "10.56.217.181",
     "routes": [{
       "dst": "0.0.0.0/0"
     }],
     "gateway": "10.56.217.1"
   }
 vlan: 0
 resourceName: intelsriov
 networkNamespace: default

7. git clone openshift origin in $GOPATH/src/github.com/openshift/
  https://github.com/openshift/origin.git
  cd $GOPATH/src/github.com/openshift/
  make WHAT=cmd/openshift-tests

8. Export env vars and run topology manager E2E tests:
  export KUBECONFIG=/root/.kube/config
  export SRIOV_NETWORK_NAMESPACE=default
  export SRIOV_NETWORK=sriov-intel
  export RESOURCE_NAME=openshift.io/intelsriov
  OPENSHIFT_TESTS=/root/go/src/github.com/openshift/origin/openshift-tests

9. Run topology manager E2E tests
  $OPENSHIFT_TESTS run openshift/conformance --dry-run | grep -E "TopologyManager" | $OPENSHIFT_TESTS run -f -

Actual results:
Pass fails with permission issues trying to ping between pods
failed: (1m32s) 2020-10-10T20:20:49 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"

Expected results:
pass test

Additional info:
Link to more logs oc commands and cluster setup info will be in next private comment

Comment 2 Francesco Romani 2020-10-12 16:15:06 UTC
I looked at Walid's setup and the most likely culprits are:
1. a OCP change got unnoticed because the test don't run often enough (that's a sore point and a long due fixes)
2. surely the tests fail to pin the images they need, so maybe an image got updated and an unwanted change sneaked in

2. needs to be fixed anyway, so I will.

Comment 3 Francesco Romani 2020-10-14 16:05:26 UTC
I'm confident the issue is in the tests, which, because of an oversight, do not pin the image they need to a specific tag. A simple change in the test code should fix.

Comment 7 Francesco Romani 2020-11-16 08:45:10 UTC
This is actually on tests, but "topology-manager" seems the closest in the current sub-component list. Reassigned.

Comment 9 Walid A. 2020-12-04 18:17:26 UTC
Tested on same OCP 4.6 baremetal cluster from the latest master branch of origin:

Tests are being skipped as I have one of the two worker nodes enabled for topology manager, that was not a requirement before AFAIK:

started: (0/5/11) "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"

skip [github.com/openshift/origin/test/extended/topology_manager/utils.go:69]: topology manager not configured on all nodes

skipped: (16.2s) 2020-12-04T14:17:19 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"

-----

And also skipped because NETWORK_CHECK_IMAGE env var is not defined.  According to origin docs:

The testsuite runs a basic connectivity test to ensure the NUMA-aligned devices are functional. Use this variable to set the image URL to use to check the network is working between pods which requested, and got, aligned resources. If this value is not set (default), the connectivity test will skip.

What is a good example to use ?

started: (0/10/11) "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"

skip [github.com/openshift/origin/test/extended/topology_manager/resourcealign.go:140]: no network check image provided (use NETWORK_CHECK_IMAGE)

skipped: (16.4s) 2020-12-04T16:42:30 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"

Comment 10 Francesco Romani 2020-12-04 18:30:44 UTC
(In reply to Walid A. from comment #9)
> Tested on same OCP 4.6 baremetal cluster from the latest master branch of
> origin:
> 
> Tests are being skipped as I have one of the two worker nodes enabled for
> topology manager, that was not a requirement before AFAIK:
> 
> started: (0/5/11) "[Serial][sig-node][Feature:TopologyManager] Configured
> cluster with gu workload attached to SRIOV networks should let
> resource-aligned PODs have working SRIOV network interface
> [Suite:openshift/conformance/serial]"
> 
> skip
> [github.com/openshift/origin/test/extended/topology_manager/utils.go:69]:
> topology manager not configured on all nodes
> 
> skipped: (16.2s) 2020-12-04T14:17:19
> "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu
> workload attached to SRIOV networks should let resource-aligned PODs have
> working SRIOV network interface [Suite:openshift/conformance/serial]"
> 
> -----
> 
> And also skipped because NETWORK_CHECK_IMAGE env var is not defined. 
> According to origin docs:
> 
> The testsuite runs a basic connectivity test to ensure the NUMA-aligned
> devices are functional. Use this variable to set the image URL to use to
> check the network is working between pods which requested, and got, aligned
> resources. If this value is not set (default), the connectivity test will
> skip.
> 
> What is a good example to use ?
> 
> started: (0/10/11) "[Serial][sig-node][Feature:TopologyManager] Configured
> cluster with gu workload attached to SRIOV networks should let
> resource-aligned PODs have working SRIOV network interface
> [Suite:openshift/conformance/serial]"
> 
> skip
> [github.com/openshift/origin/test/extended/topology_manager/resourcealign.go:
> 140]: no network check image provided (use NETWORK_CHECK_IMAGE)
> 
> skipped: (16.4s) 2020-12-04T16:42:30
> "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu
> workload attached to SRIOV networks should let resource-aligned PODs have
> working SRIOV network interface [Suite:openshift/conformance/serial]"

For the time being: quay.io/openshift-kni/cnf-tests:$OCP_VERSION (e.g. "4.7" to test OCP 4.7.z and so forth)
This probably need a bit better docs somewhere.
In the next weeks we are looking to have a suitable image available in OCP and a better default for this option

The image could be simple as
$ cat Dockerfile
FROM registry.access.redhat.com/ubi8/ubi-minimal:latest
RUN microdnf install -y iputils && microdnf clean all
ENTRYPOINT [ "/bin/ping" ]

but we need to have it automatically built (and maintained) and this will take a little bit.

Comment 11 Walid A. 2020-12-04 21:28:00 UTC
Verified on same OCP 4.6 baremetal cluster from the latest master branch of origin:
Set env var:
export NETWORK_CHECK_IMAGE=quay.io/openshift-kni/cnf-tests:4.6
.
.
.

started: (0/10/11) "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"

passed: (1m51s) 2020-12-04T18:39:53 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"

Comment 14 errata-xmlrpc 2021-02-24 15:25:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633