Description of problem: This is on an OCP 4.6 baremetal cluster of 3 master and 2 worker nodes, installed with IPI Ansible method ( see steps below) with OVNKubernetes network plugin, and openshift-network-sriov-operator successfully deployed. The Topology Manager OpenShift E2E test "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]" fails with following error on OCp 4.6 baremetal cluster with OVNKubernetes network plugin, and openshift-network-sriov-operator successfully deployed: Oct 10 20:20:49.195: INFO: Waiting up to 7m0s for all (but 100) nodes to be ready STEP: Destroying namespace "e2e-test-topology-manager-r86df" for this suite. Oct 10 20:20:49.208: INFO: Running AfterSuite actions on all nodes Oct 10 20:20:49.208: INFO: Running AfterSuite actions on node 1 fail [github.com/openshift/origin/test/extended/topology_manager/resourcealign.go:183]: Unexpected error: <*util.ExitError | 0xc002e80060>: { Cmd: "oc --namespace=e2e-test-topology-manager-r86df --kubeconfig=/root/.kube/config rsh -n default -c test-0 test-pqdtp ping -c 3 10.56.217.178", StdErr: "ping: permission denied (are you root?)\nPING 10.56.217.178 (10.56.217.178): 56 data bytes\ncommand terminated with exit code 1", ExitError: { ProcessState: { pid: 14145, status: 256, rusage: { Utime: {Sec: 0, Usec: 160857}, Stime: {Sec: 0, Usec: 37275}, Maxrss: 154324, Ixrss: 0, Idrss: 0, Isrss: 0, Minflt: 2414, Majflt: 0, Nswap: 0, Inblock: 0, Oublock: 0, Msgsnd: 0, Msgrcv: 0, Nsignals: 0, Nvcsw: 1061, Nivcsw: 0, }, }, Stderr: nil, }, } exit status 1 occurred Oct 10 20:19:18.786 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Pulled image/registry.redhat.io/redhat/redhat-marketplace-index:v4.6 Oct 10 20:19:18.982 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Created Oct 10 20:19:19.008 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Started Oct 10 20:19:30.528 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Ready Oct 10 20:19:30.532 W ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 reason/GracefulDelete in 0s Oct 10 20:19:30.552 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Killing Oct 10 20:19:30.552 W ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 reason/Deleted Oct 10 20:19:30.558 I ns/openshift-marketplace pod/redhat-marketplace-rszns node/worker001 container/registry-server reason/Killing Oct 10 20:19:33.696 I ns/default pod/test-vcts8 node/ reason/Created Oct 10 20:19:33.696 I ns/default pod/test-pqdtp node/ reason/Created Oct 10 20:19:33.703 I ns/default pod/test-vcts8 node/worker000 reason/Scheduled Oct 10 20:19:33.703 I ns/default pod/test-pqdtp node/worker000 reason/Scheduled Oct 10 20:19:35.796 I ns/default pod/test-pqdtp reason/AddedInterface Add eth0 [10.128.2.121/23] Oct 10 20:19:35.853 I ns/default pod/test-pqdtp reason/AddedInterface Add sriov1 [10.56.217.178/24] from default/sriov-intel Oct 10 20:19:35.862 I ns/default pod/test-vcts8 reason/AddedInterface Add eth0 [10.128.2.120/23] Oct 10 20:19:35.935 I ns/default pod/test-vcts8 reason/AddedInterface Add sriov1 [10.56.217.179/24] from default/sriov-intel Oct 10 20:19:36.194 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Pulling image/busybox Oct 10 20:19:36.214 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Pulling image/busybox Oct 10 20:19:37.564 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Pulled image/busybox Oct 10 20:19:37.566 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Pulled image/busybox Oct 10 20:19:37.728 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Created Oct 10 20:19:37.734 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Created Oct 10 20:19:37.749 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Started Oct 10 20:19:37.754 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Started Oct 10 20:19:38.704 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Ready Oct 10 20:19:38.709 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Ready Oct 10 20:19:41.158 W ns/default pod/test-pqdtp node/worker000 reason/GracefulDelete in 30s Oct 10 20:19:41.166 I ns/default pod/test-pqdtp node/worker000 container/test-0 reason/Killing Oct 10 20:20:11.745 E ns/default pod/test-pqdtp node/worker000 container/test-0 container exited with code 137 (Error): Oct 10 20:20:16.763 W ns/default pod/test-pqdtp node/worker000 reason/Deleted Oct 10 20:20:17.167 W ns/default pod/test-vcts8 node/worker000 reason/GracefulDelete in 30s Oct 10 20:20:17.171 I ns/default pod/test-vcts8 node/worker000 container/test-0 reason/Killing Oct 10 20:20:47.780 E ns/default pod/test-vcts8 node/worker000 container/test-0 container exited with code 137 (Error): Oct 10 20:20:48.784 W ns/default pod/test-vcts8 node/worker000 reason/Deleted failed: (1m32s) 2020-10-10T20:20:49 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]" Version-Release number of selected component (if applicable): Client Version: 4.6.0-0.nightly-2020-10-03-051134 Server Version: 4.6.0-0.nightly-2020-10-03-051134 Kubernetes Version: v1.19.0+db1fc96 How reproducible: Everytime Steps to Reproduce: 1. Install OCP 4.6.0-0.nightly-2020-10-03-051134 nightly clusters (3 masters and two worker nodes) with Jetski IPI Ansible automation on baremetal cluster (see steps in https://github.com/redhat-performance/JetSki) 2. Configure CPU manager with static policy and Enable Topology manager ( see docs: https://docs.openshift.com/container-platform/4.5/scalability_and_performance/using-cpu-manager.html https://docs.openshift.com/container-platform/4.5/scalability_and_performance/using-topology-manager.html 3. Deploy SR-IOV network operator: git clone https://github.com/openshift/sriov-network-operator.git make deploy-setup 4. Label worker with Mellanox SR-IOV cards with virtualization enabled in bios: oc label node worker000 feature.node.kubernetes.io/sriov-capable=true 5. oc create -f sriov-policy.yaml: apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetworkNodePolicy metadata: name: policy-sriov namespace: openshift-sriov-network-operator spec: resourceName: intelsriov nodeSelector: feature.node.kubernetes.io/sriov-capable: 'true' priority: 51 mtu: 1500 numVfs: 8 nicSelector: vendor: '8086' rootDevices: - '0000:86:00.1' pfNames: - ens7f1 deviceType: netdevice 6. oc create -f sriov-ipam.yaml: apiVersion: sriovnetwork.openshift.io/v1 kind: SriovNetwork metadata: name: sriov-intel namespace: openshift-sriov-network-operator spec: ipam: | { "type": "host-local", "subnet": "10.56.217.0/24", "rangeStart": "10.56.217.171", "rangeEnd": "10.56.217.181", "routes": [{ "dst": "0.0.0.0/0" }], "gateway": "10.56.217.1" } vlan: 0 resourceName: intelsriov networkNamespace: default 7. git clone openshift origin in $GOPATH/src/github.com/openshift/ https://github.com/openshift/origin.git cd $GOPATH/src/github.com/openshift/ make WHAT=cmd/openshift-tests 8. Export env vars and run topology manager E2E tests: export KUBECONFIG=/root/.kube/config export SRIOV_NETWORK_NAMESPACE=default export SRIOV_NETWORK=sriov-intel export RESOURCE_NAME=openshift.io/intelsriov OPENSHIFT_TESTS=/root/go/src/github.com/openshift/origin/openshift-tests 9. Run topology manager E2E tests $OPENSHIFT_TESTS run openshift/conformance --dry-run | grep -E "TopologyManager" | $OPENSHIFT_TESTS run -f - Actual results: Pass fails with permission issues trying to ping between pods failed: (1m32s) 2020-10-10T20:20:49 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]" Expected results: pass test Additional info: Link to more logs oc commands and cluster setup info will be in next private comment
I looked at Walid's setup and the most likely culprits are: 1. a OCP change got unnoticed because the test don't run often enough (that's a sore point and a long due fixes) 2. surely the tests fail to pin the images they need, so maybe an image got updated and an unwanted change sneaked in 2. needs to be fixed anyway, so I will.
I'm confident the issue is in the tests, which, because of an oversight, do not pin the image they need to a specific tag. A simple change in the test code should fix.
This is actually on tests, but "topology-manager" seems the closest in the current sub-component list. Reassigned.
Tested on same OCP 4.6 baremetal cluster from the latest master branch of origin: Tests are being skipped as I have one of the two worker nodes enabled for topology manager, that was not a requirement before AFAIK: started: (0/5/11) "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]" skip [github.com/openshift/origin/test/extended/topology_manager/utils.go:69]: topology manager not configured on all nodes skipped: (16.2s) 2020-12-04T14:17:19 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]" ----- And also skipped because NETWORK_CHECK_IMAGE env var is not defined. According to origin docs: The testsuite runs a basic connectivity test to ensure the NUMA-aligned devices are functional. Use this variable to set the image URL to use to check the network is working between pods which requested, and got, aligned resources. If this value is not set (default), the connectivity test will skip. What is a good example to use ? started: (0/10/11) "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]" skip [github.com/openshift/origin/test/extended/topology_manager/resourcealign.go:140]: no network check image provided (use NETWORK_CHECK_IMAGE) skipped: (16.4s) 2020-12-04T16:42:30 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"
(In reply to Walid A. from comment #9) > Tested on same OCP 4.6 baremetal cluster from the latest master branch of > origin: > > Tests are being skipped as I have one of the two worker nodes enabled for > topology manager, that was not a requirement before AFAIK: > > started: (0/5/11) "[Serial][sig-node][Feature:TopologyManager] Configured > cluster with gu workload attached to SRIOV networks should let > resource-aligned PODs have working SRIOV network interface > [Suite:openshift/conformance/serial]" > > skip > [github.com/openshift/origin/test/extended/topology_manager/utils.go:69]: > topology manager not configured on all nodes > > skipped: (16.2s) 2020-12-04T14:17:19 > "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu > workload attached to SRIOV networks should let resource-aligned PODs have > working SRIOV network interface [Suite:openshift/conformance/serial]" > > ----- > > And also skipped because NETWORK_CHECK_IMAGE env var is not defined. > According to origin docs: > > The testsuite runs a basic connectivity test to ensure the NUMA-aligned > devices are functional. Use this variable to set the image URL to use to > check the network is working between pods which requested, and got, aligned > resources. If this value is not set (default), the connectivity test will > skip. > > What is a good example to use ? > > started: (0/10/11) "[Serial][sig-node][Feature:TopologyManager] Configured > cluster with gu workload attached to SRIOV networks should let > resource-aligned PODs have working SRIOV network interface > [Suite:openshift/conformance/serial]" > > skip > [github.com/openshift/origin/test/extended/topology_manager/resourcealign.go: > 140]: no network check image provided (use NETWORK_CHECK_IMAGE) > > skipped: (16.4s) 2020-12-04T16:42:30 > "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu > workload attached to SRIOV networks should let resource-aligned PODs have > working SRIOV network interface [Suite:openshift/conformance/serial]" For the time being: quay.io/openshift-kni/cnf-tests:$OCP_VERSION (e.g. "4.7" to test OCP 4.7.z and so forth) This probably need a bit better docs somewhere. In the next weeks we are looking to have a suitable image available in OCP and a better default for this option The image could be simple as $ cat Dockerfile FROM registry.access.redhat.com/ubi8/ubi-minimal:latest RUN microdnf install -y iputils && microdnf clean all ENTRYPOINT [ "/bin/ping" ] but we need to have it automatically built (and maintained) and this will take a little bit.
Verified on same OCP 4.6 baremetal cluster from the latest master branch of origin: Set env var: export NETWORK_CHECK_IMAGE=quay.io/openshift-kni/cnf-tests:4.6 . . . started: (0/10/11) "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]" passed: (1m51s) 2020-12-04T18:39:53 "[Serial][sig-node][Feature:TopologyManager] Configured cluster with gu workload attached to SRIOV networks should let resource-aligned PODs have working SRIOV network interface [Suite:openshift/conformance/serial]"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633