Bug 2130326

Summary: unable to run subctl benchmark latency, pods fail with ImagePullBackOff
Product: Red Hat Advanced Cluster Management for Kubernetes Reporter: Jason Kincl <jkincl>
Component: SubmarinerAssignee: Mike Kolesnik <mkolesni>
Status: CLOSED CURRENTRELEASE QA Contact: Noam Manos <nmanos>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhacm-2.6CC: dfarrell, ecai, maafried, mbabushk, mkolesni, nmanos, nyechiel, skitt, tpanteli
Target Milestone: ---Keywords: Reopened
Target Release: rhacm-2.7Flags: bot-tracker-sync: rhacm-2.7+
nyechiel: rhacm-2.7.z+
Hardware: All   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-01-31 21:49:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jason Kincl 2022-09-27 19:45:33 UTC
**What happened**:

When running `subctl benchmark latency` the tool creates a e2e- namespace and tries to run a pod that fails to pull it's image:

```
Events:
  Type     Reason          Age                From               Message
  ----     ------          ----               ----               -------
  Normal   Scheduled       69s                default-scheduler  Successfully assigned e2e-tests-latency-k9m7w/latency-server-podb8swr to ip-10-0-54-62.us-east-2.compute.internal by ip-10-0-155-214
  Normal   AddedInterface  68s                multus             Add eth0 [10.137.4.6/23] from openshift-sdn
  Normal   Pulling         24s (x3 over 68s)  kubelet            Pulling image "registry.redhat.io/rhacm2/nettest:v0.13.0"
  Warning  Failed          24s (x3 over 68s)  kubelet            Failed to pull image "registry.redhat.io/rhacm2/nettest:v0.13.0": rpc error: code = Unknown desc = reading manifest v0.13.0 in registry.redhat.io/rhacm2/nettest: unknown: Not Found
  Warning  Failed          24s (x3 over 68s)  kubelet            Error: ErrImagePull
  Normal   BackOff         11s (x4 over 67s)  kubelet            Back-off pulling image "registry.redhat.io/rhacm2/nettest:v0.13.0"
  Warning  Failed          11s (x4 over 67s)  kubelet            Error: ImagePullBackOff
```

It appears that the image does not exist on our image repository

**What you expected to happen**:

I expect the image to exist and to work with subctl

**How to reproduce it (as minimally and precisely as possible)**:

Install submariner with ACM and use `subctl benchmark latency`

**Anything else we need to know?**:

**Environment**:
- Submariner version (use `subctl version`): v0.13.1
- Kubernetes version (use `kubectl version`):
Client Version: v1.24.1
Kustomize Version: v4.5.4
Server Version: v1.24.0+3882f8f

Comment 1 Stephen Kitt 2022-09-28 08:00:02 UTC
@tpanteli since you recently looked at the image overrides in the operator, could you take care of this? The problem is that subctl wants to deploy registry.redhat.io/rhacm2/nettest:v0.13.0, but the image is really registry.redhat.io/rhacm2/nettest-rhel8:v0.13.0. With ACM, the operator gets its images overrides through SubmarinerConfig, using that to populate the Submariner CR (see https://github.com/stolostron/submariner-addon/blob/main/pkg/hub/submarineragent/manifests/operator/submariner.io-submariners-cr.yaml for the template).

I suspect we need to add a nettest entry in the template.

Comment 2 Mike Kolesnik 2022-10-23 08:15:26 UTC
We need to merge https://github.com/submariner-io/subctl/pull/316 and then we can compile `subctl` downstream with an override directive to suffix the nettes image with `-rhel8`.
This should also help with other usages of `nettest` by subctl, such as diagnose.

With this fix, it would also be possible to specify the image in the overrides on the `Subctl` CR, but not necessarily mandatory.

Comment 3 Maayan Friedman 2022-11-03 14:38:07 UTC
QE is waiting for 0.14.0 to be downstream

Comment 4 Noam Manos 2022-11-16 23:44:22 UTC
Was it backported to 0.13.1 for ACM 2.6.2 ?

On a test run I got:
https://qe-jenkins-csb-skynet.apps.ocp-c1.prod.psi.redhat.com/job/ACM-2.6.2-Submariner-0.13.1-AWS-OSP-OVN/Test-Report/

$▶ oc get all -n submariner-operator
NAME                                                                  READY   STATUS             RESTARTS   AGE
pod/130259ce215f8646cf7a92686a732803718658cb32c71c36fc23cff5aa5htgt   0/1     Completed          0          118m
pod/3995ff715a639884baf12b984ddd3e2d0b65894d48654f55438cab15a5kr6hj   0/1     Completed          0          117m
pod/query-iface-listlxnln                                             0/1     ErrImagePull       0          88m
pod/submariner-addon-675984b497-vkv4b                                 1/1     Running            0          118m
pod/submariner-gateway-4zprj                                          1/1     Running            0          114m
pod/submariner-lighthouse-agent-7ccffc979d-64vhs                      1/1     Running            0          116m
pod/submariner-lighthouse-coredns-65d9bb8488-ptpmm                    1/1     Running            0          116m
pod/submariner-lighthouse-coredns-65d9bb8488-xhbn2                    1/1     Running            0          116m
pod/submariner-networkplugin-syncer-7d49598784-cssjl                  1/1     Running            0          116m
pod/submariner-operator-7b597fd5df-mw4cr                              1/1     Running            0          117m
pod/submariner-routeagent-44kj9                                       1/1     Running            0          116m
pod/submariner-routeagent-5nz8l                                       1/1     Running            0          114m
pod/submariner-routeagent-84lbv                                       1/1     Running            0          116m
pod/submariner-routeagent-9j7gn                                       1/1     Running            0          116m
pod/submariner-routeagent-hrp27                                       1/1     Running            0          116m
pod/submariner-routeagent-jq8mk                                       1/1     Running            0          116m
pod/submariner-routeagent-jwc7p                                       1/1     Running            0          116m
pod/submariner-stable-0-13-catalog-fs8w8                              1/1     Running            0          124m
pod/validate-sniffer79jsh                                             0/1     ImagePullBackOff   0          85m


query-iface-list and validate-sniffer pods failed on:
Failed to pull image "registry.redhat.io/rhacm2/nettest:v0.13.1": rpc error: code = Unknown desc = (Mirrors also failed: [brew.registry.redhat.io/rh-osbs/rhacm2/nettest:v0.13.1

Comment 5 Mike Kolesnik 2022-11-17 08:26:45 UTC
This was fixed for 0.14 (2.7) and was backported to 0.13, awaiting 0.13.2 release.
Once 0.13.2 is available, ACM 2.6 should consume it and then you can expect it to be fixed for 2.6

Comment 6 Noam Manos 2022-11-17 20:25:20 UTC
Mike, also for 0.14.0 I'm getting nettest ImagePullBackOff with 0.14.0:

$ subctl benchmark latency "/mnt/skynet-data/skynet-env-1/aws-nmanos-a1/auth/kubeconfig" "/mnt/skynet-data/skynet-env-1/gcp-nmanos-c1/auth/kubeconfig" --verbose
Performing latency tests
Creating kubernetes clients
Setting new cluster ID "acm-aws-nmanos-a1", previous cluster ID was "api-aws-nmanos-a1-devcluster-openshift-com:6443"
Setting new cluster ID "acm-gcp-nmanos-c1", previous cluster ID was "api-gcp-nmanos-c1-gcp-subm-red-chesterfield-com:6443"
Creating lighthouse clients
Creating submariner clients
Creating namespace objects with basename "latency"
Generated namespace "e2e-tests-latency-vgdtx" in cluster "acm-aws-nmanos-a1" to execute the tests in
Creating namespace "e2e-tests-latency-vgdtx" in cluster "acm-gcp-nmanos-c1"
Latency test is not supported with Globalnet enabled, skipping the test...
Deleting namespace "e2e-tests-latency-vgdtx" on cluster "acm-aws-nmanos-a1"
Deleting namespace "e2e-tests-latency-vgdtx" on cluster "acm-gcp-nmanos-c1"

$ subctl benchmark throughput "/mnt/skynet-data/skynet-env-1/aws-nmanos-a1/auth/kubeconfig" "/mnt/skynet-data/skynet-env-1/gcp-nmanos-c1/auth/kubeconfig" --verbose
Performing throughput tests
Creating kubernetes clients
Setting new cluster ID "acm-aws-nmanos-a1", previous cluster ID was "api-aws-nmanos-a1-devcluster-openshift-com:6443"
Setting new cluster ID "acm-gcp-nmanos-c1", previous cluster ID was "api-gcp-nmanos-c1-gcp-subm-red-chesterfield-com:6443"
Creating lighthouse clients
Creating submariner clients
Creating namespace objects with basename "throughput"
Generated namespace "e2e-tests-throughput-vztjs" in cluster "acm-aws-nmanos-a1" to execute the tests in
Creating namespace "e2e-tests-throughput-vztjs" in cluster "acm-gcp-nmanos-c1"
Performing throughput tests from Gateway pod on cluster "acm-aws-nmanos-a1" to Gateway pod on cluster "acm-gcp-nmanos-c1"
Creating a Nettest Server Pod on "acm-gcp-nmanos-c1"
Deleting namespace "e2e-tests-throughput-vztjs" on cluster "acm-aws-nmanos-a1"
Deleting namespace "e2e-tests-throughput-vztjs" on cluster "acm-gcp-nmanos-c1"
panic: Failed to await pod ready. Pod "nettest-server-podw7nk4" is still pending: status:
{
  "phase": "Pending",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2022-11-16T19:49:48Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2022-11-16T19:49:48Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [nettest-server-pod]"
    },
    {
      "type": "ContainersReady",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2022-11-16T19:49:48Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [nettest-server-pod]"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2022-11-16T19:49:48Z"
    }
  ],
  "hostIP": "10.16.128.5",
  "podIP": "10.218.2.5",
  "podIPs": [
    {
      "ip": "10.218.2.5"
    }
  ],
  "startTime": "2022-11-16T19:49:48Z",
  "containerStatuses": [
    {
      "name": "nettest-server-pod",
      "state": {
        "waiting": {
          "reason": "ImagePullBackOff",
          "message": "Back-off pulling image \"registry.redhat.io/rhacm2/nettest-rhel8:v0.14.0\""
        }
      },
      "lastState": {},
      "ready": false,
      "restartCount": 0,
      "image": "registry.redhat.io/rhacm2/nettest-rhel8:v0.14.0",
      "imageID": "",
      "started": false
    }
  ],
  "qosClass": "BestEffort"
}
Unexpected error:
    <*errors.errorString | 0xc000348ba0>: {
        s: "timed out waiting for the condition",
    }
    timed out waiting for the condition
occurred

goroutine 1 [running]:
github.com/submariner-io/subctl/internal/benchmark.StartThroughputTests.func1({0xc0004bc300, 0x6c3}, {0xc000348ba0?, 0xc00006d810?, 0xc00052f150?})
	/remote-source/app/internal/benchmark/throughput.go:46 +0x54
github.com/onsi/gomega/internal.(*Assertion).match(0xc00153ab00, {0x3345c80, 0x46c9800}, 0x0, {0xc000bb0780, 0x1, 0x1})
	/remote-source/app/vendor/github.com/onsi/gomega/internal/assertion.go:105 +0x1f0
github.com/onsi/gomega/internal.(*Assertion).NotTo(0xc00153ab00, {0x3345c80, 0x46c9800}, {0xc000bb0780, 0x1, 0x1})
	/remote-source/app/vendor/github.com/onsi/gomega/internal/assertion.go:73 +0xb2
github.com/submariner-io/shipyard/test/e2e/framework.AwaitUntil({0x2eb1c64?, 0xc00060adc0?}, 0x1a?, 0x0?)
	/remote-source/app/vendor/github.com/submariner-io/shipyard/test/e2e/framework/framework.go:562 +0xd4
github.com/submariner-io/shipyard/test/e2e/framework.(*NetworkPod).AwaitReady(0xc0005269b0)
	/remote-source/app/vendor/github.com/submariner-io/shipyard/test/e2e/framework/network_pods.go:137 +0xd9
github.com/submariner-io/shipyard/test/e2e/framework.(*NetworkPod).buildThroughputServerPod(0xc0005269b0)
	/remote-source/app/vendor/github.com/submariner-io/shipyard/test/e2e/framework/network_pods.go:425 +0x4d3
github.com/submariner-io/shipyard/test/e2e/framework.(*Framework).NewNetworkPod(0xc0008dd980, 0xc000993a70)
	/remote-source/app/vendor/github.com/submariner-io/shipyard/test/e2e/framework/network_pods.go:120 +0x205
github.com/submariner-io/subctl/internal/benchmark.runThroughputTest(0xc0008dd980, {0xc000010018?, 0x2f44bb2?, 0x58?, 0xc000cb7990?}, 0x1)
	/remote-source/app/internal/benchmark/throughput.go:116 +0x159
github.com/submariner-io/subctl/internal/benchmark.StartThroughputTests(0x0, 0x1)
	/remote-source/app/internal/benchmark/throughput.go:65 +0x23d
github.com/submariner-io/subctl/cmd/subctl.runBenchmark(0x2fef830, 0xc000c384c0, 0xc00123e5c0, 0xd0?)
	/remote-source/app/cmd/subctl/benchmark.go:157 +0x32d
github.com/submariner-io/subctl/cmd/subctl.buildBenchmarkRunner.func1.1.1(0xc0012b05f0?, {0x34?, 0xc000f91440?}, {0x0?, 0x0?})
	/remote-source/app/cmd/subctl/benchmark.go:92 +0x2e
github.com/submariner-io/subctl/internal/restconfig.(*Producer).RunOnSelectedContext(0xc000ef9b38, 0xc000ef9b20, {0x3354710, 0xc00029a9d0})
	/remote-source/app/internal/restconfig/restconfig.go:283 +0x1ca
github.com/submariner-io/subctl/cmd/subctl.buildBenchmarkRunner.func1.1(0xc000c384c0, {0x2f?, 0xc000c6a900?}, {0x3354710, 0xc00029a9d0})
	/remote-source/app/cmd/subctl/benchmark.go:90 +0xf6
github.com/submariner-io/subctl/internal/restconfig.(*Producer).RunOnSelectedContext(0xc00139fcb0, 0xc00139fc88, {0x3354710, 0xc00029a9d0})
	/remote-source/app/internal/restconfig/restconfig.go:283 +0x1ca
github.com/submariner-io/subctl/cmd/subctl.buildBenchmarkRunner.func1(0x4667b60?, {0xc000725350?, 0x2, 0x3})
	/remote-source/app/cmd/subctl/benchmark.go:88 +0x1af
github.com/spf13/cobra.(*Command).execute(0x4667b60, {0xc0007252c0, 0x3, 0x3})
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:920 +0x847
github.com/spf13/cobra.(*Command).ExecuteC(0x4665600)
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:1044 +0x3bc
github.com/spf13/cobra.(*Command).Execute(...)
	/remote-source/app/vendor/github.com/spf13/cobra/command.go:968
github.com/submariner-io/subctl/cmd/subctl.Execute()
	/remote-source/app/cmd/subctl/root.go:49 +0x25
main.main()
	/remote-source/app/cmd/main.go:20 +0x17

Comment 7 Nir Yechiel 2022-11-18 13:26:36 UTC
@Mike, can you please take a look at the last comment?

Comment 8 Mike Kolesnik 2022-11-20 10:38:06 UTC
The image name is correct now - registry.redhat.io/rhacm2/nettest-rhel8:v0.14.0, is it already published in the registry?

I would think it's still not published..

Comment 9 Noam Manos 2022-11-21 20:19:00 UTC
On https://qe-jenkins-csb-skynet.apps.ocp-c1.prod.psi.redhat.com/view/ACM%202.7/job/ACM-2.7.0-Submariner-0.14.0-AWS-GCP-Globalnet/36/Test-Report/

I got the "Back-off pulling image registry.redhat.io/rhacm2/nettest-rhel8:v0.14.0" 
When running subctl benchmark command.

However, trying to pull Nettest image directly from "registry-proxy.engineering.redhat.com/rh-osbs/rhacm2-nettest-rhel8:v0.14.0" works good:

Events:
  Type    Reason          Age   From               Message
  ----    ------          ----  ----               -------
  Normal  Scheduled       12s   default-scheduler  Successfully assigned test-submariner/netshoot-cl-a to ip-10-16-214-254.us-west-1.compute.internal by ip-10-16-170-71
  Normal  AddedInterface  10s   multus             Add eth0 [10.216.2.7/23] from openshift-sdn
  Normal  Pulling         10s   kubelet            Pulling image "registry-proxy.engineering.redhat.com/rh-osbs/rhacm2-nettest-rhel8:v0.14.0"
  Normal  Pulled          2s    kubelet            Successfully pulled image "registry-proxy.engineering.redhat.com/rh-osbs/rhacm2-nettest-rhel8:v0.14.0" in 8.464611409s
  Normal  Created         1s    kubelet            Created container netshoot
  Normal  Started         1s    kubelet            Started container netshoot


Note that the subctl binary was pulled from:
oc  image extract "registry-proxy.engineering.redhat.com/rh-osbs/rhacm2-subctl-rhel8:v0.14.0" 

But (as reported on a Jira issue recently), the subctl there seems unbaked still, at least for the version and filename:

 1 -rw-r-----     12599528 /mnt/skynet-data/skynet-env-1/subctl-vsubctl-darwin-amd64.tar.xz
 1 -rw-r-----     12000016 /mnt/skynet-data/skynet-env-1/subctl-vsubctl-linux-amd64.tar.xz
 1 -rw-r-----     12002148 /mnt/skynet-data/skynet-env-1/subctl-vsubctl-windows-amd64.exe.tar.xz

Comment 10 Mike Kolesnik 2022-11-22 07:23:41 UTC
Seems that we need to wait for the official image then, unless you want to test this with "registry-proxy.engineering.redhat.com/rh-osbs/rhacm2-nettest-rhel8:v0.14.0" but this wont validate the default `subctl benchmark` behavior.

I'm switching back to MODIFIED as we're waiting for the image to be available.

Comment 11 Stephen Kitt 2022-12-01 11:20:48 UTC
@nmanos you won’t get registry.redhat.io/rhacm2/nettest-rhel8:v0.14.0 until 2.7 goes GA with 0.14.0. I thought you mirrored those images into a QE-specific repository; isn’t that the case? @mbabushk can you reproduce this too?

Comment 12 Maxim Babushkin 2022-12-01 12:46:05 UTC
@skitt yes, you're right. We are mirroring the images into the cluster internal registry.

@nmanos please, make sure to set the name of the nettest image as "nettest-rhel8" when you are importing the image into the cluster internal registry.
I believe that's the issue.

I verified the use of the nettest-rhel8 image for 0.14.0 release and it works fine.
We (qe) just need to make sure we are importing the image with the right name.

Comment 13 Noam Manos 2022-12-01 14:58:19 UTC
Thanks, I imported it to the local registry, and it now works:
oc  import-image -n submariner-operator nettest-rhel8:v0.14.0 --from=brew.registry.redhat.io/rh-osbs/rhacm2-nettest-rhel8:v0.14.0 --confirm

Name:			nettest-rhel8
 Namespace:		submariner-operator
 Created:		Less than a second ago
 Labels:			<none>
 Annotations:		openshift.io/image.dockerRepositoryCheck=2022-12-01T14:51:07Z
 Image Repository:	image-registry.openshift-image-registry.svc:5000/submariner-operator/nettest-rhel8
 Image Lookup:		local=false
 Unique Images:		1
 Tags:			1
 
 v0.14.0
   tagged from brew.registry.redhat.io/rh-osbs/rhacm2-nettest-rhel8:v0.14.0
 
   * brew.registry.redhat.io/rh-osbs/rhacm2-nettest-rhel8@sha256:efed4fca8735e8ad1cfc02091969876bb961c46ad6fcff1b02d4a68a4c464834
       Less than a second ago
 
 Image Name:	nettest-rhel8:v0.14.0
 Docker Image:	brew.registry.redhat.io/rh-osbs/rhacm2-nettest-rhel8@sha256:efed4fca8735e8ad1cfc02091969876bb961c46ad6fcff1b02d4a68a4c464834
 Name:		sha256:efed4fca8735e8ad1cfc02091969876bb961c46ad6fcff1b02d4a68a4c464834
 Created:	Less than a second ago
 Annotations:	image.openshift.io/dockerLayersOrder=ascending
 Image Size:	110.6MB in 2 layers
 Layers:		39.38MB	sha256:725a55c4212630f1b818ee1e82c5f7a9e4ed42456f19ea13b052da427cfdff82
 		71.25MB	sha256:f3db4525bee5d3f5de0e069d826414f256f0a3f6b2fb4c2df79c52c854bd08b5
 Image Created:	22 hours ago
 Author:		<none>
 Arch:		amd64
 Command:	/bin/bash -l
 Working Dir:	/app
 User:		<none>
 Exposes Ports:	<none>
 Docker Labels:	architecture=x86_64
 		build-date=2022-11-30T17:05:39
 		com.github.commit=f6ef77489d735185ad7d200926bc151c11e03200
 		com.github.url=https://github.com/submariner-io/shipyard.git
 		com.redhat.component=nettest-container
 		com.redhat.license_terms=https://www.redhat.com/agreements
 		description=nettest
 		distribution-scope=public
 		io.buildah.version=1.27.1
 		io.k8s.description=nettest
 		io.k8s.display-name=nettest
 		io.openshift.expose-services=
 		io.openshift.non-scalable=true
 		io.openshift.tags=submariner,nettest,rhel8
 		io.openshift.wants=
 		maintainer=['multi-cluster-networking']
 		name=rhacm2/nettest-rhel8
 		release=13
 		summary=nettest
 		url=https://access.redhat.com/containers/#/registry.access.redhat.com/rhacm2/nettest-rhel8/images/v0.14.0-13
 		vcs-ref=66421474326c7aa6138ee3087329b87e83462408
 		vcs-type=git
 		vendor=Red Hat, Inc.
 		version=v0.14.0
 Environment:	PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
 		container=oci