Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1977479

Summary:	CI failing on firing CertifiedOperatorsCatalogError due to slow livenessProbe responses
Product:	OpenShift Container Platform	Reporter:	Anik <anbhatta>
Component:	OLM	Assignee:	Anik <anbhatta>
OLM sub component:	OLM	QA Contact:	Jian Zhang <jiazha>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	davegord, ecordell, ercohen, jchaloup, jiazha, palshure, sgreene, wking
Version:	4.8
Target Milestone:	---
Target Release:	4.8.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:	1976326	Environment:	[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing
Last Closed:	2021-07-27 23:13:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1976326
Bug Blocks:

Description Anik 2021-06-29 20:57:32 UTC

+++ This bug was initially created as a clone of Bug #1976326 +++

Catalog operator containers certified-operators, redhat-marketplace and redhat-operators are failing on liveness probe (from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408366552603955200):

```
alert CertifiedOperatorsCatalogError fired for 240 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.25:8081", job="catalog-operator-metrics", name="certified-operators", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-9c74f8b8d-6x84r", service="catalog-operator-metrics", severity="warning"}
alert RedhatMarketplaceCatalogError fired for 300 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.25:8081", job="catalog-operator-metrics", name="redhat-marketplace", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-9c74f8b8d-6x84r", service="catalog-operator-metrics", severity="warning"}
alert RedhatOperatorsCatalogError fired for 180 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.25:8081", job="catalog-operator-metrics", name="redhat-operators", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-9c74f8b8d-6x84r", service="catalog-operator-metrics", severity="warning"}
```

Prometheus rules defined in https://github.com/operator-framework/operator-marketplace/blob/e39ff59d5abc3e27effc7b726329d06a37644f2e/manifests/12_prometheus_rule.yaml.

Checking journal logs (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408366552603955200/artifacts/e2e-gcp-rt/gather-extra/artifacts/nodes/ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv/journal) where all three containers are running:

```
Jun 25 11:01:19.629237 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:01:19.629204    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:01:49.452546 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:01:49.452510    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:02:09.886205 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:09.885666    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:02:19.486876 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:19.486839    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:02:59.519726 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:59.519379    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:03:09.442769 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:03:09.442731    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:06:29.494754 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:06:29.494696    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:06:40.318189 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:06:40.318020    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:06:59.603150 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:06:59.603106    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
...
Jun 25 11:41:59.457077 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:41:59.457009    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
```

The liveness probe's configuration:
```
    livenessProbe:
      exec:
        command:
        - grpc_health_probe
        - -addr=:50051
      failureThreshold: 3
      initialDelaySeconds: 10
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
```

The probe is configured to be ran every 10s. Correlating with the "Probe failed" logs from the ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv node, the probe does not fail consistently every 10s. E.g.:
```
Jun 25 11:01:19.629237 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:01:19.629204    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:01:49.452546 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:01:49.452510    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:02:09.886205 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:09.885666    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:02:19.486876 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:19.486839    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
Jun 25 11:02:59.519726 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:59.519379    1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out"
```

So it's likely the container itself does not always respond to grpc_health_probe in time. The issue looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1815957#c6 described as:
```
The issue is that on some systems, a lack of an nsswitch config means that dns will attempt to resolve from external sources before local files (so resolving `localhost` could take a long time).
```

Affected jobs can be found in https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt. Namely:
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408366552603955200
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408333569146753024
- https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408215657794768896

Other jobs are failing with "alert CommunityOperatorsCatalogError fired for" alert at least. See https://search.ci.openshift.org/?search=alert+CommunityOperatorsCatalogError+fired+for&maxAge=24h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

--- Additional comment from Jan Chaloupka on 2021-06-25 19:27:40 UTC ---

Possibly fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1974054#c1? See https://coreos.slack.com/archives/C3VS0LV41/p1624282085278400

--- Additional comment from Evan Cordell on 2021-06-28 22:08:49 UTC ---

The images in question will start up and write out an nsswitch config to avoid the slow resolution problem, I doubt that is the issue (unless there's been some regression related to nsswitch)

There are two options that seems possible/likely:

1.grpc_health_probe is slow to respond inside the container
2. CRI-O itself is timing out on ExecSync

(1) is possible, but I'm skeptical of it just because of the long timeouts / retries used. They're really generous, and the api should be available almost immediately on container start. To investigate more, I think we should instrument the index pods with grpc metrics and see if we can get data on what the probes are seeing.

(2) is starting to seem more likely to me, especially because via CI search I can see this issue affecting SNO clusters at a much higher rate than other clusters configurations (where CRI-O would have the highest contention). Unfortunately CRI-O metrics are not enabled in openshift, so I can't see any latency/load data during cluster boot.

I think both are worth further investigation, and I'd like to enlist some node folks to help investigate in case (2) seems plausible. My read of kubelet is that we can't distinguish between the two cases based on the audit logs alone: https://github.com/kubernetes/kubernetes/blob/397ff0b3cb2d674113f3cb8195e044ebdf88d27b/pkg/kubelet/cri/remote/remote_runtime.go#L396-L399

--- Additional comment from Kevin Rizza on 2021-06-29 12:42:40 UTC ---



--- Additional comment from Evan Cordell on 2021-06-29 20:41:16 UTC ---

Updating based on further investigation:

the issue seems to be simple cpu contention in the probe, which is more likely in SNO clusters during boot. increasing the timeout should resolve the CI issues - this is not an issue for running clusters, since eventually the pods restart at a time when there is not contention and can respond to the probes.

--- Additional comment from Anik on 2021-06-29 20:43:52 UTC ---

It is highly plausible that the "crio is timing out" issue mentioned by Evan is the likely cause here. In the absence of crio metrics, observing a cluster that had the alert firing during cluster bootstrap (deduced from the restarting registry pods in the openshift-marketplace namespace before eventually running successfully): 

```
$ oc get pods -n openshift-marketplace 
NAME                                    READY   STATUS    RESTARTS   AGE
certified-operators-bxv82               1/1     Running   6          30m
community-operators-rjq4j               1/1     Running   6          30m
marketplace-operator-6b559b4bd9-65pzk   1/1     Running   0          34m
redhat-marketplace-9fn6k                1/1     Running   5          30m
redhat-operators-9fztp                  1/1     Running   5          30m 

$ oc get events -n openshift-marketplace 
LAST SEEN   TYPE      REASON                         OBJECT                                      MESSAGE
31m         Normal    Scheduled                      pod/certified-operators-bxv82               Successfully assigned openshift-marketplace/certified-operators-bxv82 to ci-ln-fn2msht-f76d1-svf6b-master-0
31m         Warning   FailedScheduling               pod/certified-operators-bxv82               binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "certified-operators-bxv82": pod certified-operators-bxv82 is already assigned to node "ci-ln-fn2msht-f76d1-svf6b-master-0"
31m         Normal    AddedInterface                 pod/certified-operators-bxv82               Add eth0 [10.128.0.61/23] from openshift-sdn
21m         Normal    Pulling                        pod/certified-operators-bxv82               Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.8"
30m         Normal    Pulled                         pod/certified-operators-bxv82               Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1m28.476540981s
21m         Normal    Created                        pod/certified-operators-bxv82               Created container registry-server
21m         Normal    Started                        pod/certified-operators-bxv82               Started container registry-server
4m36s       Warning   Unhealthy                      pod/certified-operators-bxv82               Liveness probe failed: command timed out
21m         Normal    Killing                        pod/certified-operators-bxv82               Container registry-server failed liveness probe, will be restarted
21m         Warning   Unhealthy                      pod/certified-operators-bxv82               Readiness probe failed:
21m         Normal    Pulled                         pod/certified-operators-bxv82               Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1.831610737s
21m         Normal    Pulled                         pod/certified-operators-bxv82               Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1.712293116s
4m37s       Normal    Scheduled                      pod/certified-operators-jjvq8               Successfully assigned openshift-marketplace/certified-operators-jjvq8 to ci-ln-fn2msht-f76d1-svf6b-master-0
4m35s       Normal    AddedInterface                 pod/certified-operators-jjvq8               Add eth0 [10.128.0.111/23] from openshift-sdn
4m33s       Normal    Pulling                        pod/certified-operators-jjvq8               Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.8"
4m31s       Normal    Pulled                         pod/certified-operators-jjvq8               Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1.688453392s
4m31s       Normal    Created                        pod/certified-operators-jjvq8               Created container registry-server
4m30s       Normal    Started                        pod/certified-operators-jjvq8               Started container registry-server
4m17s       Normal    Killing                        pod/certified-operators-jjvq8               Stopping container registry-server
14m         Normal    Scheduled                      pod/certified-operators-t9t5k               Successfully assigned openshift-marketplace/certified-operators-t9t5k to ci-ln-fn2msht-f76d1-svf6b-master-0
14m         Normal    AddedInterface                 pod/certified-operators-t9t5k               Add eth0 [10.128.0.106/23] from openshift-sdn
14m         Normal    Pulling                        pod/certified-operators-t9t5k               Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.8"
14m         Normal    Pulled                         pod/certified-operators-t9t5k               Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1.846749222s
14m         Normal    Created                        pod/certified-operators-t9t5k               Created container registry-server
14m         Normal    Started                        pod/certified-operators-t9t5k               Started container registry-server
14m         Normal    Killing                        pod/certified-operators-t9t5k               Stopping container registry-server
20m         Normal    Scheduled                      pod/certified-operators-zz5w9               Successfully assigned openshift-marketplace/certified-operators-zz5w9 to ci-ln-fn2msht-f76d1-svf6b-master-0
20m         Normal    AddedInterface                 pod/certified-operators-zz5w9               Add eth0 [10.128.0.94/23] from openshift-sdn
20m         Normal    Pulling                        pod/certified-operators-zz5w9               Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.8"
20m         Normal    Pulled                         pod/certified-operators-zz5w9               Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 2.015105481s
19m         Normal    Created                        pod/certified-operators-zz5w9               Created container registry-server
19m         Normal    Started                        pod/certified-operators-zz5w9               Started container registry-server
19m         Normal    Killing                        pod/certified-operators-zz5w9               Stopping container registry-server
19m         Warning   Unhealthy                      pod/certified-operators-zz5w9               Liveness probe failed: command timed out
19m         Warning   Unhealthy                      pod/certified-operators-zz5w9               Liveness probe errored: rpc error: code = NotFound desc = could not find container "906cab4d013567d4733c03f12bd99766299d959cbe3cb02e953ae2c638931864": container with ID starting with 906cab4d013567d4733c03f12bd99766299d959cbe3cb02e953ae2c638931864 not found: ID does not exist
19m         Warning   Unhealthy                      pod/certified-operators-zz5w9               Readiness probe errored: rpc error: code = NotFound desc = could not find container "906cab4d013567d4733c03f12bd99766299d959cbe3cb02e953ae2c638931864": container with ID starting with 906cab4d013567d4733c03f12bd99766299d959cbe3cb02e953ae2c638931864 not found: ID does not exist
18m         Warning   FailedKillPod                  pod/certified-operators-zz5w9               error killing pod: failed to "KillPodSandbox" for "1ace1669-e07e-4087-ba65-90ae1fc7520f" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_certified-operators-zz5w9_openshift-marketplace_1ace1669-e07e-4087-ba65-90ae1fc7520f_0(9a59f9d7e0d748984383c86e19a162b4f24230f1660c8b0969a7dbe973775c1d): Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/certified-operators-zz5w9]: error getting pod: Get \"https://[api-int.ci-ln-fn2msht-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-marketplace/pods/certified-operators-zz5w9?timeout=1m0s\": dial tcp 10.0.0.2:6443: i/o timeout"
31m         Warning   FailedToUpdateEndpointSlices   service/certified-operators                 Error updating Endpoint Slices for Service openshift-marketplace/certified-operators: failed to create EndpointSlice for Service openshift-marketplace/certified-operators: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s
4m46s       Normal    Scheduled                      pod/community-operators-dw74z               Successfully assigned openshift-marketplace/community-operators-dw74z to ci-ln-fn2msht-f76d1-svf6b-master-0
4m44s       Normal    AddedInterface                 pod/community-operators-dw74z               Add eth0 [10.128.0.110/23] from openshift-sdn
4m44s       Normal    Pulling                        pod/community-operators-dw74z               Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8"
4m42s       Normal    Pulled                         pod/community-operators-dw74z               Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.767186196s
4m41s       Normal    Created                        pod/community-operators-dw74z               Created container registry-server
4m41s       Normal    Started                        pod/community-operators-dw74z               Started container registry-server
4m26s       Normal    Killing                        pod/community-operators-dw74z               Stopping container registry-server
20m         Normal    Scheduled                      pod/community-operators-fmrqw               Successfully assigned openshift-marketplace/community-operators-fmrqw to ci-ln-fn2msht-f76d1-svf6b-master-0
20m         Normal    AddedInterface                 pod/community-operators-fmrqw               Add eth0 [10.128.0.96/23] from openshift-sdn
20m         Normal    Pulling                        pod/community-operators-fmrqw               Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8"
20m         Normal    Pulled                         pod/community-operators-fmrqw               Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 2.774797223s
20m         Normal    Created                        pod/community-operators-fmrqw               Created container registry-server
19m         Normal    Started                        pod/community-operators-fmrqw               Started container registry-server
19m         Warning   Unhealthy                      pod/community-operators-fmrqw               Liveness probe failed: command timed out
19m         Normal    Killing                        pod/community-operators-fmrqw               Stopping container registry-server
19m         Warning   Unhealthy                      pod/community-operators-fmrqw               Liveness probe errored: rpc error: code = NotFound desc = could not find container "c5a04306f9c1787a9434820dc1ff18d0a1351971df1e04a565a084439be7ab14": container with ID starting with c5a04306f9c1787a9434820dc1ff18d0a1351971df1e04a565a084439be7ab14 not found: ID does not exist
19m         Warning   Unhealthy                      pod/community-operators-fmrqw               Readiness probe errored: rpc error: code = NotFound desc = could not find container "c5a04306f9c1787a9434820dc1ff18d0a1351971df1e04a565a084439be7ab14": container with ID starting with c5a04306f9c1787a9434820dc1ff18d0a1351971df1e04a565a084439be7ab14 not found: ID does not exist
18m         Warning   FailedKillPod                  pod/community-operators-fmrqw               error killing pod: failed to "KillPodSandbox" for "2f0d6d3f-c3c7-49e9-bc35-d0a2eda0ffdb" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_community-operators-fmrqw_openshift-marketplace_2f0d6d3f-c3c7-49e9-bc35-d0a2eda0ffdb_0(80052adee9bfda08edb8cf6a4912bfa30b2a6bdddb260a99ed0211ee138feb69): Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/community-operators-fmrqw]: error getting pod: Get \"https://[api-int.ci-ln-fn2msht-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-marketplace/pods/community-operators-fmrqw?timeout=1m0s\": dial tcp 10.0.0.2:6443: i/o timeout"
14m         Normal    Scheduled                      pod/community-operators-gfwlc               Successfully assigned openshift-marketplace/community-operators-gfwlc to ci-ln-fn2msht-f76d1-svf6b-master-0
14m         Normal    AddedInterface                 pod/community-operators-gfwlc               Add eth0 [10.128.0.104/23] from openshift-sdn
14m         Normal    Pulling                        pod/community-operators-gfwlc               Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8"
14m         Normal    Pulled                         pod/community-operators-gfwlc               Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.666605741s
14m         Normal    Created                        pod/community-operators-gfwlc               Created container registry-server
14m         Normal    Started                        pod/community-operators-gfwlc               Started container registry-server
14m         Normal    Killing                        pod/community-operators-gfwlc               Stopping container registry-server
4m15s       Normal    Scheduled                      pod/community-operators-kzxq2               Successfully assigned openshift-marketplace/community-operators-kzxq2 to ci-ln-fn2msht-f76d1-svf6b-master-0
4m13s       Normal    AddedInterface                 pod/community-operators-kzxq2               Add eth0 [10.128.0.113/23] from openshift-sdn
4m12s       Normal    Pulling                        pod/community-operators-kzxq2               Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8"
4m10s       Normal    Pulled                         pod/community-operators-kzxq2               Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.680937805s
4m9s        Normal    Created                        pod/community-operators-kzxq2               Created container registry-server
4m9s        Normal    Started                        pod/community-operators-kzxq2               Started container registry-server
3m55s       Normal    Killing                        pod/community-operators-kzxq2               Stopping container registry-server
31m         Normal    Scheduled                      pod/community-operators-rjq4j               Successfully assigned openshift-marketplace/community-operators-rjq4j to ci-ln-fn2msht-f76d1-svf6b-master-0
31m         Warning   FailedScheduling               pod/community-operators-rjq4j               binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "community-operators-rjq4j": pod community-operators-rjq4j is already assigned to node "ci-ln-fn2msht-f76d1-svf6b-master-0"
31m         Normal    AddedInterface                 pod/community-operators-rjq4j               Add eth0 [10.128.0.62/23] from openshift-sdn
20m         Normal    Pulling                        pod/community-operators-rjq4j               Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8"
30m         Normal    Pulled                         pod/community-operators-rjq4j               Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1m28.879243424s
20m         Normal    Created                        pod/community-operators-rjq4j               Created container registry-server
20m         Normal    Started                        pod/community-operators-rjq4j               Started container registry-server
4m36s       Warning   Unhealthy                      pod/community-operators-rjq4j               Liveness probe failed: command timed out
20m         Normal    Killing                        pod/community-operators-rjq4j               Container registry-server failed liveness probe, will be restarted
21m         Warning   Unhealthy                      pod/community-operators-rjq4j               Readiness probe failed:
21m         Normal    Pulled                         pod/community-operators-rjq4j               Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.688388556s
20m         Normal    Pulled                         pod/community-operators-rjq4j               Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.836622456s
31m         Warning   FailedToUpdateEndpointSlices   service/community-operators                 Error updating Endpoint Slices for Service openshift-marketplace/community-operators: failed to create EndpointSlice for Service openshift-marketplace/community-operators: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s
35m         Warning   FailedScheduling               pod/marketplace-operator-6b559b4bd9-65pzk   0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
35m         Warning   FailedScheduling               pod/marketplace-operator-6b559b4bd9-65pzk   0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate.
34m         Normal    Scheduled                      pod/marketplace-operator-6b559b4bd9-65pzk   Successfully assigned openshift-marketplace/marketplace-operator-6b559b4bd9-65pzk to ci-ln-fn2msht-f76d1-svf6b-master-0
34m         Warning   FailedMount                    pod/marketplace-operator-6b559b4bd9-65pzk   MountVolume.SetUp failed for volume "marketplace-operator-metrics" : secret "marketplace-operator-metrics" not found
33m         Normal    AddedInterface                 pod/marketplace-operator-6b559b4bd9-65pzk   Add eth0 [10.128.0.32/23] from openshift-sdn
33m         Normal    Pulling                        pod/marketplace-operator-6b559b4bd9-65pzk   Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f5c31a28af169f435d45f2f9a517333f5b29e1b488acb4cdb1c6f53b68a25fb"
33m         Normal    Pulled                         pod/marketplace-operator-6b559b4bd9-65pzk   Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f5c31a28af169f435d45f2f9a517333f5b29e1b488acb4cdb1c6f53b68a25fb" in 32.330314486s
32m         Normal    Created                        pod/marketplace-operator-6b559b4bd9-65pzk   Created container marketplace-operator
32m         Normal    Started                        pod/marketplace-operator-6b559b4bd9-65pzk   Started container marketplace-operator
32m         Warning   ProbeError                     pod/marketplace-operator-6b559b4bd9-65pzk   Liveness probe error: Get "http://10.128.0.32:8080/healthz": dial tcp 10.128.0.32:8080: connect: connection refused
body:
32m         Warning   Unhealthy                      pod/marketplace-operator-6b559b4bd9-65pzk   Liveness probe failed: Get "http://10.128.0.32:8080/healthz": dial tcp 10.128.0.32:8080: connect: connection refused
32m         Warning   ProbeError                     pod/marketplace-operator-6b559b4bd9-65pzk   Readiness probe error: Get "http://10.128.0.32:8080/healthz": dial tcp 10.128.0.32:8080: connect: connection refused
body:
32m         Warning   Unhealthy                      pod/marketplace-operator-6b559b4bd9-65pzk    Readiness probe failed: Get "http://10.128.0.32:8080/healthz": dial tcp 10.128.0.32:8080: connect: connection refused
37m         Warning   FailedCreate                   replicaset/marketplace-operator-6b559b4bd9   Error creating: pods "marketplace-operator-6b559b4bd9-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride the cluster does not have any nodes
35m         Normal    SuccessfulCreate               replicaset/marketplace-operator-6b559b4bd9   Created pod: marketplace-operator-6b559b4bd9-65pzk
38m         Normal    ScalingReplicaSet              deployment/marketplace-operator              Scaled up replica set marketplace-operator-6b559b4bd9 to 1
38m         Normal    CreatedSCCRanges               namespace/openshift-marketplace              created SCC ranges
14m         Normal    Scheduled                      pod/redhat-marketplace-5t4wt                 Successfully assigned openshift-marketplace/redhat-marketplace-5t4wt to ci-ln-fn2msht-f76d1-svf6b-master-0
14m         Normal    AddedInterface                 pod/redhat-marketplace-5t4wt                 Add eth0 [10.128.0.105/23] from openshift-sdn
14m         Normal    Pulling                        pod/redhat-marketplace-5t4wt                 Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8"
14m         Normal    Pulled                         pod/redhat-marketplace-5t4wt                 Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 2.081101225s
14m         Normal    Created                        pod/redhat-marketplace-5t4wt                 Created container registry-server
14m         Normal    Started                        pod/redhat-marketplace-5t4wt                 Started container registry-server
14m         Normal    Killing                        pod/redhat-marketplace-5t4wt                 Stopping container registry-server
31m         Normal    Scheduled                      pod/redhat-marketplace-9fn6k                 Successfully assigned openshift-marketplace/redhat-marketplace-9fn6k to ci-ln-fn2msht-f76d1-svf6b-master-0
31m         Warning   FailedScheduling               pod/redhat-marketplace-9fn6k                 binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "redhat-marketplace-9fn6k": pod redhat-marketplace-9fn6k is already assigned to node "ci-ln-fn2msht-f76d1-svf6b-master-0"
31m         Normal    AddedInterface                 pod/redhat-marketplace-9fn6k                 Add eth0 [10.128.0.63/23] from openshift-sdn
20m         Normal    Pulling                        pod/redhat-marketplace-9fn6k                 Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8"
30m         Normal    Pulled                         pod/redhat-marketplace-9fn6k                 Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 1m29.016713858s
20m         Normal    Created                        pod/redhat-marketplace-9fn6k                 Created container registry-server
20m         Normal    Started                        pod/redhat-marketplace-9fn6k                 Started container registry-server
4m36s       Warning   Unhealthy                      pod/redhat-marketplace-9fn6k                 Liveness probe failed: command timed out
20m         Normal    Killing                        pod/redhat-marketplace-9fn6k                 Container registry-server failed liveness probe, will be restarted
21m         Normal    Pulled                         pod/redhat-marketplace-9fn6k                 Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 2.018343487s
20m         Normal    Pulled                         pod/redhat-marketplace-9fn6k                 Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 1.819118352s
4m56s       Normal    Scheduled                      pod/redhat-marketplace-9kx89                 Successfully assigned openshift-marketplace/redhat-marketplace-9kx89 to ci-ln-fn2msht-f76d1-svf6b-master-0
4m54s       Normal    AddedInterface                 pod/redhat-marketplace-9kx89                 Add eth0 [10.128.0.109/23] from openshift-sdn
4m54s       Normal    Pulling                        pod/redhat-marketplace-9kx89                 Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8"
4m52s       Normal    Pulled                         pod/redhat-marketplace-9kx89                 Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 1.689170741s
4m52s       Normal    Created                        pod/redhat-marketplace-9kx89                 Created container registry-server
4m51s       Normal    Started                        pod/redhat-marketplace-9kx89                 Started container registry-server
4m45s       Normal    Killing                        pod/redhat-marketplace-9kx89                 Stopping container registry-server
20m         Normal    Scheduled                      pod/redhat-marketplace-lvq9q                 Successfully assigned openshift-marketplace/redhat-marketplace-lvq9q to ci-ln-fn2msht-f76d1-svf6b-master-0
20m         Normal    AddedInterface                 pod/redhat-marketplace-lvq9q                 Add eth0 [10.128.0.97/23] from openshift-sdn
20m         Normal    Pulling                        pod/redhat-marketplace-lvq9q                 Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8"
20m         Normal    Pulled                         pod/redhat-marketplace-lvq9q                 Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 2.157198382s
19m         Normal    Created                        pod/redhat-marketplace-lvq9q                 Created container registry-server
19m         Normal    Started                        pod/redhat-marketplace-lvq9q                 Started container registry-server
19m         Normal    Killing                        pod/redhat-marketplace-lvq9q                 Stopping container registry-server
19m         Warning   Unhealthy                      pod/redhat-marketplace-lvq9q                 Liveness probe errored: rpc error: code = NotFound desc = could not find container "31cc0950ae3de4f9e5a0243582328b9b3702c2faa29b9563e09a3a122b52c045": container with ID starting with 31cc0950ae3de4f9e5a0243582328b9b3702c2faa29b9563e09a3a122b52c045 not found: ID does not exist
19m         Warning   Unhealthy                      pod/redhat-marketplace-lvq9q                 Readiness probe errored: rpc error: code = NotFound desc = could not find container "31cc0950ae3de4f9e5a0243582328b9b3702c2faa29b9563e09a3a122b52c045": container with ID starting with 31cc0950ae3de4f9e5a0243582328b9b3702c2faa29b9563e09a3a122b52c045 not found: ID does not exist
18m         Warning   FailedKillPod                  pod/redhat-marketplace-lvq9q                 error killing pod: failed to "KillPodSandbox" for "2f8a6e6c-fadb-479e-b3a4-8f14fcb34e50" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_redhat-marketplace-lvq9q_openshift-marketplace_2f8a6e6c-fadb-479e-b3a4-8f14fcb34e50_0(baa433f8efca06a00f9d2658c430338cced62efeace11d8a3a24182b38c4f031): Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/redhat-marketplace-lvq9q]: error getting pod: Get \"https://[api-int.ci-ln-fn2msht-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-marketplace/pods/redhat-marketplace-lvq9q?timeout=1m0s\": dial tcp 10.0.0.2:6443: i/o timeout"
31m         Warning   FailedToUpdateEndpointSlices   service/redhat-marketplace                   Error updating Endpoint Slices for Service openshift-marketplace/redhat-marketplace: failed to create EndpointSlice for Service openshift-marketplace/redhat-marketplace: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s
31m         Normal    Scheduled                      pod/redhat-operators-9fztp                   Successfully assigned openshift-marketplace/redhat-operators-9fztp to ci-ln-fn2msht-f76d1-svf6b-master-0
31m         Warning   FailedScheduling               pod/redhat-operators-9fztp                   binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "redhat-operators-9fztp": pod redhat-operators-9fztp is already assigned to node "ci-ln-fn2msht-f76d1-svf6b-master-0"
31m         Normal    AddedInterface                 pod/redhat-operators-9fztp                   Add eth0 [10.128.0.64/23] from openshift-sdn
19m         Normal    Pulling                        pod/redhat-operators-9fztp                   Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.8"
30m         Normal    Pulled                         pod/redhat-operators-9fztp                   Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1m27.497894729s
19m         Normal    Created                        pod/redhat-operators-9fztp                   Created container registry-server
21m         Normal    Started                        pod/redhat-operators-9fztp                   Started container registry-server
4m36s       Warning   Unhealthy                      pod/redhat-operators-9fztp                   Liveness probe failed: command timed out
20m         Normal    Killing                        pod/redhat-operators-9fztp                   Container registry-server failed liveness probe, will be restarted
21m         Warning   Unhealthy                      pod/redhat-operators-9fztp                   Readiness probe failed:
21m         Normal    Pulled                         pod/redhat-operators-9fztp                   Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.019213405s
20m         Warning   Unhealthy                      pod/redhat-operators-9fztp                   Readiness probe failed: timeout: health rpc did not complete within 1s
19m         Warning   Unhealthy                      pod/redhat-operators-9fztp                   Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 91d58e16971f2f02f99cbfe7404e8ade61b989bcc43db3bb0102637a5a655daf is running failed: container process not found
19m         Normal    Pulled                         pod/redhat-operators-9fztp                   Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.360934434s
20m         Normal    Scheduled                      pod/redhat-operators-f4sfx                   Successfully assigned openshift-marketplace/redhat-operators-f4sfx to ci-ln-fn2msht-f76d1-svf6b-master-0
20m         Normal    AddedInterface                 pod/redhat-operators-f4sfx                   Add eth0 [10.128.0.95/23] from openshift-sdn
20m         Normal    Pulling                        pod/redhat-operators-f4sfx                   Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.8"
20m         Normal    Pulled                         pod/redhat-operators-f4sfx                   Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.272612348s
19m         Normal    Created                        pod/redhat-operators-f4sfx                   Created container registry-server
19m         Normal    Started                        pod/redhat-operators-f4sfx                   Started container registry-server
19m         Normal    Killing                        pod/redhat-operators-f4sfx                   Stopping container registry-server
19m         Warning   Unhealthy                      pod/redhat-operators-f4sfx                   Readiness probe errored: rpc error: code = NotFound desc = could not find container "533c015cb4227303c9a78a3832e3536936d0de7ba2392787107590064b9f3080": container with ID starting with 533c015cb4227303c9a78a3832e3536936d0de7ba2392787107590064b9f3080 not found: ID does not exist
19m         Warning   Unhealthy                      pod/redhat-operators-f4sfx                   Liveness probe errored: rpc error: code = NotFound desc = could not find container "533c015cb4227303c9a78a3832e3536936d0de7ba2392787107590064b9f3080": container with ID starting with 533c015cb4227303c9a78a3832e3536936d0de7ba2392787107590064b9f3080 not found: ID does not exist
18m         Warning   FailedKillPod                  pod/redhat-operators-f4sfx                   error killing pod: failed to "KillPodSandbox" for "97726ff1-513a-4c9a-aebb-eabbb85ecf2a" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_redhat-operators-f4sfx_openshift-marketplace_97726ff1-513a-4c9a-aebb-eabbb85ecf2a_0(f689d1743667a8d627e1deb0129d0e34a3d558b329d2e309215762eb74d52ce6): Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/redhat-operators-f4sfx]: error getting pod: Get \"https://[api-int.ci-ln-fn2msht-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-marketplace/pods/redhat-operators-f4sfx?timeout=1m0s\": dial tcp 10.0.0.2:6443: i/o timeout"
4m18s       Normal    Scheduled                      pod/redhat-operators-gnq2h                   Successfully assigned openshift-marketplace/redhat-operators-gnq2h to ci-ln-fn2msht-f76d1-svf6b-master-0
4m16s       Normal    AddedInterface                 pod/redhat-operators-gnq2h                   Add eth0 [10.128.0.112/23] from openshift-sdn
4m15s       Normal    Pulling                        pod/redhat-operators-gnq2h                   Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.8"
4m14s       Normal    Pulled                         pod/redhat-operators-gnq2h                   Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.009916555s
4m13s       Normal    Created                        pod/redhat-operators-gnq2h                   Created container registry-server
4m13s       Normal    Started                        pod/redhat-operators-gnq2h                   Started container registry-server
4m8s        Normal    Killing                        pod/redhat-operators-gnq2h                   Stopping container registry-server
14m         Normal    Scheduled                      pod/redhat-operators-vlcnw                   Successfully assigned openshift-marketplace/redhat-operators-vlcnw to ci-ln-fn2msht-f76d1-svf6b-master-0
14m         Normal    AddedInterface                 pod/redhat-operators-vlcnw                   Add eth0 [10.128.0.103/23] from openshift-sdn
14m         Normal    Pulling                        pod/redhat-operators-vlcnw                   Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.8"
14m         Normal    Pulled                         pod/redhat-operators-vlcnw                   Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.364520164s
14m         Normal    Created                        pod/redhat-operators-vlcnw                   Created container registry-server
14m         Normal    Started                        pod/redhat-operators-vlcnw                   Started container registry-server
14m         Warning   Unhealthy                      pod/redhat-operators-vlcnw                   Liveness probe failed: command timed out
14m         Normal    Killing                        pod/redhat-operators-vlcnw                   Stopping container registry-server
31m         Warning   FailedToUpdateEndpointSlices   service/redhat-operators                     Error updating Endpoint Slices for Service openshift-marketplace/redhat-operators: failed to create EndpointSlice for Service openshift-marketplace/redhat-operators: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s
```

looks like it takes a while for the registry pods to eventually start running. This was likely happening all along, but started getting highlighted in the ci when prometheus alerts for default CatalogSources in the openshift-marketplace namespace were introduced as part of https://bugzilla.redhat.com/show_bug.cgi?id=1973482. 

The fix PR linked in the bz bumps up the values of the initial delay seconds and timeout seconds of the liveness probe of the registry pods, after consultation with the node team that the default value of 1s for liveness probe timeout is likely not enough when there is a resource constraint.

--- Additional comment from Anik on 2021-06-29 20:46:55 UTC ---

> the issue seems to be simple cpu contention in the probe

that sounds reasonable too. The fix remains the same though.

Comment 4 Jian Zhang 2021-07-02 03:32:28 UTC

There is no Alert for the CaatalogSource pods probe failure in the 6 hours CI jobs. Details:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/842/pull-ci-openshift-kubernetes-release-4.8-e2e-gcp/1410772738255097856
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/promote-release-openshift-okd-machine-os-content-e2e-aws-4.8/1410771153017901056
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/842/pull-ci-openshift-kubernetes-release-4.8-e2e-aws-disruptive/1410741786506170368
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-single-node/1410750382417645568
...
Search link: https://search.ci.openshift.org/?search=Alerts+shouldn%27t+report+any+alerts+in+firing&maxAge=6h&context=1&type=junit&name=4%5C.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Verify it.

Comment 7 errata-xmlrpc 2021-07-27 23:13:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438