Catalog operator containers certified-operators, redhat-marketplace and redhat-operators are failing on liveness probe (from https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408366552603955200): ``` alert CertifiedOperatorsCatalogError fired for 240 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.25:8081", job="catalog-operator-metrics", name="certified-operators", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-9c74f8b8d-6x84r", service="catalog-operator-metrics", severity="warning"} alert RedhatMarketplaceCatalogError fired for 300 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.25:8081", job="catalog-operator-metrics", name="redhat-marketplace", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-9c74f8b8d-6x84r", service="catalog-operator-metrics", severity="warning"} alert RedhatOperatorsCatalogError fired for 180 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.25:8081", job="catalog-operator-metrics", name="redhat-operators", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-9c74f8b8d-6x84r", service="catalog-operator-metrics", severity="warning"} ``` Prometheus rules defined in https://github.com/operator-framework/operator-marketplace/blob/e39ff59d5abc3e27effc7b726329d06a37644f2e/manifests/12_prometheus_rule.yaml. Checking journal logs (https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408366552603955200/artifacts/e2e-gcp-rt/gather-extra/artifacts/nodes/ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv/journal) where all three containers are running: ``` Jun 25 11:01:19.629237 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:01:19.629204 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:01:49.452546 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:01:49.452510 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:02:09.886205 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:09.885666 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:02:19.486876 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:19.486839 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:02:59.519726 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:59.519379 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:03:09.442769 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:03:09.442731 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:06:29.494754 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:06:29.494696 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:06:40.318189 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:06:40.318020 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:06:59.603150 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:06:59.603106 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" ... Jun 25 11:41:59.457077 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:41:59.457009 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" ``` The liveness probe's configuration: ``` livenessProbe: exec: command: - grpc_health_probe - -addr=:50051 failureThreshold: 3 initialDelaySeconds: 10 periodSeconds: 10 successThreshold: 1 timeoutSeconds: 1 ``` The probe is configured to be ran every 10s. Correlating with the "Probe failed" logs from the ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv node, the probe does not fail consistently every 10s. E.g.: ``` Jun 25 11:01:19.629237 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:01:19.629204 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:01:49.452546 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:01:49.452510 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:02:09.886205 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:09.885666 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:02:19.486876 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:19.486839 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" Jun 25 11:02:59.519726 ci-op-k72hr7js-f0f18-wpzls-worker-a-qtzrv hyperkube[1572]: I0625 11:02:59.519379 1572 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-marketplace/redhat-operators-b8rjq" podUID=8921eb47-a019-40a4-b97e-52afb34611de containerName="registry-server" probeResult=failure output="command timed out" ``` So it's likely the container itself does not always respond to grpc_health_probe in time. The issue looks similar to https://bugzilla.redhat.com/show_bug.cgi?id=1815957#c6 described as: ``` The issue is that on some systems, a lack of an nsswitch config means that dns will attempt to resolve from external sources before local files (so resolving `localhost` could take a long time). ``` Affected jobs can be found in https://testgrid.k8s.io/redhat-openshift-ocp-release-4.9-informing#periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt. Namely: - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408366552603955200 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408333569146753024 - https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1408215657794768896 Other jobs are failing with "alert CommunityOperatorsCatalogError fired for" alert at least. See https://search.ci.openshift.org/?search=alert+CommunityOperatorsCatalogError+fired+for&maxAge=24h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job
Possibly fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1974054#c1? See https://coreos.slack.com/archives/C3VS0LV41/p1624282085278400
The images in question will start up and write out an nsswitch config to avoid the slow resolution problem, I doubt that is the issue (unless there's been some regression related to nsswitch) There are two options that seems possible/likely: 1.grpc_health_probe is slow to respond inside the container 2. CRI-O itself is timing out on ExecSync (1) is possible, but I'm skeptical of it just because of the long timeouts / retries used. They're really generous, and the api should be available almost immediately on container start. To investigate more, I think we should instrument the index pods with grpc metrics and see if we can get data on what the probes are seeing. (2) is starting to seem more likely to me, especially because via CI search I can see this issue affecting SNO clusters at a much higher rate than other clusters configurations (where CRI-O would have the highest contention). Unfortunately CRI-O metrics are not enabled in openshift, so I can't see any latency/load data during cluster boot. I think both are worth further investigation, and I'd like to enlist some node folks to help investigate in case (2) seems plausible. My read of kubelet is that we can't distinguish between the two cases based on the audit logs alone: https://github.com/kubernetes/kubernetes/blob/397ff0b3cb2d674113f3cb8195e044ebdf88d27b/pkg/kubelet/cri/remote/remote_runtime.go#L396-L399
*** Bug 1977241 has been marked as a duplicate of this bug. ***
Updating based on further investigation: the issue seems to be simple cpu contention in the probe, which is more likely in SNO clusters during boot. increasing the timeout should resolve the CI issues - this is not an issue for running clusters, since eventually the pods restart at a time when there is not contention and can respond to the probes.
It is highly plausible that the "crio is timing out" issue mentioned by Evan is the likely cause here. In the absence of crio metrics, observing a cluster that had the alert firing during cluster bootstrap (deduced from the restarting registry pods in the openshift-marketplace namespace before eventually running successfully): ``` $ oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE certified-operators-bxv82 1/1 Running 6 30m community-operators-rjq4j 1/1 Running 6 30m marketplace-operator-6b559b4bd9-65pzk 1/1 Running 0 34m redhat-marketplace-9fn6k 1/1 Running 5 30m redhat-operators-9fztp 1/1 Running 5 30m $ oc get events -n openshift-marketplace LAST SEEN TYPE REASON OBJECT MESSAGE 31m Normal Scheduled pod/certified-operators-bxv82 Successfully assigned openshift-marketplace/certified-operators-bxv82 to ci-ln-fn2msht-f76d1-svf6b-master-0 31m Warning FailedScheduling pod/certified-operators-bxv82 binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "certified-operators-bxv82": pod certified-operators-bxv82 is already assigned to node "ci-ln-fn2msht-f76d1-svf6b-master-0" 31m Normal AddedInterface pod/certified-operators-bxv82 Add eth0 [10.128.0.61/23] from openshift-sdn 21m Normal Pulling pod/certified-operators-bxv82 Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.8" 30m Normal Pulled pod/certified-operators-bxv82 Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1m28.476540981s 21m Normal Created pod/certified-operators-bxv82 Created container registry-server 21m Normal Started pod/certified-operators-bxv82 Started container registry-server 4m36s Warning Unhealthy pod/certified-operators-bxv82 Liveness probe failed: command timed out 21m Normal Killing pod/certified-operators-bxv82 Container registry-server failed liveness probe, will be restarted 21m Warning Unhealthy pod/certified-operators-bxv82 Readiness probe failed: 21m Normal Pulled pod/certified-operators-bxv82 Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1.831610737s 21m Normal Pulled pod/certified-operators-bxv82 Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1.712293116s 4m37s Normal Scheduled pod/certified-operators-jjvq8 Successfully assigned openshift-marketplace/certified-operators-jjvq8 to ci-ln-fn2msht-f76d1-svf6b-master-0 4m35s Normal AddedInterface pod/certified-operators-jjvq8 Add eth0 [10.128.0.111/23] from openshift-sdn 4m33s Normal Pulling pod/certified-operators-jjvq8 Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.8" 4m31s Normal Pulled pod/certified-operators-jjvq8 Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1.688453392s 4m31s Normal Created pod/certified-operators-jjvq8 Created container registry-server 4m30s Normal Started pod/certified-operators-jjvq8 Started container registry-server 4m17s Normal Killing pod/certified-operators-jjvq8 Stopping container registry-server 14m Normal Scheduled pod/certified-operators-t9t5k Successfully assigned openshift-marketplace/certified-operators-t9t5k to ci-ln-fn2msht-f76d1-svf6b-master-0 14m Normal AddedInterface pod/certified-operators-t9t5k Add eth0 [10.128.0.106/23] from openshift-sdn 14m Normal Pulling pod/certified-operators-t9t5k Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.8" 14m Normal Pulled pod/certified-operators-t9t5k Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 1.846749222s 14m Normal Created pod/certified-operators-t9t5k Created container registry-server 14m Normal Started pod/certified-operators-t9t5k Started container registry-server 14m Normal Killing pod/certified-operators-t9t5k Stopping container registry-server 20m Normal Scheduled pod/certified-operators-zz5w9 Successfully assigned openshift-marketplace/certified-operators-zz5w9 to ci-ln-fn2msht-f76d1-svf6b-master-0 20m Normal AddedInterface pod/certified-operators-zz5w9 Add eth0 [10.128.0.94/23] from openshift-sdn 20m Normal Pulling pod/certified-operators-zz5w9 Pulling image "registry.redhat.io/redhat/certified-operator-index:v4.8" 20m Normal Pulled pod/certified-operators-zz5w9 Successfully pulled image "registry.redhat.io/redhat/certified-operator-index:v4.8" in 2.015105481s 19m Normal Created pod/certified-operators-zz5w9 Created container registry-server 19m Normal Started pod/certified-operators-zz5w9 Started container registry-server 19m Normal Killing pod/certified-operators-zz5w9 Stopping container registry-server 19m Warning Unhealthy pod/certified-operators-zz5w9 Liveness probe failed: command timed out 19m Warning Unhealthy pod/certified-operators-zz5w9 Liveness probe errored: rpc error: code = NotFound desc = could not find container "906cab4d013567d4733c03f12bd99766299d959cbe3cb02e953ae2c638931864": container with ID starting with 906cab4d013567d4733c03f12bd99766299d959cbe3cb02e953ae2c638931864 not found: ID does not exist 19m Warning Unhealthy pod/certified-operators-zz5w9 Readiness probe errored: rpc error: code = NotFound desc = could not find container "906cab4d013567d4733c03f12bd99766299d959cbe3cb02e953ae2c638931864": container with ID starting with 906cab4d013567d4733c03f12bd99766299d959cbe3cb02e953ae2c638931864 not found: ID does not exist 18m Warning FailedKillPod pod/certified-operators-zz5w9 error killing pod: failed to "KillPodSandbox" for "1ace1669-e07e-4087-ba65-90ae1fc7520f" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_certified-operators-zz5w9_openshift-marketplace_1ace1669-e07e-4087-ba65-90ae1fc7520f_0(9a59f9d7e0d748984383c86e19a162b4f24230f1660c8b0969a7dbe973775c1d): Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/certified-operators-zz5w9]: error getting pod: Get \"https://[api-int.ci-ln-fn2msht-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-marketplace/pods/certified-operators-zz5w9?timeout=1m0s\": dial tcp 10.0.0.2:6443: i/o timeout" 31m Warning FailedToUpdateEndpointSlices service/certified-operators Error updating Endpoint Slices for Service openshift-marketplace/certified-operators: failed to create EndpointSlice for Service openshift-marketplace/certified-operators: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s 4m46s Normal Scheduled pod/community-operators-dw74z Successfully assigned openshift-marketplace/community-operators-dw74z to ci-ln-fn2msht-f76d1-svf6b-master-0 4m44s Normal AddedInterface pod/community-operators-dw74z Add eth0 [10.128.0.110/23] from openshift-sdn 4m44s Normal Pulling pod/community-operators-dw74z Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8" 4m42s Normal Pulled pod/community-operators-dw74z Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.767186196s 4m41s Normal Created pod/community-operators-dw74z Created container registry-server 4m41s Normal Started pod/community-operators-dw74z Started container registry-server 4m26s Normal Killing pod/community-operators-dw74z Stopping container registry-server 20m Normal Scheduled pod/community-operators-fmrqw Successfully assigned openshift-marketplace/community-operators-fmrqw to ci-ln-fn2msht-f76d1-svf6b-master-0 20m Normal AddedInterface pod/community-operators-fmrqw Add eth0 [10.128.0.96/23] from openshift-sdn 20m Normal Pulling pod/community-operators-fmrqw Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8" 20m Normal Pulled pod/community-operators-fmrqw Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 2.774797223s 20m Normal Created pod/community-operators-fmrqw Created container registry-server 19m Normal Started pod/community-operators-fmrqw Started container registry-server 19m Warning Unhealthy pod/community-operators-fmrqw Liveness probe failed: command timed out 19m Normal Killing pod/community-operators-fmrqw Stopping container registry-server 19m Warning Unhealthy pod/community-operators-fmrqw Liveness probe errored: rpc error: code = NotFound desc = could not find container "c5a04306f9c1787a9434820dc1ff18d0a1351971df1e04a565a084439be7ab14": container with ID starting with c5a04306f9c1787a9434820dc1ff18d0a1351971df1e04a565a084439be7ab14 not found: ID does not exist 19m Warning Unhealthy pod/community-operators-fmrqw Readiness probe errored: rpc error: code = NotFound desc = could not find container "c5a04306f9c1787a9434820dc1ff18d0a1351971df1e04a565a084439be7ab14": container with ID starting with c5a04306f9c1787a9434820dc1ff18d0a1351971df1e04a565a084439be7ab14 not found: ID does not exist 18m Warning FailedKillPod pod/community-operators-fmrqw error killing pod: failed to "KillPodSandbox" for "2f0d6d3f-c3c7-49e9-bc35-d0a2eda0ffdb" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_community-operators-fmrqw_openshift-marketplace_2f0d6d3f-c3c7-49e9-bc35-d0a2eda0ffdb_0(80052adee9bfda08edb8cf6a4912bfa30b2a6bdddb260a99ed0211ee138feb69): Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/community-operators-fmrqw]: error getting pod: Get \"https://[api-int.ci-ln-fn2msht-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-marketplace/pods/community-operators-fmrqw?timeout=1m0s\": dial tcp 10.0.0.2:6443: i/o timeout" 14m Normal Scheduled pod/community-operators-gfwlc Successfully assigned openshift-marketplace/community-operators-gfwlc to ci-ln-fn2msht-f76d1-svf6b-master-0 14m Normal AddedInterface pod/community-operators-gfwlc Add eth0 [10.128.0.104/23] from openshift-sdn 14m Normal Pulling pod/community-operators-gfwlc Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8" 14m Normal Pulled pod/community-operators-gfwlc Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.666605741s 14m Normal Created pod/community-operators-gfwlc Created container registry-server 14m Normal Started pod/community-operators-gfwlc Started container registry-server 14m Normal Killing pod/community-operators-gfwlc Stopping container registry-server 4m15s Normal Scheduled pod/community-operators-kzxq2 Successfully assigned openshift-marketplace/community-operators-kzxq2 to ci-ln-fn2msht-f76d1-svf6b-master-0 4m13s Normal AddedInterface pod/community-operators-kzxq2 Add eth0 [10.128.0.113/23] from openshift-sdn 4m12s Normal Pulling pod/community-operators-kzxq2 Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8" 4m10s Normal Pulled pod/community-operators-kzxq2 Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.680937805s 4m9s Normal Created pod/community-operators-kzxq2 Created container registry-server 4m9s Normal Started pod/community-operators-kzxq2 Started container registry-server 3m55s Normal Killing pod/community-operators-kzxq2 Stopping container registry-server 31m Normal Scheduled pod/community-operators-rjq4j Successfully assigned openshift-marketplace/community-operators-rjq4j to ci-ln-fn2msht-f76d1-svf6b-master-0 31m Warning FailedScheduling pod/community-operators-rjq4j binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "community-operators-rjq4j": pod community-operators-rjq4j is already assigned to node "ci-ln-fn2msht-f76d1-svf6b-master-0" 31m Normal AddedInterface pod/community-operators-rjq4j Add eth0 [10.128.0.62/23] from openshift-sdn 20m Normal Pulling pod/community-operators-rjq4j Pulling image "registry.redhat.io/redhat/community-operator-index:v4.8" 30m Normal Pulled pod/community-operators-rjq4j Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1m28.879243424s 20m Normal Created pod/community-operators-rjq4j Created container registry-server 20m Normal Started pod/community-operators-rjq4j Started container registry-server 4m36s Warning Unhealthy pod/community-operators-rjq4j Liveness probe failed: command timed out 20m Normal Killing pod/community-operators-rjq4j Container registry-server failed liveness probe, will be restarted 21m Warning Unhealthy pod/community-operators-rjq4j Readiness probe failed: 21m Normal Pulled pod/community-operators-rjq4j Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.688388556s 20m Normal Pulled pod/community-operators-rjq4j Successfully pulled image "registry.redhat.io/redhat/community-operator-index:v4.8" in 1.836622456s 31m Warning FailedToUpdateEndpointSlices service/community-operators Error updating Endpoint Slices for Service openshift-marketplace/community-operators: failed to create EndpointSlice for Service openshift-marketplace/community-operators: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s 35m Warning FailedScheduling pod/marketplace-operator-6b559b4bd9-65pzk 0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate. 35m Warning FailedScheduling pod/marketplace-operator-6b559b4bd9-65pzk 0/1 nodes are available: 1 node(s) had taint {node.kubernetes.io/not-ready: }, that the pod didn't tolerate. 34m Normal Scheduled pod/marketplace-operator-6b559b4bd9-65pzk Successfully assigned openshift-marketplace/marketplace-operator-6b559b4bd9-65pzk to ci-ln-fn2msht-f76d1-svf6b-master-0 34m Warning FailedMount pod/marketplace-operator-6b559b4bd9-65pzk MountVolume.SetUp failed for volume "marketplace-operator-metrics" : secret "marketplace-operator-metrics" not found 33m Normal AddedInterface pod/marketplace-operator-6b559b4bd9-65pzk Add eth0 [10.128.0.32/23] from openshift-sdn 33m Normal Pulling pod/marketplace-operator-6b559b4bd9-65pzk Pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f5c31a28af169f435d45f2f9a517333f5b29e1b488acb4cdb1c6f53b68a25fb" 33m Normal Pulled pod/marketplace-operator-6b559b4bd9-65pzk Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:0f5c31a28af169f435d45f2f9a517333f5b29e1b488acb4cdb1c6f53b68a25fb" in 32.330314486s 32m Normal Created pod/marketplace-operator-6b559b4bd9-65pzk Created container marketplace-operator 32m Normal Started pod/marketplace-operator-6b559b4bd9-65pzk Started container marketplace-operator 32m Warning ProbeError pod/marketplace-operator-6b559b4bd9-65pzk Liveness probe error: Get "http://10.128.0.32:8080/healthz": dial tcp 10.128.0.32:8080: connect: connection refused body: 32m Warning Unhealthy pod/marketplace-operator-6b559b4bd9-65pzk Liveness probe failed: Get "http://10.128.0.32:8080/healthz": dial tcp 10.128.0.32:8080: connect: connection refused 32m Warning ProbeError pod/marketplace-operator-6b559b4bd9-65pzk Readiness probe error: Get "http://10.128.0.32:8080/healthz": dial tcp 10.128.0.32:8080: connect: connection refused body: 32m Warning Unhealthy pod/marketplace-operator-6b559b4bd9-65pzk Readiness probe failed: Get "http://10.128.0.32:8080/healthz": dial tcp 10.128.0.32:8080: connect: connection refused 37m Warning FailedCreate replicaset/marketplace-operator-6b559b4bd9 Error creating: pods "marketplace-operator-6b559b4bd9-" is forbidden: autoscaling.openshift.io/ManagementCPUsOverride the cluster does not have any nodes 35m Normal SuccessfulCreate replicaset/marketplace-operator-6b559b4bd9 Created pod: marketplace-operator-6b559b4bd9-65pzk 38m Normal ScalingReplicaSet deployment/marketplace-operator Scaled up replica set marketplace-operator-6b559b4bd9 to 1 38m Normal CreatedSCCRanges namespace/openshift-marketplace created SCC ranges 14m Normal Scheduled pod/redhat-marketplace-5t4wt Successfully assigned openshift-marketplace/redhat-marketplace-5t4wt to ci-ln-fn2msht-f76d1-svf6b-master-0 14m Normal AddedInterface pod/redhat-marketplace-5t4wt Add eth0 [10.128.0.105/23] from openshift-sdn 14m Normal Pulling pod/redhat-marketplace-5t4wt Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" 14m Normal Pulled pod/redhat-marketplace-5t4wt Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 2.081101225s 14m Normal Created pod/redhat-marketplace-5t4wt Created container registry-server 14m Normal Started pod/redhat-marketplace-5t4wt Started container registry-server 14m Normal Killing pod/redhat-marketplace-5t4wt Stopping container registry-server 31m Normal Scheduled pod/redhat-marketplace-9fn6k Successfully assigned openshift-marketplace/redhat-marketplace-9fn6k to ci-ln-fn2msht-f76d1-svf6b-master-0 31m Warning FailedScheduling pod/redhat-marketplace-9fn6k binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "redhat-marketplace-9fn6k": pod redhat-marketplace-9fn6k is already assigned to node "ci-ln-fn2msht-f76d1-svf6b-master-0" 31m Normal AddedInterface pod/redhat-marketplace-9fn6k Add eth0 [10.128.0.63/23] from openshift-sdn 20m Normal Pulling pod/redhat-marketplace-9fn6k Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" 30m Normal Pulled pod/redhat-marketplace-9fn6k Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 1m29.016713858s 20m Normal Created pod/redhat-marketplace-9fn6k Created container registry-server 20m Normal Started pod/redhat-marketplace-9fn6k Started container registry-server 4m36s Warning Unhealthy pod/redhat-marketplace-9fn6k Liveness probe failed: command timed out 20m Normal Killing pod/redhat-marketplace-9fn6k Container registry-server failed liveness probe, will be restarted 21m Normal Pulled pod/redhat-marketplace-9fn6k Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 2.018343487s 20m Normal Pulled pod/redhat-marketplace-9fn6k Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 1.819118352s 4m56s Normal Scheduled pod/redhat-marketplace-9kx89 Successfully assigned openshift-marketplace/redhat-marketplace-9kx89 to ci-ln-fn2msht-f76d1-svf6b-master-0 4m54s Normal AddedInterface pod/redhat-marketplace-9kx89 Add eth0 [10.128.0.109/23] from openshift-sdn 4m54s Normal Pulling pod/redhat-marketplace-9kx89 Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" 4m52s Normal Pulled pod/redhat-marketplace-9kx89 Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 1.689170741s 4m52s Normal Created pod/redhat-marketplace-9kx89 Created container registry-server 4m51s Normal Started pod/redhat-marketplace-9kx89 Started container registry-server 4m45s Normal Killing pod/redhat-marketplace-9kx89 Stopping container registry-server 20m Normal Scheduled pod/redhat-marketplace-lvq9q Successfully assigned openshift-marketplace/redhat-marketplace-lvq9q to ci-ln-fn2msht-f76d1-svf6b-master-0 20m Normal AddedInterface pod/redhat-marketplace-lvq9q Add eth0 [10.128.0.97/23] from openshift-sdn 20m Normal Pulling pod/redhat-marketplace-lvq9q Pulling image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" 20m Normal Pulled pod/redhat-marketplace-lvq9q Successfully pulled image "registry.redhat.io/redhat/redhat-marketplace-index:v4.8" in 2.157198382s 19m Normal Created pod/redhat-marketplace-lvq9q Created container registry-server 19m Normal Started pod/redhat-marketplace-lvq9q Started container registry-server 19m Normal Killing pod/redhat-marketplace-lvq9q Stopping container registry-server 19m Warning Unhealthy pod/redhat-marketplace-lvq9q Liveness probe errored: rpc error: code = NotFound desc = could not find container "31cc0950ae3de4f9e5a0243582328b9b3702c2faa29b9563e09a3a122b52c045": container with ID starting with 31cc0950ae3de4f9e5a0243582328b9b3702c2faa29b9563e09a3a122b52c045 not found: ID does not exist 19m Warning Unhealthy pod/redhat-marketplace-lvq9q Readiness probe errored: rpc error: code = NotFound desc = could not find container "31cc0950ae3de4f9e5a0243582328b9b3702c2faa29b9563e09a3a122b52c045": container with ID starting with 31cc0950ae3de4f9e5a0243582328b9b3702c2faa29b9563e09a3a122b52c045 not found: ID does not exist 18m Warning FailedKillPod pod/redhat-marketplace-lvq9q error killing pod: failed to "KillPodSandbox" for "2f8a6e6c-fadb-479e-b3a4-8f14fcb34e50" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_redhat-marketplace-lvq9q_openshift-marketplace_2f8a6e6c-fadb-479e-b3a4-8f14fcb34e50_0(baa433f8efca06a00f9d2658c430338cced62efeace11d8a3a24182b38c4f031): Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/redhat-marketplace-lvq9q]: error getting pod: Get \"https://[api-int.ci-ln-fn2msht-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-marketplace/pods/redhat-marketplace-lvq9q?timeout=1m0s\": dial tcp 10.0.0.2:6443: i/o timeout" 31m Warning FailedToUpdateEndpointSlices service/redhat-marketplace Error updating Endpoint Slices for Service openshift-marketplace/redhat-marketplace: failed to create EndpointSlice for Service openshift-marketplace/redhat-marketplace: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s 31m Normal Scheduled pod/redhat-operators-9fztp Successfully assigned openshift-marketplace/redhat-operators-9fztp to ci-ln-fn2msht-f76d1-svf6b-master-0 31m Warning FailedScheduling pod/redhat-operators-9fztp binding rejected: running Bind plugin "DefaultBinder": Operation cannot be fulfilled on pods/binding "redhat-operators-9fztp": pod redhat-operators-9fztp is already assigned to node "ci-ln-fn2msht-f76d1-svf6b-master-0" 31m Normal AddedInterface pod/redhat-operators-9fztp Add eth0 [10.128.0.64/23] from openshift-sdn 19m Normal Pulling pod/redhat-operators-9fztp Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.8" 30m Normal Pulled pod/redhat-operators-9fztp Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1m27.497894729s 19m Normal Created pod/redhat-operators-9fztp Created container registry-server 21m Normal Started pod/redhat-operators-9fztp Started container registry-server 4m36s Warning Unhealthy pod/redhat-operators-9fztp Liveness probe failed: command timed out 20m Normal Killing pod/redhat-operators-9fztp Container registry-server failed liveness probe, will be restarted 21m Warning Unhealthy pod/redhat-operators-9fztp Readiness probe failed: 21m Normal Pulled pod/redhat-operators-9fztp Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.019213405s 20m Warning Unhealthy pod/redhat-operators-9fztp Readiness probe failed: timeout: health rpc did not complete within 1s 19m Warning Unhealthy pod/redhat-operators-9fztp Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of 91d58e16971f2f02f99cbfe7404e8ade61b989bcc43db3bb0102637a5a655daf is running failed: container process not found 19m Normal Pulled pod/redhat-operators-9fztp Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.360934434s 20m Normal Scheduled pod/redhat-operators-f4sfx Successfully assigned openshift-marketplace/redhat-operators-f4sfx to ci-ln-fn2msht-f76d1-svf6b-master-0 20m Normal AddedInterface pod/redhat-operators-f4sfx Add eth0 [10.128.0.95/23] from openshift-sdn 20m Normal Pulling pod/redhat-operators-f4sfx Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.8" 20m Normal Pulled pod/redhat-operators-f4sfx Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.272612348s 19m Normal Created pod/redhat-operators-f4sfx Created container registry-server 19m Normal Started pod/redhat-operators-f4sfx Started container registry-server 19m Normal Killing pod/redhat-operators-f4sfx Stopping container registry-server 19m Warning Unhealthy pod/redhat-operators-f4sfx Readiness probe errored: rpc error: code = NotFound desc = could not find container "533c015cb4227303c9a78a3832e3536936d0de7ba2392787107590064b9f3080": container with ID starting with 533c015cb4227303c9a78a3832e3536936d0de7ba2392787107590064b9f3080 not found: ID does not exist 19m Warning Unhealthy pod/redhat-operators-f4sfx Liveness probe errored: rpc error: code = NotFound desc = could not find container "533c015cb4227303c9a78a3832e3536936d0de7ba2392787107590064b9f3080": container with ID starting with 533c015cb4227303c9a78a3832e3536936d0de7ba2392787107590064b9f3080 not found: ID does not exist 18m Warning FailedKillPod pod/redhat-operators-f4sfx error killing pod: failed to "KillPodSandbox" for "97726ff1-513a-4c9a-aebb-eabbb85ecf2a" with KillPodSandboxError: "rpc error: code = Unknown desc = failed to destroy network for pod sandbox k8s_redhat-operators-f4sfx_openshift-marketplace_97726ff1-513a-4c9a-aebb-eabbb85ecf2a_0(f689d1743667a8d627e1deb0129d0e34a3d558b329d2e309215762eb74d52ce6): Error while removing pod from CNI network \"multus-cni-network\": Multus: [openshift-marketplace/redhat-operators-f4sfx]: error getting pod: Get \"https://[api-int.ci-ln-fn2msht-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-marketplace/pods/redhat-operators-f4sfx?timeout=1m0s\": dial tcp 10.0.0.2:6443: i/o timeout" 4m18s Normal Scheduled pod/redhat-operators-gnq2h Successfully assigned openshift-marketplace/redhat-operators-gnq2h to ci-ln-fn2msht-f76d1-svf6b-master-0 4m16s Normal AddedInterface pod/redhat-operators-gnq2h Add eth0 [10.128.0.112/23] from openshift-sdn 4m15s Normal Pulling pod/redhat-operators-gnq2h Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.8" 4m14s Normal Pulled pod/redhat-operators-gnq2h Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.009916555s 4m13s Normal Created pod/redhat-operators-gnq2h Created container registry-server 4m13s Normal Started pod/redhat-operators-gnq2h Started container registry-server 4m8s Normal Killing pod/redhat-operators-gnq2h Stopping container registry-server 14m Normal Scheduled pod/redhat-operators-vlcnw Successfully assigned openshift-marketplace/redhat-operators-vlcnw to ci-ln-fn2msht-f76d1-svf6b-master-0 14m Normal AddedInterface pod/redhat-operators-vlcnw Add eth0 [10.128.0.103/23] from openshift-sdn 14m Normal Pulling pod/redhat-operators-vlcnw Pulling image "registry.redhat.io/redhat/redhat-operator-index:v4.8" 14m Normal Pulled pod/redhat-operators-vlcnw Successfully pulled image "registry.redhat.io/redhat/redhat-operator-index:v4.8" in 1.364520164s 14m Normal Created pod/redhat-operators-vlcnw Created container registry-server 14m Normal Started pod/redhat-operators-vlcnw Started container registry-server 14m Warning Unhealthy pod/redhat-operators-vlcnw Liveness probe failed: command timed out 14m Normal Killing pod/redhat-operators-vlcnw Stopping container registry-server 31m Warning FailedToUpdateEndpointSlices service/redhat-operators Error updating Endpoint Slices for Service openshift-marketplace/redhat-operators: failed to create EndpointSlice for Service openshift-marketplace/redhat-operators: Internal error occurred: admission plugin "OwnerReferencesPermissionEnforcement" failed to complete validation in 13s ``` looks like it takes a while for the registry pods to eventually start running. This was likely happening all along, but started getting highlighted in the ci when prometheus alerts for default CatalogSources in the openshift-marketplace namespace were introduced as part of https://bugzilla.redhat.com/show_bug.cgi?id=1973482. The fix PR linked in the bz bumps up the values of the initial delay seconds and timeout seconds of the liveness probe of the registry pods, after consultation with the node team that the default value of 1s for liveness probe timeout is likely not enough when there is a resource constraint.
> the issue seems to be simple cpu contention in the probe that sounds reasonable too. The fix remains the same though.
It does not look like the increase of timeout helped, I still can see errors under https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1410082424506814464 https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1410072665523228672
But it better than under 4.8, so probably we can increase a little bit more the timeout and cherry-pick it to 4.8
Arytom, looks like the PR merged around 12:00 UTC, so around the same time you saw those failures. I'm guessing it would have taken a while for the PR to show up in ci builds. Probably need to verify if the PR was included in the builds for the failures you saw. We'll also get more information from amortized results from ci-search/ci test grids than individual PRs.
Sorry missed it, thanks for the information will wait for the next run.
I'm not sure if this has the PR in it. Do you have a link? oc version Server Version: 4.9.0-0.nightly-2021-06-30-123441 I'm running a SNO, installing ISV_Operators. oc get events -n openshift-marketplace | grep probe I see lots of m45s Warning Unhealthy pod/community-operators-p6gmv Liveness probe failed: command timed out 6m45s Normal Killing pod/community-operators-p6gmv Container registry-server failed liveness probe, will be restarted 6m44s Warning Unhealthy pod/community-operators-p6gmv Readiness probe failed: 6m13s Warning Unhealthy pod/community-operators-p6gmv Readiness probe failed: timeout: failed to connect service ":50051" within 1s 6m Warning Unhealthy pod/community-operators-p6gmv Readiness probe failed: command timed out 34m Warning ProbeError pod/marketplace-operator-6d76ddc58-25r7p Readiness probe error: Get "http://10.128.0.38:8080/healthz": dial tcp 10.128.0.38:8080: connect: connection refused 34m Warning Unhealthy pod/marketplace-operator-6d76ddc58-25r7p Readiness probe failed: Get "http://10.128.0.38:8080/healthz": dial tcp 10.128.0.38:8080: connect: connection refused 34m Warning ProbeError pod/marketplace-operator-6d76ddc58-25r7p Liveness probe error: Get "http://10.128.0.38:8080/healthz": dial tcp 10.128.0.38:8080: connect: connection refused 34m Warning Unhealthy pod/marketplace-operator-6d76ddc58-25r7p Liveness probe failed: Get "http://10.128.0.38:8080/healthz": dial tcp 10.128.0.38:8080: connect: connection refused 3m35s Warning Unhealthy pod/redhat-marketplace-qqwlr Liveness probe failed: command timed out 7m3s Warning Unhealthy pod/redhat-marketplace-qqwlr Readiness probe failed: timeout: failed to connect service ":50051" within 1s 6m5s Normal Killing pod/redhat-marketplace-qqwlr Container registry-server failed liveness probe, will be restarted 6m3s Warning Unhealthy pod/redhat-marketplace-qqwlr Readiness probe failed: time="2021-06-30T17:36:04Z" level=error msg="exec failed: container_linux.go:367: starting container process caused: process_linux.go:152: writing config to pipe caused: write init-p: broken pipe" 5m56s Warning Unhealthy pod/redhat-marketplace-qqwlr Readiness probe errored: rpc error: code = NotFound desc = container is not created or running: checking if PID of b1b0349b545a9bd25448740bb045e8bed48558be2b7a3376b1cbd6bb769cf3b9 is running failed: container process not found oc get pods -n openshift-marketplace NAME READY STATUS RESTARTS AGE certified-operators-694ng 1/1 Running 2 33m community-operators-p6gmv 1/1 Running 2 33m marketplace-operator-6d76ddc58-25r7p 1/1 Running 0 36m redhat-marketplace-qqwlr 1/1 Running 2 33m redhat-operators-hsv4j 1/1 Running 2 33m Once my run finished and the cluster was quiescent, I stopped seeing probe failures running oc get events -n openshift-marketplace -w
Hi Tom, > I'm not sure if this has the PR in it. Do you have a link? The fixed PR is https://github.com/openshift/operator-framework-olm/pull/106, and it was merged in commit 121f35e9, details: https://github.com/openshift/operator-framework-olm/commits/master So you need to check the OLM commits info, there are 2 ways: 1, Check the cluster payload, as follows: [cloud-user@preserve-olm-env jian]$ oc adm release info registry.ci.openshift.org/ocp/release:4.9.0-0.nightly-2021-06-30-123441 --commits|grep lifecycle operator-lifecycle-manager https://github.com/openshift/operator-framework-olm 2a45f52303b05d6d02766c4ee76758881fc14522 2, Check the OLM version: [cloud-user@preserve-olm-env jian]$ oc -n openshift-operator-lifecycle-manager exec deployment/catalog-operator -- olm --version OLM version: 0.17.0 git commit: 2a45f52303b05d6d02766c4ee76758881fc14522 Obviously, this commit(2a45f52303b05d6d02766c4ee76758881fc14522) is before the PR merged commit(121f35e9), so the payload "4.9.0-0.nightly-2021-06-30-123441" doesn't have that fixed PR.
I couldn't find the Alert for the CaatalogSource pods probe failure in the CI jobs. Details: For example, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1410448969061896192, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-vsphere-serial/1410486425169694720, https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-azure/1410486445654675456, ..etc. Search links: https://search.ci.openshift.org/?search=Alerts+shouldn%27t+report+any+alerts+in+firing&maxAge=6h&context=1&type=junit&name=4%5C.9&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Looks good, verify it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759