Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1976940

Summary: GCP RT CI failing on firing KubeContainerWaiting due to liveness and readiness probes timing out
Product: OpenShift Container Platform Reporter: Jan Chaloupka <jchaloup>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED DUPLICATE QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: low    
Version: 4.9CC: alegrand, anpicker, aos-bugs, erooth, kakkoyun, pgough, pkrupa, spasquie
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing
Last Closed: 2021-07-20 16:51:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jan Chaloupka 2021-06-28 15:34:13 UTC
https://search.ci.openshift.org/?search=alert+KubeContainerWaiting+pending.*pod%3D%22thanos-querier-&maxAge=336h&context=1&type=junit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1409082443587129344/artifacts/e2e-gcp-rt/gather-extra/artifacts/nodes/ci-op-wi8zyd24-f0f18-n4njb-worker-b-grcfp/journal:
```
Jun 27 10:22:36.201420 ci-op-wi8zyd24-f0f18-n4njb-worker-b-grcfp hyperkube[1563]: I0627 10:22:36.201379    1563 prober.go:116] "Probe failed" probeType="Readiness" pod="openshift-monitoring/thanos-querier-654df9fd8c-f27c6" podUID=02594f9b-c589-4fc1-859a-948d9ee4f160 containerName="thanos-query" probeResult=failure output="command timed out"
Jun 27 10:23:05.879168 ci-op-wi8zyd24-f0f18-n4njb-worker-b-grcfp hyperkube[1563]: I0627 10:23:05.879108    1563 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-monitoring/thanos-querier-654df9fd8c-f27c6" podUID=02594f9b-c589-4fc1-859a-948d9ee4f160 containerName="thanos-query" probeResult=failure output="command timed out"
...
Jun 27 11:04:35.846393 ci-op-wi8zyd24-f0f18-n4njb-worker-b-grcfp hyperkube[1563]: I0627 11:04:35.846345    1563 prober.go:116] "Probe failed" probeType="Liveness" pod="openshift-monitoring/thanos-querier-654df9fd8c-f27c6" podUID=02594f9b-c589-4fc1-859a-948d9ee4f160 containerName="thanos-query" probeResult=failure output="command timed out"
Jun 27 11:04:35.891788 ci-op-wi8zyd24-f0f18-n4njb-worker-b-grcfp hyperkube[1563]: I0627 11:04:35.891706    1563 prober.go:116] "Probe failed" probeType="Readiness" pod="openshift-monitoring/thanos-querier-654df9fd8c-f27c6" podUID=02594f9b-c589-4fc1-859a-948d9ee4f160 containerName="thanos-query" probeResult=failure output="command timed out"
```

The probe configuration (from https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.9-e2e-gcp-rt/1409082443587129344/artifacts/e2e-gcp-rt/gather-extra/artifacts/pods.json):
```
                        "livenessProbe": {
                            "exec": {
                                "command": [
                                    "sh",
                                    "-c",
                                    "if [ -x \"$(command -v curl)\" ]; then exec curl http://localhost:9090/-/healthy; elif [ -x \"$(command -v wget)\" ]; then exec wget --quiet --tries=1 --spider http://localhost:9090/-/healthy; else exit 1; fi"
                                ]
                            },
                            "failureThreshold": 3,
                            "periodSeconds": 10,
                            "successThreshold": 1,
                            "timeoutSeconds": 1
                        },
                        "readinessProbe": {
                            "exec": {
                                "command": [
                                    "sh",
                                    "-c",
                                    "if [ -x \"$(command -v curl)\" ]; then exec curl http://localhost:9090/-/ready; elif [ -x \"$(command -v wget)\" ]; then exec wget --quiet --tries=1 --spider http://localhost:9090/-/ready; else exit 1; fi"
                                ]
                            },
                            "failureThreshold": 3,
                            "periodSeconds": 10,
                            "successThreshold": 1,
                            "timeoutSeconds": 1
                        },
```

Manifesting mostly in RT only. Possibly a slow DNS? Or kubelet running container's exec command is just to slow?

Comment 1 Jan Chaloupka 2021-06-28 15:45:17 UTC
For completeness of the report:

```
flake: Unexpected alert behavior during test:

alert CommunityOperatorsCatalogError pending for 125.6010000705719 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.13:8081", job="catalog-operator-metrics", name="community-operators", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-6cb6465654-vhpd8", service="catalog-operator-metrics", severity="warning"}
alert KubeAPIErrorBudgetBurn pending for 2103.601000070572 seconds with labels: {long="3d", severity="warning", short="6h"}
alert KubeContainerWaiting pending for 252.6010000705719 seconds with labels: {container="registry-server", namespace="openshift-marketplace", pod="redhat-marketplace-d5f7s", severity="warning"}
alert KubeContainerWaiting pending for 72.6010000705719 seconds with labels: {container="registry-server", namespace="openshift-marketplace", pod="community-operators-2zmw5", severity="warning"}
alert KubeContainerWaiting pending for 72.6010000705719 seconds with labels: {container="thanos-query", namespace="openshift-monitoring", pod="thanos-querier-654df9fd8c-f27c6", severity="warning"}
alert KubeDeploymentReplicasMismatch pending for 306.6010000705719 seconds with labels: {container="kube-rbac-proxy-main", deployment="thanos-querier", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning"}
alert PodDisruptionBudgetAtLimit pending for 12.6010000705719 seconds with labels: {namespace="openshift-monitoring", poddisruptionbudget="thanos-querier-pdb", severity="warning"}
alert RedhatMarketplaceCatalogError pending for 251.6010000705719 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.13:8081", job="catalog-operator-metrics", name="redhat-marketplace", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-6cb6465654-vhpd8", service="catalog-operator-metrics", severity="warning"}
alert TargetDown pending for 353.6010000705719 seconds with labels: {job="thanos-querier", namespace="openshift-monitoring", service="thanos-querier", severity="warning"}
```

Comment 3 Arunprasad Rajkumar 2021-07-20 12:53:06 UTC
This seem to be a duplicate of Bug 1980888.

Comment 4 Philip Gough 2021-07-20 16:51:21 UTC
This indeed looks like a duplicate and the probe info in the ticket is now outdated as we now use http probes. There appears to have been an issue with the underlying exec. If you check https://search.ci.openshift.org/chart?search=alert+KubeContainerWaiting+fired&maxAge=24h&type=junit you will see the failure rate in ci is now < 1% with no thanos related failures that I have observed so closing as a duplicate as this is resolved.

*** This bug has been marked as a duplicate of bug 1980888 ***