Bug 1976940
| Summary: | GCP RT CI failing on firing KubeContainerWaiting due to liveness and readiness probes timing out | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jan Chaloupka <jchaloup> |
| Component: | Monitoring | Assignee: | Arunprasad Rajkumar <arajkuma> |
| Status: | CLOSED DUPLICATE | QA Contact: | Junqi Zhao <juzhao> |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 4.9 | CC: | alegrand, anpicker, aos-bugs, erooth, kakkoyun, pgough, pkrupa, spasquie |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: |
[sig-instrumentation][Late] Alerts shouldn't report any alerts in firing or pending state apart from Watchdog and AlertmanagerReceiversNotConfigured and have no gaps in Watchdog firing
|
|
| Last Closed: | 2021-07-20 16:51:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jan Chaloupka
2021-06-28 15:34:13 UTC
For completeness of the report:
```
flake: Unexpected alert behavior during test:
alert CommunityOperatorsCatalogError pending for 125.6010000705719 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.13:8081", job="catalog-operator-metrics", name="community-operators", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-6cb6465654-vhpd8", service="catalog-operator-metrics", severity="warning"}
alert KubeAPIErrorBudgetBurn pending for 2103.601000070572 seconds with labels: {long="3d", severity="warning", short="6h"}
alert KubeContainerWaiting pending for 252.6010000705719 seconds with labels: {container="registry-server", namespace="openshift-marketplace", pod="redhat-marketplace-d5f7s", severity="warning"}
alert KubeContainerWaiting pending for 72.6010000705719 seconds with labels: {container="registry-server", namespace="openshift-marketplace", pod="community-operators-2zmw5", severity="warning"}
alert KubeContainerWaiting pending for 72.6010000705719 seconds with labels: {container="thanos-query", namespace="openshift-monitoring", pod="thanos-querier-654df9fd8c-f27c6", severity="warning"}
alert KubeDeploymentReplicasMismatch pending for 306.6010000705719 seconds with labels: {container="kube-rbac-proxy-main", deployment="thanos-querier", endpoint="https-main", job="kube-state-metrics", namespace="openshift-monitoring", service="kube-state-metrics", severity="warning"}
alert PodDisruptionBudgetAtLimit pending for 12.6010000705719 seconds with labels: {namespace="openshift-monitoring", poddisruptionbudget="thanos-querier-pdb", severity="warning"}
alert RedhatMarketplaceCatalogError pending for 251.6010000705719 seconds with labels: {container="catalog-operator", endpoint="https-metrics", exported_namespace="openshift-marketplace", instance="10.129.0.13:8081", job="catalog-operator-metrics", name="redhat-marketplace", namespace="openshift-operator-lifecycle-manager", pod="catalog-operator-6cb6465654-vhpd8", service="catalog-operator-metrics", severity="warning"}
alert TargetDown pending for 353.6010000705719 seconds with labels: {job="thanos-querier", namespace="openshift-monitoring", service="thanos-querier", severity="warning"}
```
This seem to be a duplicate of Bug 1980888. This indeed looks like a duplicate and the probe info in the ticket is now outdated as we now use http probes. There appears to have been an issue with the underlying exec. If you check https://search.ci.openshift.org/chart?search=alert+KubeContainerWaiting+fired&maxAge=24h&type=junit you will see the failure rate in ci is now < 1% with no thanos related failures that I have observed so closing as a duplicate as this is resolved. *** This bug has been marked as a duplicate of bug 1980888 *** |