Bug 1805891 - KubePodNotReady should not be raised for must-gather pod
Summary: KubePodNotReady should not be raised for must-gather pod
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: oc
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: 4.6.0
Assignee: Jan Chaloupka
QA Contact: Mike Fiedler
URL:
Whiteboard: LifecycleReset
Depends On:
Blocks: 1875551
TreeView+ depends on / blocked
 
Reported: 2020-02-21 17:11 UTC by Naveen Malik
Modified: 2020-10-27 15:56 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1875551 (view as bug list)
Environment:
Last Closed: 2020-10-27 15:55:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ALERTS (61.25 KB, image/png)
2020-05-18 15:23 UTC, Naveen Malik
no flags Details
must-gather pod yaml (3.93 KB, text/plain)
2020-05-20 14:06 UTC, Naveen Malik
no flags Details
must-gather pod describe (3.72 KB, text/plain)
2020-05-20 14:06 UTC, Naveen Malik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift oc pull 540 0 None closed bug 1805891: must-gather: move gather init container under containers 2020-09-14 10:10:01 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 15:56:00 UTC

Description Naveen Malik 2020-02-21 17:11:30 UTC
Description of problem:
On a large cluster, must-gather takes a while.  It should not raise the critical KubePodNotReady.  On OSD this pages SRE on-call and is not useful.  Hit this on a 42 node cluster (3x m5.xlarge masters, 39x m5.4xlarge workers)

Version-Release number of selected component (if applicable):
4.3.0

How reproducible:
Every time.

Steps to Reproduce:
1. login to large cluster where must-gather will take longer than 15 minutes
2. run must-gather 
3. observe KubePodNotReady alert raised at 15 minutes

Actual results:
KubePodNotReady alert fires for must-gather pod running.

Expected results:
KubePodNotReady does not fire for must-gather pod.

Additional info:

Comment 1 Jan Chaloupka 2020-05-18 12:21:37 UTC
Based on https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/085ed55f986bb721d10e5b83d75aa2ce6e1f0819/alerts/apps_alerts.libsonnet#L26-L48, a pod triggers KubePodNotReady only when it's in Pending or Unknown state.

Naveen, can you share output of `oc describe` and `oc get -o yaml` for the must-gather pod when KubePodNotReady is triggered?

Comment 2 Naveen Malik 2020-05-18 15:22:35 UTC
I ran a must-gather on the large cluster that exhibited this alert in the past.  The alert did not fire but the alert is shown in a "pending" state in prometheus metric "ALERTS".  I didn't capture oc describe or pod yaml while this was going on as I expected to collect it after the alert was raised.  The alert goes pending very soon after the must-gather is started and is simply a matter of creating a scenario in which it will take more than 15m for the must-gather to complete in order for the alert to fire.

If the underlying expr on the alert were tweaked to exclude openshift-must-gather.* namespaces this problem would be resolved.  This, of course, could result in a valid problem not being alerted on but it's a command an admin is running and waiting for output.

sum by(namespace, pod) (max by(namespace, pod) (kube_pod_status_phase{job="kube-state-metrics",namespace!~"^openshift-must-gather.*",namespace=~"(openshift-.*|kube-.*|default|logging)",phase=~"Pending|Unknown"}) * on(namespace, pod) group_left(owner_kind) max by(namespace, pod, owner_kind) (kube_pod_owner{owner_kind!="Job"})) > 0

Screenshot of the alert will be attached.

Comment 3 Naveen Malik 2020-05-18 15:23:27 UTC
Created attachment 1689619 [details]
ALERTS

Comment 4 Jan Chaloupka 2020-05-18 17:16:06 UTC
> The alert goes pending very soon after the must-gather is started

That's expected based on "pending: the state of an alert that has been active for less than the configured threshold duration" [1].

> and is simply a matter of creating a scenario in which it will take more than 15m for the must-gather to complete in order for the alert to fire.

Based on [2], the alert will get triggered only when the must-gather pod has been in Pending|Unknown state for at least 15 minutes.

If the must-gather pod is already running, no alert gets fired.

If the alert gets fired even when must-gather is running, the alert expressions might be wrong.

Are you sure the must-gather pod is really running and not pending? E.g. missing sufficient resources for running?


[1] https://pracucci.com/prometheus-understanding-the-delays-on-alerting.html

[2] https://github.com/kubernetes-monitoring/kubernetes-mixin/blob/085ed55f986bb721d10e5b83d75aa2ce6e1f0819/alerts/apps_alerts.libsonnet#L26-L48

Comment 5 Naveen Malik 2020-05-18 20:41:38 UTC
I'm sure it was doing what it was designed to do, I ended up with all the must-gather data I'd expect locally.  I didn't check the status of the pod as noted because I was going to do that if/when the alert triggered and it didn't.  This is on 4.3.18 cluster, could just try running a must-gather on a cluster and check the alert is raised as pending.

Comment 6 Jan Chaloupka 2020-05-20 08:30:04 UTC
> I'm sure it was doing what it was designed to do, I ended up with all the must-gather data I'd expect locally.

In that case the pod state had to be unknown if not pending.

> This is on 4.3.18 cluster, could just try running a must-gather on a cluster and check the alert is raised as pending.

Yes please. If you get the change, please collect `oc get -o yaml` and `oc describe` for the must-gather container while the alert is pending. To see in which state the pod is, resp. if there are any suspicious events reported.

Comment 7 Naveen Malik 2020-05-20 14:06:07 UTC
Created attachment 1690265 [details]
must-gather pod yaml

while alert KubePodNotReady is pending for the namespace openshift-must-gather-r2qm7

Comment 8 Naveen Malik 2020-05-20 14:06:35 UTC
Created attachment 1690266 [details]
must-gather pod describe

while alert KubePodNotReady is pending for the namespace openshift-must-gather-r2qm7

Comment 9 Jan Chaloupka 2020-05-21 10:46:40 UTC
Thanks Naveen, this helped. So in short the gather command is run inside an init container:

```
Status:             Pending
Init Containers:
  gather:
    Container ID:  cri-o://1cbcab0dbf9fff91541368ee8b54ade7ee23d505820f48d5c6da4230fedc0658
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ed21db98f1f2f954c95167a242124c01b7db408ba8f88c227401f51097da1534
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:ed21db98f1f2f954c95167a242124c01b7db408ba8f88c227401f51097da1534
    Port:          <none>
    Host Port:     <none>
    Command:
      /usr/bin/gather
    State:          Running
      Started:      Wed, 20 May 2020 10:02:53 -0400
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /must-gather from must-gather-output (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-7czhm (ro)
```

Also, "A Pod that is initializing is in the Pending state" [1] which explains why KubePodNotReady alert is triggered.

[1] https://kubernetes.io/docs/concepts/workloads/pods/init-containers/

Comment 10 Jan Chaloupka 2020-05-21 10:49:30 UTC
Naveen, what exact command did you use to run the must-gather pod? Resp. where did you take the must-gather pod manifest? Was it hand-crafted or officially provided?

Comment 11 Naveen Malik 2020-05-21 19:13:13 UTC
To run must-gather it was simply: oc adm must-gather
I have no idea what my default project was, but I assume 'default'.
Regarding the pod manifests, just dumped oc get pod <name> -o yaml for whatever was spun up.

Comment 12 Jan Chaloupka 2020-05-22 12:14:12 UTC
Having to move the gather command from init containers to containers section is not sufficient to immediately fix the issue. Moving to 4.6 so we can properly analyze the right solution to fix it properly.

Comment 14 Jan Chaloupka 2020-05-25 16:01:38 UTC
> On OSD this pages SRE on-call and is not useful.

Naveen, would it be sufficient for SRE to update the alert expression on their own? This effects only large clusters where the must-gather runs for more than 15 minutes. Or, would it help to educate all affected SREs and tell them the alert gets triggered when the must-gather gets run for too long?

Comment 15 Naveen Malik 2020-05-27 12:16:24 UTC
(In reply to Jan Chaloupka from comment #14)
> > On OSD this pages SRE on-call and is not useful.
> 
> Naveen, would it be sufficient for SRE to update the alert expression on
> their own? This effects only large clusters where the must-gather runs for
> more than 15 minutes. Or, would it help to educate all affected SREs and
> tell them the alert gets triggered when the must-gather gets run for too
> long?

We are not adjusting OCP shipped alerts as this requires CVO no longer manages the OCP provided PrometheusRules.

Adjusting the alert for SRE or ensuring SRE knows this can fire for must-gather isn't a problem we need to solve.  This BZ is to address that the alert shouldn't fire.  If it's going to be a bit before this is closed out then we can simply carry this in our SOP as a known issue and look forward to the future fix.

Comment 16 Jan Chaloupka 2020-05-29 11:21:58 UTC
> Adjusting the alert for SRE or ensuring SRE knows this can fire for must-gather isn't a problem we need to solve.  This BZ is to address that the alert shouldn't fire.
> If it's going to be a bit before this is closed out then we can simply carry this in our SOP as a known issue and look forward to the future fix.

Please proceed. Once we have the fix ready, this BZ gets properly closed. Keeping it open until then.

Comment 17 Michal Fojtik 2020-08-24 13:11:21 UTC
This bug hasn't had any activity in the last 30 days. Maybe the problem got resolved, was a duplicate of something else, or became less pressing for some reason - or maybe it's still relevant but just hasn't been looked at yet. As such, we're marking this bug as "LifecycleStale" and decreasing the severity/priority. If you have further information on the current state of the bug, please update it, otherwise this bug can be closed in about 7 days. The information can be, for example, that the problem still occurs, that you still want the feature, that more information is needed, or that the bug is (for whatever reason) no longer relevant.

Comment 19 Michal Fojtik 2020-08-24 14:08:11 UTC
The LifecycleStale keyword was removed because the bug got commented on recently.
The bug assignee was notified.

Comment 21 Jan Chaloupka 2020-08-31 22:25:36 UTC
wip PR: https://github.com/openshift/oc/pull/540

Comment 25 Mike Fiedler 2020-09-02 21:27:48 UTC
Verified on 4.6.0-0.nightly-2020-09-02-131630

54 node cluster on GCP - successful oc adm must-gather with no alerts firing.

Comment 26 Jan Chaloupka 2020-09-03 08:37:46 UTC
Hi Naveen, do you need this to be fixed in 4.5 as well?

Comment 27 Naveen Malik 2020-09-03 12:45:50 UTC
Jan, OSD is on 4.4 now and expect to upgrade to 4.5 around the beginning of October.  So if focusing on where to fix for OSD's needs I would say 4.5+

Comment 29 errata-xmlrpc 2020-10-27 15:55:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.