Bug 1744285 - Frequent failed revision-pruner pods after upgrade from 4.1.11 to 4.1.12
Summary: Frequent failed revision-pruner pods after upgrade from 4.1.11 to 4.1.12
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.1.z
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
: 4.4.z
Assignee: Mike Dame
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-08-21 18:25 UTC by Chris Collins
Modified: 2023-09-07 20:26 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:13:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
output of `oc get pod` for the failed revision-pruner pod (5.09 KB, text/plain)
2019-08-21 18:25 UTC, Chris Collins
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:13:35 UTC

Description Chris Collins 2019-08-21 18:25:27 UTC
Created attachment 1606686 [details]
output of `oc get pod` for the failed revision-pruner pod

Description of problem:

Revision-pruner-<somestring> pods in the openshift-kube-scheduler namespace frequently exist in Error state after upgrading from 4.1.11 to 4.1.12


Version-Release number of selected component (if applicable):

Unsure.  
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:926b87f00234f43437245a02fd5d68e0975186e77450fc94728ce1a679c8553e


How reproducible:

So far, 4/6 upgrades.  I have been using the github.com/openshift/osde2e framework to do the upgrades, but it's just a standard upgrade request.  Anecdotally, another on my team ran into this with a manual upgrade. 

Steps to Reproduce:
1. upgrade an openshift cluster from 4.1.11 to 4.1.12
2. search for pods with error state in openshift-kube-schedular namespace

my specific case:

1. run osde2e with UPGRADE_RELEASE_STREAM=4-stable environment variable to create a 4.1.11 cluster and upgrade it to 4.1.12
2. tests for pods not in the Running or Completed state will fail

Actual results:

Failed revision-pruner pod in Error state with 255 exit code and log message (attached)


Expected results:

Pod in Completed or Running state


Additional info:

See attached Kubernetes pod YAML for info, and pod logs message in the container status message.

Comment 1 Naveen Malik 2019-08-21 18:28:55 UTC
I've run into this with 4.1.9 to 4.1.11 upgrade on a fresh cluster, installed at 4.1.9 specifically so I could do the upgrade.

Comment 2 Mike Dame 2019-08-23 12:19:35 UTC
This is weird, and unfortunately the logs in the pod state don't give any information about the error (they just dump the info at the top of cmd https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L38). However it's a pretty simple function, so there are only 4 places that can return an error:

1. Validating flags (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L41)
2. Failure to read the resource dir (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L81)
3. Failure to parse a revision ID (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L98)
4. Failure to remove the resource dir (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L112)

Since this is only happening during an upgrade, a shot in the dark bet is that maybe something weird happens to the resource dirs during an upgrade, making one that's expected not actually be there (2).

Is it possible to get the full pod logs? That would be the most helpful.

It could also be helpful to see the statuses of the current revisions, to get that you can run:
$ go get github.com/openshift/must-gather/cmd/openshift-dev-helpers
$ openshift-dev-helpers revision-status -n <namespace of failing pruner>

Comment 3 Chris Collins 2019-08-27 17:33:24 UTC
Unfortunately, I don't have the full logs from the initial failures anymore, but also, I am no longer able to replicate this.  I've run this a dozen times over the last two days and several times last Friday, and haven't gotten a single failed pod.  I'm going to keep trying, but perhaps it's no longer an issue.

Comment 4 Naveen Malik 2019-09-17 13:59:17 UTC
@Mike

Full logs (very small) for a failure for a 4.1.14 to 4.1.15 upgrade:

$ oc -n openshift-kube-scheduler get pods
NAME                                                                     READY   STATUS      RESTARTS   AGE
installer-10-ip-10-0-128-168.eu-central-1.compute.internal               0/1     Completed   0          24m
installer-10-ip-10-0-129-84.eu-central-1.compute.internal                0/1     Completed   0          26m
installer-10-ip-10-0-137-61.eu-central-1.compute.internal                0/1     Completed   0          23m
openshift-kube-scheduler-ip-10-0-128-168.eu-central-1.compute.internal   1/1     Running     0          24m
openshift-kube-scheduler-ip-10-0-129-84.eu-central-1.compute.internal    1/1     Running     0          26m
openshift-kube-scheduler-ip-10-0-137-61.eu-central-1.compute.internal    1/1     Running     0          23m
revision-pruner-10-ip-10-0-128-168.eu-central-1.compute.internal         0/1     Completed   0          23m
revision-pruner-10-ip-10-0-129-84.eu-central-1.compute.internal          0/1     Completed   0          25m
revision-pruner-10-ip-10-0-137-61.eu-central-1.compute.internal          0/1     Completed   0          22m
revision-pruner-9-ip-10-0-128-168.eu-central-1.compute.internal          0/1     Error       0          4d20h
revision-pruner-9-ip-10-0-129-84.eu-central-1.compute.internal           0/1     Completed   0          4d19h
revision-pruner-9-ip-10-0-137-61.eu-central-1.compute.internal           0/1     Completed   0          4d20h


$ oc -n openshift-kube-scheduler logs revision-pruner-9-ip-10-0-128-168.eu-central-1.compute.internal
I0912 17:38:47.965799       1 cmd.go:38] &{<nil> true {false} prune true map[protected-revisions:0xc0005b4f00 resource-dir:0xc0005b4fa0 static-pod-name:0xc0005b5040 v:0xc00070ba40 max-eligible-revision:0xc0005b4e60] [0xc00070ba40 0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0005b5040] [] map[resource-dir:0xc0005b4fa0 log-dir:0xc0000e5d60 alsologtostderr:0xc0000e5c20 skip-headers:0xc0000e5f40 stderrthreshold:0xc00070b9a0 v:0xc00070ba40 vmodule:0xc00070bae0 max-eligible-revision:0xc0005b4e60 static-pod-name:0xc0005b5040 log-backtrace-at:0xc0000e5cc0 help:0xc0005b5220 protected-revisions:0xc0005b4f00 log-file:0xc0000e5e00 log-flush-frequency:0xc0000dc280 logtostderr:0xc0000e5ea0] [0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0005b5040 0xc0000e5c20 0xc0000e5cc0 0xc0000e5d60 0xc0000e5e00 0xc0000dc280 0xc0000e5ea0 0xc0000e5f40 0xc00070b9a0 0xc00070ba40 0xc00070bae0 0xc0005b5220] [0xc0000e5c20 0xc0005b5220 0xc0000e5cc0 0xc0000e5d60 0xc0000e5e00 0xc0000dc280 0xc0000e5ea0 0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0000e5f40 0xc0005b5040 0xc00070b9a0 0xc00070ba40 0xc00070bae0] map[118:0xc00070ba40 104:0xc0005b5220] [] -1 0 0xc00014c150 true <nil> []}
I0912 17:38:47.965967       1 cmd.go:39] (*prune.PruneOptions)(0xc000410f00)({
 MaxEligibleRevision: (int) 9,
 ProtectedRevisions: ([]int) (len=9 cap=9) {
  (int) 1,
  (int) 2,
  (int) 3,
  (int) 4,
  (int) 5,
  (int) 6,
  (int) 7,
  (int) 8,
  (int) 9
 },
 ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources",
 StaticPodName: (string) (len=18) "kube-scheduler-pod"
})

Comment 7 Maciej Szulik 2020-03-13 15:25:35 UTC
A lot has changed between 4.2 and now, let's move it to qa to verify in 4.4.

Comment 12 RamaKasturi 2020-03-24 06:54:24 UTC
Got to know that Installer pods will be removed by upgrade and as per the above comment i did not see any crashes upgrading from 4.3.2->4.3.3->4.3.5. SO moving the bug to verified state.

Comment 14 errata-xmlrpc 2020-05-04 11:13:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 15 RamaKasturi 2022-11-16 12:49:23 UTC
Setting qe_test_coverage flag to '-' as i see that the issue does not happen in clusters with ocp version >= 4.8.


Note You need to log in before you can comment on or make changes to this bug.