Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1744285

Summary:

Frequent failed revision-pruner pods after upgrade from 4.1.11 to 4.1.12

Product:

OpenShift Container Platform

Reporter:

Chris Collins <chris.collins>

Component:

kube-scheduler

Assignee:

Mike Dame <mdame>

Status:

CLOSED ERRATA

QA Contact:

RamaKasturi <knarra>

Severity:

low

Docs Contact:

Priority:

unspecified

Version:

4.1.z

CC:

aos-bugs, brad.williams, chris.collins, knarra, malonso, maszulik, mdame, mfojtik

Target Milestone:

---

Target Release:

4.4.z

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-05-04 11:13:08 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
output of `oc get pod` for the failed revision-pruner pod	none

Description Chris Collins 2019-08-21 18:25:27 UTC

Created attachment 1606686 [details]
output of `oc get pod` for the failed revision-pruner pod

Description of problem:

Revision-pruner-<somestring> pods in the openshift-kube-scheduler namespace frequently exist in Error state after upgrading from 4.1.11 to 4.1.12


Version-Release number of selected component (if applicable):

Unsure.  
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:926b87f00234f43437245a02fd5d68e0975186e77450fc94728ce1a679c8553e


How reproducible:

So far, 4/6 upgrades.  I have been using the github.com/openshift/osde2e framework to do the upgrades, but it's just a standard upgrade request.  Anecdotally, another on my team ran into this with a manual upgrade. 

Steps to Reproduce:
1. upgrade an openshift cluster from 4.1.11 to 4.1.12
2. search for pods with error state in openshift-kube-schedular namespace

my specific case:

1. run osde2e with UPGRADE_RELEASE_STREAM=4-stable environment variable to create a 4.1.11 cluster and upgrade it to 4.1.12
2. tests for pods not in the Running or Completed state will fail

Actual results:

Failed revision-pruner pod in Error state with 255 exit code and log message (attached)


Expected results:

Pod in Completed or Running state


Additional info:

See attached Kubernetes pod YAML for info, and pod logs message in the container status message.

Comment 1 Naveen Malik 2019-08-21 18:28:55 UTC

I've run into this with 4.1.9 to 4.1.11 upgrade on a fresh cluster, installed at 4.1.9 specifically so I could do the upgrade.

Comment 2 Mike Dame 2019-08-23 12:19:35 UTC

This is weird, and unfortunately the logs in the pod state don't give any information about the error (they just dump the info at the top of cmd https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L38). However it's a pretty simple function, so there are only 4 places that can return an error:

1. Validating flags (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L41)
2. Failure to read the resource dir (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L81)
3. Failure to parse a revision ID (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L98)
4. Failure to remove the resource dir (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L112)

Since this is only happening during an upgrade, a shot in the dark bet is that maybe something weird happens to the resource dirs during an upgrade, making one that's expected not actually be there (2).

Is it possible to get the full pod logs? That would be the most helpful.

It could also be helpful to see the statuses of the current revisions, to get that you can run:
$ go get github.com/openshift/must-gather/cmd/openshift-dev-helpers
$ openshift-dev-helpers revision-status -n <namespace of failing pruner>

Comment 3 Chris Collins 2019-08-27 17:33:24 UTC

Unfortunately, I don't have the full logs from the initial failures anymore, but also, I am no longer able to replicate this.  I've run this a dozen times over the last two days and several times last Friday, and haven't gotten a single failed pod.  I'm going to keep trying, but perhaps it's no longer an issue.

Comment 4 Naveen Malik 2019-09-17 13:59:17 UTC

@Mike

Full logs (very small) for a failure for a 4.1.14 to 4.1.15 upgrade:

$ oc -n openshift-kube-scheduler get pods
NAME                                                                     READY   STATUS      RESTARTS   AGE
installer-10-ip-10-0-128-168.eu-central-1.compute.internal               0/1     Completed   0          24m
installer-10-ip-10-0-129-84.eu-central-1.compute.internal                0/1     Completed   0          26m
installer-10-ip-10-0-137-61.eu-central-1.compute.internal                0/1     Completed   0          23m
openshift-kube-scheduler-ip-10-0-128-168.eu-central-1.compute.internal   1/1     Running     0          24m
openshift-kube-scheduler-ip-10-0-129-84.eu-central-1.compute.internal    1/1     Running     0          26m
openshift-kube-scheduler-ip-10-0-137-61.eu-central-1.compute.internal    1/1     Running     0          23m
revision-pruner-10-ip-10-0-128-168.eu-central-1.compute.internal         0/1     Completed   0          23m
revision-pruner-10-ip-10-0-129-84.eu-central-1.compute.internal          0/1     Completed   0          25m
revision-pruner-10-ip-10-0-137-61.eu-central-1.compute.internal          0/1     Completed   0          22m
revision-pruner-9-ip-10-0-128-168.eu-central-1.compute.internal          0/1     Error       0          4d20h
revision-pruner-9-ip-10-0-129-84.eu-central-1.compute.internal           0/1     Completed   0          4d19h
revision-pruner-9-ip-10-0-137-61.eu-central-1.compute.internal           0/1     Completed   0          4d20h


$ oc -n openshift-kube-scheduler logs revision-pruner-9-ip-10-0-128-168.eu-central-1.compute.internal
I0912 17:38:47.965799       1 cmd.go:38] &{<nil> true {false} prune true map[protected-revisions:0xc0005b4f00 resource-dir:0xc0005b4fa0 static-pod-name:0xc0005b5040 v:0xc00070ba40 max-eligible-revision:0xc0005b4e60] [0xc00070ba40 0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0005b5040] [] map[resource-dir:0xc0005b4fa0 log-dir:0xc0000e5d60 alsologtostderr:0xc0000e5c20 skip-headers:0xc0000e5f40 stderrthreshold:0xc00070b9a0 v:0xc00070ba40 vmodule:0xc00070bae0 max-eligible-revision:0xc0005b4e60 static-pod-name:0xc0005b5040 log-backtrace-at:0xc0000e5cc0 help:0xc0005b5220 protected-revisions:0xc0005b4f00 log-file:0xc0000e5e00 log-flush-frequency:0xc0000dc280 logtostderr:0xc0000e5ea0] [0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0005b5040 0xc0000e5c20 0xc0000e5cc0 0xc0000e5d60 0xc0000e5e00 0xc0000dc280 0xc0000e5ea0 0xc0000e5f40 0xc00070b9a0 0xc00070ba40 0xc00070bae0 0xc0005b5220] [0xc0000e5c20 0xc0005b5220 0xc0000e5cc0 0xc0000e5d60 0xc0000e5e00 0xc0000dc280 0xc0000e5ea0 0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0000e5f40 0xc0005b5040 0xc00070b9a0 0xc00070ba40 0xc00070bae0] map[118:0xc00070ba40 104:0xc0005b5220] [] -1 0 0xc00014c150 true <nil> []}
I0912 17:38:47.965967       1 cmd.go:39] (*prune.PruneOptions)(0xc000410f00)({
 MaxEligibleRevision: (int) 9,
 ProtectedRevisions: ([]int) (len=9 cap=9) {
  (int) 1,
  (int) 2,
  (int) 3,
  (int) 4,
  (int) 5,
  (int) 6,
  (int) 7,
  (int) 8,
  (int) 9
 },
 ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources",
 StaticPodName: (string) (len=18) "kube-scheduler-pod"
})

Comment 7 Maciej Szulik 2020-03-13 15:25:35 UTC

A lot has changed between 4.2 and now, let's move it to qa to verify in 4.4.

Comment 12 RamaKasturi 2020-03-24 06:54:24 UTC

Got to know that Installer pods will be removed by upgrade and as per the above comment i did not see any crashes upgrading from 4.3.2->4.3.3->4.3.5. SO moving the bug to verified state.

Comment 14 errata-xmlrpc 2020-05-04 11:13:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Comment 15 RamaKasturi 2022-11-16 12:49:23 UTC

Setting qe_test_coverage flag to '-' as i see that the issue does not happen in clusters with ocp version >= 4.8.