Created attachment 1606686 [details] output of `oc get pod` for the failed revision-pruner pod Description of problem: Revision-pruner-<somestring> pods in the openshift-kube-scheduler namespace frequently exist in Error state after upgrading from 4.1.11 to 4.1.12 Version-Release number of selected component (if applicable): Unsure. image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:926b87f00234f43437245a02fd5d68e0975186e77450fc94728ce1a679c8553e How reproducible: So far, 4/6 upgrades. I have been using the github.com/openshift/osde2e framework to do the upgrades, but it's just a standard upgrade request. Anecdotally, another on my team ran into this with a manual upgrade. Steps to Reproduce: 1. upgrade an openshift cluster from 4.1.11 to 4.1.12 2. search for pods with error state in openshift-kube-schedular namespace my specific case: 1. run osde2e with UPGRADE_RELEASE_STREAM=4-stable environment variable to create a 4.1.11 cluster and upgrade it to 4.1.12 2. tests for pods not in the Running or Completed state will fail Actual results: Failed revision-pruner pod in Error state with 255 exit code and log message (attached) Expected results: Pod in Completed or Running state Additional info: See attached Kubernetes pod YAML for info, and pod logs message in the container status message.
I've run into this with 4.1.9 to 4.1.11 upgrade on a fresh cluster, installed at 4.1.9 specifically so I could do the upgrade.
This is weird, and unfortunately the logs in the pod state don't give any information about the error (they just dump the info at the top of cmd https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L38). However it's a pretty simple function, so there are only 4 places that can return an error: 1. Validating flags (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L41) 2. Failure to read the resource dir (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L81) 3. Failure to parse a revision ID (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L98) 4. Failure to remove the resource dir (https://github.com/openshift/library-go/blob/master/pkg/operator/staticpod/prune/cmd.go#L112) Since this is only happening during an upgrade, a shot in the dark bet is that maybe something weird happens to the resource dirs during an upgrade, making one that's expected not actually be there (2). Is it possible to get the full pod logs? That would be the most helpful. It could also be helpful to see the statuses of the current revisions, to get that you can run: $ go get github.com/openshift/must-gather/cmd/openshift-dev-helpers $ openshift-dev-helpers revision-status -n <namespace of failing pruner>
Unfortunately, I don't have the full logs from the initial failures anymore, but also, I am no longer able to replicate this. I've run this a dozen times over the last two days and several times last Friday, and haven't gotten a single failed pod. I'm going to keep trying, but perhaps it's no longer an issue.
@Mike Full logs (very small) for a failure for a 4.1.14 to 4.1.15 upgrade: $ oc -n openshift-kube-scheduler get pods NAME READY STATUS RESTARTS AGE installer-10-ip-10-0-128-168.eu-central-1.compute.internal 0/1 Completed 0 24m installer-10-ip-10-0-129-84.eu-central-1.compute.internal 0/1 Completed 0 26m installer-10-ip-10-0-137-61.eu-central-1.compute.internal 0/1 Completed 0 23m openshift-kube-scheduler-ip-10-0-128-168.eu-central-1.compute.internal 1/1 Running 0 24m openshift-kube-scheduler-ip-10-0-129-84.eu-central-1.compute.internal 1/1 Running 0 26m openshift-kube-scheduler-ip-10-0-137-61.eu-central-1.compute.internal 1/1 Running 0 23m revision-pruner-10-ip-10-0-128-168.eu-central-1.compute.internal 0/1 Completed 0 23m revision-pruner-10-ip-10-0-129-84.eu-central-1.compute.internal 0/1 Completed 0 25m revision-pruner-10-ip-10-0-137-61.eu-central-1.compute.internal 0/1 Completed 0 22m revision-pruner-9-ip-10-0-128-168.eu-central-1.compute.internal 0/1 Error 0 4d20h revision-pruner-9-ip-10-0-129-84.eu-central-1.compute.internal 0/1 Completed 0 4d19h revision-pruner-9-ip-10-0-137-61.eu-central-1.compute.internal 0/1 Completed 0 4d20h $ oc -n openshift-kube-scheduler logs revision-pruner-9-ip-10-0-128-168.eu-central-1.compute.internal I0912 17:38:47.965799 1 cmd.go:38] &{<nil> true {false} prune true map[protected-revisions:0xc0005b4f00 resource-dir:0xc0005b4fa0 static-pod-name:0xc0005b5040 v:0xc00070ba40 max-eligible-revision:0xc0005b4e60] [0xc00070ba40 0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0005b5040] [] map[resource-dir:0xc0005b4fa0 log-dir:0xc0000e5d60 alsologtostderr:0xc0000e5c20 skip-headers:0xc0000e5f40 stderrthreshold:0xc00070b9a0 v:0xc00070ba40 vmodule:0xc00070bae0 max-eligible-revision:0xc0005b4e60 static-pod-name:0xc0005b5040 log-backtrace-at:0xc0000e5cc0 help:0xc0005b5220 protected-revisions:0xc0005b4f00 log-file:0xc0000e5e00 log-flush-frequency:0xc0000dc280 logtostderr:0xc0000e5ea0] [0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0005b5040 0xc0000e5c20 0xc0000e5cc0 0xc0000e5d60 0xc0000e5e00 0xc0000dc280 0xc0000e5ea0 0xc0000e5f40 0xc00070b9a0 0xc00070ba40 0xc00070bae0 0xc0005b5220] [0xc0000e5c20 0xc0005b5220 0xc0000e5cc0 0xc0000e5d60 0xc0000e5e00 0xc0000dc280 0xc0000e5ea0 0xc0005b4e60 0xc0005b4f00 0xc0005b4fa0 0xc0000e5f40 0xc0005b5040 0xc00070b9a0 0xc00070ba40 0xc00070bae0] map[118:0xc00070ba40 104:0xc0005b5220] [] -1 0 0xc00014c150 true <nil> []} I0912 17:38:47.965967 1 cmd.go:39] (*prune.PruneOptions)(0xc000410f00)({ MaxEligibleRevision: (int) 9, ProtectedRevisions: ([]int) (len=9 cap=9) { (int) 1, (int) 2, (int) 3, (int) 4, (int) 5, (int) 6, (int) 7, (int) 8, (int) 9 }, ResourceDir: (string) (len=36) "/etc/kubernetes/static-pod-resources", StaticPodName: (string) (len=18) "kube-scheduler-pod" })
A lot has changed between 4.2 and now, let's move it to qa to verify in 4.4.
Got to know that Installer pods will be removed by upgrade and as per the above comment i did not see any crashes upgrading from 4.3.2->4.3.3->4.3.5. SO moving the bug to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581
Setting qe_test_coverage flag to '-' as i see that the issue does not happen in clusters with ocp version >= 4.8.