Hide Forgot
Description of problem: Occasionally we are seeing e2e-aws-upgrade tests flake and return a very useless error message: fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58]: Expected <bool>: false to be true Independently of debugging the actual root cause, it would be beneficial to expose better context in this error so triagers can know if the same component is triggering this or if they are getting that message for different reasons etc. IMO this bug can be marked as fixed once the message is improved. Next time it comes up with more context we can debug further. How reproducible: 8 times in the last 12 hours https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1434/build-log.txt
14 in the last 24 hours per [1]. Click on the dots for job-detail pages, from which you can drill down into gathered artifacts. [1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=-e2e-aws-upgrade&search=k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58
Source is [1]. I guess some job pid is not running, but would be nice to know which. [1]: https://github.com/openshift/origin/blob/d4de9989b20b594b4bdf208e4b4de4652cf0fa80/vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job.go#L58
Test seems to be new in [1]. Still dunno what it's about ;). [1]: https://github.com/kubernetes/kubernetes/pull/41271
I dunno that anyone is watching this component, but that looks like a k8s test which makes it the master team's domain. Moving to that component.
Took a stab at this in [1], but didn't bother to figure out how to test or even install Go 1.12 to see if it compiled ;). If someone who actually contributes upstream more often wants to close my PR and submit something else instead, that's fine with me :). [1]: https://github.com/kubernetes/kubernetes/pull/77716
See https://github.com/openshift/origin/pull/22809 which David opened to debug these issues. Still Trevor's PR makes sense in the long run, so I left some comments, as well as I've submitted the fixes in https://github.com/wking/kubernetes/pull/1 to help get that merged. I'm going to close this issue in favour of bug 1706082 which is chasing the exact same problem. *** This bug has been marked as a duplicate of bug 1706082 ***
Since bug 1706082 is about *fixing* this issue, and my Kubernetes PR was just about better logging, is it worth backporting or otherwise cherry-picking that adjustment into openshift/origin? Or just wait until we bump to Kubernetes... v1.15? Looks like that's through a few alphas already, so maybe my PR is on track to land in v1.16 ;). So probably need some sort of backport to get it into origin in the near future (and the patch is insignificant enough that it's probably not worth the trouble ;), unless there is an easier way I'm not aware of?
After not seeing it for a while, we saw this again yesterday (although now the line number has changed): fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:63]: Expected <bool>: false to be true while testing upgrades from 4.1.0 to the 4.1.2 candidate 4.1.0-0.nightly-2019-06-12-205455 [1] and from 4.1.1 to that nightly [2,3]. It's the only e2e failure we've seen in those upgrade tests, and overall we have 14 passes and 4 failures as a result. I think we should cherry-pick [5] back into release-4.1 to get better logging when this test fails, to make it easier to decide if this is something we need to fix or not. Thoughts? [1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2522 [2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2530 [3]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2531 [4]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2537 [5]: https://github.com/kubernetes/kubernetes/pull/77716
At minimum open a PR and retest it a couple of times. Setting to high severity because we have no idea why these jobs disappear.
Backported the upstream patch to release-4.1: https://github.com/openshift/origin/pull/23161
Part of master-branch handling for this: https://github.com/openshift/machine-config-operator/pull/855
This is this PR now https://github.com/openshift/machine-config-operator/pull/856
Using PR#856 as a basis for verification ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-07-14-190751 True False 103s Cluster version is 4.1.0-0.nightly-2019-07-14-190751 #### check MCP for `spec.Configuration` and `status.Configuration` $ oc get machineconfigpool/worker -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2019-07-18T18:45:36Z" generation: 2 name: worker resourceVersion: "14139" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker uid: 3af38c75-a98c-11e9-a580-02ce5d86741a spec: configuration: name: rendered-worker-7805b68e9625f85f38ad07827c4a1600 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker maxUnavailable: null nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2019-07-18T18:46:04Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2019-07-18T18:46:08Z" message: "" reason: "" status: "False" type: NodeDegraded - lastTransitionTime: "2019-07-18T18:46:08Z" message: "" reason: "" status: "False" type: Degraded - lastTransitionTime: "2019-07-18T18:52:51Z" message: "" reason: All nodes are updated with rendered-worker-7805b68e9625f85f38ad07827c4a1600 status: "True" type: Updated - lastTransitionTime: "2019-07-18T18:52:51Z" message: "" reason: "" status: "False" type: Updating configuration: name: rendered-worker-7805b68e9625f85f38ad07827c4a1600 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 0 machineCount: 3 observedGeneration: 2 readyMachineCount: 3 unavailableMachineCount: 0 updatedMachineCount: 3 ### Check that machine counts are accurate as upgrade happens $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-07-14-190751 True True 25m Working towards 4.1.0-0.nightly-2019-07-18-023612: 89% complete $ oc get node NAME STATUS ROLES AGE VERSION ip-10-0-132-146.us-west-2.compute.internal Ready worker 39m v1.13.4+c9e4f28ff ip-10-0-137-94.us-west-2.compute.internal Ready master 46m v1.13.4+c9e4f28ff ip-10-0-154-224.us-west-2.compute.internal Ready master 46m v1.13.4+c9e4f28ff ip-10-0-156-154.us-west-2.compute.internal Ready,SchedulingDisabled worker 39m v1.13.4+c9e4f28ff ip-10-0-168-63.us-west-2.compute.internal Ready worker 38m v1.13.4+c9e4f28ff ip-10-0-175-81.us-west-2.compute.internal Ready,SchedulingDisabled master 46m v1.13.4+c9e4f28ff $ oc describe machineconfigpool/master Name: master Namespace: Labels: operator.machineconfiguration.openshift.io/required-for-upgrade= Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-07-18T18:45:36Z Generation: 3 Resource Version: 32062 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master UID: 3af23395-a98c-11e9-a580-02ce5d86741a Spec: Configuration: Name: rendered-master-e1a48829ad119e49efe18f2c08a95ba7 Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-3af23395-a98c-11e9-a580-02ce5d86741a-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-ssh Machine Config Selector: Match Labels: machineconfiguration.openshift.io/role: master Max Unavailable: <nil> Node Selector: Match Labels: node-role.kubernetes.io/master: Paused: false Status: Conditions: Last Transition Time: 2019-07-18T18:46:04Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2019-07-18T18:46:08Z Message: Reason: Status: False Type: NodeDegraded Last Transition Time: 2019-07-18T18:46:08Z Message: Reason: Status: False Type: Degraded Last Transition Time: 2019-07-18T19:28:31Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-07-18T19:28:31Z Message: Reason: All nodes are updating to rendered-master-e1a48829ad119e49efe18f2c08a95ba7 Status: True Type: Updating Configuration: Name: rendered-master-10713c0a9c1eabc9d0937f26b7e1acad Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-master API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-master-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-3af23395-a98c-11e9-a580-02ce5d86741a-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-master-ssh Degraded Machine Count: 0 Machine Count: 3 Observed Generation: 3 Ready Machine Count: 0 Unavailable Machine Count: 1 Updated Machine Count: 0 Events: <none> $ oc describe machineconfigpool/worker Name: worker Namespace: Labels: <none> Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2019-07-18T18:45:36Z Generation: 3 Resource Version: 32009 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker UID: 3af38c75-a98c-11e9-a580-02ce5d86741a Spec: Configuration: Name: rendered-worker-c6993aa563ef60c470e4d0d28ee2f321 Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-ssh Machine Config Selector: Match Labels: machineconfiguration.openshift.io/role: worker Max Unavailable: <nil> Node Selector: Match Labels: node-role.kubernetes.io/worker: Paused: false Status: Conditions: Last Transition Time: 2019-07-18T18:46:04Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2019-07-18T18:46:08Z Message: Reason: Status: False Type: NodeDegraded Last Transition Time: 2019-07-18T18:46:08Z Message: Reason: Status: False Type: Degraded Last Transition Time: 2019-07-18T19:28:31Z Message: Reason: Status: False Type: Updated Last Transition Time: 2019-07-18T19:28:31Z Message: Reason: All nodes are updating to rendered-worker-c6993aa563ef60c470e4d0d28ee2f321 Status: True Type: Updating Configuration: Name: rendered-worker-7805b68e9625f85f38ad07827c4a1600 Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-ssh Degraded Machine Count: 0 Machine Count: 3 Observed Generation: 3 Ready Machine Count: 0 Unavailable Machine Count: 1 Updated Machine Count: 0 Events: <none> ### confirm upgrade was succesful $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-07-18-023612 True False 13s Cluster version is 4.1.0-0.nightly-2019-07-18-023612 $ oc get machineconfigpool NAME CONFIG UPDATED UPDATING DEGRADED master rendered-master-e1a48829ad119e49efe18f2c08a95ba7 True False False worker rendered-worker-c6993aa563ef60c470e4d0d28ee2f321 True False False $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-132-146.us-west-2.compute.internal Ready worker 50m v1.13.4+c9e4f28ff ip-10-0-137-94.us-west-2.compute.internal Ready master 57m v1.13.4+c9e4f28ff ip-10-0-154-224.us-west-2.compute.internal Ready master 58m v1.13.4+c9e4f28ff ip-10-0-156-154.us-west-2.compute.internal Ready worker 50m v1.13.4+c9e4f28ff ip-10-0-168-63.us-west-2.compute.internal Ready worker 50m v1.13.4+c9e4f28ff ip-10-0-175-81.us-west-2.compute.internal Ready master 58m v1.13.4+c9e4f28ff ``` Additionally, no sign of the original error message from comment#1 in the last 14d of CI tests - https://ci-search-ci-search-next.svc.ci.openshift.org/?search=fail+%5Bk8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fupgrades%2Fapps%2Fjob.go%3A63%5D%3A+Expected&maxAge=336h&context=2&type=all
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1809