Bug 1708454
| Summary: | Machine-config operator reports upgrade success before nodes complete | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | lserven |
| Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> |
| Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.1.0 | CC: | aos-bugs, ccoleman, jokerman, maszulik, mmccomas, sponnaga, wking |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | 4.1.z | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | buildcop 4.1.7 | ||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2019-07-25 05:32:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
14 in the last 24 hours per [1]. Click on the dots for job-detail pages, from which you can drill down into gathered artifacts. [1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=-e2e-aws-upgrade&search=k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58 Source is [1]. I guess some job pid is not running, but would be nice to know which. [1]: https://github.com/openshift/origin/blob/d4de9989b20b594b4bdf208e4b4de4652cf0fa80/vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job.go#L58 Test seems to be new in [1]. Still dunno what it's about ;). [1]: https://github.com/kubernetes/kubernetes/pull/41271 I dunno that anyone is watching this component, but that looks like a k8s test which makes it the master team's domain. Moving to that component. Took a stab at this in [1], but didn't bother to figure out how to test or even install Go 1.12 to see if it compiled ;). If someone who actually contributes upstream more often wants to close my PR and submit something else instead, that's fine with me :). [1]: https://github.com/kubernetes/kubernetes/pull/77716 See https://github.com/openshift/origin/pull/22809 which David opened to debug these issues. Still Trevor's PR makes sense in the long run, so I left some comments, as well as I've submitted the fixes in https://github.com/wking/kubernetes/pull/1 to help get that merged. I'm going to close this issue in favour of bug 1706082 which is chasing the exact same problem. *** This bug has been marked as a duplicate of bug 1706082 *** Since bug 1706082 is about *fixing* this issue, and my Kubernetes PR was just about better logging, is it worth backporting or otherwise cherry-picking that adjustment into openshift/origin? Or just wait until we bump to Kubernetes... v1.15? Looks like that's through a few alphas already, so maybe my PR is on track to land in v1.16 ;). So probably need some sort of backport to get it into origin in the near future (and the patch is insignificant enough that it's probably not worth the trouble ;), unless there is an easier way I'm not aware of? After not seeing it for a while, we saw this again yesterday (although now the line number has changed):
fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:63]: Expected
<bool>: false
to be true
while testing upgrades from 4.1.0 to the 4.1.2 candidate 4.1.0-0.nightly-2019-06-12-205455 [1] and from 4.1.1 to that nightly [2,3]. It's the only e2e failure we've seen in those upgrade tests, and overall we have 14 passes and 4 failures as a result. I think we should cherry-pick [5] back into release-4.1 to get better logging when this test fails, to make it easier to decide if this is something we need to fix or not. Thoughts?
[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2522
[2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2530
[3]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2531
[4]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2537
[5]: https://github.com/kubernetes/kubernetes/pull/77716
At minimum open a PR and retest it a couple of times. Setting to high severity because we have no idea why these jobs disappear. Backported the upstream patch to release-4.1: https://github.com/openshift/origin/pull/23161 Part of master-branch handling for this: https://github.com/openshift/machine-config-operator/pull/855 This is this PR now https://github.com/openshift/machine-config-operator/pull/856 Using PR#856 as a basis for verification
```
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.nightly-2019-07-14-190751 True False 103s Cluster version is 4.1.0-0.nightly-2019-07-14-190751
#### check MCP for `spec.Configuration` and `status.Configuration`
$ oc get machineconfigpool/worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: "2019-07-18T18:45:36Z"
generation: 2
name: worker
resourceVersion: "14139"
selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
uid: 3af38c75-a98c-11e9-a580-02ce5d86741a
spec:
configuration:
name: rendered-worker-7805b68e9625f85f38ad07827c4a1600
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
machineConfigSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker
maxUnavailable: null
nodeSelector:
matchLabels:
node-role.kubernetes.io/worker: ""
paused: false
status:
conditions:
- lastTransitionTime: "2019-07-18T18:46:04Z"
message: ""
reason: ""
status: "False"
type: RenderDegraded
- lastTransitionTime: "2019-07-18T18:46:08Z"
message: ""
reason: ""
status: "False"
type: NodeDegraded
- lastTransitionTime: "2019-07-18T18:46:08Z"
message: ""
reason: ""
status: "False"
type: Degraded
- lastTransitionTime: "2019-07-18T18:52:51Z"
message: ""
reason: All nodes are updated with rendered-worker-7805b68e9625f85f38ad07827c4a1600
status: "True"
type: Updated
- lastTransitionTime: "2019-07-18T18:52:51Z"
message: ""
reason: ""
status: "False"
type: Updating
configuration:
name: rendered-worker-7805b68e9625f85f38ad07827c4a1600
source:
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 00-worker
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-container-runtime
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 01-worker-kubelet
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries
- apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
name: 99-worker-ssh
degradedMachineCount: 0
machineCount: 3
observedGeneration: 2
readyMachineCount: 3
unavailableMachineCount: 0
updatedMachineCount: 3
### Check that machine counts are accurate as upgrade happens
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.nightly-2019-07-14-190751 True True 25m Working towards 4.1.0-0.nightly-2019-07-18-023612: 89% complete
$ oc get node
NAME STATUS ROLES AGE VERSION
ip-10-0-132-146.us-west-2.compute.internal Ready worker 39m v1.13.4+c9e4f28ff
ip-10-0-137-94.us-west-2.compute.internal Ready master 46m v1.13.4+c9e4f28ff
ip-10-0-154-224.us-west-2.compute.internal Ready master 46m v1.13.4+c9e4f28ff
ip-10-0-156-154.us-west-2.compute.internal Ready,SchedulingDisabled worker 39m v1.13.4+c9e4f28ff
ip-10-0-168-63.us-west-2.compute.internal Ready worker 38m v1.13.4+c9e4f28ff
ip-10-0-175-81.us-west-2.compute.internal Ready,SchedulingDisabled master 46m v1.13.4+c9e4f28ff
$ oc describe machineconfigpool/master
Name: master
Namespace:
Labels: operator.machineconfiguration.openshift.io/required-for-upgrade=
Annotations: <none>
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfigPool
Metadata:
Creation Timestamp: 2019-07-18T18:45:36Z
Generation: 3
Resource Version: 32062
Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
UID: 3af23395-a98c-11e9-a580-02ce5d86741a
Spec:
Configuration:
Name: rendered-master-e1a48829ad119e49efe18f2c08a95ba7
Source:
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 00-master
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-master-container-runtime
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-master-kubelet
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-master-3af23395-a98c-11e9-a580-02ce5d86741a-registries
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-master-ssh
Machine Config Selector:
Match Labels:
machineconfiguration.openshift.io/role: master
Max Unavailable: <nil>
Node Selector:
Match Labels:
node-role.kubernetes.io/master:
Paused: false
Status:
Conditions:
Last Transition Time: 2019-07-18T18:46:04Z
Message:
Reason:
Status: False
Type: RenderDegraded
Last Transition Time: 2019-07-18T18:46:08Z
Message:
Reason:
Status: False
Type: NodeDegraded
Last Transition Time: 2019-07-18T18:46:08Z
Message:
Reason:
Status: False
Type: Degraded
Last Transition Time: 2019-07-18T19:28:31Z
Message:
Reason:
Status: False
Type: Updated
Last Transition Time: 2019-07-18T19:28:31Z
Message:
Reason: All nodes are updating to rendered-master-e1a48829ad119e49efe18f2c08a95ba7
Status: True
Type: Updating
Configuration:
Name: rendered-master-10713c0a9c1eabc9d0937f26b7e1acad
Source:
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 00-master
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-master-container-runtime
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-master-kubelet
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-master-3af23395-a98c-11e9-a580-02ce5d86741a-registries
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-master-ssh
Degraded Machine Count: 0
Machine Count: 3
Observed Generation: 3
Ready Machine Count: 0
Unavailable Machine Count: 1
Updated Machine Count: 0
Events: <none>
$ oc describe machineconfigpool/worker
Name: worker
Namespace:
Labels: <none>
Annotations: <none>
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfigPool
Metadata:
Creation Timestamp: 2019-07-18T18:45:36Z
Generation: 3
Resource Version: 32009
Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
UID: 3af38c75-a98c-11e9-a580-02ce5d86741a
Spec:
Configuration:
Name: rendered-worker-c6993aa563ef60c470e4d0d28ee2f321
Source:
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 00-worker
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-worker-container-runtime
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-worker-kubelet
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-ssh
Machine Config Selector:
Match Labels:
machineconfiguration.openshift.io/role: worker
Max Unavailable: <nil>
Node Selector:
Match Labels:
node-role.kubernetes.io/worker:
Paused: false
Status:
Conditions:
Last Transition Time: 2019-07-18T18:46:04Z
Message:
Reason:
Status: False
Type: RenderDegraded
Last Transition Time: 2019-07-18T18:46:08Z
Message:
Reason:
Status: False
Type: NodeDegraded
Last Transition Time: 2019-07-18T18:46:08Z
Message:
Reason:
Status: False
Type: Degraded
Last Transition Time: 2019-07-18T19:28:31Z
Message:
Reason:
Status: False
Type: Updated
Last Transition Time: 2019-07-18T19:28:31Z
Message:
Reason: All nodes are updating to rendered-worker-c6993aa563ef60c470e4d0d28ee2f321
Status: True
Type: Updating
Configuration:
Name: rendered-worker-7805b68e9625f85f38ad07827c4a1600
Source:
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 00-worker
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-worker-container-runtime
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-worker-kubelet
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-ssh
Degraded Machine Count: 0
Machine Count: 3
Observed Generation: 3
Ready Machine Count: 0
Unavailable Machine Count: 1
Updated Machine Count: 0
Events: <none>
### confirm upgrade was succesful
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.1.0-0.nightly-2019-07-18-023612 True False 13s Cluster version is 4.1.0-0.nightly-2019-07-18-023612
$ oc get machineconfigpool
NAME CONFIG UPDATED UPDATING DEGRADED
master rendered-master-e1a48829ad119e49efe18f2c08a95ba7 True False False
worker rendered-worker-c6993aa563ef60c470e4d0d28ee2f321 True False False
$ oc get nodes
NAME STATUS ROLES AGE VERSION
ip-10-0-132-146.us-west-2.compute.internal Ready worker 50m v1.13.4+c9e4f28ff
ip-10-0-137-94.us-west-2.compute.internal Ready master 57m v1.13.4+c9e4f28ff
ip-10-0-154-224.us-west-2.compute.internal Ready master 58m v1.13.4+c9e4f28ff
ip-10-0-156-154.us-west-2.compute.internal Ready worker 50m v1.13.4+c9e4f28ff
ip-10-0-168-63.us-west-2.compute.internal Ready worker 50m v1.13.4+c9e4f28ff
ip-10-0-175-81.us-west-2.compute.internal Ready master 58m v1.13.4+c9e4f28ff
```
Additionally, no sign of the original error message from comment#1 in the last 14d of CI tests - https://ci-search-ci-search-next.svc.ci.openshift.org/?search=fail+%5Bk8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fupgrades%2Fapps%2Fjob.go%3A63%5D%3A+Expected&maxAge=336h&context=2&type=all
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:1809 |
Description of problem: Occasionally we are seeing e2e-aws-upgrade tests flake and return a very useless error message: fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58]: Expected <bool>: false to be true Independently of debugging the actual root cause, it would be beneficial to expose better context in this error so triagers can know if the same component is triggering this or if they are getting that message for different reasons etc. IMO this bug can be marked as fixed once the message is improved. Next time it comes up with more context we can debug further. How reproducible: 8 times in the last 12 hours https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1434/build-log.txt