1708454 – Machine-config operator reports upgrade success before nodes complete

Bug 1708454 - Machine-config operator reports upgrade success before nodes complete

Summary: Machine-config operator reports upgrade success before nodes complete

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.1.z
Assignee:	Antonio Murdaca
QA Contact:	Micah Abbott
Docs Contact:
URL:
Whiteboard:	buildcop 4.1.7
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-10 01:05 UTC by lserven
Modified:	2019-07-25 05:32 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-07-25 05:32:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	kubernetes kubernetes pull 77716	'None'	closed	test/e2e/upgrades/apps/job: List Pods in failure message	2020-11-19 09:33:46 UTC
Github	openshift origin pull 23161	'None'	closed	vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job: List Pods in failure message	2020-11-19 09:33:47 UTC
Red Hat Product Errata	RHBA-2019:1809	None	None	None	2019-07-25 05:32:37 UTC

Description lserven 2019-05-10 01:05:52 UTC

Description of problem:
Occasionally we are seeing e2e-aws-upgrade tests flake and return a very useless error message:

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58]: Expected
    <bool>: false
to be true

Independently of debugging the actual root cause, it would be beneficial to expose better context in this error so triagers can know if the same component is triggering this or if they are getting that message for different reasons etc.

IMO this bug can be marked as fixed once the message is improved. Next time it comes up with more context we can debug further.


How reproducible:
8 times in the last 12 hours

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/1434/build-log.txt

Comment 1 W. Trevor King 2019-05-10 01:21:29 UTC

14 in the last 24 hours per [1].  Click on the dots for job-detail pages, from which you can drill down into gathered artifacts.

[1]: https://ci-search-ci-search-next.svc.ci.openshift.org/chart?name=-e2e-aws-upgrade&search=k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:58

Comment 2 W. Trevor King 2019-05-10 01:46:55 UTC

Source is [1].  I guess some job pid is not running, but would be nice to know which.

[1]: https://github.com/openshift/origin/blob/d4de9989b20b594b4bdf208e4b4de4652cf0fa80/vendor/k8s.io/kubernetes/test/e2e/upgrades/apps/job.go#L58

Comment 3 W. Trevor King 2019-05-10 01:59:38 UTC

Test seems to be new in [1].  Still dunno what it's about ;).

[1]: https://github.com/kubernetes/kubernetes/pull/41271

Comment 4 Ben Parees 2019-05-10 02:05:20 UTC

I dunno that anyone is watching this component, but that looks like a k8s test which makes it the master team's domain. Moving to that component.

Comment 5 W. Trevor King 2019-05-10 04:44:33 UTC

Took a stab at this in [1], but didn't bother to figure out how to test or even install Go 1.12 to see if it compiled ;).  If someone who actually contributes upstream more often wants to close my PR and submit something else instead, that's fine with me :).

[1]: https://github.com/kubernetes/kubernetes/pull/77716

Comment 6 Maciej Szulik 2019-05-10 12:19:47 UTC

See https://github.com/openshift/origin/pull/22809 which David opened to debug these issues. 
Still Trevor's PR makes sense in the long run, so I left some comments, as well as I've submitted 
the fixes in https://github.com/wking/kubernetes/pull/1 to help get that merged. 
I'm going to close this issue in favour of bug 1706082 which is chasing the exact
same problem.

*** This bug has been marked as a duplicate of bug 1706082 ***

Comment 7 W. Trevor King 2019-05-10 22:58:51 UTC

Since bug 1706082 is about *fixing* this issue, and my Kubernetes PR was just about better logging, is it worth backporting or otherwise cherry-picking that adjustment into openshift/origin?  Or just wait until we bump to Kubernetes... v1.15?  Looks like that's through a few alphas already, so maybe my PR is on track to land in v1.16 ;).  So probably need some sort of backport to get it into origin in the near future (and the patch is insignificant enough that it's probably not worth the trouble ;), unless there is an easier way I'm not aware of?

Comment 8 W. Trevor King 2019-06-13 10:42:26 UTC

After not seeing it for a while, we saw this again yesterday (although now the line number has changed):

fail [k8s.io/kubernetes/test/e2e/upgrades/apps/job.go:63]: Expected
    <bool>: false
to be true

while testing upgrades from 4.1.0 to the 4.1.2 candidate 4.1.0-0.nightly-2019-06-12-205455 [1] and from 4.1.1 to that nightly [2,3].  It's the only e2e failure we've seen in those upgrade tests, and overall we have 14 passes and 4 failures as a result.  I think we should cherry-pick [5] back into release-4.1 to get better logging when this test fails, to make it easier to decide if this is something we need to fix or not.  Thoughts?

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2522
[2]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2530
[3]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2531
[4]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2537
[5]: https://github.com/kubernetes/kubernetes/pull/77716

Comment 9 Clayton Coleman 2019-06-13 15:36:49 UTC

At minimum open a PR and retest it a couple of times.

Setting to high severity because we have no idea why these jobs disappear.

Comment 10 W. Trevor King 2019-06-13 15:59:52 UTC

Backported the upstream patch to release-4.1: https://github.com/openshift/origin/pull/23161

Comment 11 W. Trevor King 2019-06-13 23:33:57 UTC

Part of master-branch handling for this: https://github.com/openshift/machine-config-operator/pull/855

Comment 12 Antonio Murdaca 2019-06-17 15:33:05 UTC

This is this PR now https://github.com/openshift/machine-config-operator/pull/856

Comment 16 Micah Abbott 2019-07-18 19:42:36 UTC

Using PR#856 as a basis for verification


```
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-07-14-190751   True        False         103s    Cluster version is 4.1.0-0.nightly-2019-07-14-190751

#### check MCP for `spec.Configuration` and `status.Configuration`
$ oc get machineconfigpool/worker -o yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  creationTimestamp: "2019-07-18T18:45:36Z"
  generation: 2
  name: worker
  resourceVersion: "14139"
  selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  uid: 3af38c75-a98c-11e9-a580-02ce5d86741a
spec:
  configuration:
    name: rendered-worker-7805b68e9625f85f38ad07827c4a1600
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  machineConfigSelector:
    matchLabels:
      machineconfiguration.openshift.io/role: worker
  maxUnavailable: null
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/worker: ""
  paused: false
status:
  conditions:
  - lastTransitionTime: "2019-07-18T18:46:04Z"
    message: ""
    reason: ""
    status: "False"
    type: RenderDegraded
  - lastTransitionTime: "2019-07-18T18:46:08Z"
    message: ""
    reason: ""
    status: "False"
    type: NodeDegraded
  - lastTransitionTime: "2019-07-18T18:46:08Z"
    message: ""
    reason: ""
    status: "False"
    type: Degraded
  - lastTransitionTime: "2019-07-18T18:52:51Z"
    message: ""
    reason: All nodes are updated with rendered-worker-7805b68e9625f85f38ad07827c4a1600
    status: "True"
    type: Updated
  - lastTransitionTime: "2019-07-18T18:52:51Z"
    message: ""
    reason: ""
    status: "False"
    type: Updating
  configuration:
    name: rendered-worker-7805b68e9625f85f38ad07827c4a1600
    source:
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 00-worker
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-container-runtime
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 01-worker-kubelet
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries
    - apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfig
      name: 99-worker-ssh
  degradedMachineCount: 0
  machineCount: 3
  observedGeneration: 2
  readyMachineCount: 3
  unavailableMachineCount: 0
  updatedMachineCount: 3


### Check that machine counts are accurate as upgrade happens

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-07-14-190751   True        True          25m     Working towards 4.1.0-0.nightly-2019-07-18-023612: 89% complete

$ oc get node
NAME                                         STATUS                     ROLES    AGE   VERSION
ip-10-0-132-146.us-west-2.compute.internal   Ready                      worker   39m   v1.13.4+c9e4f28ff                                
ip-10-0-137-94.us-west-2.compute.internal    Ready                      master   46m   v1.13.4+c9e4f28ff
ip-10-0-154-224.us-west-2.compute.internal   Ready                      master   46m   v1.13.4+c9e4f28ff
ip-10-0-156-154.us-west-2.compute.internal   Ready,SchedulingDisabled   worker   39m   v1.13.4+c9e4f28ff
ip-10-0-168-63.us-west-2.compute.internal    Ready                      worker   38m   v1.13.4+c9e4f28ff
ip-10-0-175-81.us-west-2.compute.internal    Ready,SchedulingDisabled   master   46m   v1.13.4+c9e4f28ff

$ oc describe machineconfigpool/master
Name:         master               
Namespace:                                     
Labels:       operator.machineconfiguration.openshift.io/required-for-upgrade=
Annotations:  <none>        
API Version:  machineconfiguration.openshift.io/v1                                         
Kind:         MachineConfigPool                            
Metadata:                                      
  Creation Timestamp:  2019-07-18T18:45:36Z    
  Generation:          3                                                                             
  Resource Version:    32062   
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/master
  UID:                 3af23395-a98c-11e9-a580-02ce5d86741a
Spec:                                                      
  Configuration:                                        
    Name:  rendered-master-e1a48829ad119e49efe18f2c08a95ba7                                          
    Source:                                    
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig                                 
      Name:         00-master                              
      API Version:  machineconfiguration.openshift.io/v1 
      Kind:         MachineConfig                                 
      Name:         01-master-container-runtime                              
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig                                 
      Name:         01-master-kubelet      
      API Version:  machineconfiguration.openshift.io/v1                               
      Kind:         MachineConfig                                 
      Name:         99-master-3af23395-a98c-11e9-a580-02ce5d86741a-registries
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig                                 
      Name:         99-master-ssh          
  Machine Config Selector:                                                             
    Match Labels:                                                 
      machineconfiguration.openshift.io/role:  master
  Max Unavailable:                             <nil>
  Node Selector:                               
    Match Labels:                                                                                  
      node-role.kubernetes.io/master:
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2019-07-18T18:46:04Z
    Message:
    Reason:
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-07-18T18:46:08Z
    Message:
    Reason:
    Status:                False
    Type:                  NodeDegraded
    Last Transition Time:  2019-07-18T18:46:08Z
    Message:
    Reason:
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-07-18T19:28:31Z
    Message:
    Reason:
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-07-18T19:28:31Z
    Message:
    Reason:                All nodes are updating to rendered-master-e1a48829ad119e49efe18f2c08a95ba7
    Status:                True
    Type:                  Updating
  Configuration:
    Name:  rendered-master-10713c0a9c1eabc9d0937f26b7e1acad
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-master
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-master-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-master-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-master-3af23395-a98c-11e9-a580-02ce5d86741a-registries
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-master-ssh
  Degraded Machine Count:     0
  Machine Count:              3
  Observed Generation:        3
  Ready Machine Count:        0
  Unavailable Machine Count:  1
  Updated Machine Count:      0
Events:                       <none>

$ oc describe machineconfigpool/worker
Name:         worker
Namespace:
Labels:       <none>
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2019-07-18T18:45:36Z
  Generation:          3
  Resource Version:    32009
  Self Link:           /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  UID:                 3af38c75-a98c-11e9-a580-02ce5d86741a
Spec:
  Configuration:
    Name:  rendered-worker-c6993aa563ef60c470e4d0d28ee2f321
    Source:
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         00-worker
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-worker-container-runtime
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-worker-kubelet
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-ssh
  Machine Config Selector:
    Match Labels:
      machineconfiguration.openshift.io/role:  worker
  Max Unavailable:                             <nil>
  Node Selector:
    Match Labels:
      node-role.kubernetes.io/worker:
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2019-07-18T18:46:04Z
    Message:
    Reason:
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2019-07-18T18:46:08Z
    Message:
    Reason:
    Status:                False
    Type:                  NodeDegraded
    Last Transition Time:  2019-07-18T18:46:08Z
    Message:
    Reason:
    Status:                False
    Type:                  Degraded
    Last Transition Time:  2019-07-18T19:28:31Z
    Message:
    Reason:
    Status:                False
    Type:                  Updated
    Last Transition Time:  2019-07-18T19:28:31Z
    Message:
    Reason:                All nodes are updating to rendered-worker-c6993aa563ef60c470e4d0d28ee2f321
    Status:                True
    Type:                  Updating
  Configuration:
    Name:  rendered-worker-7805b68e9625f85f38ad07827c4a1600
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-3af38c75-a98c-11e9-a580-02ce5d86741a-registries
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-ssh
  Degraded Machine Count:     0
  Machine Count:              3
  Observed Generation:        3
  Ready Machine Count:        0
  Unavailable Machine Count:  1
  Updated Machine Count:      0
Events:                       <none>

### confirm upgrade was succesful

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-07-18-023612   True        False         13s     Cluster version is 4.1.0-0.nightly-2019-07-18-023612

$ oc get machineconfigpool
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED
master   rendered-master-e1a48829ad119e49efe18f2c08a95ba7   True      False      False
worker   rendered-worker-c6993aa563ef60c470e4d0d28ee2f321   True      False      False

$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-132-146.us-west-2.compute.internal   Ready    worker   50m   v1.13.4+c9e4f28ff
ip-10-0-137-94.us-west-2.compute.internal    Ready    master   57m   v1.13.4+c9e4f28ff
ip-10-0-154-224.us-west-2.compute.internal   Ready    master   58m   v1.13.4+c9e4f28ff
ip-10-0-156-154.us-west-2.compute.internal   Ready    worker   50m   v1.13.4+c9e4f28ff
ip-10-0-168-63.us-west-2.compute.internal    Ready    worker   50m   v1.13.4+c9e4f28ff
ip-10-0-175-81.us-west-2.compute.internal    Ready    master   58m   v1.13.4+c9e4f28ff
```

Additionally, no sign of the original error message from comment#1 in the last 14d of CI tests - https://ci-search-ci-search-next.svc.ci.openshift.org/?search=fail+%5Bk8s.io%2Fkubernetes%2Ftest%2Fe2e%2Fupgrades%2Fapps%2Fjob.go%3A63%5D%3A+Expected&maxAge=336h&context=2&type=all

Comment 18 errata-xmlrpc 2019-07-25 05:32:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:1809

Note You need to log in before you can comment on or make changes to this bug.