1751471 – [osp] Update a worker machine, machine-controller output "master inplace update failed"

Bug 1751471 - [osp] Update a worker machine, machine-controller output "master inplace update failed"

Summary: [osp] Update a worker machine, machine-controller output "master inplace upda...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Pierre Prinetti
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:	osp
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-12 05:20 UTC by sunzhaohua
Modified:	2020-05-04 11:14 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Machine update logic can't identify a worker. Consequence: Updating a machine always fail because master update is not implemented; even when the updated machine is a worker. Fix: Add proper worker recognition to the Update logic. Result: A worker machine can be updated with a new spec.
Clone Of:
Environment:
Last Closed:	2020-05-04 11:13:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-api-provider-openstack pull 77	'None'	closed	Bug 1751471: Apply update to workers selectively	2020-05-04 02:50:03 UTC
Github	openshift cluster-api-provider-openstack pull 82	None	closed	Bug 1751471: Machine update: create before delete	2020-05-04 02:50:03 UTC
Red Hat Product Errata	RHBA-2020:0581	None	None	None	2020-05-04 11:14:08 UTC

Description sunzhaohua 2019-09-12 05:20:06 UTC

Description of problem:
When update a worker machine, the machine-controller logs "master inplace update failed: not support master in place update now".
Check the machine event, the message is "Updated machine zhsun-8p6x5-worker-aaa%!(EXTRA string=Update)", seems parameters & arguments are mismatched.

Version-Release number of selected component (if applicable):
$ ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.0-0.nightly-2019-09-11-202233   True        False         64m     Cluster version is 4.2.0-0.nightly-2019-09-11-202233

How reproducible:
Always

Steps to Reproduce:
1. Update a machine's Flavor to "ci.m1.xlarge-invalid"
2. Check machine-controller logs
3. Check machine event

Actual results:

$ ./oc logs -f machine-api-controllers-56756968cf-tjf9r -c machine-controller
I0912 02:07:15.491373       1 controller.go:238] Reconciling machine "zhsun-8p6x5-worker-aaa" triggers idempotent update
I0912 02:15:19.342174       1 controller.go:129] Reconciling Machine "zhsun-8p6x5-worker-aaa"
I0912 02:15:19.342347       1 controller.go:298] Machine "zhsun-8p6x5-worker-aaa" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0912 02:15:19.896148       1 controller.go:238] Reconciling machine "zhsun-8p6x5-worker-aaa" triggers idempotent update
E0912 02:15:19.896390       1 actuator.go:368] master inplace update failed: not support master in place update now

$ ./oc describe machine zhsun-8p6x5-worker-aaa
Events:
  Type    Reason   Age               From                  Message
  ----    ------   ----              ----                  -------
  Normal  Created  22m               openstack_controller  Created Machine zhsun-8p6x5-worker-aaa
  Normal  Updated  8s (x2 over 10m)  openstack_controller  Updated machine zhsun-8p6x5-worker-aaa%!(EXTRA string=Update)


Expected results:
Output correct information.

Additional info:

Comment 2 Mike Fedosin 2019-09-16 12:11:05 UTC

We decided to move it to 4.3, since we can't fix the bug properly without enabled machine-healthcheck-controller.
Related to: https://bugzilla.redhat.com/show_bug.cgi?id=1746369

Comment 4 Milind Yadav 2020-02-03 10:12:23 UTC

Description:
While trying to update the worker machine with the invalid flavor , the machine gets deleted .


Version :
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-02-03-005212   True        False         3h48m   Cluster version is 4.4.0-0.nightly-2020-02-03-005212



How reproducible:
Always

Steps to Reproduce:
1. Update a machine's Flavor to "ci.m1.xlarge-invalid"
2. Check machine-controller logs
3. Check machine event

  Phase:   Failed
Events:
  Type    Reason   Age    From                  Message
  ----    ------   ----   ----                  -------
  Normal  Created  39m    openstack_controller  Created Machine miyadav-4pzrg-worker-4d254
  Normal  Deleted  5m28s  openstack_controller  Deleted machine miyadav-4pzrg-worker-4d254

The machine gets deleted from the openstack cloud.(Check from UI also)

Expected :

Only update event should trigger and not delete.

Comment 5 Pierre Prinetti 2020-02-04 09:40:05 UTC

I am not sure I understand what the expected behaviour is.

Are you referring to the error message inconsistency? (i.e. "it is an update event, the error message should not mention 'Delete'")


Or do you think that the actual behaviour is wrong (i.e. "the machine should not be deleted if the new flavor is invalid")?

Comment 6 Milind Yadav 2020-02-05 05:10:14 UTC

Both issues needs to be resolved , 'describe machine' should give message of update failed and the machine should not be deleted

Comment 7 Pierre Prinetti 2020-02-07 08:39:59 UTC

I have addressed "the machine should not be deleted if the new flavor is invalid".

Please report the logging issue as a separate bug, so that we can analyse and prioritise separately. Thanks!

Comment 9 Jianwei Hou 2020-02-10 07:13:51 UTC

Tested on 4.4.0-0.nightly-2020-02-10-013941

Steps:
1. Update a worker machine's providerSpec to an invalid flavor, for example "ci.m1.xlarge.invalid"
2. Monitor the machine-controller. The update failed but machine went into 'Failed' phase. 
3. Correct the invalid flavor, machine won't come back because in 'Failed' phase a machine won't reconcile.

According to https://github.com/openshift/enhancements/blob/master/enhancements/machine-api/machine-instance-lifecycle.md#failed, the machine is not expected to be 'Failed' in this scenario. 

machine-controller log after step 2
```
I0210 06:49:16.361944       1 controller.go:284] Reconciling machine "miyadav-1002-vwppm-worker-b6kwn" triggers idempotent update
I0210 06:57:35.763393       1 controller.go:164] Reconciling Machine "miyadav-1002-vwppm-worker-9pt2f"
I0210 06:57:35.763592       1 controller.go:376] Machine "miyadav-1002-vwppm-worker-9pt2f" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0210 06:57:35.776324       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:57:36.334016       1 controller.go:284] Reconciling machine "miyadav-1002-vwppm-worker-9pt2f" triggers idempotent update
I0210 06:57:36.334230       1 actuator.go:373] re-creating machine miyadav-1002-vwppm-worker-9pt2f for update.
I0210 06:57:36.346211       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:57:36.399681       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:57:36.769712       1 actuator.go:146] Skipped creating a VM that already exists.
I0210 06:57:36.777235       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:57:36.840589       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
E0210 06:58:00.224892       1 actuator.go:382] delete machine miyadav-1002-vwppm-worker-9pt2f for update failed: unable to update machine status: Operation cannot be fulfilled on machines.machine.openshift.io "miyadav-1002-vwppm-worker-9pt2f": the object has been modified; please apply your changes to the latest version and try again
E0210 06:58:00.225227       1 controller.go:286] Error updating machine "openshift-machine-api/miyadav-1002-vwppm-worker-9pt2f": Cannot delete machine miyadav-1002-vwppm-worker-9pt2f: unable to update machine status: Operation cannot be fulfilled on machines.machine.openshift.io "miyadav-1002-vwppm-worker-9pt2f": the object has been modified; please apply your changes to the latest version and try again
I0210 06:58:01.225786       1 controller.go:164] Reconciling Machine "miyadav-1002-vwppm-worker-9pt2f"
I0210 06:58:01.226038       1 controller.go:376] Machine "miyadav-1002-vwppm-worker-9pt2f" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0210 06:58:01.240726       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:58:06.590669       1 controller.go:284] Reconciling machine "miyadav-1002-vwppm-worker-9pt2f" triggers idempotent update
I0210 06:58:06.591051       1 actuator.go:373] re-creating machine miyadav-1002-vwppm-worker-9pt2f for update.
I0210 06:58:06.605211       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:58:06.651868       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:58:11.898942       1 actuator.go:146] Skipped creating a VM that already exists.
I0210 06:58:11.906541       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:58:11.958761       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:58:12.974534       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:58:23.293494       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:58:23.518163       1 actuator.go:398] Successfully updated machine miyadav-1002-vwppm-worker-9pt2f
I0210 06:58:23.518285       1 controller.go:164] Reconciling Machine "miyadav-1002-vwppm-worker-9pt2f"
I0210 06:58:23.518300       1 controller.go:376] Machine "miyadav-1002-vwppm-worker-9pt2f" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0210 06:58:23.528024       1 machineservice.go:229] Cloud provider CA cert not provided, using system trust bundle
I0210 06:58:23.770735       1 controller.go:428] Machine "miyadav-1002-vwppm-worker-9pt2f" going into phase "Failed"
I0210 06:58:23.783438       1 controller.go:164] Reconciling Machine "miyadav-1002-vwppm-worker-9pt2f"
I0210 06:58:23.783462       1 controller.go:376] Machine "miyadav-1002-vwppm-worker-9pt2f" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
W0210 06:58:23.783473       1 controller.go:273] Machine "miyadav-1002-vwppm-worker-9pt2f" has gone "Failed" phase. It won't reconcile
```

Comment 10 Pierre Prinetti 2020-03-30 09:49:00 UTC

Milind, Jianwei,
This BZ originally reported two problems:

- When update a worker machine, the machine-controller logs "master inplace update failed: not support master in place update now".
- Check the machine event, the message is "Updated machine zhsun-8p6x5-worker-aaa%!(EXTRA string=Update)", seems parameters & arguments are mismatched.

Both issues were addressed in the linked Github pull requests.

Please report any outstanding issues as separate BZs, so that we can properly track them. I understand that from a user perspective, they are closely linked; however in code they have to be addressed separately.

Thank you!

Comment 12 Jianwei Hou 2020-04-03 03:25:40 UTC

Thanks Pierre, I think the PRs addressed the fix to this bug, I'll move this to verified.

For the issue that machine becomes failed, will track with 1820421.

Comment 14 errata-xmlrpc 2020-05-04 11:13:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.