2036594 – [MAPO] Machine goes to failed state due to a momentary error of the cluster etcd

Bug 2036594 - [MAPO] Machine goes to failed state due to a momentary error of the cluster etcd

Summary: [MAPO] Machine goes to failed state due to a momentary error of the cluster etcd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Matthew Booth
QA Contact:	rlobillo
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-03 10:26 UTC by rlobillo
Modified:	2022-08-10 10:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:41:05 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-api-provider-openstack pull 26	0	None	open	Refactor MAPO to use a reconcile function for update and create	2022-03-11 17:38:31 UTC
Red Hat Product Errata	RHSA-2022:5069	0	None	None	None	2022-08-10 10:41:27 UTC

Description rlobillo 2022-01-03 10:26:58 UTC

Description of problem: While deploying a cluster with featureGate enabling the TechPreviewNoUpgrade featureSet, it is observed that some machine resources are going to Failed status and they are never reconcilied.


$ oc get machines -n openshift-machine-api
NAME                          PHASE     TYPE   REGION   ZONE   AGE
ostest-c5hkx-master-0         Running                          71m
ostest-c5hkx-master-1         Running                          71m
ostest-c5hkx-master-2         Running                          71m
ostest-c5hkx-worker-0-d2lh5   Failed                           48m
ostest-c5hkx-worker-0-qpmmz   Running                          62m
ostest-c5hkx-worker-0-sghfp   Running                          48m

On the machine-controller container, it is observed below lines regarding the failed worker:

$ oc logs -n openshift-machine-api machine-api-controllers-7d79b76cd7-6glz9 -c machine-controller  | grep ostest-c5hkx-worker-0-d2lh5
I0103 09:29:54.778487       1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine
I0103 09:30:58.152298       1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine
I0103 09:31:08.920209       1 controller.go:358] ostest-c5hkx-worker-0-d2lh5: setting phase to Provisioning and requeuing
I0103 09:31:08.920229       1 controller.go:483] ostest-c5hkx-worker-0-d2lh5: going into phase "Provisioning"
I0103 09:31:40.672498       1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine
I0103 09:31:41.696294       1 controller.go:365] ostest-c5hkx-worker-0-d2lh5: reconciling machine triggers idempotent create
E0103 09:32:23.356103       1 actuator.go:406] Machine error ostest-c5hkx-worker-0-d2lh5: error setting provider ID: etcdserver: request timed out
W0103 09:32:23.356155       1 controller.go:367] ostest-c5hkx-worker-0-d2lh5: failed to create machine: error setting provider ID: etcdserver: request timed out
E0103 09:32:23.356231       1 controller.go:304] controller/machine_controller "msg"="Reconciler error" "error"="error setting provider ID: etcdserver: request timed out" "name"="ostest-c5hkx-worker-0-d2lh5" "namespace"="openshift-machine-api" 
I0103 09:32:28.851987       1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine
I0103 09:32:29.974108       1 controller.go:483] ostest-c5hkx-worker-0-d2lh5: going into phase "Failed"
I0103 09:32:32.510839       1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine
W0103 09:32:32.510851       1 controller.go:276] ostest-c5hkx-worker-0-d2lh5: machine has gone "Failed" phase. It won't reconcile
I0103 09:37:13.835043       1 controller.go:175] ostest-c5hkx-worker-0-d2lh5: reconciling Machine
W0103 09:37:13.835101       1 controller.go:276] ostest-c5hkx-worker-0-d2lh5: machine has gone "Failed" phase. It won't reconcile



Version-Release number of selected component (if applicable):
4.10.0-0.nightly-2021-12-21-130047
RHOS-16.1-RHEL-8-20210903.n.0


How reproducible: Sometimes


Steps to Reproduce:
1. Install OCP4.10 latest nightly setting the featureGate:

kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
  creationTimestamp: "2022-01-03T09:06:36Z"
  generation: 1
  name: cluster
  resourceVersion: "1360"
  uid: cc6eed62-b26f-4ed0-91fd-34309dfe6e49
spec:
  featureSet: TechPreviewNoUpgrade


Actual results:
Installation fails and/or not all the required workers are deployed.

Expected results:
Despite the possible glitch on the etcd server, MAPO should try to set the provider ID more than once or it should try to reconcile.

Comment 1 egarcia 2022-01-05 16:29:42 UTC

Lets defer judgement on this until https://github.com/openshift/machine-api-operator/pull/975 and https://github.com/openshift/release/pull/24933 merge. These patches fully enable the MAPO TP feature flag, and allow you to test the most recent changes to MAPO, which are very significant.

Comment 2 Matthew Booth 2022-01-11 11:06:06 UTC

@jspeed would this be an MAO bug?

Comment 3 Joel Speed 2022-01-11 18:02:51 UTC

Do we know if this error is coming from MAPO or CAPO?

Looks like this could be MAPO: https://github.com/openshift/machine-api-provider-openstack/blob/2401f74dba8112e0dd4a4a1bb77c26887d5da3f4/pkg/machine/actuator.go#L194-L204

Has anyone tried to reproduce this as it seems like this could use some debug logging?

What looks to me like has happened is that the MAPO code created the machine, attempted to set the provider ID, got an error back and then attempted to delete the instance

Meanwhile, I think the update to set the provider ID was actually successful, etcd timed out because not all members acked the update, but it did eventually get stored.

So next reconcile, the Machine controller sees that the instance has a provider ID, but it was deleted on OpenStack, so it rightfully fails the machine

I would say the issue here is, why are we deleting the instance in OpenStack if we can't set the provider ID?

Comment 4 Matthew Booth 2022-01-13 10:48:23 UTC

(In reply to Joel Speed from comment #3)
> Do we know if this error is coming from MAPO or CAPO?
> 
> Looks like this could be MAPO:
> https://github.com/openshift/machine-api-provider-openstack/blob/
> 2401f74dba8112e0dd4a4a1bb77c26887d5da3f4/pkg/machine/actuator.go#L194-L204
> 
> Has anyone tried to reproduce this as it seems like this could use some
> debug logging?
> 
> What looks to me like has happened is that the MAPO code created the
> machine, attempted to set the provider ID, got an error back and then
> attempted to delete the instance
> 
> Meanwhile, I think the update to set the provider ID was actually
> successful, etcd timed out because not all members acked the update, but it
> did eventually get stored.
> 
> So next reconcile, the Machine controller sees that the instance has a
> provider ID, but it was deleted on OpenStack, so it rightfully fails the
> machine
> 
> I would say the issue here is, why are we deleting the instance in OpenStack
> if we can't set the provider ID?

The reason is in the comment above: we're treating ProviderID as the canonical source of truth for the creation of the machine. Therefore if we created the machine but weren't able to record that we need to delete the machine, because next time round, with no ProviderID, we're going to create another one and leak a VM.

Incidentally, note that I considered the possibility that we might race with another process that shouldn't be running, but it didn't occur to me that etcd would return an error for a transaction that later succeeded. I'm... not embarrassed about that. How on earth are you supposed to work with that deterministically?

I think this effectively forces us to fall back to matching on name rather than UUID, which comes with its own problems. I can't see that etcd can ever be canonical for anything with these semantics.

Comment 8 Michał Dulko 2022-03-11 07:50:19 UTC

I am bumping the priority because I just seen it in the CI, meaning that it might be more common than we think.

Comment 9 Matthew Booth 2022-03-11 17:38:54 UTC

This should be one of the things fixed by https://github.com/openshift/machine-api-provider-openstack/pull/26

Comment 14 rlobillo 2022-03-28 10:51:45 UTC

Verified on 4.11.0-0.nightly-2022-03-23-132952 on top of RHOS-16.2-RHEL-8-20220311.n.1 with featureGate enabling MAPO:

      apiVersion: config.openshift.io/v1
      kind: FeatureGate
      metadata:
        annotations:
          include.release.openshift.io/self-managed-high-availability: "true"
          include.release.openshift.io/single-node-developer: "true"
          release.openshift.io/create-only: "true"
        name: cluster
      spec:
        customNoUpgrade:
          enabled:
          - MachineAPIProviderOpenStack
        featureSet: CustomNoUpgrade

We run 9 OCP installations (3 installation types * 3 networkTypes) and the workers were successfully deployed in all the cases.

Comment 17 errata-xmlrpc 2022-08-10 10:41:05 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.