Bug 2115090 - [IBMCloud][OVN] Install fails, worker machines stuck in Provisioned "has no node yet"
Summary: [IBMCloud][OVN] Install fails, worker machines stuck in Provisioned "has no n...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.11
Hardware: Unspecified
OS: Unspecified
low
high
Target Milestone: ---
: 4.12.0
Assignee: Christopher J Schaefer
QA Contact: sunzhaohua
Jeana Routh
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-08-03 19:48 UTC by Ross Brattain
Modified: 2022-12-15 20:23 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, short DHCP lease times in combination with `NetworkManager` not being run as a daemon or in continuous mode caused machines to become stuck during their ignition configuration never become nodes in the cluster. With this release, extra checks are added so that if a machine becomes stuck in this state it is deleted and recreated automatically. Machines that are affected by this network condition can become nodes after a reboot from the Machine API controller. (link:https://bugzilla.redhat.com/show_bug.cgi?id=2115090[*BZ#2115090*])
Clone Of:
Environment:
Last Closed: 2022-11-10 14:12:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Ross Brattain 2022-08-03 19:48:35 UTC
Description of problem:

IPI IBMCloud OVN

platform:
  ibmcloud:
    region: eu-gb

Version-Release number of selected component (if applicable):

4.11.0-0.nightly-2022-08-03-021959

How reproducible:

Twice

Steps to Reproduce:
1. Deploy with 3 master, 3 workers


Actual results:

o411i34-r8j67-worker-2-7plcw   Provisioned   bx2-4x16   eu-gb    eu-gb-2   5h10m


I0803 18:49:46.243465       1 machine_scope.go:156] "o411i34-r8j67-worker-3-t4kkt": patching machine
I0803 18:49:46.243503       1 logr.go:252] events "msg"="Normal"  "message"="Updated Machine o411i34-r8j67-worker-3-t4kkt" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"o411i34-r8j67-worker-3-t4kkt","uid":"0babc345-3e0c-4169-93d9-87286df752b2","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"12034"} "reason"="Update"
I0803 18:49:46.298370       1 controller.go:333] o411i34-r8j67-worker-3-t4kkt: has no node yet, requeuing
I0803 18:50:01.204636       1 controller.go:179] o411i34-r8j67-worker-2-7plcw: reconciling Machine
I0803 18:50:01.204678       1 actuator.go:122] o411i34-r8j67-worker-2-7plcw: Checking if machine exists
I0803 18:50:03.199834       1 controller.go:305] o411i34-r8j67-worker-2-7plcw: reconciling machine triggers idempotent update
I0803 18:50:03.199874       1 actuator.go:98] o411i34-r8j67-worker-2-7plcw: Updating machine
I0803 18:50:05.332116       1 reconciler.go:203] o411i34-r8j67-worker-2-7plcw: provider id already set in the machine Spec with value:ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///o411i34-r8j67/0797_e6ec2ccd-cfc2-491e-adbb-f785d78defcb
I0803 18:50:05.332209       1 machine_scope.go:138] o411i34-r8j67-worker-2-7plcw: status unchanged
I0803 18:50:05.332256       1 machine_scope.go:156] "o411i34-r8j67-worker-2-7plcw": patching machine
I0803 18:50:05.333929       1 logr.go:252] events "msg"="Normal"  "message"="Updated Machine o411i34-r8j67-worker-2-7plcw" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"o411i34-r8j67-worker-2-7plcw","uid":"4a918312-baa1-4b28-a9b8-56c40cbfd2dc","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"13043"} "reason"="Update"


openshift-machine-api/pods/machine-api-controllers-664c4dfb97-gnztv/nodelink-controller/nodelink-contro
ller/logs/current.log
122:2022-08-03T14:41:54.715789624Z I0803 14:41:54.715772       1 nodelink_controller.go:70] Adding providerID "ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///o411i34-r8j67/0797_e6ec2ccd-cfc2-491e-adbb-f785d78defcb" for machine "o411i34-r8j67-worker-2-7plcw" to
indexer

238:2022-08-03T14:44:05.990334767Z I0803 14:44:05.990306       1 nodelink_controller.go:403] Matching node not found for machine "o411i34-r8j67-worker-2-7plcw" with internal IP "10.242.65.5"
239:2022-08-03T14:44:05.990396634Z I0803 14:44:05.990372       1 nodelink_controller.go:327] No-op: Node for machine "o411i34-r8j67-worker-2-7plcw" not found


Expected results:

All workers deploy correctly

Additional info:


Scaled machinesets up to 2 machines each and the new workers deployed correctly.


apiVersion: v1
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform: {}
  replicas: 3
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform: {}
  replicas: 3
metadata:
  name: o411i34
platform:
  ibmcloud:
    region: eu-gb
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  serviceNetwork:
  - 172.30.0.0/16
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
publish: External
credentialsMode: Manual

Comment 4 Ross Brattain 2022-08-04 12:54:56 UTC
Finally success on region "jp-tok" with payload 4.11.0-0.nightly-2022-08-04-081314

Comment 5 Ross Brattain 2022-08-04 15:18:29 UTC
Success on "eu-gb" with 4.11.0-0.nightly-2022-08-04-081314

Comment 6 Ross Brattain 2022-08-04 15:34:41 UTC
The issue seems to be the nodes getting stuck.  I was able to scale-up 2 nodes per MachineSet then scale back down and the stuck Provisioning nodes were removed.

Does the machine-api ever give up and retry a node stuck in Provisioning?

Comment 7 Joel Speed 2022-08-10 10:33:55 UTC
> Does the machine-api ever give up and retry a node stuck in Provisioning?

It doesn't give up no, it should keep retrying forever.

The logs for jp-tok show that the instances were all provisioned, that is, the cloud provider returned that it would create the instance and gave us an IP address for that instance. Looking at the Machine Config Server logs I can't see any record of the worker-1 or worker-2 instances having attempted to fetch their config from the ignition server which would indicate a very early boot failure. To go any further into why those broke we would need the serial console logs. Did you per change manage to grab those/are they available from the IBM cloud console?

Looks like there's similar symptoms on the first must gather too except it's worker-2 and worker-3

Comment 8 Ross Brattain 2022-08-10 16:23:24 UTC
I did not grab serial console logs this time, but I do have serial access so I'll attempt to reproduce.

I guess maybe I mean giving up trying to scale a mchineset?  A workaround was to scale up the machineset and scale down, so maybe that could be a fallback mode, attempt to provision a new node then scale either way.

Comment 13 Joel Speed 2022-08-11 10:29:32 UTC
When you restarted the node that somehow fixed the networking and it successfully ignited. However, because it had been more than 2 hours since the Machine was created by Machine API, the CSR approver thinks this is too late and rejects the CSR. This is preventing the kubelet from creating its node and joining the cluster.

Can you try creating a new node and see if the symptoms persist on a new node, we ideally need to see the first boot when it fails to see what is happening there

Comment 14 Christopher J Schaefer 2022-08-16 20:57:04 UTC
If you have further details on a recreate, I'd be interested in seeing them.

I haven't been able to recreate this issue using 4.12 and OVN so far on IBM Cloud.

However, the failure reported looks familiar to my initial OVN testing back on 4.11, and I am unsure what had changed that OVN simply starting deploying without issue on 4.12, at least so far for me anyway.

If you have any other details I'd be interested in checking them out as well, given we hope to have IBM Cloud IPI working on OVN and SDN for 4.12.

Comment 15 Joel Speed 2022-10-03 15:35:31 UTC
Are we still seeing this issue or has it been resolved? Would be good to get a must gather from a cluster or steps to reproduce if this is still happening

Comment 16 Ross Brattain 2022-10-03 20:01:05 UTC
Last IBMCloud QE CI failure was https://issues.redhat.com/browse/OCPBUGS-1327  Is this the same?

Comment 17 Christopher J Schaefer 2022-10-03 20:48:48 UTC
My guess is that yes, it would be the same issue.

Do we have access to those console logs, from CI failure, to confirm that? Although I know that is extremely difficult to do.

The https://issues.redhat.com/browse/OCPBUGS-1327 reported bug affects both OVN and SDN, and 4.11 and 4.12.

Comment 18 Joel Speed 2022-10-27 12:53:54 UTC
Just wanted to poke on this bug, are we still seeing this? Is there any progress we can make?

Comment 19 Joel Speed 2022-11-10 14:12:13 UTC
This bug has been superseeded by https://issues.redhat.com/browse/OCPBUGS-3289

Comment 20 Christopher J Schaefer 2022-11-23 18:50:01 UTC
The bug was mitigated in 4.12+ per
https://issues.redhat.com/browse/OCPBUGS-1327

We do not expect to backport the fix to 4.11, since IBM Cloud was a Tech Preview on 4.11.

The mitigation included improvements to the IBM Cloud Machine-API Provider (https://github.com/openshift/machine-api-provider-ibmcloud).
https://github.com/openshift/machine-api-provider-ibmcloud/commit/bc112f3bf46b17c9f9745fb60c87571c8cdaf8e9

The core issue appears to be related to timing, and the fact that currently RHEL 8 (used as a base for RHCOS in 4.12 and previous affected releases), does not run NetworkManager via a daemon, so DHCP leases do not get renewed on IBM Cloud for the Machines, which is relatively short lease period (5 minutes). This resulted in Machines basically becoming inaccessible/stuck, since the DHCP lease was not renewed and the Machine was unless from that point on (RHCOS failed to run Ignition, so the OS was not properly configured before the DHCP lease expired).

To resolve the issue, until RHCOS is based off RHEL 9, which runs NetworkManager via a daemon (or at least in a continuous mode, not just a one time configuration) to refresh the DHCP lease, the IBM Cloud MAPI code was updated to identify this situation and attempt to mitigate it.

Doc Notes for the bug mitigation (which has only be resolved on 4.12+):

The IBM Cloud MAPI Provider will attempt to find Machines that have provisioned, but the Operating System appears to be stuck, as Ignition has not yet run. If a Machine is identified in this state, and has been for 15 minutes since the IBM Cloud Virtual Server Instance was created, IBM Cloud MAPI, attempts to delete and replace the Machine. The expectation is the network connectivity stabilizes around 5-10 minutes, which can be longer than the DHCP lease allows. Replacing the Machine after 15 minutes appears to provide sufficient time for a second attempt in creating the Machine to allow Ignition to complete successfully.

If there continues to be network connectivity issues after this time frame, and the second attempt of creating the Machine fails, IBM Cloud MAPI will not perform another attempt, and the issue must be resolved manually (such as deleting the Machine via a "oc delete machine" command), which assuming the network is stable and healthy, appears to resolve the issue.

Comment 21 Michael McCune 2022-11-30 14:16:16 UTC
thanks Christopher, i updated the doc text


Note You need to log in before you can comment on or make changes to this bug.