Description of problem: IPI IBMCloud OVN platform: ibmcloud: region: eu-gb Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-08-03-021959 How reproducible: Twice Steps to Reproduce: 1. Deploy with 3 master, 3 workers Actual results: o411i34-r8j67-worker-2-7plcw Provisioned bx2-4x16 eu-gb eu-gb-2 5h10m I0803 18:49:46.243465 1 machine_scope.go:156] "o411i34-r8j67-worker-3-t4kkt": patching machine I0803 18:49:46.243503 1 logr.go:252] events "msg"="Normal" "message"="Updated Machine o411i34-r8j67-worker-3-t4kkt" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"o411i34-r8j67-worker-3-t4kkt","uid":"0babc345-3e0c-4169-93d9-87286df752b2","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"12034"} "reason"="Update" I0803 18:49:46.298370 1 controller.go:333] o411i34-r8j67-worker-3-t4kkt: has no node yet, requeuing I0803 18:50:01.204636 1 controller.go:179] o411i34-r8j67-worker-2-7plcw: reconciling Machine I0803 18:50:01.204678 1 actuator.go:122] o411i34-r8j67-worker-2-7plcw: Checking if machine exists I0803 18:50:03.199834 1 controller.go:305] o411i34-r8j67-worker-2-7plcw: reconciling machine triggers idempotent update I0803 18:50:03.199874 1 actuator.go:98] o411i34-r8j67-worker-2-7plcw: Updating machine I0803 18:50:05.332116 1 reconciler.go:203] o411i34-r8j67-worker-2-7plcw: provider id already set in the machine Spec with value:ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///o411i34-r8j67/0797_e6ec2ccd-cfc2-491e-adbb-f785d78defcb I0803 18:50:05.332209 1 machine_scope.go:138] o411i34-r8j67-worker-2-7plcw: status unchanged I0803 18:50:05.332256 1 machine_scope.go:156] "o411i34-r8j67-worker-2-7plcw": patching machine I0803 18:50:05.333929 1 logr.go:252] events "msg"="Normal" "message"="Updated Machine o411i34-r8j67-worker-2-7plcw" "object"={"kind":"Machine","namespace":"openshift-machine-api","name":"o411i34-r8j67-worker-2-7plcw","uid":"4a918312-baa1-4b28-a9b8-56c40cbfd2dc","apiVersion":"machine.openshift.io/v1beta1","resourceVersion":"13043"} "reason"="Update" openshift-machine-api/pods/machine-api-controllers-664c4dfb97-gnztv/nodelink-controller/nodelink-contro ller/logs/current.log 122:2022-08-03T14:41:54.715789624Z I0803 14:41:54.715772 1 nodelink_controller.go:70] Adding providerID "ibm://fdc2e14cf8bc4d53a67f972dc2e2c861///o411i34-r8j67/0797_e6ec2ccd-cfc2-491e-adbb-f785d78defcb" for machine "o411i34-r8j67-worker-2-7plcw" to indexer 238:2022-08-03T14:44:05.990334767Z I0803 14:44:05.990306 1 nodelink_controller.go:403] Matching node not found for machine "o411i34-r8j67-worker-2-7plcw" with internal IP "10.242.65.5" 239:2022-08-03T14:44:05.990396634Z I0803 14:44:05.990372 1 nodelink_controller.go:327] No-op: Node for machine "o411i34-r8j67-worker-2-7plcw" not found Expected results: All workers deploy correctly Additional info: Scaled machinesets up to 2 machines each and the new workers deployed correctly. apiVersion: v1 controlPlane: architecture: amd64 hyperthreading: Enabled name: master platform: {} replicas: 3 compute: - architecture: amd64 hyperthreading: Enabled name: worker platform: {} replicas: 3 metadata: name: o411i34 platform: ibmcloud: region: eu-gb networking: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 serviceNetwork: - 172.30.0.0/16 machineNetwork: - cidr: 10.0.0.0/16 networkType: OVNKubernetes publish: External credentialsMode: Manual
Finally success on region "jp-tok" with payload 4.11.0-0.nightly-2022-08-04-081314
Success on "eu-gb" with 4.11.0-0.nightly-2022-08-04-081314
The issue seems to be the nodes getting stuck. I was able to scale-up 2 nodes per MachineSet then scale back down and the stuck Provisioning nodes were removed. Does the machine-api ever give up and retry a node stuck in Provisioning?
> Does the machine-api ever give up and retry a node stuck in Provisioning? It doesn't give up no, it should keep retrying forever. The logs for jp-tok show that the instances were all provisioned, that is, the cloud provider returned that it would create the instance and gave us an IP address for that instance. Looking at the Machine Config Server logs I can't see any record of the worker-1 or worker-2 instances having attempted to fetch their config from the ignition server which would indicate a very early boot failure. To go any further into why those broke we would need the serial console logs. Did you per change manage to grab those/are they available from the IBM cloud console? Looks like there's similar symptoms on the first must gather too except it's worker-2 and worker-3
I did not grab serial console logs this time, but I do have serial access so I'll attempt to reproduce. I guess maybe I mean giving up trying to scale a mchineset? A workaround was to scale up the machineset and scale down, so maybe that could be a fallback mode, attempt to provision a new node then scale either way.
When you restarted the node that somehow fixed the networking and it successfully ignited. However, because it had been more than 2 hours since the Machine was created by Machine API, the CSR approver thinks this is too late and rejects the CSR. This is preventing the kubelet from creating its node and joining the cluster. Can you try creating a new node and see if the symptoms persist on a new node, we ideally need to see the first boot when it fails to see what is happening there
If you have further details on a recreate, I'd be interested in seeing them. I haven't been able to recreate this issue using 4.12 and OVN so far on IBM Cloud. However, the failure reported looks familiar to my initial OVN testing back on 4.11, and I am unsure what had changed that OVN simply starting deploying without issue on 4.12, at least so far for me anyway. If you have any other details I'd be interested in checking them out as well, given we hope to have IBM Cloud IPI working on OVN and SDN for 4.12.
Are we still seeing this issue or has it been resolved? Would be good to get a must gather from a cluster or steps to reproduce if this is still happening
Last IBMCloud QE CI failure was https://issues.redhat.com/browse/OCPBUGS-1327 Is this the same?
My guess is that yes, it would be the same issue. Do we have access to those console logs, from CI failure, to confirm that? Although I know that is extremely difficult to do. The https://issues.redhat.com/browse/OCPBUGS-1327 reported bug affects both OVN and SDN, and 4.11 and 4.12.
Just wanted to poke on this bug, are we still seeing this? Is there any progress we can make?
This bug has been superseeded by https://issues.redhat.com/browse/OCPBUGS-3289
The bug was mitigated in 4.12+ per https://issues.redhat.com/browse/OCPBUGS-1327 We do not expect to backport the fix to 4.11, since IBM Cloud was a Tech Preview on 4.11. The mitigation included improvements to the IBM Cloud Machine-API Provider (https://github.com/openshift/machine-api-provider-ibmcloud). https://github.com/openshift/machine-api-provider-ibmcloud/commit/bc112f3bf46b17c9f9745fb60c87571c8cdaf8e9 The core issue appears to be related to timing, and the fact that currently RHEL 8 (used as a base for RHCOS in 4.12 and previous affected releases), does not run NetworkManager via a daemon, so DHCP leases do not get renewed on IBM Cloud for the Machines, which is relatively short lease period (5 minutes). This resulted in Machines basically becoming inaccessible/stuck, since the DHCP lease was not renewed and the Machine was unless from that point on (RHCOS failed to run Ignition, so the OS was not properly configured before the DHCP lease expired). To resolve the issue, until RHCOS is based off RHEL 9, which runs NetworkManager via a daemon (or at least in a continuous mode, not just a one time configuration) to refresh the DHCP lease, the IBM Cloud MAPI code was updated to identify this situation and attempt to mitigate it. Doc Notes for the bug mitigation (which has only be resolved on 4.12+): The IBM Cloud MAPI Provider will attempt to find Machines that have provisioned, but the Operating System appears to be stuck, as Ignition has not yet run. If a Machine is identified in this state, and has been for 15 minutes since the IBM Cloud Virtual Server Instance was created, IBM Cloud MAPI, attempts to delete and replace the Machine. The expectation is the network connectivity stabilizes around 5-10 minutes, which can be longer than the DHCP lease allows. Replacing the Machine after 15 minutes appears to provide sufficient time for a second attempt in creating the Machine to allow Ignition to complete successfully. If there continues to be network connectivity issues after this time frame, and the second attempt of creating the Machine fails, IBM Cloud MAPI will not perform another attempt, and the issue must be resolved manually (such as deleting the Machine via a "oc delete machine" command), which assuming the network is stable and healthy, appears to resolve the issue.
thanks Christopher, i updated the doc text