1909682 – When scaling down the status of the node is stuck on deleting

Bug 1909682 - When scaling down the status of the node is stuck on deleting

Summary: When scaling down the status of the node is stuck on deleting

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Derek Higgins
QA Contact:	Polina Rabinovich
Docs Contact:
URL:
Whiteboard:
Depends On:	1901040
Blocks:	1868104 1872659
TreeView+	depends on / blocked

Reported:	2020-12-21 12:02 UTC by Derek Higgins
Modified:	2021-02-04 12:28 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Previously, master nodes were losing their link-local address on the provisioning interface preventing provisioning from working. A workaround has been added to toggle addr_gen_mode to prevent this from occurring.
Clone Of:	1901040
Environment:
Last Closed:	2021-01-18 18:00:14 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift ironic-static-ip-manager pull 13	0	None	closed	Bug 1909682: Force addr_gen_mode to "0" on PROVISIONING_INTERFACE	2021-02-16 17:19:15 UTC
Red Hat Product Errata	RHSA-2021:0037	0	None	None	None	2021-01-18 18:00:38 UTC

Description Derek Higgins 2020-12-21 12:02:06 UTC

+++ This bug was initially created as a clone of Bug #1901040 +++

Version:
Cluster version is 4.7.0-0.nightly-2020-11-23-195308
-----------

Setup:
Provisioning_net_IPv6, Baremetal_net_IPv4, disconnected install
-----------

Platform:

libvirt
IPI (automated install with `openshift-baremetal-install`) 
-----------

When trying to remove a node the status of the node is stuck on deleting state after oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=<N-1> although it should have been removed from nodes and machines lists ($ oc get nodes $ oc get machine -o wide -n openshift-machine-api).

Error Count field is present but empty (I attached bmh-description)

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-23-195308   True        False         68m     Cluster version is 4.7.0-0.nightly-2020-11-23-195308

After scaling down:
[kni@provisionhost-0-0 ~]$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2020-11-23-195308   True        False         False      15m
baremetal                                  4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m
cloud-credential                           4.7.0-0.nightly-2020-11-23-195308   True        False         False      145m
cluster-autoscaler                         4.7.0-0.nightly-2020-11-23-195308   True        False         False      120m
config-operator                            4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m
console                                    4.7.0-0.nightly-2020-11-23-195308   True        False         False      81m
csi-snapshot-controller                    4.7.0-0.nightly-2020-11-23-195308   True        False         False      112m
dns                                        4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
etcd                                       4.7.0-0.nightly-2020-11-23-195308   True        False         False      120m
image-registry                             4.7.0-0.nightly-2020-11-23-195308   True        False         False      70m
ingress                                    4.7.0-0.nightly-2020-11-23-195308   True        False         True       86m
insights                                   4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m
kube-apiserver                             4.7.0-0.nightly-2020-11-23-195308   True        False         False      119m
kube-controller-manager                    4.7.0-0.nightly-2020-11-23-195308   True        False         False      120m
kube-scheduler                             4.7.0-0.nightly-2020-11-23-195308   True        False         False      118m
kube-storage-version-migrator              4.7.0-0.nightly-2020-11-23-195308   True        False         False      18m
machine-api                                4.7.0-0.nightly-2020-11-23-195308   True        False         False      118m
machine-approver                           4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
machine-config                             4.7.0-0.nightly-2020-11-23-195308   False       False         True       5m39s
marketplace                                4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
monitoring                                 4.7.0-0.nightly-2020-11-23-195308   False       True          True       7m50s
network                                    4.7.0-0.nightly-2020-11-23-195308   True        True          True       122m
node-tuning                                4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
openshift-apiserver                        4.7.0-0.nightly-2020-11-23-195308   True        False         False      113m
openshift-controller-manager               4.7.0-0.nightly-2020-11-23-195308   True        False         False      113m
openshift-samples                          4.7.0-0.nightly-2020-11-23-195308   True        False         False      69m
operator-lifecycle-manager                 4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2020-11-23-195308   True        False         False      99m
service-ca                                 4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m
storage                                    4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m


Steps:
-----------

[kni@provisionhost-0-0 ~]$ oc annotate machine ocp-edge-cluster-0-n2gmf-worker-0-pjmfw machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
machine.machine.openshift.io/ocp-edge-cluster-0-n2gmf-worker-0-pjmfw annotated

[kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-n2gmf-worker-0 --replicas=1
machineset.machine.openshift.io/ocp-edge-cluster-0-n2gmf-worker-0 scaled

[kni@provisionhost-0-0 ~]$ oc get nodes


[kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api 
  
-------------

Must-gather:
-------------------------------------
I tried to do must-gather but got this error, I attached ironic-conductor.log instead of must-gather:

error: gather did not start for pod must-gather-xvzzz: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.io/openshift4/ose-must-gather:latest"

Expected results:
-------------------------------------
Deleted node not listed in nodes and machines lists

Actual results:
-------------------------------------
Status of the node is stuck on deleting state


How to reproduce:
-------------------------------------
1.Match the bmh name <name-CONSUMER> to machine (CONSUMER column)
  $ oc get bmh -n openshift-machine-api

2.$ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api- 
  delete-machine=yes -n openshift-machine-api

3.Scale down the replicas number
  $ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 -- 
  replicas=<N-1>

4.Ensure the deleted node not listed
  $ oc get nodes 
  $ oc get machine -o wide -n openshift-machine-api

--- Additional comment from Polina Rabinovich on 2020-11-24 10:52:45 GMT ---



--- Additional comment from Polina Rabinovich on 2020-11-24 10:58:56 GMT ---

This is the output, it didn't added in the Bug description 
[kni@provisionhost-0-0 ~]$ oc get nodes 
NAME         STATUS                     ROLES    AGE    VERSION
master-0-0   Ready                      master   106m   v1.19.2+b005cfc
master-0-1   Ready                      master   106m   v1.19.2+b005cfc
master-0-2   Ready                      master   106m   v1.19.2+b005cfc
worker-0-0   Ready,SchedulingDisabled   worker   69m    v1.19.2+b005cfc
worker-0-1   Ready                      worker   67m    v1.19.2+b005cfc

[kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api 
NAME                                      PHASE      TYPE   REGION   ZONE   AGE    NODE         PROVIDERID                                                    STATE
ocp-edge-cluster-0-n2gmf-master-0         Running                           129m   master-0-0   baremetalhost:///openshift-machine-api/openshift-master-0-0   
ocp-edge-cluster-0-n2gmf-master-1         Running                           129m   master-0-1   baremetalhost:///openshift-machine-api/openshift-master-0-1   
ocp-edge-cluster-0-n2gmf-master-2         Running                           129m   master-0-2   baremetalhost:///openshift-machine-api/openshift-master-0-2   
ocp-edge-cluster-0-n2gmf-worker-0-pjmfw   Deleting                          103m   worker-0-0   baremetalhost:///openshift-machine-api/openshift-worker-0-0   
ocp-edge-cluster-0-n2gmf-worker-0-zxqp7   Running                           103m   worker-0-1   baremetalhost:///openshift-machine-api/openshift-worker-0-1

--- Additional comment from Dmitry Tantsur on 2020-11-24 17:24:15 GMT ---

I have a feeling that the node got stuck while cleaning. We'll need logs from the dnsmasq container in the metal3 pod, ironic-api and ideally also take a look at the node's virtual console to see what is actually happening there. Could you please provide these?

--- Additional comment from Polina Rabinovich on 2020-11-25 10:04:10 GMT ---

log from ironic-api container in the metal3 pod

--- Additional comment from Polina Rabinovich on 2020-11-25 10:05:33 GMT ---

log from dnsmasq container in the metal3 pod

--- Additional comment from Polina Rabinovich on 2020-11-25 10:08:22 GMT ---

(In reply to Dmitry Tantsur from comment #3)
> I have a feeling that the node got stuck while cleaning. We'll need logs
> from the dnsmasq container in the metal3 pod, ironic-api and ideally also
> take a look at the node's virtual console to see what is actually happening
> there. Could you please provide these?

Sure, I attached logs from the ironic-api & dnsmasq containers in the metal3 pod.

--- Additional comment from Bob Fournier on 2020-12-01 19:52:44 GMT ---

Yes, it looks like cleaning has failed:
ironic-api.log:2020-11-25 09:53:18.492 41 DEBUG ironic.api.expose [req-64e1da59-0ec3-4696-9e1f-9b6d6b9b8c5f ironic-user - - - -] Client-side error: The requested action "deleted" can not be performed on node "fa8d76db-75f8-49ac-bf17-8d311e437d96" while it is in state "clean failed". format_exception /usr/lib/python3.6/site-packages/ironic/api
/expose.py:184
ironic-api.log:2020-11-25 09:53:18.592 41 DEBUG ironic.api.expose [req-b3ecf171-aa25-4123-a175-1288b84729b0 ironic-user - - - -] Client-side error: The requested action "deleted" can not be performed on node "fa8d76db-75f8-49ac-bf17-8d311e437d96" while it is in state "clean failed". format_exception /usr/lib/python3.6/site-packages/ironic/api
/expose.py:184

Why isn't clear yet. I only see node fa8d76db-75f8-49ac-bf17-8d311e437d96 referenced in ironic-api.log, not in ironic-conductor.log.

Note that we don't have the console output which may provide more info.

--- Additional comment from Bob Fournier on 2020-12-01 21:25:06 GMT ---

It looks like the last timestamp in the ironic-conductor.log is 2020-11-24 10:44:54.236, which is prior to the error in ironic-api.log.  Do you have an updated ironic-api.log?

--- Additional comment from Bob Fournier on 2020-12-01 22:29:39 GMT ---

Prior comment should have read "Do you have an updated ironic-conductor.log?"

--- Additional comment from Polina Rabinovich on 2020-12-02 12:48:57 GMT ---



--- Additional comment from Polina Rabinovich on 2020-12-02 12:50:16 GMT ---



--- Additional comment from Polina Rabinovich on 2020-12-02 12:51:21 GMT ---



--- Additional comment from Polina Rabinovich on 2020-12-02 12:54:18 GMT ---

(In reply to Bob Fournier from comment #9)
> Prior comment should have read "Do you have an updated ironic-conductor.log?"

I attached new logs (dnsmasq.log, ironic-conductor.log and ironic-api.log).

--- Additional comment from Bob Fournier on 2020-12-02 16:37:14 GMT ---

Thanks Polina, looking at your system I was able to confirm cleaning times out after 30 minutes on worker-0-0:

2020-12-02 12:37:32.191 1 INFO ironic.conductor.utils [req-44060f5c-633c-4960-a2ea-d9a052606394 ironic-user - - - -] Successfully set node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 power state to power on by rebooting.^[[00m
2020-12-02 12:37:32.210 1 INFO ironic.conductor.task_manager [req-44060f5c-633c-4960-a2ea-d9a052606394 ironic-user - - - -] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 moved to provision state "clean wait" from state "cleaning"; target provision state is "available"^[[00m

2020-12-02 13:07:37.248 1 ERROR ironic.conductor.task_manager [req-778612d0-7559-4030-890d-493d2469c3c3 - - - - -] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 moved to provision state "clean failed" from state "clean wait"; target provision state is "available"^[[00m
2020-12-02 13:07:37.300 1 DEBUG ironic.conductor.utils [-] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 should be fast-track-able, but the agent doesn't seem to be running. Last heartbeat: 2020-12-02T11:58:00.940759 is_fast_track /usr/lib/python3.6/site-packages/ironic/conductor/utils.py:1052^[[00m
2020-12-02 13:07:37.303 1 ERROR ironic.conductor.utils [-] Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.^[[00m

Its not clear why cleaning failed, I'll view the worker's console when we duplicate this again.

--- Additional comment from Bob Fournier on 2020-12-08 13:32:16 GMT ---



--- Additional comment from Bob Fournier on 2020-12-08 13:46:18 GMT ---

Attached the worker-0-0 console after scaling down, it looks PXE fails for both IPv4 and IPv6 attempts using the mac for provisioning-0 set for the worker-0-0 VM.

   <interface type='network'>
      <mac address='52:54:00:e3:68:36'/>
      <source network='provisioning-0' portid='aab70ff5-3026-4050-bab7-c74754ab6ed9' bridge='provisioning-0'/>
      <target dev='vnet2'/>
      <model type='virtio'/>
      <boot order='1'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>

I don't see a request in dnsmasq container 52:54:00:e3:68:36 when this PXE failure occurs.  I do DHCPv6 solicits logged when the node was originally brought up, just not after the node was rebooted for cleaning, here are the earlier requests:
dnsmasq-dhcp: 13409945 DHCPSOLICIT(enp4s0) 00:03:00:01:52:54:00:e3:68:36 
dnsmasq-dhcp: 13409945 DHCPREPLY(enp4s0) fd00:1101::eaf0:317c:1489:1577 00:03:00:01:52:54:00:e3:68:36 

In the dnsmasq container I did see any entry for shared/html/pxelinux.cfg/52-54-00-e3-68-36 but as no requests were received by dnsmasq its not used.

What's interesting in the console is that it looks like the IPv4 request gets a DHCP address as it fails with PXE-E18 attempting to get the bootfile.  However I don't see this request logged in the dnsmasq log.

--- Additional comment from Bob Fournier on 2020-12-08 14:25:30 GMT ---

Also in the ironic-conductor log we see:
- when ironic starts:
2020-12-08 10:45:40.318 1 WARNING ironic.drivers.modules.network.flat [req-542eee7f-6e40-468a-ba1c-86adaed6bd12 - - - - -] Please specify a valid UUID or name for [neutron]/cleaning_network configuration option so that this interface is able to perform cleaning. Otherwise, cleaning operations will fail to start.^[[00m
2020-12-08 10:45:40.319 1 INFO ironic.common.driver_factory [req-542eee7f-6e40-468a-ba1c-86adaed6bd12 - - - - -] Loaded the following network interfaces: ['flat', 'noop']^[[00m

- when the node that fails is put into cleaning (note that this msg is logged for all nodes):
2020-12-08 12:56:34.944 1 WARNING ironic.common.pxe_utils [req-a301b4c4-6915-437a-bb04-c38819be6d83 ironic-user - - - -] IPv6 is enabled and the DHCP driver appears set to a plugin aside from "neutron". Node 7a2282cf-2a54-4939-94ce-0110bfd23057 may not receive proper DHCPv6 provided boot parameters.

--- Additional comment from Bob Fournier on 2020-12-08 18:57:08 GMT ---

Note that we see the DHCP requests from worker-0-0 in kni-provisioner, just not in the dnsmasq container on master-0-x.

[kni@provisionhost-0-0 /]$ sudo tcpdump -i eth0 -n ether host 52:54:00:e3:68:36
18:52:45.846674 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:e3:68:36, length 347
18:53:01.846893 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:e3:68:36, length 347
18:53:33.854631 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit
18:53:38.847745 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit
18:53:47.847882 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit
18:54:04.848217 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit

--- Additional comment from Bob Fournier on 2020-12-09 02:07:06 GMT ---

Actually we are getting the DHCP Solicits in the dnsmasq log for worker-0-0 (MAC 52:54:00:e3:68:36), it was misleading in that dnsmasq logs the client ID only  (00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90 in this case) so this the full Solicit/Advertise for worker-0-0:

dnsmasq-dhcp: 8302732 DHCPSOLICIT(enp4s0) 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90
dnsmasq-dhcp: 8302732 DHCPADVERTISE(enp4s0) fd00:1101::25ce:3597:2113:afd1 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90
dnsmasq-dhcp: 8302732 requested options: 59:bootfile-url, 60:bootfile-param, 23:dns-server,
dnsmasq-dhcp: 8302732 requested options: 16:vendor-class
dnsmasq-dhcp: 8302732 tags: pxe6, dhcpv6, enp4s0
dnsmasq-dhcp: 8302732 sent size: 18 option:  1 client-id  00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb...
dnsmasq-dhcp: 8302732 sent size: 14 option:  2 server-id  00:01:00:01:27:62:15:d5:52:54:00:3e:12:b3
dnsmasq-dhcp: 8302732 sent size: 40 option:  3 ia-na  IAID=4216520443 T1=1800 T2=3150
dnsmasq-dhcp: 8302732 nest size: 24 option:  5 iaaddr  fd00:1101::25ce:3597:2113:afd1 PL=3600 VL=3600
dnsmasq-dhcp: 8302732 sent size:  9 option: 13 status  0 success
dnsmasq-dhcp: 8302732 sent size:  1 option:  7 preference  0
dnsmasq-dhcp: 8302732 sent size: 33 option: 59 bootfile-url  tftp://[fd00:1101::3]/snponly.efi

This results in the PXE process logging to the console - "PXE-E16: No valid offer received" (see attached worker-0-0 console log for additional error message).
Its not clear why the the Advertise is being rejected. It would help to pull tcpdump down to master-0-1 to capture the entire transfer, I've tried to get it via "toolbox" but the pull failed.

I'm attaching a tcpdump that matches this dnsmasq log, this was captured on provisionhost-0-0 (because of lack of tcpdump on master-0-1).

I'm also attaching the latest dnsmasq-ew.log with the worker-0-0 Solicits at the end.  There are a couple other interesting things in this log:
- when dnsmasq is started its listening using the link local address:
dnsmasq-dhcp: DHCP, sockets bound exclusively to interface enp4s0
dnsmasq: listening on enp4s0(#2): fe80::5054:ff:fe3e:12b3%enp4s0

- Sometime later a static IPv6 address is assigned to enp4s0 and dnsmasq listens on that IP. However the multicast "Address already in use" is unexpected.
dnsmasq: stopped listening on enp4s0(#2): fe80::5054:ff:fe3e:12b3%enp4s0
dnsmasq: stopped listening on enp4s0(#2): fd00:1101::3
dnsmasq: interface enp4s0 failed to join DHCPv6 multicast group: Address already in use
dnsmasq: listening on enp4s0(#2): fd00:1101::3

I'm not sure if there is a problem with the "Address already in use" as there are many transactions after this.  On unrelated? thing I noticed is that after listening on the static IPv6 address dnsmasq no longer logs the "client MAC address" of received packets even though the rest of the packet is the same, i.e.:
dnsmasq-dhcp: 13248436 client MAC address: 52:54:00:e3:68:36 <== this is not logged after interface changes
dnsmasq-dhcp: 13248436 DHCPSOLICIT(enp4s0) 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90

--- Additional comment from Bob Fournier on 2020-12-09 02:09:48 GMT ---



--- Additional comment from Bob Fournier on 2020-12-09 02:10:55 GMT ---



--- Additional comment from Derek Higgins on 2020-12-09 15:41:17 GMT ---

It looks like this is either a dnsmasq bug or a problem with the way we are setting up DHCPv6

I've reproduced this locally and can successfully hit the dnsmasq server every few minutes with a VM trying to pxe over ipv6,
it continues to work right up until we see this in the dnsmasq log

2020-12-09T15:24:11.636214728Z dnsmasq: stopped listening on enp1s0(#2): fe80::258:63ff:fe8b:308e%enp1s0
2020-12-09T15:24:11.636214728Z dnsmasq: stopped listening on enp1s0(#2): fd00:1101::3
2020-12-09T15:24:12.049867645Z dnsmasq: interface enp1s0 failed to join DHCPv6 multicast group: Address already in use
2020-12-09T15:24:13.105359874Z dnsmasq: listening on enp1s0(#2): fd00:1101::3

pxe then stops working, scaling down also works if you do it before this happens

around the same time on the master in question I see, that the provisioning interface looses its link local address 
[root@master-0 core]# date ; ip -o addr | grep enp1s0
Wed Dec  9 15:23:36 UTC 2020                                                                                                                                                                                       
2: enp1s0    inet6 fd00:1101::3/64 scope global dynamic \       valid_lft 6sec preferred_lft 6sec                                                                                                                  
2: enp1s0    inet6 fe80::258:63ff:fe8b:308e/64 scope link noprefixroute \       valid_lft forever preferred_lft forever                                                                                            

[root@master-0 core]# date ; ip -o addr | grep enp1s0
Wed Dec  9 15:24:40 UTC 2020                                                                                                                                                                                      
2: enp1s0    inet6 fd00:1101::3/64 scope global dynamic \       valid_lft 7sec preferred_lft 7sec                                                                                                                 


Also network managers seems to timeout some kind of DHCPv6 grace period
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6273] device (enp1s0): DHCPv6: grace period expired
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6274] device (enp1s0): state change: activated -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <warn>  [1607527451.6296] device (enp1s0): Activation: failed for connection 'Wired Connection'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6308] device (enp1s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Dec 09 15:24:11 master-0.ostest.test.metalkube.org dbus-daemon[1209]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by>
Dec 09 15:24:11 master-0.ostest.test.metalkube.org ovs-vswitchd[1366]: ovs|00504|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Starting Network Manager Script Dispatcher Service...
Dec 09 15:24:11 master-0.ostest.test.metalkube.org dbus-daemon[1209]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Started Network Manager Script Dispatcher Service.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6524] dhcp4 (enp1s0): canceled DHCP transaction
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6524] dhcp4 (enp1s0): state changed timeout -> done
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6527] dhcp6 (enp1s0): canceled DHCP transaction
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6528] dhcp6 (enp1s0): state changed fail -> done
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + [[ Wired Connection == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + echo 'Refusing to modify default connection.'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: Refusing to modify default connection.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + exit 0
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + '[' -z ']'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + echo 'Not a DHCP6 address. Ignoring.'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: Not a DHCP6 address. Ignoring.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + exit 0
Dec 09 15:24:11 master-0.ostest.test.metalkube.org root[140000]: NM mdns-hostname triggered by down.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: <13>Dec  9 15:24:11 root: NM mdns-hostname triggered by down.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org root[140003]: Hostname changed: master-0.ostest.test.metalkube.org
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: <13>Dec  9 15:24:11 root: Hostname changed: master-0.ostest.test.metalkube.org
Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Starting Generate console-login-helper-messages issue snippet...
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0515] device (enp1s0): Activation: starting connection 'enp1s0' (4ff259a7-73c8-468c-a3b4-25c2d363ae23)
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0517] device (enp1s0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0523] device (enp1s0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0527] device (enp1s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0533] device (enp1s0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0562] device (enp1s0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0566] device (enp1s0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0590] device (enp1s0): Activation: successful, device activated.

--- Additional comment from Derek Higgins on 2020-12-09 17:17:55 GMT ---

Looks like this is linked to the loss of the link local address,
When we trigger the regeneration of the link local address on the master node

echo 0 > /proc/sys/net/ipv6/conf/enp1s0/addr_gen_mode                                                                                                                                       
echo 1 > /proc/sys/net/ipv6/conf/enp1s0/addr_gen_mode                                                                                                                                       


[root@master-0 core]# ip -o addr | grep enp1s0                                                                                                                                                                    
2: enp1s0    inet6 fd00:1101::3/64 scope global dynamic \       valid_lft 10sec preferred_lft 10sec
2: enp1s0    inet6 fe80::258:63ff:fe8b:308e/64 scope link \       valid_lft forever preferred_lft forever             

pxe is working again, also this appeared in the dnsmasq logs
2020-12-09T15:57:00.081839842Z dnsmasq: listening on enp1s0(#2): fe80::258:63ff:fe8b:308e%enp1s0

--- Additional comment from Bob Fournier on 2020-12-14 02:47:46 GMT ---



--- Additional comment from Bob Fournier on 2020-12-14 15:25:49 GMT ---

Including Tony to give visibility to networking team.  See comment 22. The issue we are getting is that the IPv6 link-local address is being removed after some time and dnsmasq stops responding to router solicitations. resulting in PXE failures after the removal of link-local.

--- Additional comment from Derek Higgins on 2020-12-15 14:20:35 GMT ---

Setting addr_gen_mode to EUI64 works as a workaround, adding
echo 0 > /proc/sys/net/ipv6/conf/$PROVISIONING_INTERFACE/addr_gen_mode

into the refresh-static-ip script ensures the link-local address is maintained,

I doubt this is the correct way to fix the problem which will likely involve either a fix in network manager or the way we configure it.

--- Additional comment from Derek Higgins on 2020-12-15 16:48:59 GMT ---

Adding a PR to workaround the problem, I'm still investigating the root cause and will open a separate bug to address this when info is gathered.

--- Additional comment from Bob Fournier on 2020-12-15 20:04:40 GMT ---



--- Additional comment from Bob Fournier on 2020-12-16 11:44:13 GMT ---

Adding blocker flag, problem is cropping up in multiple reports.

Here is RHCOS bug tracking the issue - https://bugzilla.redhat.com/show_bug.cgi?id=1908302

--- Additional comment from Polina Rabinovich on 2020-12-21 09:38:05 GMT ---

[kni@provisionhost-0-0 ~]$ oc get bmh -n openshift-machine-api
NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                          HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-mhfwx-master-0         redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/3ac0499c-4968-4720-8dd2-534c07299b37                      true     
openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-mhfwx-master-1         redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/63e83258-6db9-40ca-8dbe-29a352aafc45                      true     
openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-mhfwx-master-2         redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/74bff71a-9559-420c-9946-2cf0c374e614                      true     
openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-mhfwx-worker-0-ccvrs   redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/bc55c605-2380-4d07-a59a-7326aba8faa4   unknown            true     
openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-mhfwx-worker-0-nftch   redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/d976f337-2ffe-4061-a8c2-4c802e2db726   unknown            true     

[kni@provisionhost-0-0 ~]$ oc annotate machine ocp-edge-cluster-0-mhfwx-worker-0-ccvrs machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
machine.machine.openshift.io/ocp-edge-cluster-0-mhfwx-worker-0-ccvrs annotated

[kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-mhfwx-worker-0 --replicas=1
machineset.machine.openshift.io/ocp-edge-cluster-0-mhfwx-worker-0 scaled

[kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api 
NAME                                      PHASE     TYPE   REGION   ZONE   AGE   NODE                                              PROVIDERID                                                                                         STATE
ocp-edge-cluster-0-mhfwx-master-0         Running                          88m   master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/openshift-master-0-0/85754a6f-1acf-4667-b3f5-73664c483b4e   
ocp-edge-cluster-0-mhfwx-master-1         Running                          88m   master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/openshift-master-0-1/ad3a2d95-dfb9-486e-b810-9a4ed4aad3ea   
ocp-edge-cluster-0-mhfwx-master-2         Running                          88m   master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/openshift-master-0-2/f907e8ba-7b0d-4dde-975c-792d3b1d38fd   
ocp-edge-cluster-0-mhfwx-worker-0-nftch   Running                          61m   worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/openshift-worker-0-1/e0e9fba8-4f84-4702-95b2-b0d298b4c048   

[kni@provisionhost-0-0 ~]$ oc get nodes
NAME                                              STATUS   ROLES    AGE   VERSION
master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   66m   v1.20.0+87544c5
master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   67m   v1.20.0+87544c5
master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   65m   v1.20.0+87544c5
worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    worker   41m   v1.20.0+87544c5
[kni@provisionhost-0-0 ~]$ oc get bmh -n openshift-machine-api

Comment 2 Polina Rabinovich 2020-12-22 13:17:47 UTC

Verified in version:
[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.9     True        False         96m     Cluster version is 4.6.9
-------------------------------------------------------------------------------
[kni@provisionhost-0-0 ~]$ oc get bmh -n openshift-machine-api
NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                    HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-lppt2-master-0         redfish://192.168.123.1:8000/redfish/v1/Systems/e717c94d-bcc9-40a1-b1a3-59cafe3bfc7b                      true     
openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-lppt2-master-1         redfish://192.168.123.1:8000/redfish/v1/Systems/550623fd-22d2-49ab-b9d5-6501d31ec179                      true     
openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-lppt2-master-2         redfish://192.168.123.1:8000/redfish/v1/Systems/7fb7f5aa-da0f-4943-800f-667789074ff8                      true     
openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-lppt2-worker-0-8f9jf   redfish://192.168.123.1:8000/redfish/v1/Systems/4e44929a-1aec-4f54-930b-6dfc01024a6d   unknown            true     
openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-lppt2-worker-0-sp7xn   redfish://192.168.123.1:8000/redfish/v1/Systems/035d09e4-0f42-474b-81f4-2de68b309135   unknown            true     
openshift-worker-0-2   OK       provisioned              ocp-edge-cluster-0-lppt2-worker-0-h65nq   redfish://192.168.123.1:8000/redfish/v1/Systems/d96892c1-fdb7-4790-b6c1-00537b5a2cb4   unknown            true     

[kni@provisionhost-0-0 ~]$ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api- 
-bash: name-CONSUMER: No such file or directory
[kni@provisionhost-0-0 ~]$ oc annotate machine ocp-edge-cluster-0-lppt2-worker-0-8f9jf machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
machine.machine.openshift.io/ocp-edge-cluster-0-lppt2-worker-0-8f9jf annotated

[kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-lppt2-worker-0 --replicas=2
machineset.machine.openshift.io/ocp-edge-cluster-0-lppt2-worker-0 scaled
-----------------------------------------------------------------------------------------------------------------------
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME         STATUS                     ROLES    AGE    VERSION
master-0-0   Ready                      master   139m   v1.19.0+7070803
master-0-1   Ready                      master   139m   v1.19.0+7070803
master-0-2   Ready                      master   139m   v1.19.0+7070803
worker-0-0   Ready,SchedulingDisabled   worker   106m   v1.19.0+7070803
worker-0-1   Ready                      worker   104m   v1.19.0+7070803
worker-0-2   Ready                      worker   105m   v1.19.0+7070803
[kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api
NAME                                      PHASE      TYPE   REGION   ZONE   AGE    NODE         PROVIDERID                                                    STATE
ocp-edge-cluster-0-lppt2-master-0         Running                           156m   master-0-0   baremetalhost:///openshift-machine-api/openshift-master-0-0   
ocp-edge-cluster-0-lppt2-master-1         Running                           156m   master-0-1   baremetalhost:///openshift-machine-api/openshift-master-0-1   
ocp-edge-cluster-0-lppt2-master-2         Running                           156m   master-0-2   baremetalhost:///openshift-machine-api/openshift-master-0-2   
ocp-edge-cluster-0-lppt2-worker-0-8f9jf   Deleting                          133m   worker-0-0   baremetalhost:///openshift-machine-api/openshift-worker-0-0   
ocp-edge-cluster-0-lppt2-worker-0-h65nq   Running                           133m   worker-0-2   baremetalhost:///openshift-machine-api/openshift-worker-0-2   
ocp-edge-cluster-0-lppt2-worker-0-sp7xn   Running                           133m   worker-0-1   baremetalhost:///openshift-machine-api/openshift-worker-0-1   
--------------------------------------------------------------------------------------------------------------
[kni@provisionhost-0-0 ~]$ oc get nodes
NAME         STATUS   ROLES    AGE    VERSION
master-0-0   Ready    master   141m   v1.19.0+7070803
master-0-1   Ready    master   141m   v1.19.0+7070803
master-0-2   Ready    master   141m   v1.19.0+7070803
worker-0-1   Ready    worker   106m   v1.19.0+7070803
worker-0-2   Ready    worker   107m   v1.19.0+7070803

[kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api
NAME                                      PHASE     TYPE   REGION   ZONE   AGE    NODE         PROVIDERID                                                    STATE
ocp-edge-cluster-0-lppt2-master-0         Running                          160m   master-0-0   baremetalhost:///openshift-machine-api/openshift-master-0-0   
ocp-edge-cluster-0-lppt2-master-1         Running                          160m   master-0-1   baremetalhost:///openshift-machine-api/openshift-master-0-1   
ocp-edge-cluster-0-lppt2-master-2         Running                          160m   master-0-2   baremetalhost:///openshift-machine-api/openshift-master-0-2   
ocp-edge-cluster-0-lppt2-worker-0-h65nq   Running                          138m   worker-0-2   baremetalhost:///openshift-machine-api/openshift-worker-0-2   
ocp-edge-cluster-0-lppt2-worker-0-sp7xn   Running                          138m   worker-0-1   baremetalhost:///openshift-machine-api/openshift-worker-0-1

Comment 5 errata-xmlrpc 2021-01-18 18:00:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.12 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:0037

Note You need to log in before you can comment on or make changes to this bug.