1901040 – When scaling down the status of the node is stuck on deleting

Bug 1901040 - When scaling down the status of the node is stuck on deleting

Summary: When scaling down the status of the node is stuck on deleting

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Bob Fournier
QA Contact:	Polina Rabinovich
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1853267 1905826 (view as bug list)
Depends On:
Blocks:	1868104 1872659 1909682
TreeView+	depends on / blocked

Reported:	2020-11-24 10:51 UTC by Polina Rabinovich
Modified:	2021-02-24 15:36 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: When using IPv6 on an interface, after a certain amount of time NetworkManager will remove the link local IPv6 address. Consequence: PXE boot failures will occur for nodes after the IPv6 link local address is removed. Fix: A workaround is to toggle the interface IPv6 addr_gen_mode which will cause the link local address to be added back. Result: Can successfully PXE boot nodes using IPv6.
Clone Of:
Clones:	1909682 (view as bug list)
Environment:
Last Closed:	2021-02-24 15:35:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
ironic-conductor.log (4.66 MB, text/plain) 2020-11-24 10:51 UTC, Polina Rabinovich	no flags	Details
get-bmh.yaml (39.75 KB, text/plain) 2020-11-24 10:52 UTC, Polina Rabinovich	no flags	Details
ironic-api.log (6.13 MB, text/plain) 2020-11-25 10:04 UTC, Polina Rabinovich	no flags	Details
dnsmasq.log (104.10 KB, text/plain) 2020-11-25 10:05 UTC, Polina Rabinovich	no flags	Details
new-dnsmasq.log (60.68 KB, text/plain) 2020-12-02 12:48 UTC, Polina Rabinovich	no flags	Details
new-ironic-conductor.log (3.43 MB, text/plain) 2020-12-02 12:50 UTC, Polina Rabinovich	no flags	Details
new-ironic-api.log (4.31 MB, text/plain) 2020-12-02 12:51 UTC, Polina Rabinovich	no flags	Details
worker-0-0 console showing PXE failure after scale down (34.19 KB, image/png) 2020-12-08 13:32 UTC, Bob Fournier	no flags	Details
tcpdump of DHCPv6 capture on provisionhost-0-0 (796 bytes, application/vnd.tcpdump.pcap) 2020-12-09 02:09 UTC, Bob Fournier	no flags	Details
latest dnsmasq log showing Solicits from worker-0-0 at end (449.61 KB, text/plain) 2020-12-09 02:10 UTC, Bob Fournier	no flags	Details
Show Obsolete (2) View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	metal3-io static-ip-manager-image pull 7	None	closed	Force addr_gen_mode to "0" on PROVISIONING_INTERFACE	2021-02-21 21:25:37 UTC
Github	openshift ironic-static-ip-manager pull 12	None	closed	Bug 1901040: Force addr_gen_mode to "0" on PROVISIONING_INTERFACE	2021-02-21 21:25:37 UTC
Red Hat Product Errata	RHSA-2020:5633	None	None	None	2021-02-24 15:36:17 UTC

Description Polina Rabinovich 2020-11-24 10:51:53 UTC

Created attachment 1732944 [details]
ironic-conductor.log

Version:
Cluster version is 4.7.0-0.nightly-2020-11-23-195308
-----------

Setup:
Provisioning_net_IPv6, Baremetal_net_IPv4, disconnected install
-----------

Platform:

libvirt
IPI (automated install with `openshift-baremetal-install`) 
-----------

When trying to remove a node the status of the node is stuck on deleting state after oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=<N-1> although it should have been removed from nodes and machines lists ($ oc get nodes $ oc get machine -o wide -n openshift-machine-api).

Error Count field is present but empty (I attached bmh-description)

[kni@provisionhost-0-0 ~]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2020-11-23-195308   True        False         68m     Cluster version is 4.7.0-0.nightly-2020-11-23-195308

After scaling down:
[kni@provisionhost-0-0 ~]$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2020-11-23-195308   True        False         False      15m
baremetal                                  4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m
cloud-credential                           4.7.0-0.nightly-2020-11-23-195308   True        False         False      145m
cluster-autoscaler                         4.7.0-0.nightly-2020-11-23-195308   True        False         False      120m
config-operator                            4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m
console                                    4.7.0-0.nightly-2020-11-23-195308   True        False         False      81m
csi-snapshot-controller                    4.7.0-0.nightly-2020-11-23-195308   True        False         False      112m
dns                                        4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
etcd                                       4.7.0-0.nightly-2020-11-23-195308   True        False         False      120m
image-registry                             4.7.0-0.nightly-2020-11-23-195308   True        False         False      70m
ingress                                    4.7.0-0.nightly-2020-11-23-195308   True        False         True       86m
insights                                   4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m
kube-apiserver                             4.7.0-0.nightly-2020-11-23-195308   True        False         False      119m
kube-controller-manager                    4.7.0-0.nightly-2020-11-23-195308   True        False         False      120m
kube-scheduler                             4.7.0-0.nightly-2020-11-23-195308   True        False         False      118m
kube-storage-version-migrator              4.7.0-0.nightly-2020-11-23-195308   True        False         False      18m
machine-api                                4.7.0-0.nightly-2020-11-23-195308   True        False         False      118m
machine-approver                           4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
machine-config                             4.7.0-0.nightly-2020-11-23-195308   False       False         True       5m39s
marketplace                                4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
monitoring                                 4.7.0-0.nightly-2020-11-23-195308   False       True          True       7m50s
network                                    4.7.0-0.nightly-2020-11-23-195308   True        True          True       122m
node-tuning                                4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
openshift-apiserver                        4.7.0-0.nightly-2020-11-23-195308   True        False         False      113m
openshift-controller-manager               4.7.0-0.nightly-2020-11-23-195308   True        False         False      113m
openshift-samples                          4.7.0-0.nightly-2020-11-23-195308   True        False         False      69m
operator-lifecycle-manager                 4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2020-11-23-195308   True        False         False      121m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2020-11-23-195308   True        False         False      99m
service-ca                                 4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m
storage                                    4.7.0-0.nightly-2020-11-23-195308   True        False         False      122m


Steps:
-----------

[kni@provisionhost-0-0 ~]$ oc annotate machine ocp-edge-cluster-0-n2gmf-worker-0-pjmfw machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
machine.machine.openshift.io/ocp-edge-cluster-0-n2gmf-worker-0-pjmfw annotated

[kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-n2gmf-worker-0 --replicas=1
machineset.machine.openshift.io/ocp-edge-cluster-0-n2gmf-worker-0 scaled

[kni@provisionhost-0-0 ~]$ oc get nodes


[kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api 
  
-------------

Must-gather:
-------------------------------------
I tried to do must-gather but got this error, I attached ironic-conductor.log instead of must-gather:

error: gather did not start for pod must-gather-xvzzz: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.io/openshift4/ose-must-gather:latest"

Expected results:
-------------------------------------
Deleted node not listed in nodes and machines lists

Actual results:
-------------------------------------
Status of the node is stuck on deleting state


How to reproduce:
-------------------------------------
1.Match the bmh name <name-CONSUMER> to machine (CONSUMER column)
  $ oc get bmh -n openshift-machine-api

2.$ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api- 
  delete-machine=yes -n openshift-machine-api

3.Scale down the replicas number
  $ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 -- 
  replicas=<N-1>

4.Ensure the deleted node not listed
  $ oc get nodes 
  $ oc get machine -o wide -n openshift-machine-api

Comment 1 Polina Rabinovich 2020-11-24 10:52:45 UTC

Created attachment 1732945 [details]
get-bmh.yaml

Comment 2 Polina Rabinovich 2020-11-24 10:58:56 UTC

This is the output, it didn't added in the Bug description 
[kni@provisionhost-0-0 ~]$ oc get nodes 
NAME         STATUS                     ROLES    AGE    VERSION
master-0-0   Ready                      master   106m   v1.19.2+b005cfc
master-0-1   Ready                      master   106m   v1.19.2+b005cfc
master-0-2   Ready                      master   106m   v1.19.2+b005cfc
worker-0-0   Ready,SchedulingDisabled   worker   69m    v1.19.2+b005cfc
worker-0-1   Ready                      worker   67m    v1.19.2+b005cfc

[kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api 
NAME                                      PHASE      TYPE   REGION   ZONE   AGE    NODE         PROVIDERID                                                    STATE
ocp-edge-cluster-0-n2gmf-master-0         Running                           129m   master-0-0   baremetalhost:///openshift-machine-api/openshift-master-0-0   
ocp-edge-cluster-0-n2gmf-master-1         Running                           129m   master-0-1   baremetalhost:///openshift-machine-api/openshift-master-0-1   
ocp-edge-cluster-0-n2gmf-master-2         Running                           129m   master-0-2   baremetalhost:///openshift-machine-api/openshift-master-0-2   
ocp-edge-cluster-0-n2gmf-worker-0-pjmfw   Deleting                          103m   worker-0-0   baremetalhost:///openshift-machine-api/openshift-worker-0-0   
ocp-edge-cluster-0-n2gmf-worker-0-zxqp7   Running                           103m   worker-0-1   baremetalhost:///openshift-machine-api/openshift-worker-0-1

Comment 3 Dmitry Tantsur 2020-11-24 17:24:15 UTC

I have a feeling that the node got stuck while cleaning. We'll need logs from the dnsmasq container in the metal3 pod, ironic-api and ideally also take a look at the node's virtual console to see what is actually happening there. Could you please provide these?

Comment 4 Polina Rabinovich 2020-11-25 10:04:10 UTC

Created attachment 1733303 [details]
ironic-api.log

log from ironic-api container in the metal3 pod

Comment 5 Polina Rabinovich 2020-11-25 10:05:33 UTC

Created attachment 1733304 [details]
dnsmasq.log

log from dnsmasq container in the metal3 pod

Comment 6 Polina Rabinovich 2020-11-25 10:08:22 UTC

(In reply to Dmitry Tantsur from comment #3)
> I have a feeling that the node got stuck while cleaning. We'll need logs
> from the dnsmasq container in the metal3 pod, ironic-api and ideally also
> take a look at the node's virtual console to see what is actually happening
> there. Could you please provide these?

Sure, I attached logs from the ironic-api & dnsmasq containers in the metal3 pod.

Comment 7 Bob Fournier 2020-12-01 19:52:44 UTC

Yes, it looks like cleaning has failed:
ironic-api.log:2020-11-25 09:53:18.492 41 DEBUG ironic.api.expose [req-64e1da59-0ec3-4696-9e1f-9b6d6b9b8c5f ironic-user - - - -] Client-side error: The requested action "deleted" can not be performed on node "fa8d76db-75f8-49ac-bf17-8d311e437d96" while it is in state "clean failed". format_exception /usr/lib/python3.6/site-packages/ironic/api
/expose.py:184
ironic-api.log:2020-11-25 09:53:18.592 41 DEBUG ironic.api.expose [req-b3ecf171-aa25-4123-a175-1288b84729b0 ironic-user - - - -] Client-side error: The requested action "deleted" can not be performed on node "fa8d76db-75f8-49ac-bf17-8d311e437d96" while it is in state "clean failed". format_exception /usr/lib/python3.6/site-packages/ironic/api
/expose.py:184

Why isn't clear yet. I only see node fa8d76db-75f8-49ac-bf17-8d311e437d96 referenced in ironic-api.log, not in ironic-conductor.log.

Note that we don't have the console output which may provide more info.

Comment 8 Bob Fournier 2020-12-01 21:25:06 UTC

It looks like the last timestamp in the ironic-conductor.log is 2020-11-24 10:44:54.236, which is prior to the error in ironic-api.log.  Do you have an updated ironic-api.log?

Comment 9 Bob Fournier 2020-12-01 22:29:39 UTC

Prior comment should have read "Do you have an updated ironic-conductor.log?"

Comment 10 Polina Rabinovich 2020-12-02 12:48:57 UTC

Created attachment 1735610 [details]
new-dnsmasq.log

Comment 11 Polina Rabinovich 2020-12-02 12:50:16 UTC

Created attachment 1735612 [details]
new-ironic-conductor.log

Comment 12 Polina Rabinovich 2020-12-02 12:51:21 UTC

Created attachment 1735613 [details]
new-ironic-api.log

Comment 13 Polina Rabinovich 2020-12-02 12:54:18 UTC

(In reply to Bob Fournier from comment #9)
> Prior comment should have read "Do you have an updated ironic-conductor.log?"

I attached new logs (dnsmasq.log, ironic-conductor.log and ironic-api.log).

Comment 14 Bob Fournier 2020-12-02 16:37:14 UTC

Thanks Polina, looking at your system I was able to confirm cleaning times out after 30 minutes on worker-0-0:

2020-12-02 12:37:32.191 1 INFO ironic.conductor.utils [req-44060f5c-633c-4960-a2ea-d9a052606394 ironic-user - - - -] Successfully set node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 power state to power on by rebooting.^[[00m
2020-12-02 12:37:32.210 1 INFO ironic.conductor.task_manager [req-44060f5c-633c-4960-a2ea-d9a052606394 ironic-user - - - -] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 moved to provision state "clean wait" from state "cleaning"; target provision state is "available"^[[00m

2020-12-02 13:07:37.248 1 ERROR ironic.conductor.task_manager [req-778612d0-7559-4030-890d-493d2469c3c3 - - - - -] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 moved to provision state "clean failed" from state "clean wait"; target provision state is "available"^[[00m
2020-12-02 13:07:37.300 1 DEBUG ironic.conductor.utils [-] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 should be fast-track-able, but the agent doesn't seem to be running. Last heartbeat: 2020-12-02T11:58:00.940759 is_fast_track /usr/lib/python3.6/site-packages/ironic/conductor/utils.py:1052^[[00m
2020-12-02 13:07:37.303 1 ERROR ironic.conductor.utils [-] Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.^[[00m

Its not clear why cleaning failed, I'll view the worker's console when we duplicate this again.

Comment 15 Bob Fournier 2020-12-08 13:32:16 UTC

Created attachment 1737618 [details]
worker-0-0 console showing PXE failure after scale down

Comment 16 Bob Fournier 2020-12-08 13:46:18 UTC

Attached the worker-0-0 console after scaling down, it looks PXE fails for both IPv4 and IPv6 attempts using the mac for provisioning-0 set for the worker-0-0 VM.

   <interface type='network'>
      <mac address='52:54:00:e3:68:36'/>
      <source network='provisioning-0' portid='aab70ff5-3026-4050-bab7-c74754ab6ed9' bridge='provisioning-0'/>
      <target dev='vnet2'/>
      <model type='virtio'/>
      <boot order='1'/>
      <alias name='net0'/>
      <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/>
    </interface>

I don't see a request in dnsmasq container 52:54:00:e3:68:36 when this PXE failure occurs.  I do DHCPv6 solicits logged when the node was originally brought up, just not after the node was rebooted for cleaning, here are the earlier requests:
dnsmasq-dhcp: 13409945 DHCPSOLICIT(enp4s0) 00:03:00:01:52:54:00:e3:68:36 
dnsmasq-dhcp: 13409945 DHCPREPLY(enp4s0) fd00:1101::eaf0:317c:1489:1577 00:03:00:01:52:54:00:e3:68:36 

In the dnsmasq container I did see any entry for shared/html/pxelinux.cfg/52-54-00-e3-68-36 but as no requests were received by dnsmasq its not used.

What's interesting in the console is that it looks like the IPv4 request gets a DHCP address as it fails with PXE-E18 attempting to get the bootfile.  However I don't see this request logged in the dnsmasq log.

Comment 17 Bob Fournier 2020-12-08 14:25:30 UTC

Also in the ironic-conductor log we see:
- when ironic starts:
2020-12-08 10:45:40.318 1 WARNING ironic.drivers.modules.network.flat [req-542eee7f-6e40-468a-ba1c-86adaed6bd12 - - - - -] Please specify a valid UUID or name for [neutron]/cleaning_network configuration option so that this interface is able to perform cleaning. Otherwise, cleaning operations will fail to start.^[[00m
2020-12-08 10:45:40.319 1 INFO ironic.common.driver_factory [req-542eee7f-6e40-468a-ba1c-86adaed6bd12 - - - - -] Loaded the following network interfaces: ['flat', 'noop']^[[00m

- when the node that fails is put into cleaning (note that this msg is logged for all nodes):
2020-12-08 12:56:34.944 1 WARNING ironic.common.pxe_utils [req-a301b4c4-6915-437a-bb04-c38819be6d83 ironic-user - - - -] IPv6 is enabled and the DHCP driver appears set to a plugin aside from "neutron". Node 7a2282cf-2a54-4939-94ce-0110bfd23057 may not receive proper DHCPv6 provided boot parameters.

Comment 18 Bob Fournier 2020-12-08 18:57:08 UTC

Note that we see the DHCP requests from worker-0-0 in kni-provisioner, just not in the dnsmasq container on master-0-x.

[kni@provisionhost-0-0 /]$ sudo tcpdump -i eth0 -n ether host 52:54:00:e3:68:36
18:52:45.846674 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:e3:68:36, length 347
18:53:01.846893 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:e3:68:36, length 347
18:53:33.854631 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit
18:53:38.847745 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit
18:53:47.847882 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit
18:54:04.848217 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit

Comment 19 Bob Fournier 2020-12-09 02:07:06 UTC

Actually we are getting the DHCP Solicits in the dnsmasq log for worker-0-0 (MAC 52:54:00:e3:68:36), it was misleading in that dnsmasq logs the client ID only  (00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90 in this case) so this the full Solicit/Advertise for worker-0-0:

dnsmasq-dhcp: 8302732 DHCPSOLICIT(enp4s0) 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90
dnsmasq-dhcp: 8302732 DHCPADVERTISE(enp4s0) fd00:1101::25ce:3597:2113:afd1 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90
dnsmasq-dhcp: 8302732 requested options: 59:bootfile-url, 60:bootfile-param, 23:dns-server,
dnsmasq-dhcp: 8302732 requested options: 16:vendor-class
dnsmasq-dhcp: 8302732 tags: pxe6, dhcpv6, enp4s0
dnsmasq-dhcp: 8302732 sent size: 18 option:  1 client-id  00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb...
dnsmasq-dhcp: 8302732 sent size: 14 option:  2 server-id  00:01:00:01:27:62:15:d5:52:54:00:3e:12:b3
dnsmasq-dhcp: 8302732 sent size: 40 option:  3 ia-na  IAID=4216520443 T1=1800 T2=3150
dnsmasq-dhcp: 8302732 nest size: 24 option:  5 iaaddr  fd00:1101::25ce:3597:2113:afd1 PL=3600 VL=3600
dnsmasq-dhcp: 8302732 sent size:  9 option: 13 status  0 success
dnsmasq-dhcp: 8302732 sent size:  1 option:  7 preference  0
dnsmasq-dhcp: 8302732 sent size: 33 option: 59 bootfile-url  tftp://[fd00:1101::3]/snponly.efi

This results in the PXE process logging to the console - "PXE-E16: No valid offer received" (see attached worker-0-0 console log for additional error message).
Its not clear why the the Advertise is being rejected. It would help to pull tcpdump down to master-0-1 to capture the entire transfer, I've tried to get it via "toolbox" but the pull failed.

I'm attaching a tcpdump that matches this dnsmasq log, this was captured on provisionhost-0-0 (because of lack of tcpdump on master-0-1).

I'm also attaching the latest dnsmasq-ew.log with the worker-0-0 Solicits at the end.  There are a couple other interesting things in this log:
- when dnsmasq is started its listening using the link local address:
dnsmasq-dhcp: DHCP, sockets bound exclusively to interface enp4s0
dnsmasq: listening on enp4s0(#2): fe80::5054:ff:fe3e:12b3%enp4s0

- Sometime later a static IPv6 address is assigned to enp4s0 and dnsmasq listens on that IP. However the multicast "Address already in use" is unexpected.
dnsmasq: stopped listening on enp4s0(#2): fe80::5054:ff:fe3e:12b3%enp4s0
dnsmasq: stopped listening on enp4s0(#2): fd00:1101::3
dnsmasq: interface enp4s0 failed to join DHCPv6 multicast group: Address already in use
dnsmasq: listening on enp4s0(#2): fd00:1101::3

I'm not sure if there is a problem with the "Address already in use" as there are many transactions after this.  On unrelated? thing I noticed is that after listening on the static IPv6 address dnsmasq no longer logs the "client MAC address" of received packets even though the rest of the packet is the same, i.e.:
dnsmasq-dhcp: 13248436 client MAC address: 52:54:00:e3:68:36 <== this is not logged after interface changes
dnsmasq-dhcp: 13248436 DHCPSOLICIT(enp4s0) 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90

Comment 20 Bob Fournier 2020-12-09 02:09:48 UTC

Created attachment 1737790 [details]
tcpdump of DHCPv6 capture on provisionhost-0-0

Comment 21 Bob Fournier 2020-12-09 02:10:55 UTC

Created attachment 1737791 [details]
latest dnsmasq log showing Solicits from worker-0-0 at end

Comment 22 Derek Higgins 2020-12-09 15:41:17 UTC

It looks like this is either a dnsmasq bug or a problem with the way we are setting up DHCPv6

I've reproduced this locally and can successfully hit the dnsmasq server every few minutes with a VM trying to pxe over ipv6,
it continues to work right up until we see this in the dnsmasq log

2020-12-09T15:24:11.636214728Z dnsmasq: stopped listening on enp1s0(#2): fe80::258:63ff:fe8b:308e%enp1s0
2020-12-09T15:24:11.636214728Z dnsmasq: stopped listening on enp1s0(#2): fd00:1101::3
2020-12-09T15:24:12.049867645Z dnsmasq: interface enp1s0 failed to join DHCPv6 multicast group: Address already in use
2020-12-09T15:24:13.105359874Z dnsmasq: listening on enp1s0(#2): fd00:1101::3

pxe then stops working, scaling down also works if you do it before this happens

around the same time on the master in question I see, that the provisioning interface looses its link local address 
[root@master-0 core]# date ; ip -o addr | grep enp1s0
Wed Dec  9 15:23:36 UTC 2020                                                                                                                                                                                       
2: enp1s0    inet6 fd00:1101::3/64 scope global dynamic \       valid_lft 6sec preferred_lft 6sec                                                                                                                  
2: enp1s0    inet6 fe80::258:63ff:fe8b:308e/64 scope link noprefixroute \       valid_lft forever preferred_lft forever                                                                                            

[root@master-0 core]# date ; ip -o addr | grep enp1s0
Wed Dec  9 15:24:40 UTC 2020                                                                                                                                                                                      
2: enp1s0    inet6 fd00:1101::3/64 scope global dynamic \       valid_lft 7sec preferred_lft 7sec                                                                                                                 


Also network managers seems to timeout some kind of DHCPv6 grace period
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6273] device (enp1s0): DHCPv6: grace period expired
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6274] device (enp1s0): state change: activated -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed')
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <warn>  [1607527451.6296] device (enp1s0): Activation: failed for connection 'Wired Connection'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6308] device (enp1s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed')
Dec 09 15:24:11 master-0.ostest.test.metalkube.org dbus-daemon[1209]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by>
Dec 09 15:24:11 master-0.ostest.test.metalkube.org ovs-vswitchd[1366]: ovs|00504|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac)
Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Starting Network Manager Script Dispatcher Service...
Dec 09 15:24:11 master-0.ostest.test.metalkube.org dbus-daemon[1209]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Started Network Manager Script Dispatcher Service.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6524] dhcp4 (enp1s0): canceled DHCP transaction
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6524] dhcp4 (enp1s0): state changed timeout -> done
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6527] dhcp6 (enp1s0): canceled DHCP transaction
Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527451.6528] dhcp6 (enp1s0): state changed fail -> done
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]]
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + [[ Wired Connection == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]]
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + echo 'Refusing to modify default connection.'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: Refusing to modify default connection.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + exit 0
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + '[' -z ']'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + echo 'Not a DHCP6 address. Ignoring.'
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: Not a DHCP6 address. Ignoring.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + exit 0
Dec 09 15:24:11 master-0.ostest.test.metalkube.org root[140000]: NM mdns-hostname triggered by down.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: <13>Dec  9 15:24:11 root: NM mdns-hostname triggered by down.
Dec 09 15:24:11 master-0.ostest.test.metalkube.org root[140003]: Hostname changed: master-0.ostest.test.metalkube.org
Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: <13>Dec  9 15:24:11 root: Hostname changed: master-0.ostest.test.metalkube.org
Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Starting Generate console-login-helper-messages issue snippet...
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0515] device (enp1s0): Activation: starting connection 'enp1s0' (4ff259a7-73c8-468c-a3b4-25c2d363ae23)
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0517] device (enp1s0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0523] device (enp1s0): state change: prepare -> config (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0527] device (enp1s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0533] device (enp1s0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0562] device (enp1s0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0566] device (enp1s0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external')
Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info>  [1607527452.0590] device (enp1s0): Activation: successful, device activated.

Comment 23 Derek Higgins 2020-12-09 17:17:55 UTC

Looks like this is linked to the loss of the link local address,
When we trigger the regeneration of the link local address on the master node

echo 0 > /proc/sys/net/ipv6/conf/enp1s0/addr_gen_mode                                                                                                                                       
echo 1 > /proc/sys/net/ipv6/conf/enp1s0/addr_gen_mode                                                                                                                                       


[root@master-0 core]# ip -o addr | grep enp1s0                                                                                                                                                                    
2: enp1s0    inet6 fd00:1101::3/64 scope global dynamic \       valid_lft 10sec preferred_lft 10sec
2: enp1s0    inet6 fe80::258:63ff:fe8b:308e/64 scope link \       valid_lft forever preferred_lft forever             

pxe is working again, also this appeared in the dnsmasq logs
2020-12-09T15:57:00.081839842Z dnsmasq: listening on enp1s0(#2): fe80::258:63ff:fe8b:308e%enp1s0

Comment 24 Bob Fournier 2020-12-14 02:47:46 UTC

*** Bug 1905826 has been marked as a duplicate of this bug. ***

Comment 25 Bob Fournier 2020-12-14 15:25:49 UTC

Including Tony to give visibility to networking team.  See comment 22. The issue we are getting is that the IPv6 link-local address is being removed after some time and dnsmasq stops responding to router solicitations. resulting in PXE failures after the removal of link-local.

Comment 26 Derek Higgins 2020-12-15 14:20:35 UTC

Setting addr_gen_mode to EUI64 works as a workaround, adding
echo 0 > /proc/sys/net/ipv6/conf/$PROVISIONING_INTERFACE/addr_gen_mode

into the refresh-static-ip script ensures the link-local address is maintained,

I doubt this is the correct way to fix the problem which will likely involve either a fix in network manager or the way we configure it.

Comment 27 Derek Higgins 2020-12-15 16:48:59 UTC

Adding a PR to workaround the problem, I'm still investigating the root cause and will open a separate bug to address this when info is gathered.

Comment 28 Bob Fournier 2020-12-15 20:04:40 UTC

*** Bug 1853267 has been marked as a duplicate of this bug. ***

Comment 29 Bob Fournier 2020-12-16 11:44:13 UTC

Adding blocker flag, problem is cropping up in multiple reports.

Here is RHCOS bug tracking the issue - https://bugzilla.redhat.com/show_bug.cgi?id=1908302

Comment 32 Polina Rabinovich 2020-12-21 09:38:05 UTC

[kni@provisionhost-0-0 ~]$ oc get bmh -n openshift-machine-api
NAME                   STATUS   PROVISIONING STATUS      CONSUMER                                  BMC                                                                                          HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0-0   OK       externally provisioned   ocp-edge-cluster-0-mhfwx-master-0         redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/3ac0499c-4968-4720-8dd2-534c07299b37                      true     
openshift-master-0-1   OK       externally provisioned   ocp-edge-cluster-0-mhfwx-master-1         redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/63e83258-6db9-40ca-8dbe-29a352aafc45                      true     
openshift-master-0-2   OK       externally provisioned   ocp-edge-cluster-0-mhfwx-master-2         redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/74bff71a-9559-420c-9946-2cf0c374e614                      true     
openshift-worker-0-0   OK       provisioned              ocp-edge-cluster-0-mhfwx-worker-0-ccvrs   redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/bc55c605-2380-4d07-a59a-7326aba8faa4   unknown            true     
openshift-worker-0-1   OK       provisioned              ocp-edge-cluster-0-mhfwx-worker-0-nftch   redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/d976f337-2ffe-4061-a8c2-4c802e2db726   unknown            true     

[kni@provisionhost-0-0 ~]$ oc annotate machine ocp-edge-cluster-0-mhfwx-worker-0-ccvrs machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api
machine.machine.openshift.io/ocp-edge-cluster-0-mhfwx-worker-0-ccvrs annotated

[kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-mhfwx-worker-0 --replicas=1
machineset.machine.openshift.io/ocp-edge-cluster-0-mhfwx-worker-0 scaled

[kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api 
NAME                                      PHASE     TYPE   REGION   ZONE   AGE   NODE                                              PROVIDERID                                                                                         STATE
ocp-edge-cluster-0-mhfwx-master-0         Running                          88m   master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/openshift-master-0-0/85754a6f-1acf-4667-b3f5-73664c483b4e   
ocp-edge-cluster-0-mhfwx-master-1         Running                          88m   master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/openshift-master-0-1/ad3a2d95-dfb9-486e-b810-9a4ed4aad3ea   
ocp-edge-cluster-0-mhfwx-master-2         Running                          88m   master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/openshift-master-0-2/f907e8ba-7b0d-4dde-975c-792d3b1d38fd   
ocp-edge-cluster-0-mhfwx-worker-0-nftch   Running                          61m   worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/openshift-worker-0-1/e0e9fba8-4f84-4702-95b2-b0d298b4c048   

[kni@provisionhost-0-0 ~]$ oc get nodes
NAME                                              STATUS   ROLES    AGE   VERSION
master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   66m   v1.20.0+87544c5
master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   67m   v1.20.0+87544c5
master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    master   65m   v1.20.0+87544c5
worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com   Ready    worker   41m   v1.20.0+87544c5
[kni@provisionhost-0-0 ~]$ oc get bmh -n openshift-machine-api

Comment 33 Gurenko Alex 2021-01-20 14:25:31 UTC

I would like to reopen this BZ as I'm seeing this behavior on our bare metal setup.

version   4.7.0-0.nightly-2021-01-12-150634

Adding new bmh is failing with same symptoms failing to boot over PXE and master node missing link local address.

After running [core@openshift-master-1 ~]$ sudo sysctl net.ipv6.conf.eno1np0.addr_gen_mode=1 I was able to add bmh and host got introspected:

[kni@ocp-edge18 bmh-node-deploy]$ oc get bmh
NAME                 STATUS   PROVISIONING STATUS      CONSUMER                            BMC                                                          HARDWARE PROFILE   ONLINE   ERROR
openshift-master-0   OK       externally provisioned   ocp-edge2-spctf-master-0            redfish://10.46.61.34/redfish/v1/Systems/System.Embedded.1                      true
openshift-master-1   OK       externally provisioned   ocp-edge2-spctf-master-1            redfish://10.46.61.35/redfish/v1/Systems/System.Embedded.1                      true
openshift-master-2   OK       externally provisioned   ocp-edge2-spctf-master-2            redfish://10.46.61.36/redfish/v1/Systems/System.Embedded.1                      true
openshift-worker-0   OK       provisioned              ocp-edge2-spctf-worker-0-5wt9r      redfish://10.46.61.37/redfish/v1/Systems/System.Embedded.1   unknown            true
openshift-worker-1   OK       ready                                                        redfish://10.46.61.38/redfish/v1/Systems/System.Embedded.1   unknown            false
openshift-worker-2   OK       provisioned              ocp-edge2-spctf-worker-lb-0-9vrqr   redfish://10.46.61.70/redfish/v1/Systems/System.Embedded.1   unknown            true
openshift-worker-3   OK       provisioned              ocp-edge2-spctf-worker-lb-0-48pqs   redfish://10.46.61.71/redfish/v1/Systems/System.Embedded.1   unknown            true

As it seems the issue presents itself some time after the deployment, so testing it right away after deployment completion does not reveal this issue.

Comment 34 Gurenko Alex 2021-01-20 14:27:58 UTC

OSTREE_VERSION='47.83.202101120742-0'

Comment 35 Derek Higgins 2021-01-20 15:01:31 UTC

(In reply to Gurenko Alex from comment #33)
> I would like to reopen this BZ as I'm seeing this behavior on our bare metal
> setup.
> 
> version   4.7.0-0.nightly-2021-01-12-150634
> 
> Adding new bmh is failing with same symptoms failing to boot over PXE and
> master node missing link local address.
> 
> After running [core@openshift-master-1 ~]$ sudo sysctl
> net.ipv6.conf.eno1np0.addr_gen_mode=1 I was able to add bmh and host got
> introspected:

There was another corner case where the same problem would occur if the
metal3 pod moved from one master to another
https://bugzilla.redhat.com/show_bug.cgi?id=1911664

The fix for this merged last week, 
https://github.com/openshift/ironic-static-ip-manager/pull/14

Can you retry with this fix as you may be hitting the same corner case.

Comment 36 Gurenko Alex 2021-01-20 15:27:23 UTC

(In reply to Derek Higgins from comment #35)
> (In reply to Gurenko Alex from comment #33)
> > I would like to reopen this BZ as I'm seeing this behavior on our bare metal
> > setup.
> > 
> > version   4.7.0-0.nightly-2021-01-12-150634
> > 
> > Adding new bmh is failing with same symptoms failing to boot over PXE and
> > master node missing link local address.
> > 
> > After running [core@openshift-master-1 ~]$ sudo sysctl
> > net.ipv6.conf.eno1np0.addr_gen_mode=1 I was able to add bmh and host got
> > introspected:
> 
> There was another corner case where the same problem would occur if the
> metal3 pod moved from one master to another
> https://bugzilla.redhat.com/show_bug.cgi?id=1911664
> 
> The fix for this merged last week, 
> https://github.com/openshift/ironic-static-ip-manager/pull/14
> 
> Can you retry with this fix as you may be hitting the same corner case.

Based on what I see, creation time of the metal3 pod matches cluster creation (deployment) time and has 0 (zero) restarts, do you still think this might be it?

Comment 37 Derek Higgins 2021-01-20 15:43:22 UTC

(In reply to Gurenko Alex from comment #36)
> Based on what I see, creation time of the metal3 pod matches cluster
> creation (deployment) time and has 0 (zero) restarts, do you still think
> this might be it?

Sounds like its a different problem but the workaround of toggling addr_gen_mode any
time the link local address was lost may work as a solution in your case. If so that
should work ok until we find the correct fix and can remove the workaround.

Comment 38 Gurenko Alex 2021-01-21 11:36:13 UTC

(In reply to Derek Higgins from comment #37)
> (In reply to Gurenko Alex from comment #36)
> > Based on what I see, creation time of the metal3 pod matches cluster
> > creation (deployment) time and has 0 (zero) restarts, do you still think
> > this might be it?
> 
> Sounds like its a different problem but the workaround of toggling
> addr_gen_mode any
> time the link local address was lost may work as a solution in your case. If
> so that
> should work ok until we find the correct fix and can remove the workaround.

Yes, the workaround works for now, but I would ask to keep this BZ open and block it with https://bugzilla.redhat.com/show_bug.cgi?id=1908302 so we have an up to date status.

Comment 39 Derek Higgins 2021-01-21 11:48:07 UTC

(In reply to Gurenko Alex from comment #38)
> (In reply to Derek Higgins from comment #37)
> > (In reply to Gurenko Alex from comment #36)
> > > Based on what I see, creation time of the metal3 pod matches cluster
> > > creation (deployment) time and has 0 (zero) restarts, do you still think
> > > this might be it?
> > 
> > Sounds like its a different problem but the workaround of toggling
> > addr_gen_mode any
> > time the link local address was lost may work as a solution in your case. If
> > so that
> > should work ok until we find the correct fix and can remove the workaround.
> 
> Yes, the workaround works for now, but I would ask to keep this BZ open and
> block it with https://bugzilla.redhat.com/show_bug.cgi?id=1908302 so we have
> an up to date status.

Just to be sure, when you say the workaround works for now, are you talking about
the workaround we've merged into the container? If thats the case then I think this
can be closed, as its no longer blocking 4.7 

On the other hand if you still have to manually set addr_gen_mode then we should
continue to find a solution for 4.7

Comment 40 Gurenko Alex 2021-01-21 12:35:52 UTC

(In reply to Derek Higgins from comment #39)
> (In reply to Gurenko Alex from comment #38)
> > (In reply to Derek Higgins from comment #37)
> > > (In reply to Gurenko Alex from comment #36)
> > > > Based on what I see, creation time of the metal3 pod matches cluster
> > > > creation (deployment) time and has 0 (zero) restarts, do you still think
> > > > this might be it?
> > > 
> > > Sounds like its a different problem but the workaround of toggling
> > > addr_gen_mode any
> > > time the link local address was lost may work as a solution in your case. If
> > > so that
> > > should work ok until we find the correct fix and can remove the workaround.
> > 
> > Yes, the workaround works for now, but I would ask to keep this BZ open and
> > block it with https://bugzilla.redhat.com/show_bug.cgi?id=1908302 so we have
> > an up to date status.
> 
> Just to be sure, when you say the workaround works for now, are you talking
> about
> the workaround we've merged into the container? If thats the case then I
> think this
> can be closed, as its no longer blocking 4.7 
> 
> On the other hand if you still have to manually set addr_gen_mode then we
> should
> continue to find a solution for 4.7

 Good clarification, I meant that I had to do it manually but as of 4.7.0-0.nightly-2021-01-12-150634 (I guess it's coincidentally the last version without the patch) I do not see the workaround being available inside the container, I'll be redeploying the cluster somewhat next week to a newer version to confirm the WA on BM.

Comment 41 Tomas Sedovic 2021-01-27 16:50:18 UTC

Re-setting back to VERIFIED. If I understand it correctly, this was not a regression. Rather, the images used did not have the fix yet.

If you reproduce this issue even with the fix present please feel free to open a new BZ or reopen this one.

Comment 43 errata-xmlrpc 2021-02-24 15:35:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.