Hide Forgot
+++ This bug was initially created as a clone of Bug #1901040 +++ Version: Cluster version is 4.7.0-0.nightly-2020-11-23-195308 ----------- Setup: Provisioning_net_IPv6, Baremetal_net_IPv4, disconnected install ----------- Platform: libvirt IPI (automated install with `openshift-baremetal-install`) ----------- When trying to remove a node the status of the node is stuck on deleting state after oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 --replicas=<N-1> although it should have been removed from nodes and machines lists ($ oc get nodes $ oc get machine -o wide -n openshift-machine-api). Error Count field is present but empty (I attached bmh-description) [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2020-11-23-195308 True False 68m Cluster version is 4.7.0-0.nightly-2020-11-23-195308 After scaling down: [kni@provisionhost-0-0 ~]$ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-2020-11-23-195308 True False False 15m baremetal 4.7.0-0.nightly-2020-11-23-195308 True False False 122m cloud-credential 4.7.0-0.nightly-2020-11-23-195308 True False False 145m cluster-autoscaler 4.7.0-0.nightly-2020-11-23-195308 True False False 120m config-operator 4.7.0-0.nightly-2020-11-23-195308 True False False 122m console 4.7.0-0.nightly-2020-11-23-195308 True False False 81m csi-snapshot-controller 4.7.0-0.nightly-2020-11-23-195308 True False False 112m dns 4.7.0-0.nightly-2020-11-23-195308 True False False 121m etcd 4.7.0-0.nightly-2020-11-23-195308 True False False 120m image-registry 4.7.0-0.nightly-2020-11-23-195308 True False False 70m ingress 4.7.0-0.nightly-2020-11-23-195308 True False True 86m insights 4.7.0-0.nightly-2020-11-23-195308 True False False 122m kube-apiserver 4.7.0-0.nightly-2020-11-23-195308 True False False 119m kube-controller-manager 4.7.0-0.nightly-2020-11-23-195308 True False False 120m kube-scheduler 4.7.0-0.nightly-2020-11-23-195308 True False False 118m kube-storage-version-migrator 4.7.0-0.nightly-2020-11-23-195308 True False False 18m machine-api 4.7.0-0.nightly-2020-11-23-195308 True False False 118m machine-approver 4.7.0-0.nightly-2020-11-23-195308 True False False 121m machine-config 4.7.0-0.nightly-2020-11-23-195308 False False True 5m39s marketplace 4.7.0-0.nightly-2020-11-23-195308 True False False 121m monitoring 4.7.0-0.nightly-2020-11-23-195308 False True True 7m50s network 4.7.0-0.nightly-2020-11-23-195308 True True True 122m node-tuning 4.7.0-0.nightly-2020-11-23-195308 True False False 121m openshift-apiserver 4.7.0-0.nightly-2020-11-23-195308 True False False 113m openshift-controller-manager 4.7.0-0.nightly-2020-11-23-195308 True False False 113m openshift-samples 4.7.0-0.nightly-2020-11-23-195308 True False False 69m operator-lifecycle-manager 4.7.0-0.nightly-2020-11-23-195308 True False False 121m operator-lifecycle-manager-catalog 4.7.0-0.nightly-2020-11-23-195308 True False False 121m operator-lifecycle-manager-packageserver 4.7.0-0.nightly-2020-11-23-195308 True False False 99m service-ca 4.7.0-0.nightly-2020-11-23-195308 True False False 122m storage 4.7.0-0.nightly-2020-11-23-195308 True False False 122m Steps: ----------- [kni@provisionhost-0-0 ~]$ oc annotate machine ocp-edge-cluster-0-n2gmf-worker-0-pjmfw machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api machine.machine.openshift.io/ocp-edge-cluster-0-n2gmf-worker-0-pjmfw annotated [kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-n2gmf-worker-0 --replicas=1 machineset.machine.openshift.io/ocp-edge-cluster-0-n2gmf-worker-0 scaled [kni@provisionhost-0-0 ~]$ oc get nodes [kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api ------------- Must-gather: ------------------------------------- I tried to do must-gather but got this error, I attached ironic-conductor.log instead of must-gather: error: gather did not start for pod must-gather-xvzzz: unable to pull image: ImagePullBackOff: Back-off pulling image "registry.redhat.io/openshift4/ose-must-gather:latest" Expected results: ------------------------------------- Deleted node not listed in nodes and machines lists Actual results: ------------------------------------- Status of the node is stuck on deleting state How to reproduce: ------------------------------------- 1.Match the bmh name <name-CONSUMER> to machine (CONSUMER column) $ oc get bmh -n openshift-machine-api 2.$ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api- delete-machine=yes -n openshift-machine-api 3.Scale down the replicas number $ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-worker-0 -- replicas=<N-1> 4.Ensure the deleted node not listed $ oc get nodes $ oc get machine -o wide -n openshift-machine-api --- Additional comment from Polina Rabinovich on 2020-11-24 10:52:45 GMT --- --- Additional comment from Polina Rabinovich on 2020-11-24 10:58:56 GMT --- This is the output, it didn't added in the Bug description [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 106m v1.19.2+b005cfc master-0-1 Ready master 106m v1.19.2+b005cfc master-0-2 Ready master 106m v1.19.2+b005cfc worker-0-0 Ready,SchedulingDisabled worker 69m v1.19.2+b005cfc worker-0-1 Ready worker 67m v1.19.2+b005cfc [kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ocp-edge-cluster-0-n2gmf-master-0 Running 129m master-0-0 baremetalhost:///openshift-machine-api/openshift-master-0-0 ocp-edge-cluster-0-n2gmf-master-1 Running 129m master-0-1 baremetalhost:///openshift-machine-api/openshift-master-0-1 ocp-edge-cluster-0-n2gmf-master-2 Running 129m master-0-2 baremetalhost:///openshift-machine-api/openshift-master-0-2 ocp-edge-cluster-0-n2gmf-worker-0-pjmfw Deleting 103m worker-0-0 baremetalhost:///openshift-machine-api/openshift-worker-0-0 ocp-edge-cluster-0-n2gmf-worker-0-zxqp7 Running 103m worker-0-1 baremetalhost:///openshift-machine-api/openshift-worker-0-1 --- Additional comment from Dmitry Tantsur on 2020-11-24 17:24:15 GMT --- I have a feeling that the node got stuck while cleaning. We'll need logs from the dnsmasq container in the metal3 pod, ironic-api and ideally also take a look at the node's virtual console to see what is actually happening there. Could you please provide these? --- Additional comment from Polina Rabinovich on 2020-11-25 10:04:10 GMT --- log from ironic-api container in the metal3 pod --- Additional comment from Polina Rabinovich on 2020-11-25 10:05:33 GMT --- log from dnsmasq container in the metal3 pod --- Additional comment from Polina Rabinovich on 2020-11-25 10:08:22 GMT --- (In reply to Dmitry Tantsur from comment #3) > I have a feeling that the node got stuck while cleaning. We'll need logs > from the dnsmasq container in the metal3 pod, ironic-api and ideally also > take a look at the node's virtual console to see what is actually happening > there. Could you please provide these? Sure, I attached logs from the ironic-api & dnsmasq containers in the metal3 pod. --- Additional comment from Bob Fournier on 2020-12-01 19:52:44 GMT --- Yes, it looks like cleaning has failed: ironic-api.log:2020-11-25 09:53:18.492 41 DEBUG ironic.api.expose [req-64e1da59-0ec3-4696-9e1f-9b6d6b9b8c5f ironic-user - - - -] Client-side error: The requested action "deleted" can not be performed on node "fa8d76db-75f8-49ac-bf17-8d311e437d96" while it is in state "clean failed". format_exception /usr/lib/python3.6/site-packages/ironic/api /expose.py:184 ironic-api.log:2020-11-25 09:53:18.592 41 DEBUG ironic.api.expose [req-b3ecf171-aa25-4123-a175-1288b84729b0 ironic-user - - - -] Client-side error: The requested action "deleted" can not be performed on node "fa8d76db-75f8-49ac-bf17-8d311e437d96" while it is in state "clean failed". format_exception /usr/lib/python3.6/site-packages/ironic/api /expose.py:184 Why isn't clear yet. I only see node fa8d76db-75f8-49ac-bf17-8d311e437d96 referenced in ironic-api.log, not in ironic-conductor.log. Note that we don't have the console output which may provide more info. --- Additional comment from Bob Fournier on 2020-12-01 21:25:06 GMT --- It looks like the last timestamp in the ironic-conductor.log is 2020-11-24 10:44:54.236, which is prior to the error in ironic-api.log. Do you have an updated ironic-api.log? --- Additional comment from Bob Fournier on 2020-12-01 22:29:39 GMT --- Prior comment should have read "Do you have an updated ironic-conductor.log?" --- Additional comment from Polina Rabinovich on 2020-12-02 12:48:57 GMT --- --- Additional comment from Polina Rabinovich on 2020-12-02 12:50:16 GMT --- --- Additional comment from Polina Rabinovich on 2020-12-02 12:51:21 GMT --- --- Additional comment from Polina Rabinovich on 2020-12-02 12:54:18 GMT --- (In reply to Bob Fournier from comment #9) > Prior comment should have read "Do you have an updated ironic-conductor.log?" I attached new logs (dnsmasq.log, ironic-conductor.log and ironic-api.log). --- Additional comment from Bob Fournier on 2020-12-02 16:37:14 GMT --- Thanks Polina, looking at your system I was able to confirm cleaning times out after 30 minutes on worker-0-0: 2020-12-02 12:37:32.191 1 INFO ironic.conductor.utils [req-44060f5c-633c-4960-a2ea-d9a052606394 ironic-user - - - -] Successfully set node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 power state to power on by rebooting.^[[00m 2020-12-02 12:37:32.210 1 INFO ironic.conductor.task_manager [req-44060f5c-633c-4960-a2ea-d9a052606394 ironic-user - - - -] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 moved to provision state "clean wait" from state "cleaning"; target provision state is "available"^[[00m 2020-12-02 13:07:37.248 1 ERROR ironic.conductor.task_manager [req-778612d0-7559-4030-890d-493d2469c3c3 - - - - -] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 moved to provision state "clean failed" from state "clean wait"; target provision state is "available"^[[00m 2020-12-02 13:07:37.300 1 DEBUG ironic.conductor.utils [-] Node 2e0b6e9d-bbc9-44b6-872d-79d359aacf38 should be fast-track-able, but the agent doesn't seem to be running. Last heartbeat: 2020-12-02T11:58:00.940759 is_fast_track /usr/lib/python3.6/site-packages/ironic/conductor/utils.py:1052^[[00m 2020-12-02 13:07:37.303 1 ERROR ironic.conductor.utils [-] Timeout reached while cleaning the node. Please check if the ramdisk responsible for the cleaning is running on the node. Failed on step {}.^[[00m Its not clear why cleaning failed, I'll view the worker's console when we duplicate this again. --- Additional comment from Bob Fournier on 2020-12-08 13:32:16 GMT --- --- Additional comment from Bob Fournier on 2020-12-08 13:46:18 GMT --- Attached the worker-0-0 console after scaling down, it looks PXE fails for both IPv4 and IPv6 attempts using the mac for provisioning-0 set for the worker-0-0 VM. <interface type='network'> <mac address='52:54:00:e3:68:36'/> <source network='provisioning-0' portid='aab70ff5-3026-4050-bab7-c74754ab6ed9' bridge='provisioning-0'/> <target dev='vnet2'/> <model type='virtio'/> <boot order='1'/> <alias name='net0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </interface> I don't see a request in dnsmasq container 52:54:00:e3:68:36 when this PXE failure occurs. I do DHCPv6 solicits logged when the node was originally brought up, just not after the node was rebooted for cleaning, here are the earlier requests: dnsmasq-dhcp: 13409945 DHCPSOLICIT(enp4s0) 00:03:00:01:52:54:00:e3:68:36 dnsmasq-dhcp: 13409945 DHCPREPLY(enp4s0) fd00:1101::eaf0:317c:1489:1577 00:03:00:01:52:54:00:e3:68:36 In the dnsmasq container I did see any entry for shared/html/pxelinux.cfg/52-54-00-e3-68-36 but as no requests were received by dnsmasq its not used. What's interesting in the console is that it looks like the IPv4 request gets a DHCP address as it fails with PXE-E18 attempting to get the bootfile. However I don't see this request logged in the dnsmasq log. --- Additional comment from Bob Fournier on 2020-12-08 14:25:30 GMT --- Also in the ironic-conductor log we see: - when ironic starts: 2020-12-08 10:45:40.318 1 WARNING ironic.drivers.modules.network.flat [req-542eee7f-6e40-468a-ba1c-86adaed6bd12 - - - - -] Please specify a valid UUID or name for [neutron]/cleaning_network configuration option so that this interface is able to perform cleaning. Otherwise, cleaning operations will fail to start.^[[00m 2020-12-08 10:45:40.319 1 INFO ironic.common.driver_factory [req-542eee7f-6e40-468a-ba1c-86adaed6bd12 - - - - -] Loaded the following network interfaces: ['flat', 'noop']^[[00m - when the node that fails is put into cleaning (note that this msg is logged for all nodes): 2020-12-08 12:56:34.944 1 WARNING ironic.common.pxe_utils [req-a301b4c4-6915-437a-bb04-c38819be6d83 ironic-user - - - -] IPv6 is enabled and the DHCP driver appears set to a plugin aside from "neutron". Node 7a2282cf-2a54-4939-94ce-0110bfd23057 may not receive proper DHCPv6 provided boot parameters. --- Additional comment from Bob Fournier on 2020-12-08 18:57:08 GMT --- Note that we see the DHCP requests from worker-0-0 in kni-provisioner, just not in the dnsmasq container on master-0-x. [kni@provisionhost-0-0 /]$ sudo tcpdump -i eth0 -n ether host 52:54:00:e3:68:36 18:52:45.846674 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:e3:68:36, length 347 18:53:01.846893 IP 0.0.0.0.bootpc > 255.255.255.255.bootps: BOOTP/DHCP, Request from 52:54:00:e3:68:36, length 347 18:53:33.854631 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit 18:53:38.847745 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit 18:53:47.847882 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit 18:54:04.848217 IP6 fe80::5054:ff:fee3:6836.dhcpv6-client > ff02::1:2.dhcpv6-server: dhcp6 solicit --- Additional comment from Bob Fournier on 2020-12-09 02:07:06 GMT --- Actually we are getting the DHCP Solicits in the dnsmasq log for worker-0-0 (MAC 52:54:00:e3:68:36), it was misleading in that dnsmasq logs the client ID only (00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90 in this case) so this the full Solicit/Advertise for worker-0-0: dnsmasq-dhcp: 8302732 DHCPSOLICIT(enp4s0) 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90 dnsmasq-dhcp: 8302732 DHCPADVERTISE(enp4s0) fd00:1101::25ce:3597:2113:afd1 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90 dnsmasq-dhcp: 8302732 requested options: 59:bootfile-url, 60:bootfile-param, 23:dns-server, dnsmasq-dhcp: 8302732 requested options: 16:vendor-class dnsmasq-dhcp: 8302732 tags: pxe6, dhcpv6, enp4s0 dnsmasq-dhcp: 8302732 sent size: 18 option: 1 client-id 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb... dnsmasq-dhcp: 8302732 sent size: 14 option: 2 server-id 00:01:00:01:27:62:15:d5:52:54:00:3e:12:b3 dnsmasq-dhcp: 8302732 sent size: 40 option: 3 ia-na IAID=4216520443 T1=1800 T2=3150 dnsmasq-dhcp: 8302732 nest size: 24 option: 5 iaaddr fd00:1101::25ce:3597:2113:afd1 PL=3600 VL=3600 dnsmasq-dhcp: 8302732 sent size: 9 option: 13 status 0 success dnsmasq-dhcp: 8302732 sent size: 1 option: 7 preference 0 dnsmasq-dhcp: 8302732 sent size: 33 option: 59 bootfile-url tftp://[fd00:1101::3]/snponly.efi This results in the PXE process logging to the console - "PXE-E16: No valid offer received" (see attached worker-0-0 console log for additional error message). Its not clear why the the Advertise is being rejected. It would help to pull tcpdump down to master-0-1 to capture the entire transfer, I've tried to get it via "toolbox" but the pull failed. I'm attaching a tcpdump that matches this dnsmasq log, this was captured on provisionhost-0-0 (because of lack of tcpdump on master-0-1). I'm also attaching the latest dnsmasq-ew.log with the worker-0-0 Solicits at the end. There are a couple other interesting things in this log: - when dnsmasq is started its listening using the link local address: dnsmasq-dhcp: DHCP, sockets bound exclusively to interface enp4s0 dnsmasq: listening on enp4s0(#2): fe80::5054:ff:fe3e:12b3%enp4s0 - Sometime later a static IPv6 address is assigned to enp4s0 and dnsmasq listens on that IP. However the multicast "Address already in use" is unexpected. dnsmasq: stopped listening on enp4s0(#2): fe80::5054:ff:fe3e:12b3%enp4s0 dnsmasq: stopped listening on enp4s0(#2): fd00:1101::3 dnsmasq: interface enp4s0 failed to join DHCPv6 multicast group: Address already in use dnsmasq: listening on enp4s0(#2): fd00:1101::3 I'm not sure if there is a problem with the "Address already in use" as there are many transactions after this. On unrelated? thing I noticed is that after listening on the static IPv6 address dnsmasq no longer logs the "client MAC address" of received packets even though the rest of the packet is the same, i.e.: dnsmasq-dhcp: 13248436 client MAC address: 52:54:00:e3:68:36 <== this is not logged after interface changes dnsmasq-dhcp: 13248436 DHCPSOLICIT(enp4s0) 00:04:c7:08:e7:ff:3d:cb:e8:47:98:a0:30:bb:db:79:7f:90 --- Additional comment from Bob Fournier on 2020-12-09 02:09:48 GMT --- --- Additional comment from Bob Fournier on 2020-12-09 02:10:55 GMT --- --- Additional comment from Derek Higgins on 2020-12-09 15:41:17 GMT --- It looks like this is either a dnsmasq bug or a problem with the way we are setting up DHCPv6 I've reproduced this locally and can successfully hit the dnsmasq server every few minutes with a VM trying to pxe over ipv6, it continues to work right up until we see this in the dnsmasq log 2020-12-09T15:24:11.636214728Z dnsmasq: stopped listening on enp1s0(#2): fe80::258:63ff:fe8b:308e%enp1s0 2020-12-09T15:24:11.636214728Z dnsmasq: stopped listening on enp1s0(#2): fd00:1101::3 2020-12-09T15:24:12.049867645Z dnsmasq: interface enp1s0 failed to join DHCPv6 multicast group: Address already in use 2020-12-09T15:24:13.105359874Z dnsmasq: listening on enp1s0(#2): fd00:1101::3 pxe then stops working, scaling down also works if you do it before this happens around the same time on the master in question I see, that the provisioning interface looses its link local address [root@master-0 core]# date ; ip -o addr | grep enp1s0 Wed Dec 9 15:23:36 UTC 2020 2: enp1s0 inet6 fd00:1101::3/64 scope global dynamic \ valid_lft 6sec preferred_lft 6sec 2: enp1s0 inet6 fe80::258:63ff:fe8b:308e/64 scope link noprefixroute \ valid_lft forever preferred_lft forever [root@master-0 core]# date ; ip -o addr | grep enp1s0 Wed Dec 9 15:24:40 UTC 2020 2: enp1s0 inet6 fd00:1101::3/64 scope global dynamic \ valid_lft 7sec preferred_lft 7sec Also network managers seems to timeout some kind of DHCPv6 grace period Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527451.6273] device (enp1s0): DHCPv6: grace period expired Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527451.6274] device (enp1s0): state change: activated -> failed (reason 'ip-config-unavailable', sys-iface-state: 'managed') Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <warn> [1607527451.6296] device (enp1s0): Activation: failed for connection 'Wired Connection' Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527451.6308] device (enp1s0): state change: failed -> disconnected (reason 'none', sys-iface-state: 'managed') Dec 09 15:24:11 master-0.ostest.test.metalkube.org dbus-daemon[1209]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher' unit='dbus-org.freedesktop.nm-dispatcher.service' requested by> Dec 09 15:24:11 master-0.ostest.test.metalkube.org ovs-vswitchd[1366]: ovs|00504|bridge|ERR|interface br-ex: ignoring mac in Interface record (use Bridge record to set local port's mac) Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Starting Network Manager Script Dispatcher Service... Dec 09 15:24:11 master-0.ostest.test.metalkube.org dbus-daemon[1209]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher' Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Started Network Manager Script Dispatcher Service. Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527451.6524] dhcp4 (enp1s0): canceled DHCP transaction Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527451.6524] dhcp4 (enp1s0): state changed timeout -> done Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527451.6527] dhcp6 (enp1s0): canceled DHCP transaction Dec 09 15:24:11 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527451.6528] dhcp6 (enp1s0): state changed fail -> done Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + [[ OVNKubernetes == \O\V\N\K\u\b\e\r\n\e\t\e\s ]] Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + [[ Wired Connection == \W\i\r\e\d\ \C\o\n\n\e\c\t\i\o\n ]] Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + echo 'Refusing to modify default connection.' Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: Refusing to modify default connection. Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + exit 0 Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + '[' -z ']' Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + echo 'Not a DHCP6 address. Ignoring.' Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: Not a DHCP6 address. Ignoring. Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: + exit 0 Dec 09 15:24:11 master-0.ostest.test.metalkube.org root[140000]: NM mdns-hostname triggered by down. Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: <13>Dec 9 15:24:11 root: NM mdns-hostname triggered by down. Dec 09 15:24:11 master-0.ostest.test.metalkube.org root[140003]: Hostname changed: master-0.ostest.test.metalkube.org Dec 09 15:24:11 master-0.ostest.test.metalkube.org nm-dispatcher[139988]: <13>Dec 9 15:24:11 root: Hostname changed: master-0.ostest.test.metalkube.org Dec 09 15:24:11 master-0.ostest.test.metalkube.org systemd[1]: Starting Generate console-login-helper-messages issue snippet... Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527452.0515] device (enp1s0): Activation: starting connection 'enp1s0' (4ff259a7-73c8-468c-a3b4-25c2d363ae23) Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527452.0517] device (enp1s0): state change: disconnected -> prepare (reason 'none', sys-iface-state: 'external') Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527452.0523] device (enp1s0): state change: prepare -> config (reason 'none', sys-iface-state: 'external') Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527452.0527] device (enp1s0): state change: config -> ip-config (reason 'none', sys-iface-state: 'external') Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527452.0533] device (enp1s0): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'external') Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527452.0562] device (enp1s0): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'external') Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527452.0566] device (enp1s0): state change: secondaries -> activated (reason 'none', sys-iface-state: 'external') Dec 09 15:24:12 master-0.ostest.test.metalkube.org NetworkManager[1382]: <info> [1607527452.0590] device (enp1s0): Activation: successful, device activated. --- Additional comment from Derek Higgins on 2020-12-09 17:17:55 GMT --- Looks like this is linked to the loss of the link local address, When we trigger the regeneration of the link local address on the master node echo 0 > /proc/sys/net/ipv6/conf/enp1s0/addr_gen_mode echo 1 > /proc/sys/net/ipv6/conf/enp1s0/addr_gen_mode [root@master-0 core]# ip -o addr | grep enp1s0 2: enp1s0 inet6 fd00:1101::3/64 scope global dynamic \ valid_lft 10sec preferred_lft 10sec 2: enp1s0 inet6 fe80::258:63ff:fe8b:308e/64 scope link \ valid_lft forever preferred_lft forever pxe is working again, also this appeared in the dnsmasq logs 2020-12-09T15:57:00.081839842Z dnsmasq: listening on enp1s0(#2): fe80::258:63ff:fe8b:308e%enp1s0 --- Additional comment from Bob Fournier on 2020-12-14 02:47:46 GMT --- --- Additional comment from Bob Fournier on 2020-12-14 15:25:49 GMT --- Including Tony to give visibility to networking team. See comment 22. The issue we are getting is that the IPv6 link-local address is being removed after some time and dnsmasq stops responding to router solicitations. resulting in PXE failures after the removal of link-local. --- Additional comment from Derek Higgins on 2020-12-15 14:20:35 GMT --- Setting addr_gen_mode to EUI64 works as a workaround, adding echo 0 > /proc/sys/net/ipv6/conf/$PROVISIONING_INTERFACE/addr_gen_mode into the refresh-static-ip script ensures the link-local address is maintained, I doubt this is the correct way to fix the problem which will likely involve either a fix in network manager or the way we configure it. --- Additional comment from Derek Higgins on 2020-12-15 16:48:59 GMT --- Adding a PR to workaround the problem, I'm still investigating the root cause and will open a separate bug to address this when info is gathered. --- Additional comment from Bob Fournier on 2020-12-15 20:04:40 GMT --- --- Additional comment from Bob Fournier on 2020-12-16 11:44:13 GMT --- Adding blocker flag, problem is cropping up in multiple reports. Here is RHCOS bug tracking the issue - https://bugzilla.redhat.com/show_bug.cgi?id=1908302 --- Additional comment from Polina Rabinovich on 2020-12-21 09:38:05 GMT --- [kni@provisionhost-0-0 ~]$ oc get bmh -n openshift-machine-api NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-mhfwx-master-0 redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/3ac0499c-4968-4720-8dd2-534c07299b37 true openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-mhfwx-master-1 redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/63e83258-6db9-40ca-8dbe-29a352aafc45 true openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-mhfwx-master-2 redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/74bff71a-9559-420c-9946-2cf0c374e614 true openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-mhfwx-worker-0-ccvrs redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/bc55c605-2380-4d07-a59a-7326aba8faa4 unknown true openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-mhfwx-worker-0-nftch redfish://[fd2e:6f44:5dd8::1]:8000/redfish/v1/Systems/d976f337-2ffe-4061-a8c2-4c802e2db726 unknown true [kni@provisionhost-0-0 ~]$ oc annotate machine ocp-edge-cluster-0-mhfwx-worker-0-ccvrs machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api machine.machine.openshift.io/ocp-edge-cluster-0-mhfwx-worker-0-ccvrs annotated [kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-mhfwx-worker-0 --replicas=1 machineset.machine.openshift.io/ocp-edge-cluster-0-mhfwx-worker-0 scaled [kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ocp-edge-cluster-0-mhfwx-master-0 Running 88m master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/openshift-master-0-0/85754a6f-1acf-4667-b3f5-73664c483b4e ocp-edge-cluster-0-mhfwx-master-1 Running 88m master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/openshift-master-0-1/ad3a2d95-dfb9-486e-b810-9a4ed4aad3ea ocp-edge-cluster-0-mhfwx-master-2 Running 88m master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/openshift-master-0-2/f907e8ba-7b0d-4dde-975c-792d3b1d38fd ocp-edge-cluster-0-mhfwx-worker-0-nftch Running 61m worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/openshift-worker-0-1/e0e9fba8-4f84-4702-95b2-b0d298b4c048 [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 66m v1.20.0+87544c5 master-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 67m v1.20.0+87544c5 master-0-2.ocp-edge-cluster-0.qe.lab.redhat.com Ready master 65m v1.20.0+87544c5 worker-0-1.ocp-edge-cluster-0.qe.lab.redhat.com Ready worker 41m v1.20.0+87544c5 [kni@provisionhost-0-0 ~]$ oc get bmh -n openshift-machine-api
Verified in version: [kni@provisionhost-0-0 ~]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.9 True False 96m Cluster version is 4.6.9 ------------------------------------------------------------------------------- [kni@provisionhost-0-0 ~]$ oc get bmh -n openshift-machine-api NAME STATUS PROVISIONING STATUS CONSUMER BMC HARDWARE PROFILE ONLINE ERROR openshift-master-0-0 OK externally provisioned ocp-edge-cluster-0-lppt2-master-0 redfish://192.168.123.1:8000/redfish/v1/Systems/e717c94d-bcc9-40a1-b1a3-59cafe3bfc7b true openshift-master-0-1 OK externally provisioned ocp-edge-cluster-0-lppt2-master-1 redfish://192.168.123.1:8000/redfish/v1/Systems/550623fd-22d2-49ab-b9d5-6501d31ec179 true openshift-master-0-2 OK externally provisioned ocp-edge-cluster-0-lppt2-master-2 redfish://192.168.123.1:8000/redfish/v1/Systems/7fb7f5aa-da0f-4943-800f-667789074ff8 true openshift-worker-0-0 OK provisioned ocp-edge-cluster-0-lppt2-worker-0-8f9jf redfish://192.168.123.1:8000/redfish/v1/Systems/4e44929a-1aec-4f54-930b-6dfc01024a6d unknown true openshift-worker-0-1 OK provisioned ocp-edge-cluster-0-lppt2-worker-0-sp7xn redfish://192.168.123.1:8000/redfish/v1/Systems/035d09e4-0f42-474b-81f4-2de68b309135 unknown true openshift-worker-0-2 OK provisioned ocp-edge-cluster-0-lppt2-worker-0-h65nq redfish://192.168.123.1:8000/redfish/v1/Systems/d96892c1-fdb7-4790-b6c1-00537b5a2cb4 unknown true [kni@provisionhost-0-0 ~]$ oc annotate machine <name-CONSUMER> machine.openshift.io/cluster-api- -bash: name-CONSUMER: No such file or directory [kni@provisionhost-0-0 ~]$ oc annotate machine ocp-edge-cluster-0-lppt2-worker-0-8f9jf machine.openshift.io/cluster-api-delete-machine=yes -n openshift-machine-api machine.machine.openshift.io/ocp-edge-cluster-0-lppt2-worker-0-8f9jf annotated [kni@provisionhost-0-0 ~]$ oc scale machineset -n openshift-machine-api ocp-edge-cluster-0-lppt2-worker-0 --replicas=2 machineset.machine.openshift.io/ocp-edge-cluster-0-lppt2-worker-0 scaled ----------------------------------------------------------------------------------------------------------------------- [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 139m v1.19.0+7070803 master-0-1 Ready master 139m v1.19.0+7070803 master-0-2 Ready master 139m v1.19.0+7070803 worker-0-0 Ready,SchedulingDisabled worker 106m v1.19.0+7070803 worker-0-1 Ready worker 104m v1.19.0+7070803 worker-0-2 Ready worker 105m v1.19.0+7070803 [kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ocp-edge-cluster-0-lppt2-master-0 Running 156m master-0-0 baremetalhost:///openshift-machine-api/openshift-master-0-0 ocp-edge-cluster-0-lppt2-master-1 Running 156m master-0-1 baremetalhost:///openshift-machine-api/openshift-master-0-1 ocp-edge-cluster-0-lppt2-master-2 Running 156m master-0-2 baremetalhost:///openshift-machine-api/openshift-master-0-2 ocp-edge-cluster-0-lppt2-worker-0-8f9jf Deleting 133m worker-0-0 baremetalhost:///openshift-machine-api/openshift-worker-0-0 ocp-edge-cluster-0-lppt2-worker-0-h65nq Running 133m worker-0-2 baremetalhost:///openshift-machine-api/openshift-worker-0-2 ocp-edge-cluster-0-lppt2-worker-0-sp7xn Running 133m worker-0-1 baremetalhost:///openshift-machine-api/openshift-worker-0-1 -------------------------------------------------------------------------------------------------------------- [kni@provisionhost-0-0 ~]$ oc get nodes NAME STATUS ROLES AGE VERSION master-0-0 Ready master 141m v1.19.0+7070803 master-0-1 Ready master 141m v1.19.0+7070803 master-0-2 Ready master 141m v1.19.0+7070803 worker-0-1 Ready worker 106m v1.19.0+7070803 worker-0-2 Ready worker 107m v1.19.0+7070803 [kni@provisionhost-0-0 ~]$ oc get machine -o wide -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE ocp-edge-cluster-0-lppt2-master-0 Running 160m master-0-0 baremetalhost:///openshift-machine-api/openshift-master-0-0 ocp-edge-cluster-0-lppt2-master-1 Running 160m master-0-1 baremetalhost:///openshift-machine-api/openshift-master-0-1 ocp-edge-cluster-0-lppt2-master-2 Running 160m master-0-2 baremetalhost:///openshift-machine-api/openshift-master-0-2 ocp-edge-cluster-0-lppt2-worker-0-h65nq Running 138m worker-0-2 baremetalhost:///openshift-machine-api/openshift-worker-0-2 ocp-edge-cluster-0-lppt2-worker-0-sp7xn Running 138m worker-0-1 baremetalhost:///openshift-machine-api/openshift-worker-0-1
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.6.12 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:0037