In CI, we're seeing sometimes that /etc/coredns/Coredns gets rendered without forwarders: [root@master-1 ~]# cat /etc/coredns/Corefile . { errors health :18080 mdns ostest.test.metalkube.org 0 ostest forward . cache 30 reload hosts { fd2e:6f44:5dd8:c956::5 api-int.ostest.test.metalkube.org fd2e:6f44:5dd8:c956::5 api.ostest.test.metalkube.org fallthrough } template IN A ostest.test.metalkube.org { match .*.apps.ostest.test.metalkube.org answer "{{ .Name }} 60 in a fd2e:6f44:5dd8:c956::4" fallthrough } } And Coredns refuses to start: [root@master-1 ~]# sudo crictl logs ca13312378976 plugin/forward: /etc/coredns/Corefile:5 - Error during parsing: Wrong argument count or unexpected line ending after '.' When this happens, the node becomes NotReady. This seems to be the source of many of the 4.5 failures in CI for e2e-metal-ipi.
https://github.com/openshift/baremetal-runtimecfg/pull/63 did not fix this, it only changed the symptoms. Still need a solution to this.
With help from cri-o folks, what appears to be happening is when the pod starts, it gets configured with an empty /etc/resolv.conf: [root@master-1 ~]# crictl exec b45947d4928e1 cat /etc/resolv.conf [root@master-1 ~]# Removing the pod (crictl rmp -f ...) fixes the problem. Restarting just the one container isn't enough, the entire pod has to be removed, so it re-reads the node's DNS configuration when it starts again. There may be some race here related to networking.
Can you get the systemd journal from that host, specifically around when NetworkManager gets DHCP (or is this static IPs?) and the crio start? I suspect this is something like a bare metal host with multiple interfaces, and we do DHCP on all plugged interfaces, and sometimes get a response from one that e.g. is a provisioning network that might not have upstream nameservers set. From NM's perspective I believe `network-online.target` is "we have a default route". Basically I think this is something in the After=network-online.target for crio.service isn't working as we want. But that said I think the Kube DNS stack needs to gracefully respond to changes in /etc/resolv.conf on the host.
Created attachment 1693045 [details] journal from a master
Looking at those logs, it seems like Ignition is running twice: $ grep -e 'Stage: files' -e Reboot master-1.log May 28 12:57:43 localhost ignition[1359]: INFO : Stage: files May 28 12:57:44 localhost ignition[1359]: INFO : Stage: files -- Reboot -- May 28 13:00:11 localhost ignition[1340]: INFO : Stage: files May 28 13:00:12 localhost ignition[1340]: INFO : Stage: files May 28 13:04:37 master-1.ostest.test.metalkube.org systemd[1]: Starting Reboot... -- Reboot -- Moreover, it's not clear why the first reboot happens. We're digging into that now. This may or may not be related to the `/etc/resolv.conf` issue. Though the unexplained reboot is such a fundamental thing to go wrong, we should get to the bottom of it first before.
Current diagnosis is that something in the IPI stack (probably Ironic) is forcibly powering off machines during the first boot. Our stack is currently not very robust around this. We need to change the MCO's firstboot handling to be more idempotent: https://github.com/openshift/machine-config-operator/pull/1762
I cloned the MCO patch to https://bugzilla.redhat.com/show_bug.cgi?id=1841255 Let's keep this bug for whatever in baremetal IPI is forcibly rebooting the node.
Dmitry has a patch up to prevent the forced rebooting: https://review.opendev.org/#/c/731575/ I'll upload the entire serial console of the control plane, but what stood out to me is IPA at the very end reported an error. It looks like efibootmgr is missing from the IPA image? However, the host did reboot into RHCOS. ``` 2020-05-28 13:43:11.525 1242 ERROR root [-] Command execution error: [Errno 2] No such file or directory: 'efibootmgr': 'efibootmgr': FileNotFoundError: [Errno 2] No such file or directory: 'efibootmgr': 'efibootmgr' 2020-05-28 13:43:11.525 1242 ERROR root Traceback (most recent call last): 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 252, in execute_command 2020-05-28 13:43:11.525 1242 ERROR root result = ext.execute(command_part, **kwargs) 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 205, in execute 2020-05-28 13:43:11.525 1242 ERROR root return cmd(**kwargs) 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 321, in wrapper 2020-05-28 13:43:11.525 1242 ERROR root result = func(self, **command_params) 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py", line 520, in install_bootloader 2020-05-28 13:43:11.525 1242 ERROR root utils.execute('efibootmgr', '--version') 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib/python3.6/site-packages/ironic_python_agent/utils.py", line 74, in execute 2020-05-28 13:43:11.525 1242 ERROR root return ironic_utils.execute(*cmd, **kwargs) 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib/python3.6/site-packages/ironic_lib/utils.py", line 99, in execute 2020-05-28 13:43:11.525 1242 ERROR root result = processutils.execute(*cmd, **kwargs) 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py", line 391, in execute 2020-05-28 13:43:11.525 1242 ERROR root env=env_variables) 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__ 2020-05-28 13:43:11.525 1242 ERROR root restore_signals, start_new_session) 2020-05-28 13:43:11.525 1242 ERROR root File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child 2020-05-28 13:43:11.525 1242 ERROR root raise child_exception_type(errno_num, err_msg, err_filename) 2020-05-28 13:43:11.525 1242 ERROR root FileNotFoundError: [Errno 2] No such file or directory: 'efibootmgr': 'efibootmgr' 2020-05-28 13:43:11.525 1242 ERROR root 2020-05-28 13:43:11.568 1242 INFO root [-] Command image.install_bootloader completed: Command name: image.install_bootloader, params: {'root_uuid': '0x00000000', 'efi_system_part_uuid': None, 'prep_boot_part_uuid': None}, status: FAILED, result: None. fd00:1101::2 - - [28/May/2020 13:43:11] "POST /v1/commands?wait=true HTTP/1.1" 200 402 ```
Derek identified the root cause of the nodes being forcibly powered off. From https://storyboard.openstack.org/#!/story/2007735: > set_boot_device in the sushy-tools libvirt driver has the power_cycle decorator, this decorator causes > the libvirt domain to be powered off so that the xml can be redefined and then powered on again. > > this was done intentionally to work around a libvirt bug > https://review.opendev.org/#/c/696620/3 > > the problem is that during deploy with the IPA agent, we set the boot device then try to do a soft > agent power off > The set boot device triggers a power cycle (the power cycle shouldn't be doing) then the soft agent power off > (in reboot_and_finish_deploy) can't succeed as the agent is no longer running the node boots into its deployed > image but some time later ironic power cycles it, when the soft power off eventually times out. > > The end result is image that gets deployed and booted and then rebooted a few minutes later > > sushy-tools needs to be fixed to avoid the power cycle, by using another API call that doesn't require the > node to be off (perhaps updateDeviceFlags?) instead of defineXML dev-scripts has moved back to ipmi until sushy-tools is fixed.
*** Bug 1840301 has been marked as a duplicate of this bug. ***
Derek, the patch to sushy-tools merged. I guess you need an upstream release at this point?
Does baremetal IPI also support being simulated on OpenStack, and if so does the provisioning system have the same bug? See https://bugzilla.redhat.com/show_bug.cgi?id=1842906#c30
(In reply to Julia Kreger from comment #15) > Derek, the patch to sushy-tools merged. I guess you need an upstream release > at this point? release proposed here https://review.opendev.org/#/c/734837/ (In reply to Colin Walters from comment #16) > Does baremetal IPI also support being simulated on OpenStack, and if so does > the provisioning system have the same bug? > > See https://bugzilla.redhat.com/show_bug.cgi?id=1842906#c30 will reply over there
The updated package was released upstream about 30 hours ago, which means people should now be pulling it in on new test environment setups. I believe this means the issue is resolved as it is limited to testing and no backports should be necessary.
Checked /etc/coredns/Corefile on all masters: forwarders are set We made a lot of deployed since the fix, no problem spotted If it returns, please, open the bug again
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409