1840222 – sushy-tools for baremetal IPI development scripts is forcibly rebooting nodes during firstboot

Bug 1840222 - sushy-tools for baremetal IPI development scripts is forcibly rebooting nodes during firstboot

Summary: sushy-tools for baremetal IPI development scripts is forcibly rebooting nodes...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Bare Metal Hardware Provisioning
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Julia Kreger
QA Contact:	Lubov
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1840301 (view as bug list)
Depends On:
Blocks:	1840301 1841255
TreeView+	depends on / blocked

Reported:	2020-05-26 15:05 UTC by Stephen Benjamin
Modified:	2020-12-15 17:04 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Clones:	1841255 (view as bug list)
Environment:
Last Closed:	2020-07-13 17:41:38 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
journal from a master (5.28 MB, text/plain) 2020-05-28 13:11 UTC, Stephen Benjamin	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift baremetal-runtimecfg pull 63	None	closed	Bug 1840222: Fail if no upstream DNS servers are found	2020-12-14 15:16:32 UTC
OpenStack gerrit	733152	None	MERGED	Read current domain state from XML on filesystem	2020-12-14 15:16:32 UTC
Red Hat Product Errata	RHBA-2020:2409	None	None	None	2020-07-13 17:41:57 UTC

Description Stephen Benjamin 2020-05-26 15:05:53 UTC

In CI, we're seeing sometimes that /etc/coredns/Coredns gets rendered without forwarders:

[root@master-1 ~]# cat /etc/coredns/Corefile      
. {
    errors
    health :18080
    mdns ostest.test.metalkube.org 0 ostest
    forward .
    cache 30
    reload
    hosts {
        fd2e:6f44:5dd8:c956::5 api-int.ostest.test.metalkube.org
        fd2e:6f44:5dd8:c956::5 api.ostest.test.metalkube.org
        fallthrough
    }
    template IN A ostest.test.metalkube.org {
        match .*.apps.ostest.test.metalkube.org
        answer "{{ .Name }} 60 in a fd2e:6f44:5dd8:c956::4"
        fallthrough
    }
}

And Coredns refuses to start:

[root@master-1 ~]# sudo crictl logs ca13312378976
plugin/forward: /etc/coredns/Corefile:5 - Error during parsing: Wrong argument count or unexpected line ending after '.'


When this happens, the node becomes NotReady. This seems to be the source of many of the 4.5 failures in CI for e2e-metal-ipi.

Comment 3 Ben Nemec 2020-05-27 17:32:04 UTC

https://github.com/openshift/baremetal-runtimecfg/pull/63 did not fix this, it only changed the symptoms. Still need a solution to this.

Comment 4 Stephen Benjamin 2020-05-27 21:06:07 UTC

With help from cri-o folks, what appears to be happening is when the pod starts, it gets configured with an empty /etc/resolv.conf:

[root@master-1 ~]# crictl exec b45947d4928e1 cat /etc/resolv.conf
[root@master-1 ~]# 

Removing the pod (crictl rmp -f ...) fixes the problem. Restarting just the one container isn't enough, the entire pod has to be removed, so it re-reads the node's DNS configuration when it starts again.

There may be some race here related to networking.

Comment 5 Colin Walters 2020-05-28 12:51:19 UTC

Can you get the systemd journal from that host, specifically around when NetworkManager gets DHCP (or is this static IPs?) and the crio start?

I suspect this is something like a bare metal host with multiple interfaces, and we do DHCP on all plugged interfaces, and sometimes get a response from one that e.g. is a provisioning network that might not have upstream nameservers set.

From NM's perspective I believe `network-online.target` is "we have a default route".

Basically I think this is something in the After=network-online.target for crio.service isn't working as we want.

But that said I think the Kube DNS stack needs to gracefully respond to changes in /etc/resolv.conf on the host.

Comment 6 Stephen Benjamin 2020-05-28 13:11:59 UTC

Created attachment 1693045 [details]
journal from a master

Comment 7 Jonathan Lebon 2020-05-28 14:55:29 UTC

Looking at those logs, it seems like Ignition is running twice:

$ grep -e 'Stage: files' -e Reboot master-1.log
May 28 12:57:43 localhost ignition[1359]: INFO     : Stage: files
May 28 12:57:44 localhost ignition[1359]: INFO     : Stage: files
-- Reboot --
May 28 13:00:11 localhost ignition[1340]: INFO     : Stage: files
May 28 13:00:12 localhost ignition[1340]: INFO     : Stage: files
May 28 13:04:37 master-1.ostest.test.metalkube.org systemd[1]: Starting Reboot...
-- Reboot --

Moreover, it's not clear why the first reboot happens. We're digging into that now.

This may or may not be related to the `/etc/resolv.conf` issue. Though the unexplained reboot is such a fundamental thing to go wrong, we should get to the bottom of it first before.

Comment 9 Colin Walters 2020-05-28 17:28:30 UTC

Current diagnosis is that something in the IPI stack (probably Ironic) is forcibly powering off machines during the first boot.

Our stack is currently not very robust around this.  We need to change the MCO's firstboot handling to be more idempotent:
https://github.com/openshift/machine-config-operator/pull/1762

Comment 10 Colin Walters 2020-05-28 17:51:09 UTC

I cloned the MCO patch to https://bugzilla.redhat.com/show_bug.cgi?id=1841255

Let's keep this bug for whatever in baremetal IPI is forcibly rebooting the node.

Comment 11 Stephen Benjamin 2020-05-28 17:58:11 UTC

Dmitry has a patch up to prevent the forced rebooting: https://review.opendev.org/#/c/731575/


I'll upload the entire serial console of the control plane, but what stood out to me is IPA at the very end reported an error. It looks like efibootmgr is missing from the IPA image? However, the host did reboot into RHCOS.


```
2020-05-28 13:43:11.525 1242 ERROR root [-] Command execution error: [Errno 2] No such file or directory: 'efibootmgr': 'efibootmgr': FileNotFoundError: [Errno 2] No such file or directory: 'efibootmgr': 'efibootmgr'
2020-05-28 13:43:11.525 1242 ERROR root Traceback (most recent call last):
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 252, in execute_command
2020-05-28 13:43:11.525 1242 ERROR root     result = ext.execute(command_part, **kwargs)
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 205, in execute
2020-05-28 13:43:11.525 1242 ERROR root     return cmd(**kwargs)
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/base.py", line 321, in wrapper
2020-05-28 13:43:11.525 1242 ERROR root     result = func(self, **command_params)
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/extensions/image.py", line 520, in install_bootloader
2020-05-28 13:43:11.525 1242 ERROR root     utils.execute('efibootmgr', '--version')
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_python_agent/utils.py", line 74, in execute
2020-05-28 13:43:11.525 1242 ERROR root     return ironic_utils.execute(*cmd, **kwargs)
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib/python3.6/site-packages/ironic_lib/utils.py", line 99, in execute
2020-05-28 13:43:11.525 1242 ERROR root     result = processutils.execute(*cmd, **kwargs)
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib/python3.6/site-packages/oslo_concurrency/processutils.py", line 391, in execute
2020-05-28 13:43:11.525 1242 ERROR root     env=env_variables)
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib64/python3.6/subprocess.py", line 729, in __init__
2020-05-28 13:43:11.525 1242 ERROR root     restore_signals, start_new_session)
2020-05-28 13:43:11.525 1242 ERROR root   File "/usr/lib64/python3.6/subprocess.py", line 1364, in _execute_child
2020-05-28 13:43:11.525 1242 ERROR root     raise child_exception_type(errno_num, err_msg, err_filename)
2020-05-28 13:43:11.525 1242 ERROR root FileNotFoundError: [Errno 2] No such file or directory: 'efibootmgr': 'efibootmgr'
2020-05-28 13:43:11.525 1242 ERROR root 
2020-05-28 13:43:11.568 1242 INFO root [-] Command image.install_bootloader completed: Command name: image.install_bootloader, params: {'root_uuid': '0x00000000', 'efi_system_part_uuid': None, 'prep_boot_part_uuid': None}, status: FAILED, result: None.
fd00:1101::2 - - [28/May/2020 13:43:11] "POST /v1/commands?wait=true HTTP/1.1" 200 402
```

Comment 13 Stephen Benjamin 2020-05-29 12:39:58 UTC

Derek identified the root cause of the nodes being forcibly powered off. From https://storyboard.openstack.org/#!/story/2007735:


> set_boot_device in the sushy-tools libvirt driver has the power_cycle decorator, this decorator causes
> the libvirt domain to be powered off so that the xml can be redefined and then powered on again.
>
> this was done intentionally to work around a libvirt bug
> https://review.opendev.org/#/c/696620/3
>
> the problem is that during deploy with the IPA agent, we set the boot device then try to do a soft
> agent power off

> The set boot device triggers a power cycle (the power cycle shouldn't be doing) then the soft agent power off
> (in reboot_and_finish_deploy) can't succeed as the agent is no longer running the node boots into its deployed
> image but some time later ironic power cycles it, when the soft power off eventually times out.
>
> The end result is image that gets deployed and booted and then rebooted a few minutes later
>
> sushy-tools needs to be fixed to avoid the power cycle, by using another API call that doesn't require the
> node to be off (perhaps updateDeviceFlags?) instead of defineXML


dev-scripts has moved back to ipmi until sushy-tools is fixed.

Comment 14 Beth White 2020-06-02 16:12:26 UTC

*** Bug 1840301 has been marked as a duplicate of this bug. ***

Comment 15 Julia Kreger 2020-06-08 16:59:11 UTC

Derek, the patch to sushy-tools merged. I guess you need an upstream release at this point?

Comment 16 Colin Walters 2020-06-10 12:35:25 UTC

Does baremetal IPI also support being simulated on OpenStack, and if so does the provisioning system have the same bug?

See https://bugzilla.redhat.com/show_bug.cgi?id=1842906#c30

Comment 17 Derek Higgins 2020-06-10 13:31:02 UTC

(In reply to Julia Kreger from comment #15)
> Derek, the patch to sushy-tools merged. I guess you need an upstream release
> at this point?

release proposed here
https://review.opendev.org/#/c/734837/


(In reply to Colin Walters from comment #16)
> Does baremetal IPI also support being simulated on OpenStack, and if so does
> the provisioning system have the same bug?
> 
> See https://bugzilla.redhat.com/show_bug.cgi?id=1842906#c30

will reply over there

Comment 18 Julia Kreger 2020-06-16 15:35:29 UTC

The updated package was released upstream about 30 hours ago, which means people should now be pulling it in on new test environment setups. I believe this means the issue is resolved as it is limited to testing and no backports should be necessary.

Comment 20 Lubov 2020-06-25 08:20:49 UTC

Checked /etc/coredns/Corefile on all masters: forwarders are set
We made a lot of deployed since the fix, no problem spotted
If it returns, please, open the bug again

Comment 22 errata-xmlrpc 2020-07-13 17:41:38 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Comment 23 Bob Fournier 2020-12-15 17:04:52 UTC

*** Bug 1840301 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.