Bug 1998662 - Network Manager Wait Online unit fails on compute node boots for OSP16.1
Summary: Network Manager Wait Online unit fails on compute node boots for OSP16.1
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: os-net-config
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z9
: 16.1 (Train on RHEL 8.2)
Assignee: Vijayalakshmi Candappa
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-27 21:43 UTC by Mark Jones
Modified: 2024-12-20 20:51 UTC (History)
17 users (show)

Fixed In Version: os-net-config-11.3.2-1.20211214113529.el8ost
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-22 11:51:32 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 819757 0 None MERGED Add ifcfg-* scripts on boot for Mellanox NIC interface 2021-12-07 05:11:18 UTC
Red Hat Issue Tracker NFV-2302 0 None None None 2021-10-12 12:15:59 UTC
Red Hat Issue Tracker OSP-7691 0 None None None 2021-11-15 12:55:55 UTC

Comment 3 Thomas Haller 2021-09-01 21:51:51 UTC
The way to debug most NetworkManager issues (this one included) is by enabling `level=TRACE` logging and checking the logfile.

Please see also https://cgit.freedesktop.org/NetworkManager/NetworkManager/tree/contrib/fedora/rpm/NetworkManager.conf#n28 with hints for logging. Note the comment about rate-limiting of journal and how to avoid that.


I didn't see a debug log attached to the case. Is it possible to provide debug logs of a boot that show the issue?

Comment 8 Thomas Haller 2021-09-02 18:05:27 UTC
(In reply to Thomas Haller from comment #6)
> Is the file the same as
> https://access.redhat.com/support/cases/#/case/03007888/discussion?attachmentId=a092K00002kzaGUQAY 


That file is larger (containing earlier boots). I assume the relevant boot starts at:

  [1630598199.3124] NetworkManager (version 1.22.8-7.el8_2) is starting... (for the first time) 

If we grep the log:

  zcat NetworkManager-computedpdk2.log.gz | sed '/1630598199.3124.*is starting/,/1630598258.1479.*is starting/ !d' | grep 'is starting\|<\(error\|warn\)>\|startup complete\|device.*ens3f[10].*state change:\|SIGTERM'

The "startup complete" messages are relevant, because NetworkManager-wait-online.service waits until the point when NetworkManager logs "manager: startup complete"

We see that the service is stopped with SIGTERM at 1630598258.0703. Usually, the user shouldn't stop NetworkManager, dunno why/who is doing that. In any case, anything after that point is irrelevant for this issue.

Before that, we see lines like

   manager: startup complete is waiting for device 'ens3f0' (activation-1)

meaning, NetworkManager is waiting for ens3f0 interface to activate (or fail for good).

We also see that DHCPv4 times out at that interface.



As to why DHCPv4 fails on ens3f0 and ens3f1, that is unclear. The most likely issue is a configuration error. Is the network OK? Is the DHCP server running? If there is on DHCP server available, don't configure "ipv4.method=auto" (on profile 'Wired connection 1' (68cde1fc-2237-32c4-805d-eb8db9050a3b)). If you don't intend this profile to activate during boot, don't configure "connection.autoconnect=yes".

In this case, "Wired connection 1" was actually automatically generated by NetworkManager. That is, this was not configured by the user. You might wish to disable this automatism, by configuring "[main].no-auto-default=*" in NetworkManager.conf (see `man NetworkManager.conf`). If could also install NetworkManager-config-server RPM, that package will do that for you. Alternatively, you may simply delete the unwanted profile with `nmcli connection delete "Wired connection 1"`. NetworkManager would remember that and don't create it on next boot. Alternatively, configuring a suitable profile for the interface, will also prevent the automatic creation of these profiles.


You can see the profiles with `nmcli connection`. If there are any unwanted/wrongly configured ones, fix that configuration. In particular, if the profile is set to "connection.autoconnet=yes", then ensure that it can properly activate. If there are these autogenerated "Wired connection 1" profiles, adjust them to your liking (or prevent their creation as explained).



Does this help? Anything unclear?

Comment 10 Thomas Haller 2021-09-09 17:55:50 UTC
sorry for the late reply.


----


So there are several options.


1) A sensible one is to mark the devices a "unmanaged". That is, so that `nmcli device` shows them in state "unmanaged". There are several ways to do that, ...

1a) the best way in this case might be a file

  /etc/NetworkManager/conf.d/90-osp-unmanage-devices.conf

with

  [device-90-osp-unmanage-devices]
  match-device=interface-name:ens*
  managed=0

See `man NetworkManager.conf`


1b) an alternative are udev rules to set `ENV{NM_UNMANAGED}="1"`. See `/usr/lib/udev/rules.d/85-nm-unmanaged.rules` for examples.


Note that you can still do `nmcli device set $IFNAME managed yes|no` to override that, but this only works at runtime and gets forgotten after reboot. So you want 1a) or 1b).


1c) NM_CONTROLLED=no in a suitable ifcfg file works too. The disadvantage compared to 1a) and 1b) is that then `nmcli device set $IFNAME managed yes` won't work, the choice cannot be overruled at runtime (short of removing the ifcfg file).



2) Another alternative is to not let NetworkManager generate these unsuitable profiles. There are several ways to do that that...

2a) configure a file /etc/NetworkManager/conf.d/90-osp-no-auto-default.conf that sets

  [main]
  no-auto-default=interface-name:ens*

  or even

  [main]
  no-auto-default=*

(see `man NetworkManager.conf`)

2b) install `NetworkManager-config-server.rpm`. This provides you a file `/usr/lib/NetworkManager/conf.d/00-server.conf` (similar to 2a)

2c) if your system is already up, and the profiles were generated, then you can just delete them (`nmcli connection delete "Wired connection 1`). That choice is remembered after reboot (in file /var/lib/NetworkManager/no-auto-default.state).

2d) just configure *any* suitable profile for the device. That prevents NM from autogenerating the "Wired connection 1" and autoactivating an unsuitable profile.

2e) the "Wired connection 1" profile is generated in /run (non-persisted storage). If you modify it (`nmcli connection modify "Wired connection 1" autoconnect no`), then it gets persisted to /etc and the effect is like 2d).


----


> Would creating ifcfg-ens3f0 / ifcfg-ens3f1 configuration scripts with NM_CONTROLLED=no be the right way to exclude these interfaces such that the startup would complete sooner?

Yes, that's 1c).


----

If you can configure the system before NetworkManager starts, then 1a) works well. If NM is already running, then merely deploying a config file is a bit tricky, because you'd need to reload it -- but don't restart NetworkManager (restarting NetworkManager is not a good idea). So, I assume you are in a position to do this before NM starts the first time, or you are fine with deploying the file and let it take effect on the next reboot. If not, deploy the files, calls `systemctl reload NetworkManager` AND `nmcli device set $IFNAME managed no`...

----

Sorry for the long text. But apparently the shorter form was insufficient :)

TL;DR: I'd suggest 1a).

Comment 17 Thomas Haller 2021-10-12 11:40:04 UTC
In my mind, this is a configuration error. Can we close this?

Do you think there is anything left for NetworkManager to do?

Comment 18 Saravanan KR 2021-10-12 12:14:58 UTC
Moving this to OSP's component os-net-config

Comment 24 OSP Team 2022-03-25 10:34:45 UTC
According to our records, this should be resolved by os-net-config-11.3.2-1.20211214143349.f49ab16.el8ost.  This build is available now.

Comment 26 Dan Sneddon 2022-09-22 11:51:32 UTC
I am closing this bug as I believe all current versions of os-net-config have the fox included. If this is affecting a current release of OpenStack Platform then the RPM version of os-net-config included in that version should be bumped to include a fixed version, or a newer RPM can be used as a workaround/hotfix. Please reopen this bug if that is not possible or does not reflect the current state.


Note You need to log in before you can comment on or make changes to this bug.