Bug 1899187
| Summary: | [Openstack] node-valid-hostname.service failes during the first boot leading to 5 minute provisioning delay | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Robert Heinzmann <rheinzma> | ||||||||||
| Component: | Machine Config Operator | Assignee: | Mike Fedosin <mfedosin> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||||||||
| Severity: | low | Docs Contact: | |||||||||||
| Priority: | low | ||||||||||||
| Version: | 4.6 | CC: | jerzhang, kgarriso, m.andre, mbooth, mfedosin, pprinett | ||||||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||||||
| Target Release: | 4.8.0 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
| Doc Text: |
Cause: A race condition between setting of the hostname and enabling networking.
Consequence: If the network started before the name was set the node couldn't join the cluster and waited 5 minutes to try again.
Fix: Explicitly set the hostname before the network is up.
Result: The node automatically joins the cluster from the first attempt.
|
Story Points: | --- | ||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2021-07-27 22:34:24 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Robert Heinzmann
2020-11-18 17:00:13 UTC
Created attachment 1730612 [details]
Console Log Initial Boot (boot 0)
Created attachment 1730613 [details]
Journal Log Initial Boot (boot -1)
Created attachment 1730614 [details]
Journal Log Initial Boot (boot 0)
It seems the reason for the timeout is a problem on OpenStack with OVN. It seems OVN is not handing out the "host-name" dhcp option [1], which can also be seen in the logs:
OSP16.1 with OVN (no dhcp option for hostname):
~~~
sh-4.4# journalctl -b -1 | grep dhcp4 | grep -e "option host_name" -e "option ip_address"
Nov 23 18:08:22 localhost NetworkManager[745]: <info> [1606154902.7909] dhcp4 (ens3): option ip_address => '192.168.150.167'
Nov 23 18:08:35 localhost NetworkManager[1651]: <info> [1606154915.1373] dhcp4 (ens3): option ip_address => '192.168.150.167'
sh-4.4# journalctl -b -1 | awk '{print $4}' | uniq -c
1 at
1776 localhost
362 ocp-99l7h-worker-0-g7hxm
~~~
QuickLab 4.5 (possibly not OVN):
~~~
sh-4.4# journalctl -b -1 | grep dhcp4 | grep -e "option host_name" -e "option ip_address"
Nov 22 06:50:00 worker-0.sharedocp4upi45.xxx.xxxx.xxx.xxx NetworkManager[1635]: <info> [1606027800.0301] dhcp4 (ens3): option host_name => 'host-10-0-91-158'
Nov 22 06:50:00 worker-0.sharedocp4upi45.xxx.xxxx.xxx.xxx NetworkManager[1635]: <info> [1606027800.0301] dhcp4 (ens3): option ip_address => '10.0.91.158'
sh-4.4# journalctl -b -1 | awk '{print $4}' | uniq -c
1 at
607 localhost
1253 host-10-0-91-158.openstacklocal ## <--- DHCP Host Name
777 worker-0.sharedocp4upi45.xxx.xxx.xxx.xxx
~~~
I then created the following machineconfig which uses nmcli to set the hostname after "afterburn-hostname.service" has been run:
~~~
[stack@osp16 ocp-test1]$ cat 99-worker-fix-hostname.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 99-worker-fix-hostname
spec:
config:
ignition:
config: {}
security:
tls: {}
timeouts: {}
version: 2.2.0
networkd: {}
passwd: {}
systemd:
units:
- contents: |
[Unit]
Description=Set Hostname via /etc/hostname
Before=node-valid-hostname.service
After=afterburn-hostname.service
[Service]
ExecStart=/bin/sh -c "nmcli general hostname $(cat /etc/hostname)"
Type=oneshot
[Install]
WantedBy=network-online.target
name: afterburn-hostname-set.service
enabled: true
fips: false
kernelArguments: null
osImageURL: ""
~~~
With this in place, there is no timeout anymore and boot time can be reduced from 13 minutes to 7, avoiding installer timeouts.
---
[1] https://docs.openstack.org/neutron/latest/ovn/dhcp_opts.html
Verified on 4.8.0-0.nightly-2021-04-26-151924. Scaled up on IPI Openstack. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-26-151924 True False 109m Cluster version is 4.8.0-0.nightly-2021-04-26-151924 $ oc get nodes NAME STATUS ROLES AGE VERSION mnguyen0427-10-j4gjr-master-0 Ready master 124m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-1 Ready master 123m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-2 Ready master 116m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-26s28 Ready worker 106m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-cwfjh Ready worker 105m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-d8kxh Ready worker 93m v1.21.0-rc.0+6143dea $ oc debug node/mnguyen0427-10-j4gjr-worker-0-26s28 Starting pod/mnguyen0427-10-j4gjr-worker-0-26s28-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# journalctl -u afterburn-hostname -- Logs begin at Tue 2021-04-27 14:45:24 UTC, end at Tue 2021-04-27 16:44:48 UTC. -- Apr 27 14:50:47 localhost systemd[1]: Starting Afterburn Hostname... Apr 27 14:50:47 localhost afterburn[1684]: Apr 27 14:50:47.365 INFO Fetching http://169.254.169.254/latest/meta-data/hostname: Attempt > Apr 27 14:50:49 localhost afterburn[1684]: Apr 27 14:50:49.026 INFO Fetch successful Apr 27 14:50:49 localhost systemd[1]: afterburn-hostname.service: Succeeded. Apr 27 14:50:49 localhost systemd[1]: Started Afterburn Hostname. Apr 27 14:50:49 localhost systemd[1]: afterburn-hostname.service: Consumed 23ms CPU time -- Reboot -- Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: Starting Afterburn Hostname... Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 afterburn[1268]: Apr 27 14:56:49.064 INFO Fetching http://169.254.169.254/latest/me> Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 afterburn[1268]: Apr 27 14:56:49.781 INFO Fetch successful Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: afterburn-hostname.service: Succeeded. Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: Started Afterburn Hostname. Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: afterburn-hostname.service: Consumed 19ms CPU time ...skipping... -- Logs begin at Tue 2021-04-27 14:45:24 UTC, end at Tue 2021-04-27 16:44:48 UTC. -- Apr 27 14:50:47 localhost systemd[1]: Starting Afterburn Hostname... Apr 27 14:50:47 localhost afterburn[1684]: Apr 27 14:50:47.365 INFO Fetching http://169.254.169.254/latest/meta-data/hostname: Attempt > Apr 27 14:50:49 localhost afterburn[1684]: Apr 27 14:50:49.026 INFO Fetch successful Apr 27 14:50:49 localhost systemd[1]: afterburn-hostname.service: Succeeded. Apr 27 14:50:49 localhost systemd[1]: Started Afterburn Hostname. Apr 27 14:50:49 localhost systemd[1]: afterburn-hostname.service: Consumed 23ms CPU time -- Reboot -- Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: Starting Afterburn Hostname... Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 afterburn[1268]: Apr 27 14:56:49.064 INFO Fetching http://169.254.169.254/latest/me> Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 afterburn[1268]: Apr 27 14:56:49.781 INFO Fetch successful Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: afterburn-hostname.service: Succeeded. Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: Started Afterburn Hostname. Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: afterburn-hostname.service: Consumed 19ms CPU time sh-4.4# systemctl cat afterburn-hostname # /etc/systemd/system/afterburn-hostname.service [Unit] Description=Afterburn Hostname # Block services relying on Networking being up. Before=network-online.target # Wait for NetworkManager to report its online After=NetworkManager-wait-online.service # Run before hostname checks Before=node-valid-hostname.service [Service] ExecStart=/usr/bin/afterburn --provider openstack-metadata --hostname=/etc/hostname Type=oneshot [Install] WantedBy=network-online.target sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc -n openshift-machine-api get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE mnguyen0427-10-j4gjr-worker-0 3 3 3 3 129m $ oc -n openshift-machine-api scale --replicas=4 machinesets/mnguyen0427-10-j4gjr-worker-0 machineset.machine.openshift.io/mnguyen0427-10-j4gjr-worker-0 scaled $ oc -n openshift-machine-api get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE mnguyen0427-10-j4gjr-worker-0 4 4 3 3 129m $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE mnguyen0427-10-j4gjr-master-0 Running m1.xlarge regionOne nova 131m mnguyen0427-10-j4gjr-master-1 Running m1.xlarge regionOne nova 131m mnguyen0427-10-j4gjr-master-2 Running m1.xlarge regionOne nova 131m mnguyen0427-10-j4gjr-worker-0-26s28 Running m1.large regionOne nova 129m mnguyen0427-10-j4gjr-worker-0-9nntv Provisioning 2m13s mnguyen0427-10-j4gjr-worker-0-cwfjh Running m1.large regionOne nova 129m mnguyen0427-10-j4gjr-worker-0-d8kxh Running m1.large regionOne nova 129m $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE mnguyen0427-10-j4gjr-master-0 Running m1.xlarge regionOne nova 135m mnguyen0427-10-j4gjr-master-1 Running m1.xlarge regionOne nova 135m mnguyen0427-10-j4gjr-master-2 Running m1.xlarge regionOne nova 135m mnguyen0427-10-j4gjr-worker-0-26s28 Running m1.large regionOne nova 133m mnguyen0427-10-j4gjr-worker-0-9nntv Provisioned m1.large regionOne nova 6m4s mnguyen0427-10-j4gjr-worker-0-cwfjh Running m1.large regionOne nova 133m mnguyen0427-10-j4gjr-worker-0-d8kxh Running m1.large regionOne nova 133m $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE mnguyen0427-10-j4gjr-master-0 Running m1.xlarge regionOne nova 140m mnguyen0427-10-j4gjr-master-1 Running m1.xlarge regionOne nova 140m mnguyen0427-10-j4gjr-master-2 Running m1.xlarge regionOne nova 140m mnguyen0427-10-j4gjr-worker-0-26s28 Running m1.large regionOne nova 138m mnguyen0427-10-j4gjr-worker-0-9nntv Running m1.large regionOne nova 11m mnguyen0427-10-j4gjr-worker-0-cwfjh Running m1.large regionOne nova 138m mnguyen0427-10-j4gjr-worker-0-d8kxh Running m1.large regionOne nova 138m $ oc get nodes NAME STATUS ROLES AGE VERSION mnguyen0427-10-j4gjr-master-0 Ready master 140m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-1 Ready master 139m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-2 Ready master 132m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-26s28 Ready worker 122m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-9nntv NotReady worker 47s v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-cwfjh Ready worker 121m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-d8kxh Ready worker 109m v1.21.0-rc.0+6143dea $ oc get nodes NAME STATUS ROLES AGE VERSION mnguyen0427-10-j4gjr-master-0 Ready master 141m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-1 Ready master 139m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-2 Ready master 132m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-26s28 Ready worker 123m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-9nntv Ready worker 79s v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-cwfjh Ready worker 122m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-d8kxh Ready worker 109m v1.21.0-rc.0+6143dea Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |