Bug 1899187
Summary: | [Openstack] node-valid-hostname.service failes during the first boot leading to 5 minute provisioning delay | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Robert Heinzmann <rheinzma> | ||||||||||
Component: | Machine Config Operator | Assignee: | Mike Fedosin <mfedosin> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||||||||
Severity: | low | Docs Contact: | |||||||||||
Priority: | low | ||||||||||||
Version: | 4.6 | CC: | jerzhang, kgarriso, m.andre, mbooth, mfedosin, pprinett | ||||||||||
Target Milestone: | --- | Keywords: | Triaged | ||||||||||
Target Release: | 4.8.0 | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: |
Cause: A race condition between setting of the hostname and enabling networking.
Consequence: If the network started before the name was set the node couldn't join the cluster and waited 5 minutes to try again.
Fix: Explicitly set the hostname before the network is up.
Result: The node automatically joins the cluster from the first attempt.
|
Story Points: | --- | ||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2021-07-27 22:34:24 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Attachments: |
|
Description
Robert Heinzmann
2020-11-18 17:00:13 UTC
Created attachment 1730612 [details]
Console Log Initial Boot (boot 0)
Created attachment 1730613 [details]
Journal Log Initial Boot (boot -1)
Created attachment 1730614 [details]
Journal Log Initial Boot (boot 0)
It seems the reason for the timeout is a problem on OpenStack with OVN. It seems OVN is not handing out the "host-name" dhcp option [1], which can also be seen in the logs: OSP16.1 with OVN (no dhcp option for hostname): ~~~ sh-4.4# journalctl -b -1 | grep dhcp4 | grep -e "option host_name" -e "option ip_address" Nov 23 18:08:22 localhost NetworkManager[745]: <info> [1606154902.7909] dhcp4 (ens3): option ip_address => '192.168.150.167' Nov 23 18:08:35 localhost NetworkManager[1651]: <info> [1606154915.1373] dhcp4 (ens3): option ip_address => '192.168.150.167' sh-4.4# journalctl -b -1 | awk '{print $4}' | uniq -c 1 at 1776 localhost 362 ocp-99l7h-worker-0-g7hxm ~~~ QuickLab 4.5 (possibly not OVN): ~~~ sh-4.4# journalctl -b -1 | grep dhcp4 | grep -e "option host_name" -e "option ip_address" Nov 22 06:50:00 worker-0.sharedocp4upi45.xxx.xxxx.xxx.xxx NetworkManager[1635]: <info> [1606027800.0301] dhcp4 (ens3): option host_name => 'host-10-0-91-158' Nov 22 06:50:00 worker-0.sharedocp4upi45.xxx.xxxx.xxx.xxx NetworkManager[1635]: <info> [1606027800.0301] dhcp4 (ens3): option ip_address => '10.0.91.158' sh-4.4# journalctl -b -1 | awk '{print $4}' | uniq -c 1 at 607 localhost 1253 host-10-0-91-158.openstacklocal ## <--- DHCP Host Name 777 worker-0.sharedocp4upi45.xxx.xxx.xxx.xxx ~~~ I then created the following machineconfig which uses nmcli to set the hostname after "afterburn-hostname.service" has been run: ~~~ [stack@osp16 ocp-test1]$ cat 99-worker-fix-hostname.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: worker name: 99-worker-fix-hostname spec: config: ignition: config: {} security: tls: {} timeouts: {} version: 2.2.0 networkd: {} passwd: {} systemd: units: - contents: | [Unit] Description=Set Hostname via /etc/hostname Before=node-valid-hostname.service After=afterburn-hostname.service [Service] ExecStart=/bin/sh -c "nmcli general hostname $(cat /etc/hostname)" Type=oneshot [Install] WantedBy=network-online.target name: afterburn-hostname-set.service enabled: true fips: false kernelArguments: null osImageURL: "" ~~~ With this in place, there is no timeout anymore and boot time can be reduced from 13 minutes to 7, avoiding installer timeouts. --- [1] https://docs.openstack.org/neutron/latest/ovn/dhcp_opts.html Verified on 4.8.0-0.nightly-2021-04-26-151924. Scaled up on IPI Openstack. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-26-151924 True False 109m Cluster version is 4.8.0-0.nightly-2021-04-26-151924 $ oc get nodes NAME STATUS ROLES AGE VERSION mnguyen0427-10-j4gjr-master-0 Ready master 124m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-1 Ready master 123m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-2 Ready master 116m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-26s28 Ready worker 106m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-cwfjh Ready worker 105m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-d8kxh Ready worker 93m v1.21.0-rc.0+6143dea $ oc debug node/mnguyen0427-10-j4gjr-worker-0-26s28 Starting pod/mnguyen0427-10-j4gjr-worker-0-26s28-debug ... To use host binaries, run `chroot /host` If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# journalctl -u afterburn-hostname -- Logs begin at Tue 2021-04-27 14:45:24 UTC, end at Tue 2021-04-27 16:44:48 UTC. -- Apr 27 14:50:47 localhost systemd[1]: Starting Afterburn Hostname... Apr 27 14:50:47 localhost afterburn[1684]: Apr 27 14:50:47.365 INFO Fetching http://169.254.169.254/latest/meta-data/hostname: Attempt > Apr 27 14:50:49 localhost afterburn[1684]: Apr 27 14:50:49.026 INFO Fetch successful Apr 27 14:50:49 localhost systemd[1]: afterburn-hostname.service: Succeeded. Apr 27 14:50:49 localhost systemd[1]: Started Afterburn Hostname. Apr 27 14:50:49 localhost systemd[1]: afterburn-hostname.service: Consumed 23ms CPU time -- Reboot -- Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: Starting Afterburn Hostname... Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 afterburn[1268]: Apr 27 14:56:49.064 INFO Fetching http://169.254.169.254/latest/me> Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 afterburn[1268]: Apr 27 14:56:49.781 INFO Fetch successful Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: afterburn-hostname.service: Succeeded. Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: Started Afterburn Hostname. Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: afterburn-hostname.service: Consumed 19ms CPU time ...skipping... -- Logs begin at Tue 2021-04-27 14:45:24 UTC, end at Tue 2021-04-27 16:44:48 UTC. -- Apr 27 14:50:47 localhost systemd[1]: Starting Afterburn Hostname... Apr 27 14:50:47 localhost afterburn[1684]: Apr 27 14:50:47.365 INFO Fetching http://169.254.169.254/latest/meta-data/hostname: Attempt > Apr 27 14:50:49 localhost afterburn[1684]: Apr 27 14:50:49.026 INFO Fetch successful Apr 27 14:50:49 localhost systemd[1]: afterburn-hostname.service: Succeeded. Apr 27 14:50:49 localhost systemd[1]: Started Afterburn Hostname. Apr 27 14:50:49 localhost systemd[1]: afterburn-hostname.service: Consumed 23ms CPU time -- Reboot -- Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: Starting Afterburn Hostname... Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 afterburn[1268]: Apr 27 14:56:49.064 INFO Fetching http://169.254.169.254/latest/me> Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 afterburn[1268]: Apr 27 14:56:49.781 INFO Fetch successful Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: afterburn-hostname.service: Succeeded. Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: Started Afterburn Hostname. Apr 27 14:56:49 mnguyen0427-10-j4gjr-worker-0-26s28 systemd[1]: afterburn-hostname.service: Consumed 19ms CPU time sh-4.4# systemctl cat afterburn-hostname # /etc/systemd/system/afterburn-hostname.service [Unit] Description=Afterburn Hostname # Block services relying on Networking being up. Before=network-online.target # Wait for NetworkManager to report its online After=NetworkManager-wait-online.service # Run before hostname checks Before=node-valid-hostname.service [Service] ExecStart=/usr/bin/afterburn --provider openstack-metadata --hostname=/etc/hostname Type=oneshot [Install] WantedBy=network-online.target sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... $ oc -n openshift-machine-api get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE mnguyen0427-10-j4gjr-worker-0 3 3 3 3 129m $ oc -n openshift-machine-api scale --replicas=4 machinesets/mnguyen0427-10-j4gjr-worker-0 machineset.machine.openshift.io/mnguyen0427-10-j4gjr-worker-0 scaled $ oc -n openshift-machine-api get machinesets NAME DESIRED CURRENT READY AVAILABLE AGE mnguyen0427-10-j4gjr-worker-0 4 4 3 3 129m $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE mnguyen0427-10-j4gjr-master-0 Running m1.xlarge regionOne nova 131m mnguyen0427-10-j4gjr-master-1 Running m1.xlarge regionOne nova 131m mnguyen0427-10-j4gjr-master-2 Running m1.xlarge regionOne nova 131m mnguyen0427-10-j4gjr-worker-0-26s28 Running m1.large regionOne nova 129m mnguyen0427-10-j4gjr-worker-0-9nntv Provisioning 2m13s mnguyen0427-10-j4gjr-worker-0-cwfjh Running m1.large regionOne nova 129m mnguyen0427-10-j4gjr-worker-0-d8kxh Running m1.large regionOne nova 129m $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE mnguyen0427-10-j4gjr-master-0 Running m1.xlarge regionOne nova 135m mnguyen0427-10-j4gjr-master-1 Running m1.xlarge regionOne nova 135m mnguyen0427-10-j4gjr-master-2 Running m1.xlarge regionOne nova 135m mnguyen0427-10-j4gjr-worker-0-26s28 Running m1.large regionOne nova 133m mnguyen0427-10-j4gjr-worker-0-9nntv Provisioned m1.large regionOne nova 6m4s mnguyen0427-10-j4gjr-worker-0-cwfjh Running m1.large regionOne nova 133m mnguyen0427-10-j4gjr-worker-0-d8kxh Running m1.large regionOne nova 133m $ oc -n openshift-machine-api get machines NAME PHASE TYPE REGION ZONE AGE mnguyen0427-10-j4gjr-master-0 Running m1.xlarge regionOne nova 140m mnguyen0427-10-j4gjr-master-1 Running m1.xlarge regionOne nova 140m mnguyen0427-10-j4gjr-master-2 Running m1.xlarge regionOne nova 140m mnguyen0427-10-j4gjr-worker-0-26s28 Running m1.large regionOne nova 138m mnguyen0427-10-j4gjr-worker-0-9nntv Running m1.large regionOne nova 11m mnguyen0427-10-j4gjr-worker-0-cwfjh Running m1.large regionOne nova 138m mnguyen0427-10-j4gjr-worker-0-d8kxh Running m1.large regionOne nova 138m $ oc get nodes NAME STATUS ROLES AGE VERSION mnguyen0427-10-j4gjr-master-0 Ready master 140m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-1 Ready master 139m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-2 Ready master 132m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-26s28 Ready worker 122m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-9nntv NotReady worker 47s v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-cwfjh Ready worker 121m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-d8kxh Ready worker 109m v1.21.0-rc.0+6143dea $ oc get nodes NAME STATUS ROLES AGE VERSION mnguyen0427-10-j4gjr-master-0 Ready master 141m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-1 Ready master 139m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-master-2 Ready master 132m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-26s28 Ready worker 123m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-9nntv Ready worker 79s v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-cwfjh Ready worker 122m v1.21.0-rc.0+6143dea mnguyen0427-10-j4gjr-worker-0-d8kxh Ready worker 109m v1.21.0-rc.0+6143dea Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |