Bug 1888761

Summary: sshd service fails to start the first time because of missing host keys
Product: Red Hat Enterprise Linux 8 Reporter: Qin Yuan <qiyuan>
Component: cloud-initAssignee: Eduardo Otubo <eterrell>
Status: CLOSED CURRENTRELEASE QA Contact: xiachen
Severity: low Docs Contact:
Priority: low    
Version: 8.3CC: eterrell, huzhao, jgreguske, mrezanin, obudai, ribarry, xiachen, xiliang, yacao, yuxisun
Target Milestone: rcKeywords: TestOnly, Triaged
Target Release: 8.0Flags: pm-rhel: mirror+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-06-23 09:23:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1957532, 1970909    
Bug Blocks:    
Attachments:
Description Flags
cloud-init logs none

Description Qin Yuan 2020-10-15 17:26:05 UTC
Created attachment 1721882 [details]
cloud-init logs

Description of problem:
When start a VM with cloud-init, sshd service fails to start the first time because of missing host keys:

[root@test1 ~]# journalctl -u sshd
Oct 15 19:23:48 test1 systemd[1]: Starting OpenSSH server daemon...
Oct 15 19:23:48 test1 sshd[1043]: Unable to load host key: /etc/ssh/ssh_host_rsa_key
Oct 15 19:23:48 test1 sshd[1043]: Unable to load host key: /etc/ssh/ssh_host_ecdsa_key
Oct 15 19:23:48 test1 sshd[1043]: Unable to load host key: /etc/ssh/ssh_host_ed25519_key
Oct 15 19:23:48 test1 sshd[1043]: sshd: no hostkeys available -- exiting.
Oct 15 19:23:48 test1 systemd[1]: sshd.service: Main process exited, code=exited, status=1/FAILURE
Oct 15 19:23:48 test1 systemd[1]: sshd.service: Failed with result 'exit-code'.
Oct 15 19:23:48 test1 systemd[1]: Failed to start OpenSSH server daemon.
Oct 15 19:24:32 test1 systemd[1]: sshd.service: Service RestartSec=42s expired, scheduling restart.
Oct 15 19:24:32 test1 systemd[1]: sshd.service: Scheduled restart job, restart counter is at 1.
Oct 15 19:24:32 test1 systemd[1]: Stopped OpenSSH server daemon.
Oct 15 19:24:32 test1 systemd[1]: Starting OpenSSH server daemon...
Oct 15 19:24:33 test1 sshd[4766]: Server listening on 0.0.0.0 port 22.
Oct 15 19:24:33 test1 sshd[4766]: Server listening on :: port 22.
Oct 15 19:24:33 test1 systemd[1]: Started OpenSSH server daemon.
Oct 15 19:25:09 test1 sshd[4769]: Accepted password for root from 10.72.12.131 port 34770 ssh2
Oct 15 19:25:09 test1 sshd[4769]: pam_unix(sshd:session): session opened for user root by (uid=0)

In /var/log/messages:

Oct 15 19:23:38 test1 systemd[1]: Starting Initial cloud-init job (pre-networking)...
Oct 15 19:23:38 test1 systemd[1]: Starting OpenSSH ecdsa Server Key Generation...
Oct 15 19:23:38 test1 systemd[1]: Starting OpenSSH rsa Server Key Generation...
Oct 15 19:23:38 test1 systemd[1]: Starting OpenSSH ed25519 Server Key Generation...
Oct 15 19:23:38 test1 systemd[1]: sshd-keygen: Succeeded.
Oct 15 19:23:38 test1 systemd[1]: Started OpenSSH ecdsa Server Key Generation.
Oct 15 19:23:38 test1 systemd[1]: sshd-keygen: Succeeded.
Oct 15 19:23:38 test1 systemd[1]: Started OpenSSH ed25519 Server Key Generation.
Oct 15 19:23:41 test1 systemd[1]: sshd-keygen: Succeeded.
Oct 15 19:23:41 test1 systemd[1]: Started OpenSSH rsa Server Key Generation.
Oct 15 19:23:41 test1 systemd[1]: Reached target sshd-keygen.target.
Oct 15 19:23:44 test1 systemd[1]: Started Initial cloud-init job (pre-networking).
Oct 15 19:23:47 test1 systemd[1]: Starting Initial cloud-init job (metadata service crawler)...
Oct 15 19:23:48 test1 systemd[1]: Started Initial cloud-init job (metadata service crawler).
Oct 15 19:23:48 test1 systemd[1]: Starting OpenSSH server daemon...
Oct 15 19:23:48 test1 sshd[1043]: Unable to load host key: /etc/ssh/ssh_host_rsa_key
Oct 15 19:23:48 test1 sshd[1043]: Unable to load host key: /etc/ssh/ssh_host_ecdsa_key
Oct 15 19:23:48 test1 sshd[1043]: Unable to load host key: /etc/ssh/ssh_host_ed25519_key
Oct 15 19:23:48 test1 sshd[1043]: sshd: no hostkeys available -- exiting.
Oct 15 19:23:48 test1 systemd[1]: sshd.service: Main process exited, code=exited, status=1/FAILURE
Oct 15 19:23:48 test1 systemd[1]: sshd.service: Failed with result 'exit-code'.
Oct 15 19:23:48 test1 systemd[1]: Failed to start OpenSSH server daemon.

In /var/log/cloud-init.log:

2020-10-15 16:23:43,838 - stages.py[DEBUG]: Running module ssh (<module 'cloudinit.config.cc_ssh' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_ssh.py'>) with frequency once-per-instance
2020-10-15 16:23:43,840 - helpers.py[DEBUG]: Running config-ssh using lock (<FileLock using file '/var/lib/cloud/instances/6c7709d2-be62-445b-b678-a022ef6c0aa9/sem/config_ssh'>)
2020-10-15 16:23:43,846 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_ecdsa_key
2020-10-15 16:23:43,846 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_ecdsa_key.pub
2020-10-15 16:23:43,847 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_ed25519_key
2020-10-15 16:23:43,847 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_ed25519_key.pub
2020-10-15 16:23:43,847 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_rsa_key
2020-10-15 16:23:43,847 - util.py[DEBUG]: Attempting to remove /etc/ssh/ssh_host_rsa_key.pub

In /etc/cloud/cloud.cfg:

ssh_deletekeys:   1
ssh_genkeytypes:  ~


As you can see, OpenSSH Server Key Generation was started almost at the same time with cloud-init job, but the keys generated by it were removed later when running cloud-init ssh module because ssh_deletekeys is 1. So when starting OpenSSH server daemon, it failed because of no hostkeys available at that time.

It's not reasonable for cloud-init to remove the hostkeys generated by sshd, sshd hostkeys generation should be executed after cloud-init is finished.


Version-Release number of selected component (if applicable):
kernel-4.18.0-233.el8.x86_64 (rhel 8.3)
cloud-init-19.4-8.el8.noarch
RHV-4.4.3

How reproducible:
100%

Steps to Reproduce:
1. Create a VM from template(latest-rhel-guest-image-8.3-infra) and enable cloud-init on RHV.
2. Run the VM.
3. Check sshd service info by running `journalctl -u sshd` after the VM is up.

Actual results:
1. sshd service failed to start the first time, and it got restarted 42s later.

Expected results:
1. sshd service should be started successfully the first time.

Additional info:
This also happens on rhel 8.2 VM.

Comment 1 Huijuan Zhao 2020-10-19 03:27:42 UTC
Can reproduce on OpenStack both in cloud-init-19.4-11.el8.noarch and cloud-init-20.3-1.el8.1.eterrell202010141324.noarch.
Give qa_ack+

Comment 2 Eduardo Otubo 2020-10-19 10:19:16 UTC
ssh_deletekeys does not remove newly generated keys, it deletes keys that are already available on the instance when it is deployed - this should be the default behavior in order to avoid having the same key across multiple instances. The problem here looks like to be a small race between sshd and cloud-init. Cloud-init needs to consume ssh keys at a point in time that they're not yet generated by sshd. Looks like an upstream bug. I'm setting the priority of this BZ to low as it doesn't prevent the startup and/or configuration. But I'll definitely look into this in the future.

Thanks for the complete report!

Comment 6 xiachen 2021-06-23 09:23:28 UTC
Close this bug based on all test PASS.

Test details in below bugs.
BZ #1957532 for RHEL8.5, cloud-init-21.1-2.el8
BZ #1963981 for RHEL8.4.0.z, cloud-init-20.3-10.el8_4.4
BZ #1963982 for RHEL8.2.0.z, cloud-init-18.5-12.el8_2.8
BZ #1970909 for RHEL9.0, cloud-init-21.1-2.el9