Bug 1327332
Summary: | sshd fails to come up correctly somtimes (~50%) if both sshd.socket and sshd.service are enabled | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 8 | Reporter: | Fabian Deutsch <fdeutsch> | ||||
Component: | systemd | Assignee: | systemd-maint | ||||
Status: | CLOSED WONTFIX | QA Contact: | Frantisek Sumsal <fsumsal> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 8.1 | CC: | aoconnor, cshao, dtardon, jjelen, lnykryn, plautrba, redhat-bugzilla, robert.scheck, sbonazzo, systemd-maint-list, systemd-maint, talayan, weiwang, ycui | ||||
Target Milestone: | alpha | ||||||
Target Release: | 8.1 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-11-01 03:02:56 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1323975 | ||||||
Attachments: |
|
Description
Fabian Deutsch
2016-04-14 20:06:37 UTC
Logs and status from sshd and sshd-keygen service would be useful. I just tried you use case in my virtual machine with exactly the same version and it worked for me. After the first reboot, I got new keys generated. Do you have default configuration? Can you attach boot logs from the first reboot, which didn't generate your keys? sshd.service and sshd.socket don't have logs (journalctl -u …). I can attach sshd-keygen.service logs, but there was nothing obvious. Did you try removing th ehost keys, and then reboot? I must say, on a clean 7.2 install I can not reproduce this issue anymore. Okya, it's a different issue with a dfiferent component. Sorry for the noise. So, I've found the root cause of this bug: Reproducer: 1. Install RHEL 7.2 2. systemctl enable sshd.service sshd.socket 3. Reboot Repeat step 3 for a couple of times, sometimes sshd.service will be up, sometimes not. In general sshd.socket will not be active. This effectively means that sshd is sometimes not available, because neither the sshd.servic enor the sshd.socket is up. This problem can appear when: sshd.socket is enabled by a preset, and th euser decides to enable ssh in a kickstart during installation. (In reply to Fabian Deutsch from comment #5) > 2. systemctl enable sshd.service sshd.socket This step is not necessary, or better it should not be allowed at all. Use either sshd.service or sshd.socket but not together. Both sshd.service and sshd.socket bind to the same port. While sshd.service runs a standalone server, sshd.socket runs a sshd for every connection. So there's probably some kind of race condition or something. Yep. First: Currently this is possible as described in comment 5. The question is IMO: 1. Should there be a way in systemd to express this conflict 2. SHould openssh decide on one of the two solutions? We should add some more visible warning, which would make clear that you have two units simultaneously enabled during the boot. Unfortunately we can't add this to the enable command, since the install logic only parses the [Install] section. Lukas, should we move the bug to systemd, if you plan to do something with that? Or are you tracking this somewhere else? Or can I close it as misconfiguration? There is certainly nothing we can do about that in openssh. Why the ssh fails in the half of the cases? Does socket activated ssh has some different configuration than the regular one? IMHO th eproblem on the openssh side is, that openssh provides to ways to "enable ssh" (a) by enabling sshd.socket b) by enabling sshd.service), if both of them get enabled, this bug appears. So openssh could decide to go with one method. For systemd - systemd could probabl raise a warning if there is some logical error (i.e. here sshd.socket conflicts with sshd.service - [defined in the unit]) If both (socket and service) are enabled, my guess is that once sshd.socket starts first and then sshd.service fails to bind the port 22 and once it is the other way round (50% chance).
The conflict is defined in the units as expected, but it is not taken into the account.
> So openssh could decide to go with one method.
We don't enable sshd.socket by default. Valid configuration would be also to disable sshd.service AFAIK.
> The conflict is defined in the units as expected, but it is not taken into
> the account.
Only one of those things should be started, otherwise it is a bug. I know that in half of cases socket is started and in other half the service is started, but I was asking why it does not work for one of those case.
I am not sure if sshd.service is failing. Status is inactive (dead). With systemd debug logs we can see that systemd is not even trying to start the service. It just deletes both jobs in the boot sequence if I am right: # journalctl -b | grep sshd Apr 27 14:22:20 frontend.local systemd[1]: Looking at job sshd.service/start conflicted_by=no Apr 27 14:22:20 frontend.local systemd[1]: Looking at job sshd.service/stop conflicted_by=yes Apr 27 14:22:20 frontend.local systemd[1]: Fixing conflicting jobs sshd.service/start,sshd.service/stop by deleting job sshd.service/start Apr 27 14:22:20 frontend.local systemd[1]: Looking at job sshd.socket/stop conflicted_by=no Apr 27 14:22:20 frontend.local systemd[1]: Looking at job sshd.socket/start conflicted_by=no Apr 27 14:22:20 frontend.local systemd[1]: Fixing conflicting jobs sshd.socket/stop,sshd.socket/start by deleting job sshd.socket/stop Apr 27 14:22:20 frontend.local systemd[1]: Looking at job sshd-keygen.service/start conflicted_by=no Apr 27 14:22:20 frontend.local systemd[1]: Looking at job sshd-keygen.service/stop conflicted_by=no Apr 27 14:22:20 frontend.local systemd[1]: Fixing conflicting jobs sshd-keygen.service/start,sshd-keygen.service/stop by deleting job sshd-keygen.service/stop Apr 27 14:22:20 frontend.local systemd[1]: Deleting job sshd.service/stop as dependency of job sshd-keygen.service/stop Apr 27 14:22:20 frontend.local systemd[1]: Deleting job sshd.socket/start as dependency of job sshd.service/stop Apr 27 14:22:20 frontend.local systemd[1]: Installed new job sshd-keygen.service/start as 266 Apr 27 14:22:21 frontend.local systemd[1]: sshd-keygen.service/stop would change existing job. Apr 27 14:22:21 frontend.local systemd[1]: Deleting sshd-keygen.service/stop to minimize impact. Apr 27 14:22:21 frontend.local systemd[1]: Deleting job sshd.service/stop as dependency of job sshd-keygen.service/stop Apr 27 14:22:21 frontend.local systemd[1]: Deleting job sshd.socket/start as dependency of job sshd.service/stop Apr 27 14:22:21 frontend.local systemd[1]: sshd-keygen.service/stop would change existing job. Apr 27 14:22:21 frontend.local systemd[1]: Deleting sshd-keygen.service/stop to minimize impact. Apr 27 14:22:21 frontend.local systemd[1]: Deleting job sshd.service/stop as dependency of job sshd-keygen.service/stop Apr 27 14:22:21 frontend.local systemd[1]: Deleting job sshd.socket/start as dependency of job sshd.service/stop Apr 27 14:22:23 frontend.local systemd[1]: ConditionFileNotEmpty=|!/etc/ssh/ssh_host_rsa_key failed for sshd-keygen.service. Apr 27 14:22:23 frontend.local systemd[1]: ConditionFileNotEmpty=|!/etc/ssh/ssh_host_ecdsa_key failed for sshd-keygen.service. Apr 27 14:22:23 frontend.local systemd[1]: ConditionFileNotEmpty=|!/etc/ssh/ssh_host_ed25519_key failed for sshd-keygen.service. Apr 27 14:22:23 frontend.local systemd[1]: Starting of sshd-keygen.service requested but condition failed. Not starting unit. Apr 27 14:22:23 frontend.local systemd[1]: Job sshd-keygen.service/start finished, result=done Lukas, correct me if I am wrong of if miss-interpreting the logs. Hmm If both jobs were deleted than it looks like a bug in our dependency engine. This missed 7.2 -> 7.6; can we have a target milestone for this? 7.6.z? 7.7? The difference is in the order of examination of units for conflicts. If sshd.service comes first, both sshd.service and sshd.socket end up inactive: Feb 14 07:17:21 localhost.localdomain systemd[1]: Looking at job sshd.service/start conflicted_by=no Feb 14 07:17:21 localhost.localdomain systemd[1]: Looking at job sshd.service/stop conflicted_by=yes Feb 14 07:17:21 localhost.localdomain systemd[1]: Fixing conflicting jobs sshd.service/start,sshd.service/stop by deleting job sshd.service/start Feb 14 07:17:21 localhost.localdomain systemd[1]: Looking at job sshd-keygen.service/start conflicted_by=no Feb 14 07:17:21 localhost.localdomain systemd[1]: Looking at job sshd-keygen.service/stop conflicted_by=no Feb 14 07:17:21 localhost.localdomain systemd[1]: Fixing conflicting jobs sshd-keygen.service/start,sshd-keygen.service/stop by deleting job sshd-keygen.service/stop Feb 14 07:17:21 localhost.localdomain systemd[1]: Deleting job sshd.socket/stop as dependency of job sshd-keygen.service/stop Feb 14 07:17:21 localhost.localdomain systemd[1]: Deleting job sshd.service/stop as dependency of job sshd-keygen.service/stop Feb 14 07:17:21 localhost.localdomain systemd[1]: Deleting job sshd.socket/start as dependency of job sshd.service/stop OTOH, if sshd.socket comes first, sshd.service ends up running: Feb 14 07:04:07 localhost.localdomain systemd[1]: Looking at job sshd.socket/stop conflicted_by=no Feb 14 07:04:07 localhost.localdomain systemd[1]: Looking at job sshd.socket/start conflicted_by=no Feb 14 07:04:07 localhost.localdomain systemd[1]: Fixing conflicting jobs sshd.socket/stop,sshd.socket/start by deleting job sshd.socket/stop Feb 14 07:04:07 localhost.localdomain systemd[1]: Looking at job sshd-keygen.service/start conflicted_by=no Feb 14 07:04:07 localhost.localdomain systemd[1]: Looking at job sshd-keygen.service/stop conflicted_by=no Feb 14 07:04:07 localhost.localdomain systemd[1]: Fixing conflicting jobs sshd-keygen.service/start,sshd-keygen.service/stop by deleting job sshd-keygen.service/stop Feb 14 07:04:07 localhost.localdomain systemd[1]: Deleting job sshd.service/stop as dependency of job sshd-keygen.service/stop Feb 14 07:04:07 localhost.localdomain systemd[1]: Deleting job sshd.socket/start as dependency of job sshd.service/stop (In reply to Sandro Bonazzola from comment #17) > This missed 7.2 -> 7.6; can we have a target milestone for this? 7.6.z? 7.7? Given that there is an easy and obvious workaround (only enable one of the units), I wouldn't assume this is very high on the priority list. missed 7.7, retrying with 8.1 for RHV 4.4 Not tracking this for RHV 4.4 anymore (In reply to Sandro Bonazzola from comment #20) > missed 7.7, retrying with 8.1 for RHV 4.4 Do you still see this on 8.1? Because I do not: sshd.service is always started and there's no conflict message in debug log... I saw very similar report in recent Fedora in bug #1851478. I do not see the error message in the logs anymore, but I believe the issue is still present. (In reply to David Tardon from comment #22) > (In reply to Sandro Bonazzola from comment #20) > > missed 7.7, retrying with 8.1 for RHV 4.4 > > Do you still see this on 8.1? Because I do not: sshd.service is always > started and there's no conflict message in debug log... Redirecting question to RHV-H QE. Chen? (In reply to Sandro Bonazzola from comment #24) > (In reply to David Tardon from comment #22) > > (In reply to Sandro Bonazzola from comment #20) > > > missed 7.7, retrying with 8.1 for RHV 4.4 > > > > Do you still see this on 8.1? Because I do not: sshd.service is always > > started and there's no conflict message in debug log... > > Redirecting question to RHV-H QE. Chen? Can't reproduce this issue with RHVH 4.4. # imgbase w You are on rhvh-4.4.0.21-0.20200518.0+1 [root@dell-per730-35 ~]# systemctl status sshd ● sshd.service - OpenSSH server daemon Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-07-30 08:06:36 UTC; 4 days ago Docs: man:sshd(8) man:sshd_config(5) Main PID: 1566 (sshd) Tasks: 1 (limit: 203983) Memory: 6.0M CGroup: /system.slice/sshd.service └─1566 /usr/sbin/sshd -D -oCiphers=aes256-gcm,chacha20-poly1305,aes256-ctr,aes256-cbc,aes128-gcm,aes128-ctr,aes128-cbc -oMACs=hmac-sha2-256-etm,hmac-sh> Jul 30 08:06:36 localhost.localdomain systemd[1]: Starting OpenSSH server daemon... Jul 30 08:06:36 localhost.localdomain sshd[1566]: Server listening on 0.0.0.0 port 22. Jul 30 08:06:36 localhost.localdomain sshd[1566]: Server listening on :: port 22. Jul 30 08:06:36 localhost.localdomain systemd[1]: Started OpenSSH server daemon. Aug 03 09:17:49 dell-per730-35.lab.eng.pek2.redhat.com sshd[14835]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=10.72.13.126 user=root Aug 03 09:17:49 dell-per730-35.lab.eng.pek2.redhat.com sshd[14835]: pam_succeed_if(sshd:auth): requirement "uid >= 1000" not met by user "root" Aug 03 09:17:50 dell-per730-35.lab.eng.pek2.redhat.com sshd[14835]: Failed password for root from 10.72.13.126 port 42340 ssh2 Aug 03 09:18:01 dell-per730-35.lab.eng.pek2.redhat.com sshd[14835]: Accepted password for root from 10.72.13.126 port 42340 ssh2 Aug 03 09:18:01 dell-per730-35.lab.eng.pek2.redhat.com sshd[14835]: pam_unix(sshd:session): session opened for user root by (uid=0) # rm -f /etc/ssh/*host* # reboot Check sshd status again. # systemctl status sshd ● sshd.service - OpenSSH server daemon Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-08-03 09:23:29 UTC; 37s ago Docs: man:sshd(8) man:sshd_config(5) Main PID: 1563 (sshd) Tasks: 1 (limit: 203983) Memory: 8.5M CGroup: /system.slice/sshd.service └─1563 /usr/sbin/sshd -D -oCiphers=aes256-gcm,chacha20-poly1305,aes256-ctr,aes256-cbc,aes128-gcm,aes128-ctr,aes128-cbc -oMACs=hmac-sha2-256-etm,hmac-sh> Aug 03 09:23:29 localhost.localdomain systemd[1]: Starting OpenSSH server daemon... Aug 03 09:23:29 localhost.localdomain sshd[1563]: Server listening on 0.0.0.0 port 22. Aug 03 09:23:29 localhost.localdomain sshd[1563]: Server listening on :: port 22. Aug 03 09:23:29 localhost.localdomain systemd[1]: Started OpenSSH server daemon. Aug 03 09:23:58 dell-per730-35.lab.eng.pek2.redhat.com sshd[2569]: Accepted password for root from 10.72.13.126 port 42578 ssh2 Aug 03 09:23:58 dell-per730-35.lab.eng.pek2.redhat.com sshd[2569]: pam_unix(sshd:session): session opened for user root by (uid=0) I do not see any reference to the sshd.socket in your logs. (In reply to Jakub Jelen from comment #26) > I do not see any reference to the sshd.socket in your logs. Update: # imgbase w You are on rhvh-4.4.0.21-0.20200518.0+1 # systemctl status sshd ● sshd.service - OpenSSH server daemon Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-08-03 09:23:29 UTC; 37min ago Docs: man:sshd(8) man:sshd_config(5) Main PID: 1563 (sshd) Tasks: 1 (limit: 203983) Memory: 9.9M CGroup: /system.slice/sshd.service └─1563 /usr/sbin/sshd -D -oCiphers=aes256-gcm,chacha20-poly1305,aes256-ctr,aes256-cbc,aes128-gcm,aes128-ctr,aes128-cbc -oMACs=hmac-sha2-256-etm,hmac-sh> Aug 03 09:23:29 localhost.localdomain sshd[1563]: Server listening on 0.0.0.0 port 22. Aug 03 09:23:29 localhost.localdomain sshd[1563]: Server listening on :: port 22. Aug 03 09:23:29 localhost.localdomain systemd[1]: Started OpenSSH server daemon. Aug 03 09:23:58 dell-per730-35.lab.eng.pek2.redhat.com sshd[2569]: Accepted password for root from 10.72.13.126 port 42578 ssh2 Aug 03 09:23:58 dell-per730-35.lab.eng.pek2.redhat.com sshd[2569]: pam_unix(sshd:session): session opened for user root by (uid=0) Aug 03 09:56:18 dell-per730-35.lab.eng.pek2.redhat.com sshd[2783]: pam_unix(sshd:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=10.72.13.126 user=root Aug 03 09:56:18 dell-per730-35.lab.eng.pek2.redhat.com sshd[2783]: pam_succeed_if(sshd:auth): requirement "uid >= 1000" not met by user "root" Aug 03 09:56:20 dell-per730-35.lab.eng.pek2.redhat.com sshd[2783]: Failed password for root from 10.72.13.126 port 42962 ssh2 Aug 03 09:56:22 dell-per730-35.lab.eng.pek2.redhat.com sshd[2783]: Accepted password for root from 10.72.13.126 port 42962 ssh2 Aug 03 09:56:22 dell-per730-35.lab.eng.pek2.redhat.com sshd[2783]: pam_unix(sshd:session): session opened for user root by (uid=0) # # systemctl status sshd.socket ● sshd.socket - OpenSSH Server Socket Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:sshd(8) man:sshd_config(5) Listen: [::]:22 (Stream) Accepted: 0; Connected: 0; # rm -f /etc/ssh/*host* # reboot # systemctl status sshd.socket ● sshd.socket - OpenSSH Server Socket Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor preset: disabled) Active: inactive (dead) Docs: man:sshd(8) man:sshd_config(5) Listen: [::]:22 (Stream) Accepted: 0; Connected: 0; # systemctl status sshd ● sshd.service - OpenSSH server daemon Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled; vendor preset: enabled) Active: active (running) since Mon 2020-08-03 10:06:15 UTC; 1min 1s ago Docs: man:sshd(8) man:sshd_config(5) Main PID: 1577 (sshd) Tasks: 1 (limit: 203983) Memory: 8.9M CGroup: /system.slice/sshd.service └─1577 /usr/sbin/sshd -D -oCiphers=aes256-gcm,chacha20-poly1305,aes256-ctr,aes256-cbc,aes128-gcm,aes128-ctr,aes128-cbc -oMACs=hmac-sha2-256-etm,hmac-sh> Aug 03 10:06:15 localhost.localdomain systemd[1]: Starting OpenSSH server daemon... Aug 03 10:06:15 localhost.localdomain sshd[1577]: Server listening on 0.0.0.0 port 22. Aug 03 10:06:15 localhost.localdomain sshd[1577]: Server listening on :: port 22. Aug 03 10:06:15 localhost.localdomain systemd[1]: Started OpenSSH server daemon. Aug 03 10:07:08 dell-per730-35.lab.eng.pek2.redhat.com sshd[2593]: Accepted password for root from 10.72.13.126 port 43092 ssh2 Aug 03 10:07:08 dell-per730-35.lab.eng.pek2.redhat.com sshd[2593]: pam_unix(sshd:session): session opened for user root by (uid=0) (In reply to cshao from comment #27) > (In reply to Jakub Jelen from comment #26) > > I do not see any reference to the sshd.socket in your logs. > > [...] > # systemctl status sshd.socket > ● sshd.socket - OpenSSH Server Socket > Loaded: loaded (/usr/lib/systemd/system/sshd.socket; disabled; vendor > preset: disabled) > Active: inactive (dead) This issue is present only if *both* socket and service are enabled. The above log shows that socket is disabled which is indeed correct configuration, but can not demonstrate the bug we are talking about. Thanks. I can reproduce this issue with RHVH 4.4. Test version: redhat-virtualization-host-4.4.1-20200722.0.el8_2 Test steps: 1) systemctl enable sshd.service sshd.socket 2) reboot (login to the system after it is up) 3) systemctl status sshd.service sshd.socket 4) reboot several times. Test result: Both of shd.service and sshd.socket in "inactive (dead)" state and can't connect through ssh. see attachment "rhvh.png". Created attachment 1710250 [details]
rhvh
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release. Therefore, it is being closed. If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened. |