Bug 2044342 - ssh client stays running even after the remote process has terminated
Summary: ssh client stays running even after the remote process has terminated
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 9
Classification: Red Hat
Component: openssh
Version: 9.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: rc
: ---
Assignee: Dmitry Belyavskiy
QA Contact: BaseOS QE Security Team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-01-24 12:28 UTC by Lukáš Doktor
Modified: 2023-07-24 07:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-07-24 07:28:28 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Script that is running fio job with timeout via ssh (514 bytes, application/x-shellscript)
2022-01-24 12:28 UTC, Lukáš Doktor
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker CRYPTO-9792 0 None None None 2023-03-13 15:59:07 UTC
Red Hat Issue Tracker RHELPLAN-109386 0 None None None 2022-01-24 12:35:02 UTC

Description Lukáš Doktor 2022-01-24 12:28:18 UTC
Created attachment 1853060 [details]
Script that is running fio job with timeout via ssh

Description of problem:
When debugging the issue from https://bugzilla.redhat.com/show_bug.cgi?id=2024544#c11 I stumbled across another issue. I was running a loop of fio processes executed via ssh under a timeout. Once in a while this hangs for infinity, the fio processes finished on the remote machine (VM).


Version-Release number of selected component (if applicable):
* RHEL-9.0.0-20211026.10
* openssh-clients-8.6p1-7.el9.1.x86_64


How reproducible:
Once in a while (often it fails in 2-5 attempts, sometimes it survives 50 iterations)


Steps to Reproduce:
1. Use steps from https://bugzilla.redhat.com/show_bug.cgi?id=2042559 to setup host and guest
2. Remove the extra network settings (following lines) to get properly working machine:

        <host csum="off" gso="off" tso4="off" tso6="off" ecn="off" ufo="off" mrg_rxbuf="off"/>
        <guest csum="off" tso4="off" tso6="off" ecn="off" ufo="off"/>

3. Run the "check.sh" script in a loop: I=0; RET=0; while [ $RET -eq 0 ]; do echo $I; bash check.sh ; RET=$?; I=`expr $I + 1`; done


Actual results:
Eventually the ssh on host hangs, not reacting to ctrl+c. Inside guest the fio process is gone.


Expected results:
ssh should finish as soon as the fio processes finishes.


Additional info:
* I tried reproducing this with "ssh -T" as well as without it, both hanged.
* Killing the "timeout" process on host won't help, one has to kill the "bash" process that executed the "ssh timeout". After killing the "bash" process I got additional output:

debug2: channel 0: read<=0 rfd 4 len -1
debug2: channel 0: read failed
debug2: chan_shutdown_read: channel 0: (i0 o0 sock -1 wfd 4 efd 6 [write])
debug2: channel 0: input open -> drain
debug3: send packet: type 1
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
  #0 client-session (t4 r0 i1/0 o0/0 e[write]/0 fd -1/5/6 sock -1 cc -1)

Killed by signal 1.

Comment 1 Lukáš Doktor 2022-01-24 12:29:41 UTC
Additional details collected by Stefan:

The ssh client process stays running even after the remove fio process
has terminated. The remote sshd process is still present.

The openssh-8.6p1-7.el9.1.x86_64 client blocks trying to read from a
pty:

(gdb) bt
#0  0x00007feb53484992 in read () from /lib64/libc.so.6
#1  0x000055e4b31bac46 in read (__nbytes=16384, __buf=0x7fffa3ee8b40, __fd=<optimized out>) at /usr/include/bits/unistd.h:47
#2  channel_handle_rfd.constprop.0.isra.0 (ssh=0x55e4b4a344b0, c=0x55e4b4a4a040, readset=<optimized out>, writeset=<optimized out>)
    at /usr/src/debug/openssh-8.6p1-7.el9.1.x86_64/channels.c:1946
#3  0x000055e4b3176819 in channel_post_open (ssh=0x55e4b4a344b0, c=0x55e4b4a4a040, readset=0x55e4b4a609b0, writeset=0x55e4b4a60a70)
    at /usr/src/debug/openssh-8.6p1-7.el9.1.x86_64/channels.c:2175
#4  0x000055e4b317eb0a in channel_handler (ssh=0x55e4b4a344b0, table=1, readset=0x55e4b4a609b0, writeset=0x55e4b4a60a70, unpause_secs=0x0)
    at /usr/src/debug/openssh-8.6p1-7.el9.1.x86_64/channels.c:2416
#5  0x000055e4b3158ff6 in channel_after_select (writeset=<optimized out>, readset=<optimized out>, ssh=0x55e4b4a344b0) at /usr/src/debug/openssh-8.6p1-7.el9.1.x86_64/channels.c:2500
#6  client_loop (ssh=<optimized out>, have_pty=<optimized out>, escape_char_arg=<optimized out>, ssh2_chan_id=<optimized out>)
    at /usr/src/debug/openssh-8.6p1-7.el9.1.x86_64/clientloop.c:1358
#7  0x000055e4b31480b9 in main (ac=<optimized out>, av=<optimized out>) at /usr/src/debug/openssh-8.6p1-7.el9.1.x86_64/ssh.c:1714
(gdb) p *c
$2 = {type = 4, self = 0, remote_id = 0, have_remote_id = 1, istate = 0, ostate = 0, flags = 0, rfd = 4, wfd = 5, efd = 6, sock = -1, ctl_chan = -1, isatty = 0, client_tty = 0, 
  force_drain = 0, notbefore = 0, delayed = 0, restore_block = 0, input = 0x55e4b4a4acc0, output = 0x55e4b4a49900, extended = 0x55e4b4a36b90, path = 0x0, listening_port = 0, 
  listening_addr = 0x0, host_port = 0, remote_name = 0x55e4b4a61580 "client-session", remote_window = 2097152, remote_maxpacket = 32768, local_window = 2097006, local_window_max = 2097152, 
  local_consumed = 146, local_maxpacket = 32768, extended_usage = 2, single_connection = 0, ctype = 0x55e4b31c19ac "session", open_confirm = 0x55e4b314adb0 <ssh_session2_setup>, 
  open_confirm_ctx = 0x0, detach_user = 0x55e4b3152260 <client_channel_closed>, detach_close = 0, status_confirms = {tqh_first = 0x0, tqh_last = 0x55e4b4a4a118}, input_filter = 0x0, 
  output_filter = 0x0, filter_ctx = 0x0, filter_cleanup = 0x0, datagram = 0, connect_ctx = {host = 0x0, port = 0, ai = 0x0, aitop = 0x0}, mux_rcb = 0x0, mux_ctx = 0x0, mux_pause = 0, 
  mux_downstream_id = 0}

The file descriptor is in blocking mode:
# ls -alF /proc/26338/fd
total 0
dr-x------. 2 root root  0 Jan 24 06:40 ./
dr-xr-xr-x. 9 root root  0 Jan 24 05:36 ../
lrwx------. 1 root root 64 Jan 24 06:40 0 -> /dev/pts/2
lrwx------. 1 root root 64 Jan 24 06:57 1 -> /dev/null
lrwx------. 1 root root 64 Jan 24 06:57 2 -> /dev/pts/2
lrwx------. 1 root root 64 Jan 24 06:57 3 -> 'socket:[155926]'
lrwx------. 1 root root 64 Jan 24 06:57 4 -> /dev/pts/2
lrwx------. 1 root root 64 Jan 24 06:57 5 -> /dev/pts/2
lrwx------. 1 root root 64 Jan 24 06:57 6 -> /dev/pts/2
# cat /proc/26338/fdinfo/4
pos:	0
flags:	02100002
mnt_id:	27
ino:	5

As a result it seems that the ssh client does not process messages from
the remote sshd. It does not notice that the remote command has
terminated.

Comment 4 Lukáš Doktor 2022-01-26 09:23:13 UTC
Tried with the RHEL-9.0.0-20220125.3 with openssh-8.7p1-6.el9.x86_64 and failed on the first attempt:

debug2: channel_input_open_confirmation: channel 0: callback start
debug2: fd 3 setting TCP_NODELAY
debug3: set_sock_tos: set socket 3 IP_TOS 0x20
debug2: client_session2_setup: id 0
debug1: Sending command: fio /fio.job
debug2: channel 0: request exec confirm 1
debug3: send packet: type 98
debug2: channel_input_open_confirmation: channel 0: callback done
debug2: channel 0: open confirm rwindow 0 rmax 32768
debug2: channel 0: rcvd adjust 2097152
debug3: receive packet: type 99
debug2: channel_input_status_confirm: type 99 id 0
debug2: exec request accepted on channel 0
job-/dev/vdb: (g=0): rw=read, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=8
...
fio-3.27
Starting 10 processes



^C^C^C^C^C^C



-----------------------------------------------------------------
After killing the bash process it produced additional output:

debug2: channel 0: read failed
debug2: chan_shutdown_read: channel 0: (i0 o0 sock -1 wfd 4 efd 6 [write])
debug2: channel 0: input open -> drain
debug3: send packet: type 1
debug1: channel 0: free: client-session, nchannels 1
debug3: channel 0: status: The following connections are open:
  #0 client-session (t4 r0 i1/0 o0/0 e[write]/0 fd -1/5/6 sock -1 cc -1)

Killed by signal 1.

[1]+  Terminated              bash check.sh

Comment 5 Stefan Hajnoczi 2022-01-26 10:22:40 UTC
I don't know the ssh code but here are my thoughts when I collected the information from gdb.

The ssh client code is that the decision to put fds into non-blocking mode is a little complicated:
https://github.com/openssh/openssh-portable/blob/master/channels.c#L327

I wonder if the bash "timeout 350 ssh -T -vvv 192.168.122.211 fio /fio.job" command line takes a code path that wasn't really intended?

I didn't investigate why channel_handle_rfd() does a blocking read expecting more data than currently available on the fd. If the fd is blocking it probably should have figured out how much data to read?
https://github.com/openssh/openssh-portable/blob/master/channels.c#L1917

Comment 15 RHEL Program Management 2023-07-24 07:28:28 UTC
After evaluating this issue, there are no plans to address it further or fix it in an upcoming release.  Therefore, it is being closed.  If plans change such that this issue will be fixed in an upcoming release, then the bug can be reopened.


Note You need to log in before you can comment on or make changes to this bug.