Bug 598631 - stty incorrectly reports cdtrdsr
Summary: stty incorrectly reports cdtrdsr
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: coreutils
Version: 6.0
Hardware: All
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Ondrej Vasik
QA Contact: qe-baseos-daemons
URL:
Whiteboard:
Depends On: 613756
Blocks: 601025 603908 613768
TreeView+ depends on / blocked
 
Reported: 2010-06-01 18:29 UTC by David Howells
Modified: 2011-05-19 13:50 UTC (History)
19 users (show)

Fixed In Version: coreutils-8.4-10.el6
Doc Type: Bug Fix
Doc Text:
Previously, the hardware control flow, DTRDSR, was implemented via TC{SG}ETX. This was changed to TC{SG}ET ioctl, which caused the CDTRDSR support in stty to fail. This was fixed to allow stty to correctly handle CDTRDSR control flow.
Clone Of:
: 613756 613768 (view as bug list)
Environment:
Last Closed: 2011-05-19 13:50:37 UTC


Attachments (Terms of Use)
hw info (5.45 KB, application/x-gzip)
2010-06-13 03:22 UTC, Amos Kong
no flags Details
plymouth:debug output from my x86_64 installation (110.93 KB, text/plain)
2010-06-15 15:25 UTC, David Howells
no flags Details
dmesg output after echo t > /proc/sysrq-trigger while cat /usr/share/pango*/HELLO.txt > /dev/ttyS0 is blocked (179.55 KB, text/plain)
2010-07-12 18:23 UTC, Ray Strode [halfline]
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2011:0646 normal SHIPPED_LIVE coreutils bug fix update 2011-05-18 18:11:00 UTC

Description David Howells 2010-06-01 18:29:49 UTC
Description of problem:

None of shutdown, reboot, halt and Ctrl-Alt-Delete actually achieve anything useful.

If I do any of them, I see a broadcast message:

[root@andromeda linux-2.6.32.i686]# shutdown now

Broadcast message from root@andromeda.procyon.org.uk
        (/dev/pts/0) at 19:24 ...

The system is going down for maintenance NOW!

The terminals on /dev/tty# and /dev/ttyS# are killed, but that's all.  Nothing else happens.  My ssh login is still there, and I can still do things.

Looking in ps, I see:

 1591 ?        Ss     0:00 /usr/sbin/atd
 1710 ?        Ss     0:00 /bin/sh -e /dev/fd/10
 1714 ?        S      0:00  \_ initctl emit splash-request IMMEDIATE=1 MODE=shutdown MESSAGE=Shutting down...
 1715 ?        Ss     0:00 /bin/sh -e /dev/fd/11
 1716 ?        S      0:00  \_ /sbin/plymouthd --mode=shutdown
 1717 ?        S      0:00      \_ /sbin/plymouthd --mode=shutdown


Version-Release number of selected component (if applicable):

upstart-0.6.5-5.el6.i686

How reproducible:

100%

Steps to Reproduce:
1. Do any of shutdown, reboot, halt or C-A-D.
  
Expected results:

The machine should shut down.

Additional info:

May need to be running an X session.

Comment 2 Petr Lautrbach 2010-06-02 14:46:08 UTC
I'm not able to reproduce it in kvm with upstart-0.6.5-5.el6.x86_64, initscripts-9.03.8-2.el6.x86_64, plymouth-0.8.3-3.el6.x86_64. 

What version of initscripts and plymouth you have?

> [root@andromeda linux-2.6.32.i686]# shutdown now

just notice that shutdown without -[rhHP] only change runlevel to 1, see "shutdown --help"

"/sbin/plymouthd --mode=shutdown" is not run when runlevel changes to 1. Is it captured after "shutdown now"?

> 1710 ?        Ss     0:00 /bin/sh -e /dev/fd/10
> 1714 ?        S      0:00  \_ initctl emit splash-request IMMEDIATE=1
MODE=shutdown MESSAGE=Shutting down...
> 1715 ?        Ss     0:00 /bin/sh -e /dev/fd/11
> 1716 ?        S      0:00  \_ /sbin/plymouthd --mode=shutdown
> 1717 ?        S      0:00      \_ /sbin/plymouthd --mode=shutdown

seems that "/sbin/plymouthd --mode=shutdown" hangs for some reason so that upstart jobs splash-manager and plymouth-shutdown freeze and process doesn't continue.

Comment 3 David Howells 2010-06-02 15:07:28 UTC
upstart-0.6.5-5.el6.i686
initscripts-9.03.8-2.el6.i686
plymouth-0.8.3-3.el6.i686


> Is it captured after "shutdown now"?

Captured after halt.

Comment 4 RHEL Product and Program Management 2010-06-02 18:05:50 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 5 Ray Strode [halfline] 2010-06-11 20:26:59 UTC
are you able to gdb attach to it and see what it's doing?

Comment 6 Amos Kong 2010-06-12 10:55:25 UTC
Hi,

We also touched the same problem on some AMD machines, I'll provide the hardware info later.
Adjust bug 'Priority' to high.

Comment 7 Amos Kong 2010-06-13 03:22:55 UTC
Created attachment 423572 [details]
hw info

cpuinfo
meminfo
lspci
dmidecode


DellT605

Comment 8 David Howells 2010-06-14 13:00:06 UTC
(In reply to comment #5)
> are you able to gdb attach to it and see what it's doing?    

Attach gdb to what? init? I don't think that's possible.

Comment 9 Ray Strode [halfline] 2010-06-14 14:08:20 UTC
I meant plymouthd.

Can you reboot with plymouth:debug on the kernel command line and get a shot of the plymouth debug spew at shutdown?

Comment 10 Marian Csontos 2010-06-15 07:10:21 UTC
This issue is pretty much reproducible on beaker systems. If you need a system for debugging ping me and I will arrange one broken machine for you...

Or you can start with cloning beaker job from Bug 603908.

Comment 11 David Howells 2010-06-15 15:25:17 UTC
Created attachment 424195 [details]
plymouth:debug output from my x86_64 installation

Here's the plymouth:debug log from my x86_64 RHEL-6 installation on my test box.  Plymouth didn't print anything more once the:

    Red Hat Enterprise Linux release 6.0 Alpha (Santiago)
    Kernel 2.6.32-33.el6.x86_64 on an x86_64

lines had appeared, not even when I typed reboot.

Comment 12 Chris Wright 2010-06-18 23:02:35 UTC
I have the same problem with plymouth-0.8.3-5.el6.x86_64

It simply hangs here:

# ps -ef | grep plymouth
root     25120 25119  0 14:55 ?        00:00:00 /sbin/plymouthd --mode=shutdown
root     25121 25120  0 14:55 ?        00:00:00 /sbin/plymouthd --mode=shutdown
# strace -p 25121
Process 25121 attached - interrupt to quit
open("/dev/ttyS0", O_RDWR|O_APPEND

killing that child plymouthd allows reboot to complete.

Comment 13 Bill Nottingham 2010-06-21 19:11:16 UTC
Chris - do you have a serial console?

Comment 14 Ray Strode [halfline] 2010-07-01 02:53:08 UTC
So the fact that there are two plymouthd processes tells us a lot.  This means the hang is happening before plymouthd has finished daemonizing.

That means it's happening here:

static bool
redirect_standard_io_to_device (const char *device)
{
  int fd;
  char *file;

  ply_trace ("redirecting stdio to %s", device);

  if (strncmp (device, "/dev/", strlen ("/dev/")) == 0)
    file = strdup (device);
  else
    asprintf (&file, "/dev/%s", device);

  fd = open (file, O_RDWR | O_APPEND);

  free (file);
}

Presumably, everyone affected has console=ttyS0 or similar on their kernel command lines.  Now the question is, why would trying to open the terminal device hang?

Not sure, off hand.  There have been a few serial console "hanging" kernel bugs in RHEL6.

See bug 590851, bug 568418, and bug 579003 for instance.

I believe they've all been fixed by the mentioned 2.6.32-33.el6.x86_64 kernel though (I think those issues were sorted out by 2.6.32-31.el6).  And they don't exactly fit the symptoms anyway.

One thing that's clearly wrong with the current setup is that plymouth shutdown is started before agetty is stopped.  We probably don't want them using the same tty at the same time.

We should probably change /etc/init/serial.conf from

stop on runlevel [016]

to

stop on starting rc RUNLEVEL=[!5]

and change  /etc/init/plymouth-shutdown.conf from

start on (splash-request IMMEDIATE=1) or (splash-request and stopped prefdm)

to

start on (splash-request IMMEDIATE=1) or (splash-request and stopped prefdm and stopped serial)

It's not clear why making those changes would fix a hang on open() though.

Comment 15 Ray Strode [halfline] 2010-07-01 03:01:22 UTC
that

 stop on starting rc RUNLEVEL=[!5]

mentioned above was a cut and paste error, it should be

 stop on starting rc RUNLEVEL=[016]

and thinking about it more we'll need to make sure we don't wait for the "stopped serial" event if serial was never started.  That means things are little more complicated.

we'll need to flip the sense of IMMEDIATE and split it into two variables (AFTER_PREFDM and AFTER_SERIAL)

and so the line will be the totally gruesome:

start on (splash-request AFTER_PREFDM=0 AFTER_SERIAL=0) or (splash-request AFTER_PREFDM=1 AFTER_SERIAL=0 and stopped prefdm) or (splash-request AFTER_PREFDM=0 AFTER_SERIAL=1 and stopped serial) or (splash-request AFTER_PREFDM=1 AFTER_SERIAL=1 and stopped prefdm and stopped serial)

and change /etc/init/splash-manager.conf

to call initctl status serial and set AFTER_PREFDM and AFTER_SERIAL appropriately.

This is getting pretty unweildy, so we may want to rethink how we do this.

Comment 16 Ray Strode [halfline] 2010-07-01 21:41:21 UTC
fwiw, I reserved a beaker system today and provisioned snapshot 6 on it, then upgraded the kernel and plymouth and tried unsuccessfully to reproduce.

I'll look into the "clone" feature mentioned above soon and try to reproduce that way.

Comment 17 Bill Nottingham 2010-07-06 16:40:51 UTC
I've tried this in a local KVM guest with a serial console, and have been unable to reproduce it there.

Comment 18 Ray Strode [halfline] 2010-07-07 17:07:18 UTC
I cloned the recipe mentioned in comment 10, but remote serial console would fail with:

Error: Unable to establish IPMI v2 / RMCP+ session
Error: No response activating SOL payload
Error: Unable to establish IPMI v2 / RMCP+ session
Error: No response activating SOL payload
Error: Unable to establish IPMI v2 / RMCP+ session
Error: No response activating SOL payload
Error: Unable to establish IPMI v2 / RMCP+ session
Error: No response activating SOL payload
Error: Unable to establish IPMI v2 / RMCP+ session
Error: No response activating SOL payload

so I wasn't able to test.  Upon sshing I got messages like this in dmesg:

kernel: K8 ECC error.
kernel: Northbridge Error, node 2

So I think machine with failing ram was provisioned.  I've initiated a new clone that's installing now and it seems to have a functioning serial console.

Comment 19 Ray Strode [halfline] 2010-07-07 17:21:20 UTC
The second clone installed fine, but I can't reproduce the issue.  If I type reboot, it reboots.

Marian, can I take you up on your offer in comment 10 to get me a reproduction environment?

Comment 20 Marian Csontos 2010-07-08 18:51:41 UTC
Hi Ray, it looks like hardware related issue.

I scheduled few reproducers for version which is known to fail and for latest STABLE release. These are still Queued as the machines required are in use.
I will notify you of the results.

They are J:667{4,5,6} in beaker.

IIUC it is related to stdin/stdout redirection.
I have seen few programs stuck when sdtin was/wasn't tty/pipe/...
Is the issue known to appear on rhts systems as well?

Comment 22 Ray Strode [halfline] 2010-07-08 20:24:42 UTC
Given this is a hang in open(), i'm going to move this to kernel so it brings more people into the loop.

I'm still going to continue to help investigate, though.

Comment 24 Ray Strode [halfline] 2010-07-09 22:30:42 UTC
Thanks Marian.

I've looked into this a bit today, but still need to investigate more.

It looks like the CLOCAL flag on the serial console tty is somehow getting cleared.

If I run

stty clocal --file=/dev/ttyS0

then shutdown works fine.

I've also added a

stty -a --file=/dev/ttyS0

call to /etc/rc.local and indeed clocal is properly set at that point (right before agetty is run). 

Adding -L to /etc/init/serial.conf also makes reboot work properly.  This means that CLOCAL is getting cleared some time after rc.local runs and some time before agetty calls termio_init().

it could upstart, plymouth, the kernel, or something else clearing CLOCAL. 

I've got to run now, but I'll investigate more on monday.

Comment 25 Ray Strode [halfline] 2010-07-10 03:42:30 UTC
So it looks like this is agetty itself dropping CLOCAL.

In termio_init, it does:

tp->c_cflag = CS8 | HUPCL | CREAD;

which wipes out all existing control flags from the tty (including CLOCAL).  The only reason it doesn't hang itself when trying to open the tty is that it uses O_NONBLOCK.  I think we need to do one of:

1) Fix the assignment above to:

 tp->c_cflag |= CS8 | HUPCL | CREAD;

2) Fix the assignment above to:

 tp->c_cflag = CS8 | HUPCL | CREAD | (tp->c_cflag & CLOCAL);

3) change initscripts to pass -L to agetty when the tty is initialized with CLOCAL by the kernel

4) change plymouth to open it's tty with O_NONBLOCK

I'm not sure what the "right" answer is out of these 4 (or if some 5th possibility is the right answer), but since 2 of the 4 involve changing agetty, i'm moving this to util-linux-ng

Comment 26 Ray Strode [halfline] 2010-07-12 15:22:34 UTC
one thing i've just noticed is while agetty opens with O_NONBLOCK it eventually does

fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) & ~O_NONBLOCK);

and also

if (-L not specified) {
        fcntl(1, F_SETFL, fcntl(1, F_GETFL, 0) & ~O_NONBLOCK);
}

so after setting up the terminal attributes it resets to blocking i/o (unless -L is specified, then it keeps writes non-blocking)

Comment 27 Ray Strode [halfline] 2010-07-12 17:33:41 UTC
So, I've been reading up a bit more about serial protocol, and I think the CLOCAL "fix" may just be a workaround.  I'd appreciate if someone who knows more about this stuff would chime in, though.  Here's an explanation on how I think it works:

As I understand it, CLOCAL means "don't block waiting for carrier detect and dataset ready signals".  These are two different bits that deal with hardware flow control. I think they mean something like this:

- When carrier detect is asserted it means there is a connection between the remote serial console and the running machine.
- When dataset ready is asserted it means the serial console is reporting that its ready to receive data from the running machine.

stty -a --file=/dev/ttyS0

on an affected machine outputs:

speed 115200 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread -clocal -crtscts cdtrdsr
-ignbrk brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff
-iuclc -ixany imaxbel -iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
echoctl echoke

The "cdtrdsr" there suggests the tty is configured for hardware flow control.  Despite this, I'm guessing we're not getting a "dataset ready" signal from the remote serial console which is what's causing the open() call to block.  When CLOCAL is set, the kernel doesn't bother waiting for the "dataset ready" signal, so no blocking will happen.  Note, I don't believe the remote end will assert "dataset ready" until the kernel asserts "data terminal ready" (a different control line) in response to an open() call from the application.  So the problem could be the kernel isn't ever asserting "data terminal ready" and so the remote end isn't ever asserting "dataset ready" in response. Or it could be something else.

Anyway, this stuff is a bit outside my knowledge domain, so I'd appreciate feedback from someone more knowledgeable here.

Also, I just noticed Karel committed a patch that does suggestion 2) from comment 25 in util-linux-ng-2.17.2-5.el6.  I'll test that and report its results.

Comment 28 Ray Strode [halfline] 2010-07-12 17:52:52 UTC
Just to follow up, I can confirm util-linux-ng-2.17.2-5.el6 makes the problem go away.

Comment 29 Aristeu Rozanski 2010-07-12 18:20:11 UTC
Something is wrongly setting DTR/DSR flow control. If the serial cable doesn't
have DTR/DSR pins, it'd block the transmission. But if used as a serial console,
it shouldn't block forever.
I'd recomend having this bug used to track what is enabling DTR/DSR by mistake
and have another BZ to track the kernel side. please add me and mchehab to the
Cc list of the new bug.

Comment 30 Ray Strode [halfline] 2010-07-12 18:23:29 UTC
Created attachment 431237 [details]
dmesg output after echo t > /proc/sysrq-trigger while cat /usr/share/pango*/HELLO.txt > /dev/ttyS0 is blocked

I've talked to aris about this issue, and he believe this is a kernel bug.  He's asked for the above output.

Comment 31 Ray Strode [halfline] 2010-07-12 18:29:26 UTC
ah missed comment 29.  I'm going to clone this report and move the clone to kernel.  I think it may be the kernel initially setting dtr/dsr though, so this may end up a kernel bug as well.

Comment 32 Ray Strode [halfline] 2010-07-12 18:36:30 UTC
As requested in comment 29, I've filed bug 613756.

Comment 34 Ray Strode [halfline] 2010-07-12 19:08:35 UTC
(In reply to comment #29)

> I'd recomend having this bug used to track what is enabling DTR/DSR by mistake
> and have another BZ to track the kernel side.
Turns out it is the kernel enabling DTR/DSR on this machine.  I added an stty -a --file=/dev/ttyS0 call at the top of /init in the initrd and it's set from the start of boot:

Write protecting the kernel read-only data: 7280k
mknod: `/dev/ttyS0': File exists
speed 115200 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = dracut: dracut-004-20.1.el6
<undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^dracut: rd_NO_LUKS: removing cryptoluks activation
V; flush = ^O; min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts cdtrdsr
-ignbrk -brkint -ignpar -parmrk -inpck -istrdevice-mapper: uevent: version 1.0.3
ip -inlcr -igncrdevice-mapper: ioctl: 4.17.0-ioctl (2010-03-05) initialised: dm-devel@redhat.com
 icrnl ixon -ixoff
-iuclc -ixany -imaxbel -iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofiudev: starting version 147

Moving back to kernel and back to ASSIGNED

Comment 35 David Howells 2010-07-12 19:21:04 UTC
The update to util-linux-ng-2.17.2-5.el6 fixed the problem for me.

Comment 47 Fedora Update System 2010-07-27 02:27:44 UTC
util-linux-ng-2.17.2-6.fc13 has been pushed to the Fedora 13 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 54 Misha H. Ali 2011-05-10 04:10:52 UTC
    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
Previously, the shut down, reboot, halt and Ctrl-Alt-Delete commands only killed terminals on /dev/tty# and /dev/ttyS# while the user's SSH login and ability to carry out operations remained unaffected. The hardware control flow, DTRDSR, was previously implemented via TC{SG}ETX and was subsequently changed to TC{SG}ET ioctl, which caused the CDTRDSR support in stty to fail. This was fixed to allow stty to correctly handle CDTRDSR control flow and shut down, reboot, halt and Ctrl-Alt-Delete operations execute as expected.

Comment 56 Misha H. Ali 2011-05-10 05:00:05 UTC
    Technical note updated. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    Diffed Contents:
@@ -1 +1 @@
-Previously, the shut down, reboot, halt and Ctrl-Alt-Delete commands only killed terminals on /dev/tty# and /dev/ttyS# while the user's SSH login and ability to carry out operations remained unaffected. The hardware control flow, DTRDSR, was previously implemented via TC{SG}ETX and was subsequently changed to TC{SG}ET ioctl, which caused the CDTRDSR support in stty to fail. This was fixed to allow stty to correctly handle CDTRDSR control flow and shut down, reboot, halt and Ctrl-Alt-Delete operations execute as expected.+Previously, the hardware control flow, DTRDSR, was implemented via TC{SG}ETX. This was changed to TC{SG}ET ioctl, which caused the CDTRDSR support in stty to fail. This was fixed to allow stty to correctly handle CDTRDSR control flow.

Comment 57 errata-xmlrpc 2011-05-19 13:50:37 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0646.html


Note You need to log in before you can comment on or make changes to this bug.