Bug 613768 - util-linux-ng has workaround patch for bug 598631 which we should consider reverting
util-linux-ng has workaround patch for bug 598631 which we should consider re...
Status: CLOSED NOTABUG
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: util-linux-ng (Show other bugs)
6.1
All Linux
low Severity low
: rc
: ---
Assigned To: Karel Zak
qe-baseos-daemons
:
Depends On: 598631
Blocks: 601025 603908
  Show dependency treegraph
 
Reported: 2010-07-12 15:16 EDT by Ray Strode [halfline]
Modified: 2010-08-23 08:08 EDT (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 598631
Environment:
Last Closed: 2010-08-23 08:08:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Ray Strode [halfline] 2010-07-12 15:16:56 EDT
As part of troubleshooting bug 598631 we got a workaround into util-linux-ng that makes the symptoms of that bug go away.  We should make sure we revert that patch once the root cause is found, since the patch could potentially have more far reaching implications than immediately apparent.

Relevant context below:

+++ This bug was initially created as a clone of Bug #598631 +++

Description of problem:

None of shutdown, reboot, halt and Ctrl-Alt-Delete actually achieve anything useful.

If I do any of them, I see a broadcast message:

[root@andromeda linux-2.6.32.i686]# shutdown now

Broadcast message from root@andromeda.procyon.org.uk
        (/dev/pts/0) at 19:24 ...

The system is going down for maintenance NOW!

The terminals on /dev/tty# and /dev/ttyS# are killed, but that's all.  Nothing else happens.  My ssh login is still there, and I can still do things.

Looking in ps, I see:

 1591 ?        Ss     0:00 /usr/sbin/atd
 1710 ?        Ss     0:00 /bin/sh -e /dev/fd/10
 1714 ?        S      0:00  \_ initctl emit splash-request IMMEDIATE=1 MODE=shutdown MESSAGE=Shutting down...
 1715 ?        Ss     0:00 /bin/sh -e /dev/fd/11
 1716 ?        S      0:00  \_ /sbin/plymouthd --mode=shutdown
 1717 ?        S      0:00      \_ /sbin/plymouthd --mode=shutdown


Version-Release number of selected component (if applicable):

upstart-0.6.5-5.el6.i686

How reproducible:

100%

Steps to Reproduce:
1. Do any of shutdown, reboot, halt or C-A-D.
  
Expected results:

The machine should shut down.

Additional info:

May need to be running an X session.


--- Additional comment from chrisw@redhat.com on 2010-06-18 19:02:35 EDT ---

I have the same problem with plymouth-0.8.3-5.el6.x86_64

It simply hangs here:

# ps -ef | grep plymouth
root     25120 25119  0 14:55 ?        00:00:00 /sbin/plymouthd --mode=shutdown
root     25121 25120  0 14:55 ?        00:00:00 /sbin/plymouthd --mode=shutdown
# strace -p 25121
Process 25121 attached - interrupt to quit
open("/dev/ttyS0", O_RDWR|O_APPEND

killing that child plymouthd allows reboot to complete.


--- Additional comment from rstrode@redhat.com on 2010-07-09 18:30:42 EDT ---

Thanks Marian.

I've looked into this a bit today, but still need to investigate more.

It looks like the CLOCAL flag on the serial console tty is somehow getting cleared.

If I run

stty clocal --file=/dev/ttyS0

then shutdown works fine.

I've also added a

stty -a --file=/dev/ttyS0

call to /etc/rc.local and indeed clocal is properly set at that point (right before agetty is run). 

Adding -L to /etc/init/serial.conf also makes reboot work properly.  This means that CLOCAL is getting cleared some time after rc.local runs and some time before agetty calls termio_init().

it could upstart, plymouth, the kernel, or something else clearing CLOCAL. 

I've got to run now, but I'll investigate more on monday.

--- Additional comment from rstrode@redhat.com on 2010-07-09 23:42:30 EDT ---

So it looks like this is agetty itself dropping CLOCAL.

In termio_init, it does:

tp->c_cflag = CS8 | HUPCL | CREAD;

which wipes out all existing control flags from the tty (including CLOCAL).  The only reason it doesn't hang itself when trying to open the tty is that it uses O_NONBLOCK.  I think we need to do one of:

1) Fix the assignment above to:

 tp->c_cflag |= CS8 | HUPCL | CREAD;

2) Fix the assignment above to:

 tp->c_cflag = CS8 | HUPCL | CREAD | (tp->c_cflag & CLOCAL);

3) change initscripts to pass -L to agetty when the tty is initialized with CLOCAL by the kernel

4) change plymouth to open it's tty with O_NONBLOCK

I'm not sure what the "right" answer is out of these 4 (or if some 5th possibility is the right answer), but since 2 of the 4 involve changing agetty, i'm moving this to util-linux-ng

--- Additional comment from rstrode@redhat.com on 2010-07-12 13:33:41 EDT ---

So, I've been reading up a bit more about serial protocol, and I think the CLOCAL "fix" may just be a workaround.  I'd appreciate if someone who knows more about this stuff would chime in, though.  Here's an explanation on how I think it works:

As I understand it, CLOCAL means "don't block waiting for carrier detect and dataset ready signals".  These are two different bits that deal with hardware flow control. I think they mean something like this:

- When carrier detect is asserted it means there is a connection between the remote serial console and the running machine.
- When dataset ready is asserted it means the serial console is reporting that its ready to receive data from the running machine.

stty -a --file=/dev/ttyS0

on an affected machine outputs:

speed 115200 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread -clocal -crtscts cdtrdsr
-ignbrk brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff
-iuclc -ixany imaxbel -iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
echoctl echoke

The "cdtrdsr" there suggests the tty is configured for hardware flow control.  Despite this, I'm guessing we're not getting a "dataset ready" signal from the remote serial console which is what's causing the open() call to block.  When CLOCAL is set, the kernel doesn't bother waiting for the "dataset ready" signal, so no blocking will happen.  Note, I don't believe the remote end will assert "dataset ready" until the kernel asserts "data terminal ready" (a different control line) in response to an open() call from the application.  So the problem could be the kernel isn't ever asserting "data terminal ready" and so the remote end isn't ever asserting "dataset ready" in response. Or it could be something else.

Anyway, this stuff is a bit outside my knowledge domain, so I'd appreciate feedback from someone more knowledgeable here.

Also, I just noticed Karel committed a patch that does suggestion 2) from comment 25 in util-linux-ng-2.17.2-5.el6.  I'll test that and report its results.

--- Additional comment from rstrode@redhat.com on 2010-07-12 13:52:52 EDT ---

Just to follow up, I can confirm util-linux-ng-2.17.2-5.el6 makes the problem go away.

--- Additional comment from rstrode@redhat.com on 2010-07-12 14:23:29 EDT ---

Created an attachment (id=431237)
dmesg output after echo t > /proc/sysrq-trigger while cat /usr/share/pango*/HELLO.txt > /dev/ttyS0 is blocked

I've talked to aris about this issue, and he believe this is a kernel bug.  He's asked for the above output.

--- Additional comment from kzak@redhat.com on 2010-07-12 14:50:04 EDT ---

Note for QA & PM: it seems that the change in agetty is a workaround for kernel bug #613756. I'm going to keep the patch in util-linux-ng until we found a better solution.
Comment 1 Suzanne Yeghiayan 2010-07-13 23:32:21 EDT
This proposed exception is missing an ack and remains unapproved and unresolved.
Snapshot 8 is the last snapshot for approved exceptions and commits and brew builds are due by Jul 14 noon Eastern.
If you plan to fix this in Snapshot 8, please request your acks of the appropriate RHEL 6.0 leads.
If this issue is not approved and resolved by Jul 14, it will be denied for RHEL 6.0.
Comment 2 Ray Strode [halfline] 2010-07-13 23:47:07 EDT
We should drop the patch if the kernel bug gets fixed.  If the kernel bug doesn't get fixed, or gets demoted we should keep this patch.  moving to blocker? so this doesn't get autoclosed.
Comment 3 Ray Strode [halfline] 2010-07-16 11:28:19 EDT
I talked to aris about this today and it sounds like we will probably need to revert this patch:

<halfline> aris: the work around we put in util-linux-ng is to set CLOCAL on the tty
<halfline> i have a suspicion that unconditionally setting CLOCAL is wrong
<halfline> and will break certain setups, however
...
<halfline> aris: can you make a call on the validity of the CLOCAL work around?
<aris> halfline, I'm pretty sure it'll break serial console
<aris> but I don't know for sure until I look at it
<notting> solution: apply etherkiller to serial ports on affected machines
<halfline> by "serial console" you mean a normal serial console over a physical serial cable, versus these remote serial console things?
<aris> normal serial console
Comment 4 Karel Zak 2010-07-16 11:45:42 EDT
(In reply to comment #3)
> I talked to aris about this today and it sounds like we will probably need to
> revert this patch:
> 
> <halfline> aris: the work around we put in util-linux-ng is to set CLOCAL on
> the tty

 That's not true. We don't set the flag -- we keep the flag if the flag is already set by kernel. So:

  tp->c_cflag = CS8 | HUPCL | CREAD | (tp->c_cflag & CLOCAL);

it means that userspace is following kernel opinion about the tty.
Comment 5 Ray Strode [halfline] 2010-07-16 11:46:21 EDT
There may be another workaround which is acceptable:

[11:26:14] <aris> halfline, there's something that can be done: IIRC, the setting for DTR/DSR flow on the serial console is 'd'. unless it's set on /proc/cmdline, DTR/DSR flow control can be disabled by default                                                                                     
[11:26:20] <aris> yes, it's ugly                                                
[11:30:39] <halfline> aris: so you're saying the fix may be to disable dtr/dsr flow control in all cases unless the user has console=/dev/ttyS0,9600d or whatever?
...
[11:34:19] <aris> adding something on util-linux or whatever as workaround
[11:34:21] <halfline> oh you're suggesting we do this from user-space?
[11:34:23] <aris> is not pretty
[11:34:26] <aris> yes                                     
[11:34:29] <aris> instead of CLOCAL                                       
[11:34:38] <aris> basically if you use serial console with flow control
[11:34:45] <aris> and you need the hardware flow control
[11:34:48] <aris> CLOCAL will disable it
[11:35:25] <halfline> this dtr/dsr issue is going to cause other problems with serial devices too though yea?
[11:35:30] <halfline> not just serial console?
[11:35:36] <aris> not really                                                
[11:35:52] <aris> because anything you need to use, you'll need to setup the serial the way you want            
[11:36:03] <halfline> it won't break serial printers that use rts/cts?             
[11:36:05] <halfline> oh i see
[11:36:12] <aris> for some reason, the implementation of serial port on those sun boxes won't have DTR always on or something
[11:36:35] <aris> that's why it didn't cause problems on other machines
[11:36:43] <aris> that's my guess from what I've seen so far
[11:37:05] <halfline> if you enter the escape sequence in the console client it has an option for toggling flow control
[11:37:09] <halfline> but i think it's software flow control
[11:38:34] <halfline> aris: so if we get kzak to add the ugly hack to util-linux-ng then we can demote the kernel bug from blocker status?
[11:39:20] <aris> yes
[11:40:25] <halfline> okay
[11:42:53] <halfline> that might be the best solution given mchehab is on vacation and you're overbooked
Comment 6 Aristeu Rozanski 2010-07-16 11:54:04 EDT
ok, after comment #4 I'm convinced the workaround is good.
Comment 8 Karel Zak 2010-07-19 09:58:49 EDT
(In reply to comment #6)
> ok, after comment #4 I'm convinced the workaround is good.    

Should be the workaround applied to the util-linux-ng upstream?

BTW, is it workaround or bugfix? -- in other words, is it correct that
we remove (reset) all tty flags in agetty?
Comment 9 Karel Zak 2010-08-23 08:08:07 EDT
Closing, I don't think that we have to remove this workaround from the package. I have added the patch to upstream code too.

Note that the upstream version also supports a new options "-c":

  -c    Don’t reset terminal cflags (control modes). See termios(3) for more details.

Note You need to log in before you can comment on or make changes to this bug.