613768 – util-linux-ng has workaround patch for bug 598631 which we should consider reverting

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 613768 - util-linux-ng has workaround patch for bug 598631 which we should consider reverting

Summary: util-linux-ng has workaround patch for bug 598631 which we should consider re...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	util-linux-ng
Sub Component:
Version:	6.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	rc
Target Release:	---
Assignee:	Karel Zak
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:	598631
Blocks:	601025 603908
TreeView+	depends on / blocked

Reported:	2010-07-12 19:16 UTC by Ray Strode [halfline]
Modified:	2010-08-23 12:08 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:	598631
Environment:
Last Closed:	2010-08-23 12:08:07 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Ray Strode [halfline] 2010-07-12 19:16:56 UTC

As part of troubleshooting bug 598631 we got a workaround into util-linux-ng that makes the symptoms of that bug go away.  We should make sure we revert that patch once the root cause is found, since the patch could potentially have more far reaching implications than immediately apparent.

Relevant context below:

+++ This bug was initially created as a clone of Bug #598631 +++

Description of problem:

None of shutdown, reboot, halt and Ctrl-Alt-Delete actually achieve anything useful.

If I do any of them, I see a broadcast message:

[root@andromeda linux-2.6.32.i686]# shutdown now

Broadcast message from root.org.uk
        (/dev/pts/0) at 19:24 ...

The system is going down for maintenance NOW!

The terminals on /dev/tty# and /dev/ttyS# are killed, but that's all.  Nothing else happens.  My ssh login is still there, and I can still do things.

Looking in ps, I see:

 1591 ?        Ss     0:00 /usr/sbin/atd
 1710 ?        Ss     0:00 /bin/sh -e /dev/fd/10
 1714 ?        S      0:00  \_ initctl emit splash-request IMMEDIATE=1 MODE=shutdown MESSAGE=Shutting down...
 1715 ?        Ss     0:00 /bin/sh -e /dev/fd/11
 1716 ?        S      0:00  \_ /sbin/plymouthd --mode=shutdown
 1717 ?        S      0:00      \_ /sbin/plymouthd --mode=shutdown


Version-Release number of selected component (if applicable):

upstart-0.6.5-5.el6.i686

How reproducible:

100%

Steps to Reproduce:
1. Do any of shutdown, reboot, halt or C-A-D.
  
Expected results:

The machine should shut down.

Additional info:

May need to be running an X session.


--- Additional comment from chrisw on 2010-06-18 19:02:35 EDT ---

I have the same problem with plymouth-0.8.3-5.el6.x86_64

It simply hangs here:

# ps -ef | grep plymouth
root     25120 25119  0 14:55 ?        00:00:00 /sbin/plymouthd --mode=shutdown
root     25121 25120  0 14:55 ?        00:00:00 /sbin/plymouthd --mode=shutdown
# strace -p 25121
Process 25121 attached - interrupt to quit
open("/dev/ttyS0", O_RDWR|O_APPEND

killing that child plymouthd allows reboot to complete.


--- Additional comment from rstrode on 2010-07-09 18:30:42 EDT ---

Thanks Marian.

I've looked into this a bit today, but still need to investigate more.

It looks like the CLOCAL flag on the serial console tty is somehow getting cleared.

If I run

stty clocal --file=/dev/ttyS0

then shutdown works fine.

I've also added a

stty -a --file=/dev/ttyS0

call to /etc/rc.local and indeed clocal is properly set at that point (right before agetty is run). 

Adding -L to /etc/init/serial.conf also makes reboot work properly.  This means that CLOCAL is getting cleared some time after rc.local runs and some time before agetty calls termio_init().

it could upstart, plymouth, the kernel, or something else clearing CLOCAL. 

I've got to run now, but I'll investigate more on monday.

--- Additional comment from rstrode on 2010-07-09 23:42:30 EDT ---

So it looks like this is agetty itself dropping CLOCAL.

In termio_init, it does:

tp->c_cflag = CS8 | HUPCL | CREAD;

which wipes out all existing control flags from the tty (including CLOCAL).  The only reason it doesn't hang itself when trying to open the tty is that it uses O_NONBLOCK.  I think we need to do one of:

1) Fix the assignment above to:

 tp->c_cflag |= CS8 | HUPCL | CREAD;

2) Fix the assignment above to:

 tp->c_cflag = CS8 | HUPCL | CREAD | (tp->c_cflag & CLOCAL);

3) change initscripts to pass -L to agetty when the tty is initialized with CLOCAL by the kernel

4) change plymouth to open it's tty with O_NONBLOCK

I'm not sure what the "right" answer is out of these 4 (or if some 5th possibility is the right answer), but since 2 of the 4 involve changing agetty, i'm moving this to util-linux-ng

--- Additional comment from rstrode on 2010-07-12 13:33:41 EDT ---

So, I've been reading up a bit more about serial protocol, and I think the CLOCAL "fix" may just be a workaround.  I'd appreciate if someone who knows more about this stuff would chime in, though.  Here's an explanation on how I think it works:

As I understand it, CLOCAL means "don't block waiting for carrier detect and dataset ready signals".  These are two different bits that deal with hardware flow control. I think they mean something like this:

- When carrier detect is asserted it means there is a connection between the remote serial console and the running machine.
- When dataset ready is asserted it means the serial console is reporting that its ready to receive data from the running machine.

stty -a --file=/dev/ttyS0

on an affected machine outputs:

speed 115200 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0;
-parenb -parodd cs8 hupcl -cstopb cread -clocal -crtscts cdtrdsr
-ignbrk brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff
-iuclc -ixany imaxbel -iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
echoctl echoke

The "cdtrdsr" there suggests the tty is configured for hardware flow control.  Despite this, I'm guessing we're not getting a "dataset ready" signal from the remote serial console which is what's causing the open() call to block.  When CLOCAL is set, the kernel doesn't bother waiting for the "dataset ready" signal, so no blocking will happen.  Note, I don't believe the remote end will assert "dataset ready" until the kernel asserts "data terminal ready" (a different control line) in response to an open() call from the application.  So the problem could be the kernel isn't ever asserting "data terminal ready" and so the remote end isn't ever asserting "dataset ready" in response. Or it could be something else.

Anyway, this stuff is a bit outside my knowledge domain, so I'd appreciate feedback from someone more knowledgeable here.

Also, I just noticed Karel committed a patch that does suggestion 2) from comment 25 in util-linux-ng-2.17.2-5.el6.  I'll test that and report its results.

--- Additional comment from rstrode on 2010-07-12 13:52:52 EDT ---

Just to follow up, I can confirm util-linux-ng-2.17.2-5.el6 makes the problem go away.

--- Additional comment from rstrode on 2010-07-12 14:23:29 EDT ---

Created an attachment (id=431237)
dmesg output after echo t > /proc/sysrq-trigger while cat /usr/share/pango*/HELLO.txt > /dev/ttyS0 is blocked

I've talked to aris about this issue, and he believe this is a kernel bug.  He's asked for the above output.

--- Additional comment from kzak on 2010-07-12 14:50:04 EDT ---

Note for QA & PM: it seems that the change in agetty is a workaround for kernel bug #613756. I'm going to keep the patch in util-linux-ng until we found a better solution.

Comment 1 Suzanne Logcher 2010-07-14 03:32:21 UTC

This proposed exception is missing an ack and remains unapproved and unresolved.
Snapshot 8 is the last snapshot for approved exceptions and commits and brew builds are due by Jul 14 noon Eastern.
If you plan to fix this in Snapshot 8, please request your acks of the appropriate RHEL 6.0 leads.
If this issue is not approved and resolved by Jul 14, it will be denied for RHEL 6.0.

Comment 2 Ray Strode [halfline] 2010-07-14 03:47:07 UTC

We should drop the patch if the kernel bug gets fixed.  If the kernel bug doesn't get fixed, or gets demoted we should keep this patch.  moving to blocker? so this doesn't get autoclosed.

Comment 3 Ray Strode [halfline] 2010-07-16 15:28:19 UTC

I talked to aris about this today and it sounds like we will probably need to revert this patch:

<halfline> aris: the work around we put in util-linux-ng is to set CLOCAL on the tty
<halfline> i have a suspicion that unconditionally setting CLOCAL is wrong
<halfline> and will break certain setups, however
...
<halfline> aris: can you make a call on the validity of the CLOCAL work around?
<aris> halfline, I'm pretty sure it'll break serial console
<aris> but I don't know for sure until I look at it
<notting> solution: apply etherkiller to serial ports on affected machines
<halfline> by "serial console" you mean a normal serial console over a physical serial cable, versus these remote serial console things?
<aris> normal serial console

Comment 4 Karel Zak 2010-07-16 15:45:42 UTC

(In reply to comment #3)
> I talked to aris about this today and it sounds like we will probably need to
> revert this patch:
> 
> <halfline> aris: the work around we put in util-linux-ng is to set CLOCAL on
> the tty

 That's not true. We don't set the flag -- we keep the flag if the flag is already set by kernel. So:

  tp->c_cflag = CS8 | HUPCL | CREAD | (tp->c_cflag & CLOCAL);

it means that userspace is following kernel opinion about the tty.

Comment 5 Ray Strode [halfline] 2010-07-16 15:46:21 UTC

There may be another workaround which is acceptable:

[11:26:14] <aris> halfline, there's something that can be done: IIRC, the setting for DTR/DSR flow on the serial console is 'd'. unless it's set on /proc/cmdline, DTR/DSR flow control can be disabled by default                                                                                     
[11:26:20] <aris> yes, it's ugly                                                
[11:30:39] <halfline> aris: so you're saying the fix may be to disable dtr/dsr flow control in all cases unless the user has console=/dev/ttyS0,9600d or whatever?
...
[11:34:19] <aris> adding something on util-linux or whatever as workaround
[11:34:21] <halfline> oh you're suggesting we do this from user-space?
[11:34:23] <aris> is not pretty
[11:34:26] <aris> yes                                     
[11:34:29] <aris> instead of CLOCAL                                       
[11:34:38] <aris> basically if you use serial console with flow control
[11:34:45] <aris> and you need the hardware flow control
[11:34:48] <aris> CLOCAL will disable it
[11:35:25] <halfline> this dtr/dsr issue is going to cause other problems with serial devices too though yea?
[11:35:30] <halfline> not just serial console?
[11:35:36] <aris> not really                                                
[11:35:52] <aris> because anything you need to use, you'll need to setup the serial the way you want            
[11:36:03] <halfline> it won't break serial printers that use rts/cts?             
[11:36:05] <halfline> oh i see
[11:36:12] <aris> for some reason, the implementation of serial port on those sun boxes won't have DTR always on or something
[11:36:35] <aris> that's why it didn't cause problems on other machines
[11:36:43] <aris> that's my guess from what I've seen so far
[11:37:05] <halfline> if you enter the escape sequence in the console client it has an option for toggling flow control
[11:37:09] <halfline> but i think it's software flow control
[11:38:34] <halfline> aris: so if we get kzak to add the ugly hack to util-linux-ng then we can demote the kernel bug from blocker status?
[11:39:20] <aris> yes
[11:40:25] <halfline> okay
[11:42:53] <halfline> that might be the best solution given mchehab is on vacation and you're overbooked

Comment 6 Aristeu Rozanski 2010-07-16 15:54:04 UTC

ok, after comment #4 I'm convinced the workaround is good.

Comment 8 Karel Zak 2010-07-19 13:58:49 UTC

(In reply to comment #6)
> ok, after comment #4 I'm convinced the workaround is good.    

Should be the workaround applied to the util-linux-ng upstream?

BTW, is it workaround or bugfix? -- in other words, is it correct that
we remove (reset) all tty flags in agetty?

Comment 9 Karel Zak 2010-08-23 12:08:07 UTC

Closing, I don't think that we have to remove this workaround from the package. I have added the patch to upstream code too.

Note that the upstream version also supports a new options "-c":

  -c    Don’t reset terminal cflags (control modes). See termios(3) for more details.

Note You need to log in before you can comment on or make changes to this bug.