Bug 598631
| Summary: | stty incorrectly reports cdtrdsr | |||
|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | David Howells <dhowells> | |
| Component: | coreutils | Assignee: | Ondrej Vasik <ovasik> | |
| Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 6.0 | CC: | akong, alitke, arozansk, azelinka, chrisw, jasowang, jiabwang, kzak, llim, mchehab, mcsontos, mhusnain, ndai, notting, ovasik, phan, rstrode, rvokal, xtian | |
| Target Milestone: | rc | |||
| Target Release: | --- | |||
| Hardware: | All | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | coreutils-8.4-10.el6 | Doc Type: | Bug Fix | |
| Doc Text: |
Previously, the hardware control flow, DTRDSR, was implemented via TC{SG}ETX. This was changed to TC{SG}ET ioctl, which caused the CDTRDSR support in stty to fail. This was fixed to allow stty to correctly handle CDTRDSR control flow.
|
Story Points: | --- | |
| Clone Of: | ||||
| : | 613756 613768 (view as bug list) | Environment: | ||
| Last Closed: | 2011-05-19 13:50:37 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 613756 | |||
| Bug Blocks: | 601025, 603908, 613768 | |||
| Attachments: | ||||
I'm not able to reproduce it in kvm with upstart-0.6.5-5.el6.x86_64, initscripts-9.03.8-2.el6.x86_64, plymouth-0.8.3-3.el6.x86_64. What version of initscripts and plymouth you have? > [root@andromeda linux-2.6.32.i686]# shutdown now just notice that shutdown without -[rhHP] only change runlevel to 1, see "shutdown --help" "/sbin/plymouthd --mode=shutdown" is not run when runlevel changes to 1. Is it captured after "shutdown now"? > 1710 ? Ss 0:00 /bin/sh -e /dev/fd/10 > 1714 ? S 0:00 \_ initctl emit splash-request IMMEDIATE=1 MODE=shutdown MESSAGE=Shutting down... > 1715 ? Ss 0:00 /bin/sh -e /dev/fd/11 > 1716 ? S 0:00 \_ /sbin/plymouthd --mode=shutdown > 1717 ? S 0:00 \_ /sbin/plymouthd --mode=shutdown seems that "/sbin/plymouthd --mode=shutdown" hangs for some reason so that upstart jobs splash-manager and plymouth-shutdown freeze and process doesn't continue. upstart-0.6.5-5.el6.i686
initscripts-9.03.8-2.el6.i686
plymouth-0.8.3-3.el6.i686
> Is it captured after "shutdown now"?
Captured after halt.
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. are you able to gdb attach to it and see what it's doing? Hi, We also touched the same problem on some AMD machines, I'll provide the hardware info later. Adjust bug 'Priority' to high. Created attachment 423572 [details]
hw info
cpuinfo
meminfo
lspci
dmidecode
DellT605
(In reply to comment #5) > are you able to gdb attach to it and see what it's doing? Attach gdb to what? init? I don't think that's possible. I meant plymouthd. Can you reboot with plymouth:debug on the kernel command line and get a shot of the plymouth debug spew at shutdown? This issue is pretty much reproducible on beaker systems. If you need a system for debugging ping me and I will arrange one broken machine for you... Or you can start with cloning beaker job from Bug 603908. Created attachment 424195 [details]
plymouth:debug output from my x86_64 installation
Here's the plymouth:debug log from my x86_64 RHEL-6 installation on my test box. Plymouth didn't print anything more once the:
Red Hat Enterprise Linux release 6.0 Alpha (Santiago)
Kernel 2.6.32-33.el6.x86_64 on an x86_64
lines had appeared, not even when I typed reboot.
I have the same problem with plymouth-0.8.3-5.el6.x86_64
It simply hangs here:
# ps -ef | grep plymouth
root 25120 25119 0 14:55 ? 00:00:00 /sbin/plymouthd --mode=shutdown
root 25121 25120 0 14:55 ? 00:00:00 /sbin/plymouthd --mode=shutdown
# strace -p 25121
Process 25121 attached - interrupt to quit
open("/dev/ttyS0", O_RDWR|O_APPEND
killing that child plymouthd allows reboot to complete.
Chris - do you have a serial console? So the fact that there are two plymouthd processes tells us a lot. This means the hang is happening before plymouthd has finished daemonizing.
That means it's happening here:
static bool
redirect_standard_io_to_device (const char *device)
{
int fd;
char *file;
ply_trace ("redirecting stdio to %s", device);
if (strncmp (device, "/dev/", strlen ("/dev/")) == 0)
file = strdup (device);
else
asprintf (&file, "/dev/%s", device);
fd = open (file, O_RDWR | O_APPEND);
free (file);
}
Presumably, everyone affected has console=ttyS0 or similar on their kernel command lines. Now the question is, why would trying to open the terminal device hang?
Not sure, off hand. There have been a few serial console "hanging" kernel bugs in RHEL6.
See bug 590851, bug 568418, and bug 579003 for instance.
I believe they've all been fixed by the mentioned 2.6.32-33.el6.x86_64 kernel though (I think those issues were sorted out by 2.6.32-31.el6). And they don't exactly fit the symptoms anyway.
One thing that's clearly wrong with the current setup is that plymouth shutdown is started before agetty is stopped. We probably don't want them using the same tty at the same time.
We should probably change /etc/init/serial.conf from
stop on runlevel [016]
to
stop on starting rc RUNLEVEL=[!5]
and change /etc/init/plymouth-shutdown.conf from
start on (splash-request IMMEDIATE=1) or (splash-request and stopped prefdm)
to
start on (splash-request IMMEDIATE=1) or (splash-request and stopped prefdm and stopped serial)
It's not clear why making those changes would fix a hang on open() though.
that stop on starting rc RUNLEVEL=[!5] mentioned above was a cut and paste error, it should be stop on starting rc RUNLEVEL=[016] and thinking about it more we'll need to make sure we don't wait for the "stopped serial" event if serial was never started. That means things are little more complicated. we'll need to flip the sense of IMMEDIATE and split it into two variables (AFTER_PREFDM and AFTER_SERIAL) and so the line will be the totally gruesome: start on (splash-request AFTER_PREFDM=0 AFTER_SERIAL=0) or (splash-request AFTER_PREFDM=1 AFTER_SERIAL=0 and stopped prefdm) or (splash-request AFTER_PREFDM=0 AFTER_SERIAL=1 and stopped serial) or (splash-request AFTER_PREFDM=1 AFTER_SERIAL=1 and stopped prefdm and stopped serial) and change /etc/init/splash-manager.conf to call initctl status serial and set AFTER_PREFDM and AFTER_SERIAL appropriately. This is getting pretty unweildy, so we may want to rethink how we do this. fwiw, I reserved a beaker system today and provisioned snapshot 6 on it, then upgraded the kernel and plymouth and tried unsuccessfully to reproduce. I'll look into the "clone" feature mentioned above soon and try to reproduce that way. I've tried this in a local KVM guest with a serial console, and have been unable to reproduce it there. I cloned the recipe mentioned in comment 10, but remote serial console would fail with: Error: Unable to establish IPMI v2 / RMCP+ session Error: No response activating SOL payload Error: Unable to establish IPMI v2 / RMCP+ session Error: No response activating SOL payload Error: Unable to establish IPMI v2 / RMCP+ session Error: No response activating SOL payload Error: Unable to establish IPMI v2 / RMCP+ session Error: No response activating SOL payload Error: Unable to establish IPMI v2 / RMCP+ session Error: No response activating SOL payload so I wasn't able to test. Upon sshing I got messages like this in dmesg: kernel: K8 ECC error. kernel: Northbridge Error, node 2 So I think machine with failing ram was provisioned. I've initiated a new clone that's installing now and it seems to have a functioning serial console. The second clone installed fine, but I can't reproduce the issue. If I type reboot, it reboots. Marian, can I take you up on your offer in comment 10 to get me a reproduction environment? Hi Ray, it looks like hardware related issue.
I scheduled few reproducers for version which is known to fail and for latest STABLE release. These are still Queued as the machines required are in use.
I will notify you of the results.
They are J:667{4,5,6} in beaker.
IIUC it is related to stdin/stdout redirection.
I have seen few programs stuck when sdtin was/wasn't tty/pipe/...
Is the issue known to appear on rhts systems as well?
Given this is a hang in open(), i'm going to move this to kernel so it brings more people into the loop. I'm still going to continue to help investigate, though. Thanks Marian. I've looked into this a bit today, but still need to investigate more. It looks like the CLOCAL flag on the serial console tty is somehow getting cleared. If I run stty clocal --file=/dev/ttyS0 then shutdown works fine. I've also added a stty -a --file=/dev/ttyS0 call to /etc/rc.local and indeed clocal is properly set at that point (right before agetty is run). Adding -L to /etc/init/serial.conf also makes reboot work properly. This means that CLOCAL is getting cleared some time after rc.local runs and some time before agetty calls termio_init(). it could upstart, plymouth, the kernel, or something else clearing CLOCAL. I've got to run now, but I'll investigate more on monday. So it looks like this is agetty itself dropping CLOCAL. In termio_init, it does: tp->c_cflag = CS8 | HUPCL | CREAD; which wipes out all existing control flags from the tty (including CLOCAL). The only reason it doesn't hang itself when trying to open the tty is that it uses O_NONBLOCK. I think we need to do one of: 1) Fix the assignment above to: tp->c_cflag |= CS8 | HUPCL | CREAD; 2) Fix the assignment above to: tp->c_cflag = CS8 | HUPCL | CREAD | (tp->c_cflag & CLOCAL); 3) change initscripts to pass -L to agetty when the tty is initialized with CLOCAL by the kernel 4) change plymouth to open it's tty with O_NONBLOCK I'm not sure what the "right" answer is out of these 4 (or if some 5th possibility is the right answer), but since 2 of the 4 involve changing agetty, i'm moving this to util-linux-ng one thing i've just noticed is while agetty opens with O_NONBLOCK it eventually does
fcntl(0, F_SETFL, fcntl(0, F_GETFL, 0) & ~O_NONBLOCK);
and also
if (-L not specified) {
fcntl(1, F_SETFL, fcntl(1, F_GETFL, 0) & ~O_NONBLOCK);
}
so after setting up the terminal attributes it resets to blocking i/o (unless -L is specified, then it keeps writes non-blocking)
So, I've been reading up a bit more about serial protocol, and I think the CLOCAL "fix" may just be a workaround. I'd appreciate if someone who knows more about this stuff would chime in, though. Here's an explanation on how I think it works: As I understand it, CLOCAL means "don't block waiting for carrier detect and dataset ready signals". These are two different bits that deal with hardware flow control. I think they mean something like this: - When carrier detect is asserted it means there is a connection between the remote serial console and the running machine. - When dataset ready is asserted it means the serial console is reporting that its ready to receive data from the running machine. stty -a --file=/dev/ttyS0 on an affected machine outputs: speed 115200 baud; rows 0; columns 0; line = 0; intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>; eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^V; flush = ^O; min = 1; time = 0; -parenb -parodd cs8 hupcl -cstopb cread -clocal -crtscts cdtrdsr -ignbrk brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl -ixon -ixoff -iuclc -ixany imaxbel -iutf8 opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0 isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt echoctl echoke The "cdtrdsr" there suggests the tty is configured for hardware flow control. Despite this, I'm guessing we're not getting a "dataset ready" signal from the remote serial console which is what's causing the open() call to block. When CLOCAL is set, the kernel doesn't bother waiting for the "dataset ready" signal, so no blocking will happen. Note, I don't believe the remote end will assert "dataset ready" until the kernel asserts "data terminal ready" (a different control line) in response to an open() call from the application. So the problem could be the kernel isn't ever asserting "data terminal ready" and so the remote end isn't ever asserting "dataset ready" in response. Or it could be something else. Anyway, this stuff is a bit outside my knowledge domain, so I'd appreciate feedback from someone more knowledgeable here. Also, I just noticed Karel committed a patch that does suggestion 2) from comment 25 in util-linux-ng-2.17.2-5.el6. I'll test that and report its results. Just to follow up, I can confirm util-linux-ng-2.17.2-5.el6 makes the problem go away. Something is wrongly setting DTR/DSR flow control. If the serial cable doesn't have DTR/DSR pins, it'd block the transmission. But if used as a serial console, it shouldn't block forever. I'd recomend having this bug used to track what is enabling DTR/DSR by mistake and have another BZ to track the kernel side. please add me and mchehab to the Cc list of the new bug. Created attachment 431237 [details]
dmesg output after echo t > /proc/sysrq-trigger while cat /usr/share/pango*/HELLO.txt > /dev/ttyS0 is blocked
I've talked to aris about this issue, and he believe this is a kernel bug. He's asked for the above output.
ah missed comment 29. I'm going to clone this report and move the clone to kernel. I think it may be the kernel initially setting dtr/dsr though, so this may end up a kernel bug as well. As requested in comment 29, I've filed bug 613756. (In reply to comment #29) > I'd recomend having this bug used to track what is enabling DTR/DSR by mistake > and have another BZ to track the kernel side. Turns out it is the kernel enabling DTR/DSR on this machine. I added an stty -a --file=/dev/ttyS0 call at the top of /init in the initrd and it's set from the start of boot: Write protecting the kernel read-only data: 7280k mknod: `/dev/ttyS0': File exists speed 115200 baud; rows 0; columns 0; line = 0; intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = dracut: dracut-004-20.1.el6 <undef>; eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R; werase = ^W; lnext = ^dracut: rd_NO_LUKS: removing cryptoluks activation V; flush = ^O; min = 1; time = 0; -parenb -parodd cs8 hupcl -cstopb cread clocal -crtscts cdtrdsr -ignbrk -brkint -ignpar -parmrk -inpck -istrdevice-mapper: uevent: version 1.0.3 ip -inlcr -igncrdevice-mapper: ioctl: 4.17.0-ioctl (2010-03-05) initialised: dm-devel icrnl ixon -ixoff -iuclc -ixany -imaxbel -iutf8 opost -olcuc -ocrnl onlcr -onocr -onlret -ofiudev: starting version 147 Moving back to kernel and back to ASSIGNED The update to util-linux-ng-2.17.2-5.el6 fixed the problem for me. util-linux-ng-2.17.2-6.fc13 has been pushed to the Fedora 13 stable repository. If problems still persist, please make note of it in this bug report.
Technical note added. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
New Contents:
Previously, the shut down, reboot, halt and Ctrl-Alt-Delete commands only killed terminals on /dev/tty# and /dev/ttyS# while the user's SSH login and ability to carry out operations remained unaffected. The hardware control flow, DTRDSR, was previously implemented via TC{SG}ETX and was subsequently changed to TC{SG}ET ioctl, which caused the CDTRDSR support in stty to fail. This was fixed to allow stty to correctly handle CDTRDSR control flow and shut down, reboot, halt and Ctrl-Alt-Delete operations execute as expected.
Technical note updated. If any revisions are required, please edit the "Technical Notes" field
accordingly. All revisions will be proofread by the Engineering Content Services team.
Diffed Contents:
@@ -1 +1 @@
-Previously, the shut down, reboot, halt and Ctrl-Alt-Delete commands only killed terminals on /dev/tty# and /dev/ttyS# while the user's SSH login and ability to carry out operations remained unaffected. The hardware control flow, DTRDSR, was previously implemented via TC{SG}ETX and was subsequently changed to TC{SG}ET ioctl, which caused the CDTRDSR support in stty to fail. This was fixed to allow stty to correctly handle CDTRDSR control flow and shut down, reboot, halt and Ctrl-Alt-Delete operations execute as expected.+Previously, the hardware control flow, DTRDSR, was implemented via TC{SG}ETX. This was changed to TC{SG}ET ioctl, which caused the CDTRDSR support in stty to fail. This was fixed to allow stty to correctly handle CDTRDSR control flow.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2011-0646.html |
Description of problem: None of shutdown, reboot, halt and Ctrl-Alt-Delete actually achieve anything useful. If I do any of them, I see a broadcast message: [root@andromeda linux-2.6.32.i686]# shutdown now Broadcast message from root.org.uk (/dev/pts/0) at 19:24 ... The system is going down for maintenance NOW! The terminals on /dev/tty# and /dev/ttyS# are killed, but that's all. Nothing else happens. My ssh login is still there, and I can still do things. Looking in ps, I see: 1591 ? Ss 0:00 /usr/sbin/atd 1710 ? Ss 0:00 /bin/sh -e /dev/fd/10 1714 ? S 0:00 \_ initctl emit splash-request IMMEDIATE=1 MODE=shutdown MESSAGE=Shutting down... 1715 ? Ss 0:00 /bin/sh -e /dev/fd/11 1716 ? S 0:00 \_ /sbin/plymouthd --mode=shutdown 1717 ? S 0:00 \_ /sbin/plymouthd --mode=shutdown Version-Release number of selected component (if applicable): upstart-0.6.5-5.el6.i686 How reproducible: 100% Steps to Reproduce: 1. Do any of shutdown, reboot, halt or C-A-D. Expected results: The machine should shut down. Additional info: May need to be running an X session.