Bug 227808 - [RHEL-4] Reboot failure on AMD64 when no keyboard attached
[RHEL-4] Reboot failure on AMD64 when no keyboard attached
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.4
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Aristeu Rozanski
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-02-08 06:43 EST by Leonard den Ottolander (Locomo)
Modified: 2012-06-20 12:03 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-06-20 12:03:00 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
messages during reboot (no snips in between) (2.74 KB, text/plain)
2007-02-08 06:43 EST, Leonard den Ottolander (Locomo)
no flags Details
dmesg first machine (12.92 KB, text/plain)
2007-02-08 06:45 EST, Leonard den Ottolander (Locomo)
no flags Details
hwconf first machine (6.43 KB, text/plain)
2007-02-08 06:46 EST, Leonard den Ottolander (Locomo)
no flags Details
lspci first machine (1.60 KB, text/plain)
2007-02-08 06:46 EST, Leonard den Ottolander (Locomo)
no flags Details
dmesg second machine (13.50 KB, text/plain)
2007-02-08 06:49 EST, Leonard den Ottolander (Locomo)
no flags Details
hwconf second machine (6.32 KB, text/plain)
2007-02-08 06:49 EST, Leonard den Ottolander (Locomo)
no flags Details
lspci second machine (1.60 KB, text/plain)
2007-02-08 06:50 EST, Leonard den Ottolander (Locomo)
no flags Details
reboot without keyboard patch for x86_64 (651 bytes, patch)
2007-03-12 14:08 EDT, Hiroto Shibuya
no flags Details | Diff

  None (edit)
Description Leonard den Ottolander (Locomo) 2007-02-08 06:43:05 EST
Problem has existed for at least 15 months (November 2005), but was worked
around by equiping the server with a keyboard emulator that caused the reboot to
occur normally. However, recently the keyboard emulator was removed from the
server, causing the problem to resurface.

When attempting a reboot a normal shutdown occurs, but the kernel is unable to
reset itself. Problem only occurs when no keyboard is attached. (Can't provide
you with console output as the machine is 800km away and the issue does not
occur when a keyboard or (remote) KVM is attached.)

Problem observed on two Athlon 64 machines, somewhat different motherboards, but
both have a VT8237 chipset.

According to our provider (Hetzner AG.) this is a known issue for at least
Debian kernels up to 2.6.13:

"Der Server war wie ich Ihnen bereits telefonisch mitteilte nicht "ausgeschalten".
Eventuell ist in Ihrem CentOS Kernel der gleiche Bug den wir bei Debiankerneln
2.6.x bis 2.6.13 beobachten konnten. Hierbei kann der Kernel nach erfolgreichem
Beenden aller Dienste den Server nicht resetten."

translated: "Issue with your CentOS kernel is probably the same as with Debian
kernel up to 2.6.13 whereby the kernel cannot reset the server after all
services have been stopped."

Their suggestion to attach a keyboard emulator indeed fixed the issue, so I
suppose their analysis is correct. They also provided me with these references:

http://forums.gentoo.org/viewtopic-t-354847.html
http://groups.google.com/group/linux.kernel/browse_thread/thread/d2ccf3ce893561ea/126caff9fb31ceac
http://forum.hetzner.de/wbb2/thread.php?threadid=5237&sid=&hilight=&hilightuser=1742
(subscription required)

I'll attach dmesg-es, lspci and /etc/sysconfig/hwconf for both machines, and
messages for the machine that failed to reboot today.
Comment 1 Leonard den Ottolander (Locomo) 2007-02-08 06:43:05 EST
Created attachment 147643 [details]
messages during reboot (no snips in between)
Comment 2 Leonard den Ottolander (Locomo) 2007-02-08 06:45:35 EST
Created attachment 147644 [details]
dmesg first machine

dmesg for the machine that failed to reboot today as the keyboard emulator was
removed. Reboot failure has also been observed on second machine, but not for
15 months after attaching (and gladly not removing again) a keyboard emulator.
Comment 3 Leonard den Ottolander (Locomo) 2007-02-08 06:46:07 EST
Created attachment 147645 [details]
hwconf first machine
Comment 4 Leonard den Ottolander (Locomo) 2007-02-08 06:46:46 EST
Created attachment 147646 [details]
lspci first machine
Comment 5 Leonard den Ottolander (Locomo) 2007-02-08 06:49:11 EST
Created attachment 147647 [details]
dmesg second machine
Comment 6 Leonard den Ottolander (Locomo) 2007-02-08 06:49:45 EST
Created attachment 147649 [details]
hwconf second machine
Comment 7 Leonard den Ottolander (Locomo) 2007-02-08 06:50:18 EST
Created attachment 147650 [details]
lspci second machine
Comment 8 Aristeu Rozanski 2007-03-09 12:00:31 EST
Leonard, can you please try the kernel on
http://people.redhat.com/arozansk/reboot/ and see if it solves your problem?
Thanks,
Comment 9 Hiroto Shibuya 2007-03-11 09:16:09 EDT
I wonder if the patch I posted on the following bug is applicable here.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=231768
Comment 10 Hiroto Shibuya 2007-03-11 20:17:13 EDT
BTW, I tested
http://people.redhat.com/arozansk/reboot/kernel-2.6.9-48.EL.227808.i686.rpm
on the EPIA board I reported in Bug 231768 and it did not reboot without 
keyboard.

Comment 11 Leonard den Ottolander (Locomo) 2007-03-12 05:01:07 EDT
Aristeu,

This problem occurs on a production system that is located 800km from here.
Although I am usually quite willing to test these kinds of things I am not in
this particular case, as you can imagine. It would mean I have to hire hands
from my ISP and schedule downtime for this production server.

I hope you could test it on one of your test machines with similar
specifications, see if you can reproduce the hang with an old kernel, and see if
the new kernel fixes it.
Comment 12 Leonard den Ottolander (Locomo) 2007-03-12 05:11:36 EDT
I suppose the issue Hiroto is having might well be the same as mine, as this is
the same chipset (VT8237), and according to my hosting provider this issue is
solved in kernel-2.6.13, which is consistent with what he reports.

However, provided patch in attachment 149791 [details] is for i386, and mine are x86_64
systems. (Maybe the subject should be changed to "Reboot failure on VT8237 when
no keyboard attached".)
Comment 13 Hiroto Shibuya 2007-03-12 14:08:33 EDT
Created attachment 149849 [details]
reboot without keyboard patch for x86_64

I found the exact same code in the difference place for x86.  
This is an untested patch for x86_64 applying the same diff
as I did in i386, although in the different place.
Comment 14 Leonard den Ottolander (Locomo) 2007-06-14 05:19:55 EDT
Any progress on this issue? As indicated earlier I am unable/unwilling to test
this on the live boxes I am experiencing the issue on, but that should not stop
you to find an appropriate system/tester. Problem should be easily reproducible.
Comment 15 Aristeu Rozanski 2007-07-23 11:56:21 EDT
Leonard, I still didn't found any system around with this chipset. For the patches
posted here, none of them are upstream, so there's another fix for this issue in
newer kernels.
So, I'm afraid that there wasn't much progress in this issue yet.
Comment 16 Leonard den Ottolander 2008-02-13 04:52:20 EST
Sadly the issue still exists in 2.6.9-67.0.4.EL.

Although this is a long shot, perhaps it is possible for you to contact my
provider Hetzner (http://hetzner.de) and see if they are willing to help out by
temporarily providing you a system with this chip set.

Else someone in one of the other Red Hat offices might have such a system available.
Comment 17 Leonard den Ottolander 2010-03-20 10:06:15 EDT
This thing keeps biting me in the rear. Had the network card burn out a little while ago so a LARA (remote java app KVM) was attached. Of course the technician forgot to put the keyboard emulator back after removing the LARA.

Just rebooted after the last kernel upgrade and of course, the machine didn't come up, so as a fyi issue still exists in kernel-2.6.9-89.0.20.EL and assumingly in kernel-2.6.9-89.0.23.EL.

I suppose I'll just have to live with this cause I don't see this issue fixed before EOL of either RHEL4 or my box :s .
Comment 18 Leonard den Ottolander 2010-03-20 12:02:47 EDT
The proposed patch only fixed the issue in mach_reboot.c which might be the issue with bug 231768 but maybe not with this bug. The issue with i8042_command() negating a register value and writing this invalid value back in i8042_controller_init() just before the reboot is not addressed by the previous patch.

Regarding the i8042 in http://groups.google.com/group/linux.kernel/browse_thread/thread/d2ccf3ce893561ea/0a28e581d1bd271d comment 26 Vojtech Pavlek concludes:

"I suppose we can get rid of the checking of data source and negation and
be done with it."

This is what actually has happened in the main tree. Patching the relevant section vs. the current tree (where i8042_command was renamed to __i8042_command) leads to:

--- i8042.c.000	2004-10-18 23:53:12.000000000 +0200
+++ i8042.c	2010-03-20 16:23:54.000000000 +0100
@@ -198,10 +198,12 @@ static int i8042_command(unsigned char *
 	if (!retval)
 		for (i = 0; i < ((command >> 8) & 0xf); i++) {
 			if ((retval = i8042_wait_read())) break;
-			if (i8042_read_status() & I8042_STR_AUXDATA)
-				param[i] = ~i8042_read_data();
-			else
-				param[i] = i8042_read_data();
+			if (command == I8042_CMD_AUX_LOOP &&
+			    !(i8042_read_status() & I8042_STR_AUXDATA)) {
+				dbg("     -- i8042 (auxerr)");
+			}
+
+			param[i] = i8042_read_data();
 			dbg("%02x <- i8042 (return)", param[i]);
 		}
 

As you can see an extra test for command == I8042_CMD_AUX_LOOP was added as well, but the main issue is the negation of param[i] being removed regardless if I8042_STR_AUXDATA is set. The crude version of this patch (without the extra test in HEAD) would read:

--- i8042.c.000	2004-10-18 23:53:12.000000000 +0200
+++ i8042.c	2010-03-20 16:30:02.000000000 +0100
@@ -198,10 +198,7 @@ static int i8042_command(unsigned char *
 	if (!retval)
 		for (i = 0; i < ((command >> 8) & 0xf); i++) {
 			if ((retval = i8042_wait_read())) break;
-			if (i8042_read_status() & I8042_STR_AUXDATA)
-				param[i] = ~i8042_read_data();
-			else
-				param[i] = i8042_read_data();
+			param[i] = i8042_read_data();
 			dbg("%02x <- i8042 (return)", param[i]);
 		}
 

Haven't tested this (not willing to take my production server down for this) but from what I've read I'm quite sure this is the solution for my problem. But like I said, probably beating a dead horse here, just gotta make sure those technicians don't remove that keyboard (emulator).
Comment 19 Jiri Pallich 2012-06-20 12:03:00 EDT
Thank you for submitting this issue for consideration in Red Hat Enterprise Linux. The release for which you requested us to review is now End of Life. 
Please See https://access.redhat.com/support/policy/updates/errata/

If you would like Red Hat to re-consider your feature request for an active release, please re-open the request via appropriate support channels and provide additional supporting details about the importance of this issue.

Note You need to log in before you can comment on or make changes to this bug.