458368 – [5.0] kdump hangs up by Sysrq+C trigger

Bug 458368 - [5.0] kdump hangs up by Sysrq+C trigger

Summary: [5.0] kdump hangs up by Sysrq+C trigger

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Neil Horman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	455638 459147 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2008-08-07 19:44 UTC by Issue Tracker
Modified:	2018-10-20 03:14 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 20:02:38 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
console log of failure (4.63 KB, text/plain) 2008-08-07 19:46 UTC, Guy Streeter	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:0225	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.3 kernel security and bug fix update	2009-01-20 16:06:24 UTC

Description Issue Tracker 2008-08-07 19:44:48 UTC

Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2008-08-07 19:44:49 UTC

RHN System ID:

Customer Contact Name:
Yasuhiro Watanabe

Description of Problem:

kdump hangs up by Sysrq+C trigger once in about 10 times.
After the above occurs, kdump cannot collect vmcores with NMI button.

Here is the log.
---
RHEL5 login: SysRq : Trigger a crashdump
BUG: warning at arch/i386/kernel/crash.c:144/nmi_shootdown_cpus() (Not tainted)
---

This problem occurs on RHEL5.0, but doesn't occur on RHEL5.1 and RHEL5.2.
Since I could not reproduce the problem either in RHEL5.1 or in RHEL5.2,
I checked if something has been changed in the source code(arch/i386/kernel/crash.c) supposed to be related between those versions but it was in vain.

Q:Could you tell me if you have fixed this problem in RHEL5.1 or RHEL5.2 ?
And if the problem has not been yet fixed and if you know,
could you tell me the reason why it could not be reproduced either in RHEL 5.1 or RHEL 5.2 ?

Drivers or hardware or architecture dependency:
I confirmed the problem on x86, x86_64.

How reproducible:
Sometime.

Step to Reproduce:
1. Set up kdump.
2. Execute sysctl command to enable sysrq parameter as follows:
# sysctl -w kernel.sysrq=1
3. Execute the SysRq+C.

Actual Results:
kdump hangs up.

Expected Results:
kdump doesn't hangs up and works properly.

Summary of actions taken to resolve issue:
None.

Location of diagnostic data:
None.

Hardware configuration:
Model: Fujitsu PRIMERGY TX 150 S4
CPU Info: Inter(R) Pentium(R) 4 CPU 3.40GHz
Memory Info: Memory: 1GB

Business Impact:
Since we cannot investigate customer's problem when kdump hangs up, we may lose confidence of customers, if this problem occurs also in RHEL5.1 or in RHEL5.2,
It will have influence on the business of Fujitsu's PRIMERGY.
Therefore, I would like to know whether the problem has been fixed or not.

Additional Info:
I attached sosreport.
This event sent from IssueTracker by streeter [SEG - Kernel]
issue 185973

Comment 2 Issue Tracker 2008-08-07 19:44:51 UTC

Their x86_64 machine doesn't have AMD Opteron processors and
Hypertransport. So,
the following fix is not releated.
http://post-office.corp.redhat.com/archives/rhkernel-list/2007-December/msg01004.html


lspci from RHEL5x86-64.20080618104214.tar.bz2:

----------------------------------------------------
00:00.0 Host bridge: Intel Corporation E7230/3000/3010 Memory Controller
Hub (rev 81)
00:01.0 PCI bridge: Intel Corporation E7230/3000/3010 PCI Express Root
Port (rev 81)
00:1c.0 PCI bridge: Intel Corporation 82801G (ICH7 Family) PCI Express
Port 1 (rev 01)
00:1c.4 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI
Express Port 5 (rev 01)
00:1c.5 PCI bridge: Intel Corporation 82801GR/GH/GHM (ICH7 Family) PCI
Express Port 6 (rev 01)
00:1d.0 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #1
(rev 01)
00:1d.1 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #2
(rev 01)
00:1d.2 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #3
(rev 01)
00:1d.3 USB Controller: Intel Corporation 82801G (ICH7 Family) USB UHCI #4
(rev 01)
00:1d.7 USB Controller: Intel Corporation 82801G (ICH7 Family) USB2 EHCI
Controller (rev 01)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev e1)
00:1f.0 ISA bridge: Intel Corporation 82801GB/GR (ICH7 Family) LPC
Interface Bridge (rev 01)
00:1f.1 IDE interface: Intel Corporation 82801G (ICH7 Family) IDE
Controller (rev 01)
00:1f.3 SMBus: Intel Corporation 82801G (ICH7 Family) SMBus Controller
(rev 01)
01:00.0 PCI bridge: Intel Corporation 6702PXH PCI Express-to-PCI Bridge A
(rev 09)
02:05.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X
Fusion-MPT Dual Ultra320 SCSI (rev c1)
0a:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721
Gigabit Ethernet PCI Express (rev 21)
0e:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)



This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 185973

Comment 3 Issue Tracker 2008-08-07 19:44:53 UTC

Internal Status set to: 'Waiting on SEG'
State the problem

## Provide clear and concise problem description as it is understood at
the time of escalation

After the system crash forces to occur by sysrq+c, the process stops with
the following message. As the result, the vmcore was not collected.

>  RHEL5 login: SysRq : Trigger a crashdump
>  BUG: warning at arch/i386/kernel/crash.c:144/nmi_shootdown_cpus() (Not
tainted)

Please check the console logs in the file "cosole.logs.tar.bz2". 

This problem doesn't occur on EL5.1 and EL5.2. So, this is a just
question.
What was fixed on EL5.0? 

## State specific action requested of SEG

I found the similar problem, but I cannot find the bugzilla including the
actual same problem. Almost ones is the problem on x86_64. But, Fujitsu
saw this problem even on x86.

sosreport on x86 and x86_64 were attached.


Issue escalated to Support Engineering Group by: mmatsuya.


This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 185973

Comment 4 Issue Tracker 2008-08-07 19:44:54 UTC

Hi, Matsuya-san

> Action: I checked the log file. It seems that vmcore was created
correctly and the system hang didn't occur. Right? May I understand that
your problem was gone?

No. 
This problem was not gone.

I explain the logs(RHEL5.0log.txt). 
I tested three times. The attached log includes the results of the three
tests.

The system hung up in the first and the second.
In the third test, the system did not hang and vmcore was created
correctly, as you pointed out.
It seems the problem occurs intermittently.

--- 1st ---
[Mon Jul 07 15:28:18 2008] SysRq : Trigger a crashdump
[Mon Jul 07 15:28:19 2008] BUG: warning at
arch/i386/kernel/crash.c:144/nmi_shootdown_cpus() (Not tainted)

--- 2nd ---
[Mon Jul 07 15:30:44 2008] build.fujitsu.com login: SysRq : Trigger a
crashdump
[Mon Jul 07 15:31:16 2008] BUG: warning at
arch/i386/kernel/crash.c:144/nmi_shootdown_cpus() (Not tainted)

--- 3rd --- 
[Mon Jul 07 15:34:12 2008] The dumpfile is saved to
/mnt///127.0.0.1-2008-07-07-15:40:57/vmcore-incomplete.
[Mon Jul 07 15:34:12 2008] makedumpfile Completed. [Mon Jul 07 15:34:14
2008] Restarting system

Best regards,
Yasuhiro Watanabe.

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 185973

Comment 5 Issue Tracker 2008-08-07 19:44:57 UTC

Hi, Matsuya-San

I attached a sosreport(RHEL5.0x86).

>Ok, warning messages was outputed and the incompleted vmcore was created.

>But, you didn't see the system hang as you reported firstly as below.
Right?
>> kdump hangs up by Sysrq+C trigger once in about 10 times.

No. 

Let me explain about cosole.logs.tar.bz2 I sent earlier.

There are two logs included in cosole.logs.tar.bz2.
Please see the following two lines:

<RHEL5-x86.log>
RHEL5 login: SysRq : Trigger a crashdump
BUG: warning at arch/i386/kernel/crash.c:144/nmi_shootdown_cpus() (Not
tainted)

<RHEL-x86-64.log>
SysRq : Trigger a crashdump
BUG: warning at arch/x86_64/kernel/crash.c:147/nmi_shootdown_cpus() (Not
tainted)
When I typed SysRq + C to run kdump, the system output the above messages

and hung up vmcore is not created. This problem occurs once in about 10
times.

I'm sorry my previous explanation is not clear. I should have written
"system hangs up by..."  rather than  "kdump hangs up by..."
since the system seems hang before 2nd kernel starts.

Best regards,
Yasuhiro Watanabe.



Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 185973

Comment 6 Issue Tracker 2008-08-07 19:45:02 UTC

Hi Fabio,

It seems that this is not the configuration miss. This problem occurs even
with the normal kdump setting.

Fujitsu saw this issue on x86 and x86_64 box.

During kdump by sysrq-c, the system stops and never response anything.
This occurs intermittently. Fujitsu reported that this can occur once per
10 times. This system hang always occurs just after the following messages
is outputed.

  RHEL5 login: SysRq : Trigger a crashdump
  BUG: warning at arch/i386/kernel/crash.c:144/nmi_shootdown_cpus() (Not
tainted)

"RHEL 5.0log.txt" includes the console log when they tested.
sosreport "build.fujitsu.com.20080715103223.tar.bz2" was attached.


Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 185973

Comment 7 Guy Streeter 2008-08-07 19:46:42 UTC

Created attachment 313742 [details]
console log of failure

Comment 8 Neil Horman 2008-08-07 19:59:31 UTC

The warning they are seeing is a warning message regarding the use of a delay function within the context of a interrupt handler.  It is used when halting other processors in an smp system when their response to a halt is not immediate.  Its harmless, and given that the system is compromised and about to boot into a kdump kernel to capture  a vmcore, theres nothing to fix there.  Unless the vmcore wasn't properly captured, this is NOTABUG.

Comment 10 Neil Horman 2008-08-08 19:16:41 UTC

Ok, so lets talk about that then, what exactly  were they unable to capture?  Did the system hang and they get nothing?  Or did they get an imcomplete vmcore? or just a corrupted vmcore?

Comment 11 Neil Horman 2008-08-15 17:23:33 UTC

*** Bug 459147 has been marked as a duplicate of this bug. ***

Comment 12 Guy Streeter 2008-08-15 17:53:12 UTC

Neil,
 The problem here is that the system doesn't get as far as running the kdump kernel. The boot kernel hangs during crash.

In bug 459147, I proposed two solutions.

Can the Hitachi confidential group be added to the bz, or can this bz be opened up?

Comment 13 Neil Horman 2008-08-15 19:41:05 UTC

I know that the linux-2.6-debug-sleep-in-irq.patch is missing from fedora.  I spoke with davej about it and the only reason its missing is that there wasn't much interest in taking the patch upstream so it was dropped from fedora as well.  I think (despite the fact that its not upstream) its the right thing to do.

So I think what we really need to do here is:

1) Get the sleep-in-irq patch upstream (I'll be doing that shortly)
2) understand the root cause of the problem.  As Guy mentioned, he believes that printk is having a hard time handling recursion during a shutdown operation.  I'm not 100% convinced of that , but it may well be a possibility in some cases.  I'll look further into that with him.  Perhaps the best thing to do is simply add a condition to the WARN_ON call to not print if we are going to kexec to a new kernel (since we know the mdelay is safe in that case).

Comment 14 Neil Horman 2008-08-15 19:48:11 UTC

Guy, regarding your suggestion to apply commit 32a76006683f7b28ae3cc491da37716e002f198e to RHEL5, can the customer confirm that a RHEL5 kernel with this patch solves the problem, or have they not tested that yet?

Comment 15 Guy Streeter 2008-08-15 20:04:26 UTC

Neil,
 As far as I know, nobody has tried the upstream patch.

I think the nmi_shootdown_cpus() case is a "we know we're calling mdelay() in interrupt context" case. Did you note my suggestion of changing its mdelay(1) to udelay(1000) to bypass the warning? If there is not time to get the printk simultaneous-use (not really "recursion") fix in, making the change in nmi_shootdown_cpus could help with the immediate problem.

Comment 17 Neil Horman 2008-08-18 01:04:31 UTC

I did see your suggestion regarding the switch to udelay, but I'm not interested in doing that.  Yeah, it makes the problem go away, but there is no reason we shouldn't be able to issue a WARN_ON in this context, so we're not really fixing the problem.  The printk patch does look interesting however.  If you would please, build a kernel with that patch included and have the customer test it (or let me know and I'll be happy to throw it together).  Once we get that tested we can move forward with that patch (assuming it fixes the problem).  Thanks!

Comment 19 Guy Streeter 2008-08-18 14:54:56 UTC

I don't see anything confidential in this BZ. Can we open it up so others with the same problem can see this discussion?

Comment 20 Neil Horman 2008-08-18 17:02:37 UTC

Thank you for looking into the backport Guy.  I don't see anything confidential , and yes, I would prefer to have this bug public, as I have a duplicate that I would like to close against it.  If you can ask the reporter if it ok, let me know and I'll open it up.

Comment 24 Neil Horman 2008-08-21 11:05:25 UTC

Ok, thank you. I'll backport and post this patch as soon as the build system is back up.

Comment 32 Don Zickus 2008-09-10 20:14:41 UTC

in kernel-2.6.18-110.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 37 Don Zickus 2008-12-05 16:58:10 UTC

*** Bug 455638 has been marked as a duplicate of this bug. ***

Comment 45 errata-xmlrpc 2009-01-20 20:02:38 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-0225.html

Note You need to log in before you can comment on or make changes to this bug.