248666 – Serious problems during the diskdump, can cause the machine to hang and not reboot.

Bug 248666 - Serious problems during the diskdump, can cause the machine to hang and not reboot.

Summary: Serious problems during the diskdump, can cause the machine to hang and not r...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.5
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Takao Indoh
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	351901 391511 430698 457499 461297
TreeView+	depends on / blocked

Reported:	2007-07-18 02:05 UTC by Norm Murray
Modified:	2018-10-20 00:48 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-05-18 19:25:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Reboot if the diskdump fails horribly. (4.18 KB, patch) 2007-07-18 03:33 UTC, Wade Mealing	no flags	Details \| Diff
NEC-supplied patch (1.58 KB, patch) 2008-07-08 17:28 UTC, Guy Streeter	no flags	Details \| Diff
NEC-supplied patch (3.18 KB, text/x-patch) 2008-07-08 17:29 UTC, Guy Streeter	no flags	Details
NEC-supplied patch (1.56 KB, patch) 2008-07-08 17:29 UTC, Guy Streeter	no flags	Details \| Diff
halt_on_err patch (3.27 KB, patch) 2008-12-11 20:46 UTC, Takao Indoh	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1024	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update	2009-05-18 14:57:26 UTC

Description Wade Mealing 2007-07-18 02:05:28 UTC

Description of problem:

When a system is diskdumping, and it encounters a serious error (broken disk)
the diskdump will not fall through to netdump or not reboot. 

Version-Release number of selected component (if applicable):

kernel 2.6.9-55.16.EL

How reproducible:

Every time,



Steps to Reproduce:
1. service diskdump initialformat; service diskdump start
2. dd if=/dev/zero of=/dev/sdb1 bs=1024 count=10; # i know this is bad, i'm
simulating hardware failure
3. echo 5 > /proc/sys/kernel/panic
4. echo c > /proc/sysrq-trigger

  
Actual results:

System panics, doesn't fall through, simple hang.
Expected results:

System to panic, reboot.

Additional info:

 Attached is a modified patch from NEC that allows the diskdump module to take a
reboot option if the dump fails.

Comment 1 Wade Mealing 2007-07-18 03:33:01 UTC

Created attachment 159493 [details]
Reboot if the diskdump fails horribly.

Comment 2 Red Hat Bugzilla 2007-10-19 04:06:39 UTC

User ntachino's account has been closed

Comment 3 RHEL Program Management 2008-01-16 03:27:45 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 Takao Indoh 2008-03-11 19:07:24 UTC

Hello Norm,

> When a system is diskdumping, and it encounters a serious error (broken disk)
> the diskdump will not fall through to netdump or not reboot. 

I tested diskdump and confirmed that diskdump fell through to netdump.
The log is as follows.

CPU frozen: #1
CPU#0 is executing diskdump.
start dumping to sda1
check dump partition...
1/262042    52408 ETA |          <3>disk_dump: bad signature in block 0
<3>disk_dump: check partition failed.
<3>disk_dump: No more dump device found
<6>disk_dump: diskdump failed, fall back to trying netdump
CPU#0 is executing netdump.
< netdump activated - performing handshake with the server. >
NETDUMP START!
< handshake completed - listening for dump requests. >
(snipped)

I think netdump was not enabled when your customers found this problem. Could
you confirm that?

Thanks,
Takao Indoh

Comment 13 Issue Tracker 2008-07-08 17:26:03 UTC

Dear SEG,

NEC is askin us to consider the three patches for this issue since they
can not allow a system to have the reboot failure. :
---
The system will reboot successfully if netdump completes successfully
after diskdump failed.  However, if netdump also fails, the kernel will
return to the original caller and will result in the same reboot failure
as when diskdump failed.

It is very unlikely that every customer would set up a netdump server
just for this purpose, so I believe netdump could not be used as a
workaround. 

Please consider our original patches.
---

Posted here on 07-13-2007:
- linux-2.6.9-panic-freeze-on-dump-err-fix.patch
- linux-2.6.9-diskdump-reboot-on-err.patch
- diskdumputils-reboot-on-err.patch

Could you please discuss the fixes with engineering?

Best Regards,
M Oshiro

Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter 
 issue 97210

Comment 14 Guy Streeter 2008-07-08 17:28:56 UTC

Created attachment 311298 [details]
NEC-supplied patch

Comment 15 Guy Streeter 2008-07-08 17:29:21 UTC

Created attachment 311299 [details]
NEC-supplied patch

Comment 16 Guy Streeter 2008-07-08 17:29:51 UTC

Created attachment 311300 [details]
NEC-supplied patch

Comment 18 RHEL Program Management 2008-09-03 12:54:35 UTC

Updating PM score.

Comment 19 Takao Indoh 2008-11-14 20:53:24 UTC

Hello,

Could you tell me how "reboot_on_err" works?
I read the following explanation, but I think we can do the same thing by using fallback_on_err and /proc/sys/kernel/panic.

+reboot_on_err:	Specify whether the system restarts after diskdump failure or
+		not.  This feature is only for RHEL4.  The default value is 1,
+		which means that the system restarts after diskdump detects
+		some errors to abort.  If the value is 0, the system halts
+		after diskdump aborts.

Thanks,
Takao Indoh

Comment 20 Masaki Tachibana 2008-12-04 08:56:12 UTC

Hi,

Below is my understanding of fallback_on_err.

If all the following conditions are met, the system is rebooted.
- Only patch of reboot_on_err is not applied.
- kernel.panic in /etc/sysctl.conf is set (for example, 10 seconds).
- fallback_on_err is 1.
- Both diskdump and netdump fail(or diskdump fails and netdump is not enabled).

Then many users who don't use a serial console cannot see information 
about the cause of the panic.

--------

reboot_on_err enables users to control the behavior, so they can see 
the information. If reboot_on_err is 1(default), the system will behave 
exactly the same as before. If reboot_on_err is 0, the system halts 
while showing the back trace of the panic on the screen.

--------

The typical behavior when reboot_on_err is 1(default):
Panic fallback_on_err diskdump netdump : result : panic information
-------------------------------------------------------------------
 10          1        success     -    : reboot
 10          1        failure  failure : reboot : goes off.
  0          1        success     -    : halt
  0          1        failure  failure : halt   : displayed.


The typical behavior if reboot_on_err is 0:
Panic fallback_on_err reboot_on_err diskdump netdump :result :panic information
-------------------------------------------------------------------
 10          1              0       success     -    : reboot
 10          1              0       failure  failure : halt  : displayed.
  0          1              0       success     -    : halt
  0          1              0       failure  failure : halt  : displayed.

Comment 21 Takao Indoh 2008-12-11 20:44:09 UTC

Hi Tachibana-san,

Thanks for your explanation. I understand the purpose of this patch.
I made another patch based on yours. This patch changes only diskdump components(drivers/block/diskdump.c, kernel/dump.c), so I think this patch is more acceptable for other Red Hat engineers. Could you check the attached patch and let me know whether this patch works for you?

BTW, in this patch, I change a name of new parameter from "reboot_on_err" to "halt_on_err" because I feel "halt_on_err" is more intuitive for this purpose. But if you prefer "reboot_on_err", please let me know. I'll change the name ;-)

Thanks,
Takao Indoh

Comment 22 Takao Indoh 2008-12-11 20:46:13 UTC

Created attachment 326668 [details]
halt_on_err patch

Comment 23 Masaki Tachibana 2008-12-15 10:08:38 UTC

Hi Indoh-san,

I understand that your patch makes the system halt just after netdump.

Do you mean unifying linux-2.6.9-panic-freeze-on-dump-err-fix.patch and 
linux-2.6.9-diskdump-reboot-on-err.patch in halt_on_err.patch?
If so, it can't avoid reboot failure in case of halt_on_err = 0.

Or only converting linux-2.6.9-diskdump-reboot-on-err.patch to 
halt_on_err.patch?
If so, linux-2.6.9-panic-freeze-on-dump-err-fix.patch can avoid reboot 
failure.
However halt_on_err.patch doesn't enable the system halt 
while showing the back trace of the panic on the screen.
I think it isn't useful to halt on error  without showing the back trace 
for this patch.

If you'd like to change only the dump codes, it may be better that 
processing behind calling netdump have both rebooting and halting.
(If there aren't other ways, I give up showing the back trace.)

Comment 24 Takao Indoh 2008-12-16 23:45:18 UTC

Hi Tachibana-san,

>Or only converting linux-2.6.9-diskdump-reboot-on-err.patch to 
>halt_on_err.patch?
>If so, linux-2.6.9-panic-freeze-on-dump-err-fix.patch can avoid reboot 
>failure.

I just converted linux-2.6.9-diskdump-reboot-on-err.patch to halt_on_err.patch. Now I think your patch is best, so I'll go ahead with your patch as-is. Thanks.

Comment 25 Takao Indoh 2008-12-22 16:36:24 UTC

Hi Tachibana-san,

I think there are two problems.
- System hangs up after diskdump fails.
    linux-2.6.9-panic-freeze-on-dump-err-fix.patch fixes this.
- Panic information should be displayed if diskdump fails.
    linux-2.6.9-diskdump-reboot-on-err.patch fixes this.

These problems are separatable, so I newly opened bz477635 for the latter problem.

Comment 26 Masaki Tachibana 2008-12-24 01:41:52 UTC

Hi Indoh-san,

Thank you for your work.

Thanks,
Masaki Tachibana

Comment 27 Takao Indoh 2009-01-05 22:18:58 UTC

Hi Tachibana-san,

I posted the patch, and Red Hat engineer pointed out that the following part
of the patch seems not to be needed.

@@ -497,6 +497,11 @@ static asmlinkage void netpoll_netdump(s
 		kfree(req);
 		req = NULL;
 	}
+	/*
+	 * The meaning of netdump_mode changes here.
+	 * Netdump is in progress. --> Netdump has been executed.
+	 */
+	netdump_mode = 1;
 	sprintf(tmp, "NETDUMP end.\n");
 	reply.code = REPLY_END_NETDUMP;
 	reply.nr = 0;


I confirmed the source code of netdump and netdump-server, and I think he is
right. First of all, netdump_mode is changed to 1 at the top of
netpoll_start_netdump(), and the netdump_mode is changed to 0 only when netdump
received COMM_EXIT command.

    case COMM_EXIT:
            Dprintk("got EXIT command.\n");
            netdump_mode = 0;
            netpoll_set_trap(0);
            break;

But it seems that netdump-server never sends COMM_EXIT command, so netdump_mode
is always 1 after netdump works. Therefore, I think this part is not needed.
If you know the reason this part is needed, please let me know. If you agree
this, I'll remove this part from the patch.

Thanks,
Takao Indoh

Comment 28 Masaki Tachibana 2009-01-06 10:59:23 UTC

Hi Indoh-san,

Thank you for your work.
I agree.

Thanks,
Masaki Tachibana

Comment 29 Vivek Goyal 2009-01-14 14:21:29 UTC

Committed in 78.28.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 34 errata-xmlrpc 2009-05-18 19:25:27 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.