Description of problem: When a system is diskdumping, and it encounters a serious error (broken disk) the diskdump will not fall through to netdump or not reboot. Version-Release number of selected component (if applicable): kernel 2.6.9-55.16.EL How reproducible: Every time, Steps to Reproduce: 1. service diskdump initialformat; service diskdump start 2. dd if=/dev/zero of=/dev/sdb1 bs=1024 count=10; # i know this is bad, i'm simulating hardware failure 3. echo 5 > /proc/sys/kernel/panic 4. echo c > /proc/sysrq-trigger Actual results: System panics, doesn't fall through, simple hang. Expected results: System to panic, reboot. Additional info: Attached is a modified patch from NEC that allows the diskdump module to take a reboot option if the dump fails.
Created attachment 159493 [details] Reboot if the diskdump fails horribly.
User ntachino's account has been closed
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
Hello Norm, > When a system is diskdumping, and it encounters a serious error (broken disk) > the diskdump will not fall through to netdump or not reboot. I tested diskdump and confirmed that diskdump fell through to netdump. The log is as follows. CPU frozen: #1 CPU#0 is executing diskdump. start dumping to sda1 check dump partition... 1/262042 52408 ETA | <3>disk_dump: bad signature in block 0 <3>disk_dump: check partition failed. <3>disk_dump: No more dump device found <6>disk_dump: diskdump failed, fall back to trying netdump CPU#0 is executing netdump. < netdump activated - performing handshake with the server. > NETDUMP START! < handshake completed - listening for dump requests. > (snipped) I think netdump was not enabled when your customers found this problem. Could you confirm that? Thanks, Takao Indoh
Dear SEG, NEC is askin us to consider the three patches for this issue since they can not allow a system to have the reboot failure. : --- The system will reboot successfully if netdump completes successfully after diskdump failed. However, if netdump also fails, the kernel will return to the original caller and will result in the same reboot failure as when diskdump failed. It is very unlikely that every customer would set up a netdump server just for this purpose, so I believe netdump could not be used as a workaround. Please consider our original patches. --- Posted here on 07-13-2007: - linux-2.6.9-panic-freeze-on-dump-err-fix.patch - linux-2.6.9-diskdump-reboot-on-err.patch - diskdumputils-reboot-on-err.patch Could you please discuss the fixes with engineering? Best Regards, M Oshiro Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by streeter issue 97210
Created attachment 311298 [details] NEC-supplied patch
Created attachment 311299 [details] NEC-supplied patch
Created attachment 311300 [details] NEC-supplied patch
Updating PM score.
Hello, Could you tell me how "reboot_on_err" works? I read the following explanation, but I think we can do the same thing by using fallback_on_err and /proc/sys/kernel/panic. +reboot_on_err: Specify whether the system restarts after diskdump failure or + not. This feature is only for RHEL4. The default value is 1, + which means that the system restarts after diskdump detects + some errors to abort. If the value is 0, the system halts + after diskdump aborts. Thanks, Takao Indoh
Hi, Below is my understanding of fallback_on_err. If all the following conditions are met, the system is rebooted. - Only patch of reboot_on_err is not applied. - kernel.panic in /etc/sysctl.conf is set (for example, 10 seconds). - fallback_on_err is 1. - Both diskdump and netdump fail(or diskdump fails and netdump is not enabled). Then many users who don't use a serial console cannot see information about the cause of the panic. -------- reboot_on_err enables users to control the behavior, so they can see the information. If reboot_on_err is 1(default), the system will behave exactly the same as before. If reboot_on_err is 0, the system halts while showing the back trace of the panic on the screen. -------- The typical behavior when reboot_on_err is 1(default): Panic fallback_on_err diskdump netdump : result : panic information ------------------------------------------------------------------- 10 1 success - : reboot 10 1 failure failure : reboot : goes off. 0 1 success - : halt 0 1 failure failure : halt : displayed. The typical behavior if reboot_on_err is 0: Panic fallback_on_err reboot_on_err diskdump netdump :result :panic information ------------------------------------------------------------------- 10 1 0 success - : reboot 10 1 0 failure failure : halt : displayed. 0 1 0 success - : halt 0 1 0 failure failure : halt : displayed.
Hi Tachibana-san, Thanks for your explanation. I understand the purpose of this patch. I made another patch based on yours. This patch changes only diskdump components(drivers/block/diskdump.c, kernel/dump.c), so I think this patch is more acceptable for other Red Hat engineers. Could you check the attached patch and let me know whether this patch works for you? BTW, in this patch, I change a name of new parameter from "reboot_on_err" to "halt_on_err" because I feel "halt_on_err" is more intuitive for this purpose. But if you prefer "reboot_on_err", please let me know. I'll change the name ;-) Thanks, Takao Indoh
Created attachment 326668 [details] halt_on_err patch
Hi Indoh-san, I understand that your patch makes the system halt just after netdump. Do you mean unifying linux-2.6.9-panic-freeze-on-dump-err-fix.patch and linux-2.6.9-diskdump-reboot-on-err.patch in halt_on_err.patch? If so, it can't avoid reboot failure in case of halt_on_err = 0. Or only converting linux-2.6.9-diskdump-reboot-on-err.patch to halt_on_err.patch? If so, linux-2.6.9-panic-freeze-on-dump-err-fix.patch can avoid reboot failure. However halt_on_err.patch doesn't enable the system halt while showing the back trace of the panic on the screen. I think it isn't useful to halt on error without showing the back trace for this patch. If you'd like to change only the dump codes, it may be better that processing behind calling netdump have both rebooting and halting. (If there aren't other ways, I give up showing the back trace.)
Hi Tachibana-san, >Or only converting linux-2.6.9-diskdump-reboot-on-err.patch to >halt_on_err.patch? >If so, linux-2.6.9-panic-freeze-on-dump-err-fix.patch can avoid reboot >failure. I just converted linux-2.6.9-diskdump-reboot-on-err.patch to halt_on_err.patch. Now I think your patch is best, so I'll go ahead with your patch as-is. Thanks.
Hi Tachibana-san, I think there are two problems. - System hangs up after diskdump fails. linux-2.6.9-panic-freeze-on-dump-err-fix.patch fixes this. - Panic information should be displayed if diskdump fails. linux-2.6.9-diskdump-reboot-on-err.patch fixes this. These problems are separatable, so I newly opened bz477635 for the latter problem.
Hi Indoh-san, Thank you for your work. Thanks, Masaki Tachibana
Hi Tachibana-san, I posted the patch, and Red Hat engineer pointed out that the following part of the patch seems not to be needed. @@ -497,6 +497,11 @@ static asmlinkage void netpoll_netdump(s kfree(req); req = NULL; } + /* + * The meaning of netdump_mode changes here. + * Netdump is in progress. --> Netdump has been executed. + */ + netdump_mode = 1; sprintf(tmp, "NETDUMP end.\n"); reply.code = REPLY_END_NETDUMP; reply.nr = 0; I confirmed the source code of netdump and netdump-server, and I think he is right. First of all, netdump_mode is changed to 1 at the top of netpoll_start_netdump(), and the netdump_mode is changed to 0 only when netdump received COMM_EXIT command. case COMM_EXIT: Dprintk("got EXIT command.\n"); netdump_mode = 0; netpoll_set_trap(0); break; But it seems that netdump-server never sends COMM_EXIT command, so netdump_mode is always 1 after netdump works. Therefore, I think this part is not needed. If you know the reason this part is needed, please let me know. If you agree this, I'll remove this part from the patch. Thanks, Takao Indoh
Hi Indoh-san, Thank you for your work. I agree. Thanks, Masaki Tachibana
Committed in 78.28.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html