Escalated to Bugzilla from IssueTracker
(1) Category Defect Report (2) Abstract x86_64 race condition at shutdown/panic (3) Symptom If panic occurs in shutdown process, the system may hang up without oops messages. (4) Environment OS: RHEL4/RHEL5(x86_64) It maybe doesn't depend on H/W. (5) Recreation Steps Repeat shutdown. It might occur once every several 1000 times. (6) Investigation If cpu_online_map is cleared and it is interrupted immediately, __smp_call_function() refers it as num_online_cpus()-1. this will be -1. Therefore, __smp_call_function() will be infinite loop. (7) Related Documentation/Related Bugzilla # This problem has already been reported in upstream. http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9964cf7d776600724ef5f1b33303ceadc588b8ba (8) Attachments N/A (9) Business Impacts We found this in our test. There is not business case yet. But, this can occur in any customers. (10) Requests Please merge the upstream patch into RHEL kernel. This event sent from IssueTracker by streeter [SEG - Kernel] issue 181497
memo: generating test kernel on xen guest olive_PV_RHEL_46_x86_64_01_ITIT133323 . This event sent from IssueTracker by streeter [SEG - Kernel] issue 181497
Hi Sonoda-san, With the patches the test kernel was built. Note only sanity-checking is complete (boot, reboot, and shutdown). Would you like to test this kernel? 1. Please confirm the normal configuration do have an issue of race condition and get sysreport. 2. and then, test the kernel. 3. Could you hand in the sysreport when everything is done? This event sent from IssueTracker by streeter [SEG - Kernel] issue 181497
memo: cherry picked from: 9964cf7d776600724ef5f1b33303ceadc588b8ba d89559589a588d1a654329d8cd9a3ad33aaad9be ### sanity check ### test kernel install [root@gssem64t x86_64]# rpm --oldpackage -ivh kernel-smp-2.6.9-67.EL_IT181497.x8 6_64.rpm Preparing... ########################################### [100%] 1:kernel-smp ########################################### [100%] ### reboot with new kernel [root@gssem64t ~]# uname -a Linux gssem64t 2.6.9-67.EL_IT181497smp #1 SMP Mon May 26 18:31:01 JST 2008 x86_6 4 x86_64 x86_64 GNU/Linux [root@gssem64t ~]# dmesg > dmesg01.txt ### reboot [root@gssem64t ~]# dmesg > dmesg02.txt ### shutdown ### boot the 2.6.9-67 kernel [root@gssem64t ~]# dmesg > dmesg03.txt ### check dmesg and messages each time to be sure. Nothing much differs. This event sent from IssueTracker by streeter [SEG - Kernel] issue 181497
We're going to test the patched kernel and update status soon. I change status to avoid autoclose. Status set to: Waiting on Client This event sent from IssueTracker by streeter [SEG - Kernel] issue 181497
Hi seg, yet to have this reported up, but the vendor is currently testing the test kernel. Could you open up BZ for this so I can enroll this in the partner tracker bug? Please flip this to WoSupport once it's done so I can continue working on this with them. Thanks! Issue escalated to Support Engineering Group by: tumeya. Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by streeter [SEG - Kernel] issue 181497
Streeter, I escalated in order to have you open up BZ to have this listed on GSS 4.8 tracker bug. Please do so although the investigation is still half way through. This event sent from IssueTracker by streeter [SEG - Kernel] issue 181497
Updating PM score.
File uploaded: sysreport-root.RHEL4u5-x64-494.tar.bz2 This event sent from IssueTracker by streeter issue 181497 it_file 152836
Info; Uploaded sysreport-root.RHEL4u5-x64-494.tar.bz2. miki Internal Status set to 'Waiting on Support' Status set to: Waiting on Tech This event sent from IssueTracker by streeter issue 181497
File uploaded: linux-2.6.9-x86_64_shutdown_panic.patch This event sent from IssueTracker by streeter issue 181497 it_file 152877
File uploaded: linux-2.6.9-x86_64_shutdown_panic_2.patch This event sent from IssueTracker by streeter issue 181497 it_file 152878
Hi Streeter, I've got sysreport and test kernel result from the vendor and it sounds good. # Provide time and date of the problem N/A this has occurred at the vendor's site. # Indicate the platform(s) (architectures) the problem is being reported against. Typically on SMP x86_64. # Provide clear and concise problem description as it is understood at the time of escalation * Observed behavior race condition at shutdown/panic would result in hang. The chance is reportedly several times out of 1000. * Desired behavior Machine will reboot properly. # State specific action requested of SEG I've provided the test kernel that has cherry-pick patch from below the git: 9964cf7d776600724ef5f1b33303ceadc588b8ba d89559589a588d1a654329d8cd9a3ad33aaad9be They've tested it and confirmed to be functional. I'll attach the patches. Let me know if I'm missing anything. # State whether or not a defect in the product is suspected * Provide Bugzilla if one already exists N/A # If there is a proposed patch, make sure it is in unified diff format (diff -pruN) --- linux-2.6.9/arch/x86_64/kernel/smp.c.org 2008-05-23 16:32:00.000000000 +0900 +++ linux-2.6.9/arch/x86_64/kernel/smp.c 2008-05-23 16:39:34.000000000 +0900 @@ -399,39 +399,31 @@ return 0; } -void smp_stop_cpu(void) +static void stop_this_cpu(void *dummy) { + local_irq_disable(); /* * Remove this CPU: */ cpu_clear(smp_processor_id(), cpu_online_map); - local_irq_disable(); disable_local_APIC(); - local_irq_enable(); -} - -static void smp_really_stop_cpu(void *dummy) -{ - smp_stop_cpu(); for (;;) asm("hlt"); } void smp_send_stop(void) { - int nolock = 0; + int nolock; + unsigned long flags; + /* Don't deadlock on the call lock in panic */ - if (!spin_trylock(&call_lock)) { - udelay(100); - /* ignore locking because we have paniced anyways */ - nolock = 1; - } - __smp_call_function(smp_really_stop_cpu, NULL, 1, 0); + nolock = !spin_trylock(&call_lock); + local_irq_save(flags); + __smp_call_function(stop_this_cpu, NULL, 0, 0); if (!nolock) spin_unlock(&call_lock); - local_irq_disable(); disable_local_APIC(); - local_irq_enable(); + local_irq_restore(flags); } /* --- linux-2.6.9/include/asm-x86_64/smp.h.org 2008-05-23 16:39:53.000000000 +0900 +++ linux-2.6.9/include/asm-x86_64/smp.h 2008-05-23 16:40:20.000000000 +0900 @@ -46,7 +46,6 @@ extern void smp_invalidate_rcv(void); /* Process an NMI */ extern void (*mtrr_hook) (void); extern void zap_low_mappings(void); -void smp_stop_cpu(void); extern cpumask_t cpu_sibling_map[NR_CPUS]; extern cpumask_t cpu_core_map[NR_CPUS]; extern u8 phys_proc_id[NR_CPUS]; --- linux-2.6.9/arch/x86_64/kernel/reboot.c.org 2008-05-26 17:26:51.000000000 +0900 +++ linux-2.6.9/arch/x86_64/kernel/reboot.c 2008-05-26 17:54:34.000000000 +0900 @@ -96,51 +96,55 @@ [target] "b" (WARMBOOT_TRAMP)); } -#ifdef CONFIG_SMP -static void smp_halt(void) +static inline void kb_wait(void) +{ + int i; + + for (i=0; i<0x10000; i++) + if ((inb_p(0x64) & 0x02) == 0) + break; +} + +void machine_shutdown(void) { - int cpuid = hard_smp_processor_id(); - static int first_entry = 1; + /* Stop the cpus and apics */ +#ifdef CONFIG_SMP + int reboot_cpu_id; + + /* The boot cpu is always logical cpu 0 */ + reboot_cpu_id = 0; - if (first_entry) { - first_entry = 0; - /* If nobody's alive, just return to machine_restart */ - if (num_online_cpus() == 1) - return; - smp_call_function((void *)machine_restart, NULL, 1, 0); - } - - smp_stop_cpu(); - - /* AP calling this. Just halt */ - if (cpuid != boot_cpu_id) { - for (;;) - asm("hlt"); + /* Make certain the cpu I'm about to reboot on is online */ + if (!cpu_isset(reboot_cpu_id, cpu_online_map)) { + reboot_cpu_id = smp_processor_id(); } - /* Wait for all other CPUs to have run smp_stop_cpu */ - while (!cpus_empty(cpu_online_map)) - rep_nop(); -} + /* Make certain I only run on the appropriate processor */ + set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id)); + + /* O.K Now that I'm on the appropriate processor, + * stop all of the others. + */ + smp_send_stop(); #endif -static inline void kb_wait(void) -{ - int i; + local_irq_disable(); - for (i=0; i<0x10000; i++) - if ((inb_p(0x64) & 0x02) == 0) - break; +#ifndef CONFIG_SMP + disable_local_APIC(); +#endif + + disable_IO_APIC(); + + local_irq_enable(); } void machine_restart(char * __unused) { int i; -#ifdef CONFIG_SMP if (!crashdump_mode()) - smp_halt(); -#endif + machine_shutdown(); local_irq_disable(); Internal Status set to 'Waiting on SEG' This event sent from IssueTracker by streeter issue 181497
attached are the patches for 67.EL. This event sent from IssueTracker by streeter issue 181497
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux maintenance release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Update release for currently deployed products. This request is not yet committed for inclusion in an Update release.
I have prepared test kernel packages for i686 and x86_64. Could anybody test them? They are available at: http://people.redhat.com/ivecera/rhel-4-ivtest/
I believe the test kernel on BZ457409 contains the same patch as on IT181497. It's been tested by the vendor twice btw. They'll come back and test this again on beta phase. Thanks! This event sent from IssueTracker by tumeya issue 181497
Created attachment 320319 [details] Final patch sent to review
Committed in 78.28.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
*** Bug 479194 has been marked as a duplicate of this bug. ***
Any updates here? Has this issue been resolved in the RHEL 4.8 Beta? later kernel?
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html