Bug 457409 - [RHEL4.6] x86_64 race condition at shutdown/panic
Summary: [RHEL4.6] x86_64 race condition at shutdown/panic
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.6
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: Ivan Vecera
QA Contact: Martin Jenner
URL:
Whiteboard:
: 479194 (view as bug list)
Depends On:
Blocks: 391511 456483 461297
TreeView+ depends on / blocked
 
Reported: 2008-07-31 14:35 UTC by Issue Tracker
Modified: 2018-10-20 02:49 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-05-18 19:30:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Final patch sent to review (8.46 KB, patch)
2008-10-14 16:09 UTC, Ivan Vecera
no flags Details | Diff


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 0 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 14:57:26 UTC

Description Issue Tracker 2008-07-31 14:35:08 UTC
Escalated to Bugzilla from IssueTracker

Comment 1 Issue Tracker 2008-07-31 14:35:10 UTC
(1) Category
  Defect Report

(2) Abstract
  x86_64 race condition at shutdown/panic

(3) Symptom
  If panic occurs in shutdown process, the system may hang up
  without oops messages.

(4) Environment
  OS: RHEL4/RHEL5(x86_64)
  It maybe doesn't depend on H/W.

(5) Recreation Steps
  Repeat shutdown. It might occur once every several 1000 times.

(6) Investigation
  If cpu_online_map is cleared and it is interrupted immediately,
  __smp_call_function() refers it as num_online_cpus()-1. this will be -1.
  Therefore, __smp_call_function() will be infinite loop.

(7) Related Documentation/Related Bugzilla #
  This problem has already been reported in upstream.
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9964cf7d776600724ef5f1b33303ceadc588b8ba

(8) Attachments
  N/A

(9) Business Impacts
  We found this in our test.
  There is not business case yet.
  But, this can occur in any customers.

(10) Requests
  Please merge the upstream patch into RHEL kernel.

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497

Comment 3 Issue Tracker 2008-07-31 14:35:13 UTC
memo: generating test kernel on xen guest
olive_PV_RHEL_46_x86_64_01_ITIT133323 . 


This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497

Comment 4 Issue Tracker 2008-07-31 14:35:14 UTC
Hi Sonoda-san, 
With the patches the test kernel was built. Note only sanity-checking is
complete (boot, reboot, and shutdown). Would you like to test this kernel?


1. Please confirm the normal configuration do have an issue of race
condition and get sysreport. 
2. and then, test the kernel. 
3. Could you hand in the sysreport when everything is done? 



This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497

Comment 5 Issue Tracker 2008-07-31 14:35:15 UTC
memo: 

cherry picked from: 
9964cf7d776600724ef5f1b33303ceadc588b8ba
d89559589a588d1a654329d8cd9a3ad33aaad9be


### sanity check
### test kernel install
[root@gssem64t x86_64]# rpm --oldpackage -ivh
kernel-smp-2.6.9-67.EL_IT181497.x8
6_64.rpm
Preparing...                ###########################################
[100%]
   1:kernel-smp             ###########################################
[100%]


### reboot with new kernel
[root@gssem64t ~]# uname -a
Linux gssem64t 2.6.9-67.EL_IT181497smp #1 SMP Mon May 26 18:31:01 JST 2008
x86_6
4 x86_64 x86_64 GNU/Linux
[root@gssem64t ~]# dmesg > dmesg01.txt

### reboot
[root@gssem64t ~]# dmesg > dmesg02.txt
### shutdown
### boot the 2.6.9-67 kernel
[root@gssem64t ~]# dmesg > dmesg03.txt

### check dmesg and messages each time to be sure. Nothing much differs. 



This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497

Comment 6 Issue Tracker 2008-07-31 14:35:16 UTC
We're going to test the patched kernel and update status soon.
I change status to avoid autoclose.


Status set to: Waiting on Client

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497

Comment 7 Issue Tracker 2008-07-31 14:35:17 UTC
Hi seg, yet to have this reported up, but the vendor is currently testing
the test kernel. Could you open up BZ for this so I can enroll this in the
partner tracker bug? Please flip this to WoSupport once it's done so I can
continue working on this with them. Thanks! 


Issue escalated to Support Engineering Group by: tumeya.
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497

Comment 8 Issue Tracker 2008-07-31 14:35:19 UTC
Streeter, I escalated in order to have you open up BZ to have this listed
on GSS 4.8 tracker bug. Please do so although the investigation is still
half way through. 


This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497

Comment 9 RHEL Program Management 2008-09-03 13:07:07 UTC
Updating PM score.

Comment 10 Issue Tracker 2008-09-09 16:16:51 UTC
File uploaded: sysreport-root.RHEL4u5-x64-494.tar.bz2

This event sent from IssueTracker by streeter 
 issue 181497
it_file 152836

Comment 11 Issue Tracker 2008-09-09 16:16:52 UTC
Info;
Uploaded sysreport-root.RHEL4u5-x64-494.tar.bz2.

miki

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by streeter 
 issue 181497

Comment 12 Issue Tracker 2008-09-09 16:16:55 UTC
File uploaded: linux-2.6.9-x86_64_shutdown_panic.patch

This event sent from IssueTracker by streeter 
 issue 181497
it_file 152877

Comment 13 Issue Tracker 2008-09-09 16:16:56 UTC
File uploaded: linux-2.6.9-x86_64_shutdown_panic_2.patch

This event sent from IssueTracker by streeter 
 issue 181497
it_file 152878

Comment 14 Issue Tracker 2008-09-09 16:16:58 UTC
Hi Streeter, I've got sysreport and test kernel result from the vendor and
it sounds good. 

# Provide time and date of the problem
N/A this has occurred at the vendor's site. 


# Indicate the platform(s) (architectures) the problem is being reported
against.
Typically on SMP x86_64. 


# Provide clear and concise problem description as it is understood at the
time of escalation
    * Observed behavior
race condition at shutdown/panic would result in hang. The chance is
reportedly several times out of 1000. 


    * Desired behavior 
Machine will reboot properly. 


# State specific action requested of SEG
I've provided the test kernel that has cherry-pick patch from below the
git: 
9964cf7d776600724ef5f1b33303ceadc588b8ba
d89559589a588d1a654329d8cd9a3ad33aaad9be
They've tested it and confirmed to be functional. I'll attach the patches.
Let me know if I'm missing anything. 


# State whether or not a defect in the product is suspected

    * Provide Bugzilla if one already exists 
N/A


# If there is a proposed patch, make sure it is in unified diff format
(diff -pruN) 
--- linux-2.6.9/arch/x86_64/kernel/smp.c.org	2008-05-23 16:32:00.000000000
+0900
+++ linux-2.6.9/arch/x86_64/kernel/smp.c	2008-05-23 16:39:34.000000000
+0900
@@ -399,39 +399,31 @@
 	return 0;
 }
 
-void smp_stop_cpu(void)
+static void stop_this_cpu(void *dummy)
 {
+        local_irq_disable();
 	/*
 	 * Remove this CPU:
 	 */
 	cpu_clear(smp_processor_id(), cpu_online_map);
-	local_irq_disable();
 	disable_local_APIC();
-	local_irq_enable(); 
-}
-
-static void smp_really_stop_cpu(void *dummy)
-{
-	smp_stop_cpu(); 
 	for (;;) 
 		asm("hlt"); 
 } 
 
 void smp_send_stop(void)
 {
-	int nolock = 0;
+        int nolock;
+        unsigned long flags;
+
 	/* Don't deadlock on the call lock in panic */
-	if (!spin_trylock(&call_lock)) {
-		udelay(100);
-		/* ignore locking because we have paniced anyways */
-		nolock = 1;
-	}
-	__smp_call_function(smp_really_stop_cpu, NULL, 1, 0);
+        nolock = !spin_trylock(&call_lock);
+        local_irq_save(flags);
+        __smp_call_function(stop_this_cpu, NULL, 0, 0);
 	if (!nolock)
 		spin_unlock(&call_lock);
-	local_irq_disable();
 	disable_local_APIC();
-	local_irq_enable(); 
+        local_irq_restore(flags);
 }
 
 /*
--- linux-2.6.9/include/asm-x86_64/smp.h.org	2008-05-23 16:39:53.000000000
+0900
+++ linux-2.6.9/include/asm-x86_64/smp.h	2008-05-23 16:40:20.000000000
+0900
@@ -46,7 +46,6 @@
 extern void smp_invalidate_rcv(void);		/* Process an NMI */
 extern void (*mtrr_hook) (void);
 extern void zap_low_mappings(void);
-void smp_stop_cpu(void);
 extern cpumask_t cpu_sibling_map[NR_CPUS];
 extern cpumask_t cpu_core_map[NR_CPUS];
 extern u8 phys_proc_id[NR_CPUS];



--- linux-2.6.9/arch/x86_64/kernel/reboot.c.org	2008-05-26
17:26:51.000000000 +0900
+++ linux-2.6.9/arch/x86_64/kernel/reboot.c	2008-05-26 17:54:34.000000000
+0900
@@ -96,51 +96,55 @@
 		      [target] "b" (WARMBOOT_TRAMP));
 }
 
-#ifdef CONFIG_SMP
-static void smp_halt(void)
+static inline void kb_wait(void)
+{
+        int i;
+
+        for (i=0; i<0x10000; i++)
+                if ((inb_p(0x64) & 0x02) == 0)
+                        break;
+}
+
+void machine_shutdown(void)
 {
-	int cpuid = hard_smp_processor_id(); 
-	static int first_entry = 1;
+        /* Stop the cpus and apics */
+#ifdef CONFIG_SMP
+        int reboot_cpu_id;
+
+        /* The boot cpu is always logical cpu 0 */
+        reboot_cpu_id = 0;
 
-	if (first_entry) { 
-		first_entry = 0;
-		/* If nobody's alive, just return to machine_restart */
-		if (num_online_cpus() == 1)
-			return;
-		smp_call_function((void *)machine_restart, NULL, 1, 0);
-	} 
-			
-	smp_stop_cpu(); 
-
-	/* AP calling this. Just halt */
-	if (cpuid != boot_cpu_id) { 
-		for (;;) 
-			asm("hlt");
+        /* Make certain the cpu I'm about to reboot on is online */
+        if (!cpu_isset(reboot_cpu_id, cpu_online_map)) {
+                reboot_cpu_id = smp_processor_id();
 	}
 
-	/* Wait for all other CPUs to have run smp_stop_cpu */
-	while (!cpus_empty(cpu_online_map))
-		rep_nop(); 
-}
+        /* Make certain I only run on the appropriate processor */
+        set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+
+        /* O.K Now that I'm on the appropriate processor,
+         * stop all of the others.
+         */
+        smp_send_stop();
 #endif
 
-static inline void kb_wait(void)
-{
-	int i;
+        local_irq_disable();
 
-	for (i=0; i<0x10000; i++)
-		if ((inb_p(0x64) & 0x02) == 0)
-			break;
+#ifndef CONFIG_SMP
+        disable_local_APIC();
+#endif
+ 
+        disable_IO_APIC();
+
+        local_irq_enable();
 }
 
 void machine_restart(char * __unused)
 {
 	int i;
 
-#ifdef CONFIG_SMP
 	if (!crashdump_mode())
-		smp_halt();
-#endif
+		machine_shutdown();
 
 	local_irq_disable();


Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter 
 issue 181497

Comment 15 Issue Tracker 2008-09-09 16:17:00 UTC
attached are the patches for 67.EL. 


This event sent from IssueTracker by streeter 
 issue 181497

Comment 17 RHEL Program Management 2008-09-22 18:23:32 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 18 Ivan Vecera 2008-09-29 15:58:49 UTC
I have prepared test kernel packages for i686 and x86_64. Could anybody test them?
They are available at:
http://people.redhat.com/ivecera/rhel-4-ivtest/

Comment 19 Issue Tracker 2008-10-02 08:21:46 UTC
I believe the test kernel on BZ457409 contains the same patch as on
IT181497. It's been tested by the vendor twice btw. They'll come back and
test this again on beta phase. Thanks! 




This event sent from IssueTracker by tumeya 
 issue 181497

Comment 20 Ivan Vecera 2008-10-14 16:09:35 UTC
Created attachment 320319 [details]
Final patch sent to review

Comment 21 Vivek Goyal 2009-01-14 14:23:18 UTC
Committed in 78.28.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 23 Prarit Bhargava 2009-02-17 13:55:04 UTC
*** Bug 479194 has been marked as a duplicate of this bug. ***

Comment 26 Chris Ward 2009-05-05 13:57:28 UTC
Any updates here? Has this issue been resolved in the RHEL 4.8 Beta? later kernel?

Comment 28 errata-xmlrpc 2009-05-18 19:30:59 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html


Note You need to log in before you can comment on or make changes to this bug.