Bugzilla will be upgraded to version 5.0. The upgrade date is tentatively scheduled for 2 December 2018, pending final testing and feedback.
Bug 457409 - [RHEL4.6] x86_64 race condition at shutdown/panic
[RHEL4.6] x86_64 race condition at shutdown/panic
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel (Show other bugs)
4.6
All Linux
medium Severity medium
: rc
: ---
Assigned To: Ivan Vecera
Martin Jenner
:
: 479194 (view as bug list)
Depends On:
Blocks: 391511 456483 461297
  Show dependency treegraph
 
Reported: 2008-07-31 10:35 EDT by Issue Tracker
Modified: 2018-10-19 22:49 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-05-18 15:30:59 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Final patch sent to review (8.46 KB, patch)
2008-10-14 12:09 EDT, Ivan Vecera
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2009:1024 normal SHIPPED_LIVE Important: Red Hat Enterprise Linux 4.8 kernel security and bug fix update 2009-05-18 10:57:26 EDT

  None (edit)
Description Issue Tracker 2008-07-31 10:35:08 EDT
Escalated to Bugzilla from IssueTracker
Comment 1 Issue Tracker 2008-07-31 10:35:10 EDT
(1) Category
  Defect Report

(2) Abstract
  x86_64 race condition at shutdown/panic

(3) Symptom
  If panic occurs in shutdown process, the system may hang up
  without oops messages.

(4) Environment
  OS: RHEL4/RHEL5(x86_64)
  It maybe doesn't depend on H/W.

(5) Recreation Steps
  Repeat shutdown. It might occur once every several 1000 times.

(6) Investigation
  If cpu_online_map is cleared and it is interrupted immediately,
  __smp_call_function() refers it as num_online_cpus()-1. this will be -1.
  Therefore, __smp_call_function() will be infinite loop.

(7) Related Documentation/Related Bugzilla #
  This problem has already been reported in upstream.
  http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=9964cf7d776600724ef5f1b33303ceadc588b8ba

(8) Attachments
  N/A

(9) Business Impacts
  We found this in our test.
  There is not business case yet.
  But, this can occur in any customers.

(10) Requests
  Please merge the upstream patch into RHEL kernel.

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497
Comment 3 Issue Tracker 2008-07-31 10:35:13 EDT
memo: generating test kernel on xen guest
olive_PV_RHEL_46_x86_64_01_ITIT133323 . 


This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497
Comment 4 Issue Tracker 2008-07-31 10:35:14 EDT
Hi Sonoda-san, 
With the patches the test kernel was built. Note only sanity-checking is
complete (boot, reboot, and shutdown). Would you like to test this kernel?


1. Please confirm the normal configuration do have an issue of race
condition and get sysreport. 
2. and then, test the kernel. 
3. Could you hand in the sysreport when everything is done? 



This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497
Comment 5 Issue Tracker 2008-07-31 10:35:15 EDT
memo: 

cherry picked from: 
9964cf7d776600724ef5f1b33303ceadc588b8ba
d89559589a588d1a654329d8cd9a3ad33aaad9be


### sanity check
### test kernel install
[root@gssem64t x86_64]# rpm --oldpackage -ivh
kernel-smp-2.6.9-67.EL_IT181497.x8
6_64.rpm
Preparing...                ###########################################
[100%]
   1:kernel-smp             ###########################################
[100%]


### reboot with new kernel
[root@gssem64t ~]# uname -a
Linux gssem64t 2.6.9-67.EL_IT181497smp #1 SMP Mon May 26 18:31:01 JST 2008
x86_6
4 x86_64 x86_64 GNU/Linux
[root@gssem64t ~]# dmesg > dmesg01.txt

### reboot
[root@gssem64t ~]# dmesg > dmesg02.txt
### shutdown
### boot the 2.6.9-67 kernel
[root@gssem64t ~]# dmesg > dmesg03.txt

### check dmesg and messages each time to be sure. Nothing much differs. 



This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497
Comment 6 Issue Tracker 2008-07-31 10:35:16 EDT
We're going to test the patched kernel and update status soon.
I change status to avoid autoclose.


Status set to: Waiting on Client

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497
Comment 7 Issue Tracker 2008-07-31 10:35:17 EDT
Hi seg, yet to have this reported up, but the vendor is currently testing
the test kernel. Could you open up BZ for this so I can enroll this in the
partner tracker bug? Please flip this to WoSupport once it's done so I can
continue working on this with them. Thanks! 


Issue escalated to Support Engineering Group by: tumeya.
Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497
Comment 8 Issue Tracker 2008-07-31 10:35:19 EDT
Streeter, I escalated in order to have you open up BZ to have this listed
on GSS 4.8 tracker bug. Please do so although the investigation is still
half way through. 


This event sent from IssueTracker by streeter  [SEG - Kernel]
 issue 181497
Comment 9 RHEL Product and Program Management 2008-09-03 09:07:07 EDT
Updating PM score.
Comment 10 Issue Tracker 2008-09-09 12:16:51 EDT
File uploaded: sysreport-root.RHEL4u5-x64-494.tar.bz2

This event sent from IssueTracker by streeter 
 issue 181497
it_file 152836
Comment 11 Issue Tracker 2008-09-09 12:16:52 EDT
Info;
Uploaded sysreport-root.RHEL4u5-x64-494.tar.bz2.

miki

Internal Status set to 'Waiting on Support'
Status set to: Waiting on Tech

This event sent from IssueTracker by streeter 
 issue 181497
Comment 12 Issue Tracker 2008-09-09 12:16:55 EDT
File uploaded: linux-2.6.9-x86_64_shutdown_panic.patch

This event sent from IssueTracker by streeter 
 issue 181497
it_file 152877
Comment 13 Issue Tracker 2008-09-09 12:16:56 EDT
File uploaded: linux-2.6.9-x86_64_shutdown_panic_2.patch

This event sent from IssueTracker by streeter 
 issue 181497
it_file 152878
Comment 14 Issue Tracker 2008-09-09 12:16:58 EDT
Hi Streeter, I've got sysreport and test kernel result from the vendor and
it sounds good. 

# Provide time and date of the problem
N/A this has occurred at the vendor's site. 


# Indicate the platform(s) (architectures) the problem is being reported
against.
Typically on SMP x86_64. 


# Provide clear and concise problem description as it is understood at the
time of escalation
    * Observed behavior
race condition at shutdown/panic would result in hang. The chance is
reportedly several times out of 1000. 


    * Desired behavior 
Machine will reboot properly. 


# State specific action requested of SEG
I've provided the test kernel that has cherry-pick patch from below the
git: 
9964cf7d776600724ef5f1b33303ceadc588b8ba
d89559589a588d1a654329d8cd9a3ad33aaad9be
They've tested it and confirmed to be functional. I'll attach the patches.
Let me know if I'm missing anything. 


# State whether or not a defect in the product is suspected

    * Provide Bugzilla if one already exists 
N/A


# If there is a proposed patch, make sure it is in unified diff format
(diff -pruN) 
--- linux-2.6.9/arch/x86_64/kernel/smp.c.org	2008-05-23 16:32:00.000000000
+0900
+++ linux-2.6.9/arch/x86_64/kernel/smp.c	2008-05-23 16:39:34.000000000
+0900
@@ -399,39 +399,31 @@
 	return 0;
 }
 
-void smp_stop_cpu(void)
+static void stop_this_cpu(void *dummy)
 {
+        local_irq_disable();
 	/*
 	 * Remove this CPU:
 	 */
 	cpu_clear(smp_processor_id(), cpu_online_map);
-	local_irq_disable();
 	disable_local_APIC();
-	local_irq_enable(); 
-}
-
-static void smp_really_stop_cpu(void *dummy)
-{
-	smp_stop_cpu(); 
 	for (;;) 
 		asm("hlt"); 
 } 
 
 void smp_send_stop(void)
 {
-	int nolock = 0;
+        int nolock;
+        unsigned long flags;
+
 	/* Don't deadlock on the call lock in panic */
-	if (!spin_trylock(&call_lock)) {
-		udelay(100);
-		/* ignore locking because we have paniced anyways */
-		nolock = 1;
-	}
-	__smp_call_function(smp_really_stop_cpu, NULL, 1, 0);
+        nolock = !spin_trylock(&call_lock);
+        local_irq_save(flags);
+        __smp_call_function(stop_this_cpu, NULL, 0, 0);
 	if (!nolock)
 		spin_unlock(&call_lock);
-	local_irq_disable();
 	disable_local_APIC();
-	local_irq_enable(); 
+        local_irq_restore(flags);
 }
 
 /*
--- linux-2.6.9/include/asm-x86_64/smp.h.org	2008-05-23 16:39:53.000000000
+0900
+++ linux-2.6.9/include/asm-x86_64/smp.h	2008-05-23 16:40:20.000000000
+0900
@@ -46,7 +46,6 @@
 extern void smp_invalidate_rcv(void);		/* Process an NMI */
 extern void (*mtrr_hook) (void);
 extern void zap_low_mappings(void);
-void smp_stop_cpu(void);
 extern cpumask_t cpu_sibling_map[NR_CPUS];
 extern cpumask_t cpu_core_map[NR_CPUS];
 extern u8 phys_proc_id[NR_CPUS];



--- linux-2.6.9/arch/x86_64/kernel/reboot.c.org	2008-05-26
17:26:51.000000000 +0900
+++ linux-2.6.9/arch/x86_64/kernel/reboot.c	2008-05-26 17:54:34.000000000
+0900
@@ -96,51 +96,55 @@
 		      [target] "b" (WARMBOOT_TRAMP));
 }
 
-#ifdef CONFIG_SMP
-static void smp_halt(void)
+static inline void kb_wait(void)
+{
+        int i;
+
+        for (i=0; i<0x10000; i++)
+                if ((inb_p(0x64) & 0x02) == 0)
+                        break;
+}
+
+void machine_shutdown(void)
 {
-	int cpuid = hard_smp_processor_id(); 
-	static int first_entry = 1;
+        /* Stop the cpus and apics */
+#ifdef CONFIG_SMP
+        int reboot_cpu_id;
+
+        /* The boot cpu is always logical cpu 0 */
+        reboot_cpu_id = 0;
 
-	if (first_entry) { 
-		first_entry = 0;
-		/* If nobody's alive, just return to machine_restart */
-		if (num_online_cpus() == 1)
-			return;
-		smp_call_function((void *)machine_restart, NULL, 1, 0);
-	} 
-			
-	smp_stop_cpu(); 
-
-	/* AP calling this. Just halt */
-	if (cpuid != boot_cpu_id) { 
-		for (;;) 
-			asm("hlt");
+        /* Make certain the cpu I'm about to reboot on is online */
+        if (!cpu_isset(reboot_cpu_id, cpu_online_map)) {
+                reboot_cpu_id = smp_processor_id();
 	}
 
-	/* Wait for all other CPUs to have run smp_stop_cpu */
-	while (!cpus_empty(cpu_online_map))
-		rep_nop(); 
-}
+        /* Make certain I only run on the appropriate processor */
+        set_cpus_allowed(current, cpumask_of_cpu(reboot_cpu_id));
+
+        /* O.K Now that I'm on the appropriate processor,
+         * stop all of the others.
+         */
+        smp_send_stop();
 #endif
 
-static inline void kb_wait(void)
-{
-	int i;
+        local_irq_disable();
 
-	for (i=0; i<0x10000; i++)
-		if ((inb_p(0x64) & 0x02) == 0)
-			break;
+#ifndef CONFIG_SMP
+        disable_local_APIC();
+#endif
+ 
+        disable_IO_APIC();
+
+        local_irq_enable();
 }
 
 void machine_restart(char * __unused)
 {
 	int i;
 
-#ifdef CONFIG_SMP
 	if (!crashdump_mode())
-		smp_halt();
-#endif
+		machine_shutdown();
 
 	local_irq_disable();


Internal Status set to 'Waiting on SEG'

This event sent from IssueTracker by streeter 
 issue 181497
Comment 15 Issue Tracker 2008-09-09 12:17:00 EDT
attached are the patches for 67.EL. 


This event sent from IssueTracker by streeter 
 issue 181497
Comment 17 RHEL Product and Program Management 2008-09-22 14:23:32 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 18 Ivan Vecera 2008-09-29 11:58:49 EDT
I have prepared test kernel packages for i686 and x86_64. Could anybody test them?
They are available at:
http://people.redhat.com/ivecera/rhel-4-ivtest/
Comment 19 Issue Tracker 2008-10-02 04:21:46 EDT
I believe the test kernel on BZ457409 contains the same patch as on
IT181497. It's been tested by the vendor twice btw. They'll come back and
test this again on beta phase. Thanks! 




This event sent from IssueTracker by tumeya 
 issue 181497
Comment 20 Ivan Vecera 2008-10-14 12:09:35 EDT
Created attachment 320319 [details]
Final patch sent to review
Comment 21 Vivek Goyal 2009-01-14 09:23:18 EST
Committed in 78.28.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/
Comment 23 Prarit Bhargava 2009-02-17 08:55:04 EST
*** Bug 479194 has been marked as a duplicate of this bug. ***
Comment 26 Chris Ward 2009-05-05 09:57:28 EDT
Any updates here? Has this issue been resolved in the RHEL 4.8 Beta? later kernel?
Comment 28 errata-xmlrpc 2009-05-18 15:30:59 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1024.html

Note You need to log in before you can comment on or make changes to this bug.