Bug 246586 - Ooops in queue_work on boot.
Ooops in queue_work on boot.
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel-xen (Show other bugs)
4.5
i686 Linux
low Severity low
: ---
: ---
Assigned To: Rik van Riel
Martin Jenner
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-07-03 04:47 EDT by Ian Campbell
Modified: 2007-11-16 20:14 EST (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2007-0791
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-15 11:29:48 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0791 normal SHIPPED_LIVE Updated kernel packages available for Red Hat Enterprise Linux 4 Update 6 2007-11-14 13:25:55 EST

  None (edit)
Description Ian Campbell 2007-07-03 04:47:34 EDT
If a timer interrupt is received between time_init() and init_workqueues() and
HYPERVISOR_shared_info->wc_version ticks over and causes clock_was_set() to be
called from timer_interrupt() the kernel can oops because clock_was_set() will
attempt to defer work to the keventd_wq which has not yet been initialised. 

Ooops:
        Checking if this processor honours the WP bit even in supervisor mode... Ok.
        printing eip:
        c012a724
        02669000 -> *pde = 00000000:00000000
        Oops: 0000 [#1]
        SMP
        Modules linked in:
        CPU:    0
        EIP:    0061:[<c012a724>]    Not tainted VLI
        EFLAGS: 00010046   (2.6.9-55.ELxenU)
        EIP is at queue_work+0x21/0x53
        eax: 00001004   ebx: 00000000   ecx: 00000000   edx: c029b600
        esi: 00000000   edi: 00000000   ebp: c0338f6c   esp: c0338f58
        ds: 007b   es: 007b   ss: 0068
        Process swapper (pid: 0, threadinfo=c0338000 task=c0297a40)
        Stack: 006fae09 00000000 c012dff0 00000000 006fae09 c0338f6c c0338f6c
006fae09
               00000000 00000000 00000000 c010dbd0 00cf1f60 00000000 00000000
00000000
               c10000c0 c1000080 00000000 00000000 45cf494b 0000c74f 00000000
00000000
        Call Trace:
         [<c012dff0>] clock_was_set+0x2d/0x180
         [<c010dbd0>] timer_interrupt+0x26e/0x3f5
         [<c01094aa>] handle_IRQ_event+0x44/0x85
         [<c0109a38>] do_IRQ+0x122/0x1b5
         =======================
         [<c01f7150>] evtchn_do_upcall+0x84/0xb8
         [<c0107538>] hypervisor_callback+0x2c/0x34
         [<c01020e6>] calibrate_delay+0xe3/0x1a8
         [<c02f36f3>] start_kernel+0x14f/0x1b6
        Code: 89 fa 5b 5e 5f e9 73 ea 13 00 56 31 f6 53 89 c3 b8 00 f0 ff ff 21
e0 8b 48 10 f0 0f ba 2a 00 19 c0 85 c0 75 33 8d 83 04 10 00 00 <39> 83 04 10 00
00 8d 42 04 0f 44 ce 39 42 04 74 08 0f 0b 68 00
         <0>Kernel panic - not syncing: Fatal exception in interrupt

It happens quite rarely but can be made to trigger quite reliably by adding a
very large (multiple tens of seconds) delay after or during calibrate_delay() to
increase the window of opportunity.

The fix is in:
http://xenbits.xensource.com/staging/linux-2.6.18-xen.hg?rev/cb040341e05a
http://xenbits.xensource.com/kernels/rhel4x.hg?rev/4c6e7201cfb7

The indirection via the workqueue is not strictly necessary in 2.6.9 since
clock_was_set() will do this itself. It is there because we wanted the fix to
apply to a wide variety of kernels.

Thanks,
Ian Campbell, XenSource.
Comment 1 Rik van Riel 2007-07-03 10:30:38 EDT
In RHEL4 things look like this:

void clock_was_set(void)
{
        struct k_itimer *timr;
        struct timespec new_wall_to;
        LIST_HEAD(cws_list);
        unsigned long seq;


        if (unlikely(in_interrupt())) {
                schedule_work(&clock_was_set_work);
                return;
        }

This seems to indicate that your patch will result in the kernel doing exactly 
what it is doing today, but maybe a little more efficiently.  Your patch also 
makes a possible (though highly unlikely in RHEL4) future change to 
clock_was_set() safe.

However, I do not understand how your patch fixes the bug.  If schedule_work 
is called before keventd_wq is initialized, surely it does not matter whether 
schedule_work is called from clock_was_set or directly from timer_interrupt?

What exactly is going on here?
Comment 2 RHEL Product and Program Management 2007-07-03 10:35:39 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.
Comment 3 Ian Campbell 2007-07-03 10:45:39 EDT
The important bit of the patch is the addition of the "if (keventd_up())" which
prevents clock_was_set() (or in this case schedule_work() directly) from being
called before keventd_wq is set.

The workqueue added by the patch is indeed unecessary for the 2.6.9 kernel since
clock_was_set() does the same thing itself. It's just there so the patch is
useful on a variety of kernels (e.g. in 2.6.21 clock_was_set() cannot be called
from interrupt context and does not defer itself). If you wanted you could
simplify to:
   if (keventd_up()) clock_was_set();
Comment 4 Rik van Riel 2007-07-03 10:57:40 EDT
Good point, doh!

Thanks for this patch Ian, I'll try to get it folded into the RHEL 4.6 tree 
ASAP.
Comment 5 Rik van Riel 2007-07-03 11:49:36 EDT
I have posted the patch on our internal kernel mailing list.
Comment 6 Red Hat Bugzilla 2007-07-24 20:53:32 EDT
change QA contact
Comment 7 Jason Baron 2007-07-25 10:32:20 EDT
committed in stream U6 build 55.22. A test kernel with this patch is available
from http://people.redhat.com/~jbaron/rhel4/
Comment 11 errata-xmlrpc 2007-11-15 11:29:48 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0791.html

Note You need to log in before you can comment on or make changes to this bug.