Created attachment 317507 [details] xml script used to duplicate the issue Description of problem: When running a Dom0, Paravirt, Fullvirt automated test. The system hits an Oops Version-Release number of selected component (if applicable): Seems to have started with 2.6.18-109.el5 How reproducible Random Steps to Reproduce: 1. Use the attachment to submit a job to RHTS Actual results: BUG: unable to handle kernel paging request at virtual address c0180c40 printing eip: c040a1e6 1c3be000 -> *pde = 00000001:0d44f001 1284f000 -> *pme = 00000000:3d272067 01272000 -> *pte = 00000000:00000000 Oops: 0000 [#1] SMP last sysfs file: /devices/pci0000:3f/0000:3f:00.0/irq Modules linked in: xt_physdev nfs lockd fscache nfs_acl netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand dm_multipath scsi_dh video backlight sbs i2c_ec i2c_core button battery asus_acpi ac parport_pc lp parport floppy sr_mod cdrom sg snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event tg3 snd_seq libphy snd_seq_device pcspkr snd_pcm_oss snd_mixer_oss serio_raw snd_pcm snd_timer snd_page_alloc snd_hwdep snd soundcore serial_core dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd CPU: 1 EIP: 0061:[<c040a1e6>] Not tainted VLI EFLAGS: 00010002 (2.6.18-116.el5xen #1) EIP is at range_straddles_page_boundary+0x2c/0xd9 eax: c0169000 ebx: 000be210 ecx: 00003000 edx: 000be210 esi: 00000000 edi: 00003000 ebp: d3c91b18 esp: e03c4c44 ds: 007b es: 007b ss: 0069 Process virtinstall.exp (pid: 9560, ti=e03c4000 task=c0e53550 task.ti=e03c4000) Stack: c046c47a 00040810 00000000 00000000 d3c91b18 c04ece67 00000002 c07f0448 00000000 00000000 00000000 00000000 00000001 be210000 00000000 ffffffff ffffffff d3c91b00 00000002 c11f5ab4 c11f4084 c040a5aa 00000002 00000002 Call Trace: [<c046c47a>] kmem_cache_alloc+0x54/0x5e [<c04ece67>] swiotlb_map_sg+0x100/0x22f [<c040a5aa>] dma_map_sg+0x7d/0x1a9 [<ee10a615>] ata_qc_issue+0x29a/0x490 [libata] [<ee0944d8>] scsi_done+0x0/0x16 [scsi_mod] [<ee10e900>] ata_scsi_translate+0x107/0x12c [libata] [<ee0944d8>] scsi_done+0x0/0x16 [scsi_mod] [<ee110f5b>] ata_scsi_queuecmd+0x18f/0x1ac [libata] [<ee10e612>] ata_scsi_rw_xlat+0x0/0x1c1 [libata] [<ee094a7a>] scsi_dispatch_cmd+0x213/0x28c [scsi_mod] [<ee0991ca>] scsi_request_fn+0x24b/0x305 [scsi_mod] [<c04d94c3>] __generic_unplug_device+0x1d/0x1f [<c04da222>] generic_unplug_device+0x1f/0x31 [<ee0ee854>] dm_table_unplug_all+0x22/0x2e [dm_mod] [<ee0ecc79>] dm_unplug_all+0x17/0x21 [dm_mod] [<c04db68a>] blk_backing_dev_unplug+0x56/0x5d [<c0450b9b>] sync_page+0x0/0x3b [<c04712a4>] block_sync_page+0x31/0x32 [<c0450bce>] sync_page+0x33/0x3b [<c060f0f2>] __wait_on_bit_lock+0x2a/0x52 [<c0450b0e>] __lock_page+0x52/0x59 [<c0430f14>] wake_bit_function+0x0/0x3c [<c045125f>] do_generic_mapping_read+0x1ff/0x3d8 [<c0451c9b>] __generic_file_aio_read+0x166/0x198 [<c04508da>] file_read_actor+0x0/0xd5 [<c0451d08>] generic_file_aio_read+0x3b/0x42 [<c046fd53>] do_sync_read+0xb6/0xf1 [<c0430ee7>] autoremove_wake_function+0x0/0x2d [<c046fc9d>] do_sync_read+0x0/0xf1 [<c047065c>] vfs_read+0x9f/0x141 [<c04789f7>] kernel_read+0x32/0x43 [<c0478acf>] prepare_binprm+0xc7/0xcc [<c047a56d>] do_execve+0xc3/0x1b2 [<c040337d>] sys_execve+0x2a/0x4a [<c0405413>] syscall_call+0x7/0xb ======================= Code: 57 56 89 d6 53 89 c3 25 ff 0f 00 00 83 ec 04 8d 3c 08 81 ff 00 10 00 00 0f 86 b2 00 00 00 89 da a1 20 61 77 c0 0f ac f2 0c 89 d3 <0f> a3 10 19 c0 85 c0 0f 85 98 00 00 00 a0 e2 f6 6e c0 88 44 24 EIP: [<c040a1e6>] range_straddles_page_boundary+0x2c/0xd9 SS:ESP 0069:e03c4c44 <0>Kernel panic - not syncing: Fatal exception BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted) [<c041142f>] smp_call_function+0x59/0xfe [<c04114e7>] smp_send_stop+0x13/0x1e [<c04207fb>] panic+0x4c/0x171 [<c04060a5>] die+0x262/0x296 [<c0611969>] do_page_fault+0xa85/0xbf9 [<c0610ee4>] do_page_fault+0x0/0xbf9 [<c0405597>] error_code+0x2b/0x30 [<c040a1e6>] range_straddles_page_boundary+0x2c/0xd9 [<c046c47a>] kmem_cache_alloc+0x54/0x5e [<c04ece67>] swiotlb_map_sg+0x100/0x22f [<c040a5aa>] dma_map_sg+0x7d/0x1a9 [<ee10a615>] ata_qc_issue+0x29a/0x490 [libata] [<ee0944d8>] scsi_done+0x0/0x16 [scsi_mod] [<ee10e900>] ata_scsi_translate+0x107/0x12c [libata] [<ee0944d8>] scsi_done+0x0/0x16 [scsi_mod] [<ee110f5b>] ata_scsi_queuecmd+0x18f/0x1ac [libata] [<ee10e612>] ata_scsi_rw_xlat+0x0/0x1c1 [libata] [<ee094a7a>] scsi_dispatch_cmd+0x213/0x28c [scsi_mod] [<ee0991ca>] scsi_request_fn+0x24b/0x305 [scsi_mod] [<c04d94c3>] __generic_unplug_device+0x1d/0x1f [<c04da222>] generic_unplug_device+0x1f/0x31 [<ee0ee854>] dm_table_unplug_all+0x22/0x2e [dm_mod] [<ee0ecc79>] dm_unplug_all+0x17/0x21 [dm_mod] [<c04db68a>] blk_backing_dev_unplug+0x56/0x5d [<c0450b9b>] sync_page+0x0/0x3b [<c04712a4>] block_sync_page+0x31/0x32 [<c0450bce>] sync_page+0x33/0x3b [<c060f0f2>] __wait_on_bit_lock+0x2a/0x52 [<c0450b0e>] __lock_page+0x52/0x59 [<c0430f14>] wake_bit_function+0x0/0x3c [<c045125f>] do_generic_mapping_read+0x1ff/0x3d8 [<c0451c9b>] __generic_file_aio_read+0x166/0x198 [<c04508da>] file_read_actor+0x0/0xd5 [<c0451d08>] generic_file_aio_read+0x3b/0x42 [<c046fd53>] do_sync_read+0xb6/0xf1 [<c0430ee7>] autoremove_wake_function+0x0/0x2d [<c046fc9d>] do_sync_read+0x0/0xf1 [<c047065c>] vfs_read+0x9f/0x141 [<c04789f7>] kernel_read+0x32/0x43 [<c0478acf>] prepare_binprm+0xc7/0xcc [<c047a56d>] do_execve+0xc3/0x1b2 [<c040337d>] sys_execve+0x2a/0x4a [<c0405413>] syscall_call+0x7/0xb ======================= Additional info: submit_job.py -S rhts.redhat.com -j xw4800-virt-failure.xml
Another crash with a different trace on a different system: > > BUG: unable to handle kernel paging request at virtual address c0197b38^M > > printing eip:^M > > c040a1e6^M > > 2cee5000 -> *pde = 00000003:cde4e001^M > > 2d04e000 -> *pme = 00000000:3dc16067^M > > 01c16000 -> *pte = 00000000:00000000^M > > Oops: 0000 [#1]^M > > SMP ^M > > last sysfs file: /devices/pci0000:00/0000:00:00.0/irq^M > > Modules linked in: loop xt_physdev nfs lockd fscache nfs_acl netloop netbk blktap blkbk ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge autofs4 hidp rfcomm l2cap bluetooth sunrpc ipv6 xfrm_nalgo crypto_api dm_mirror dm_log dm_multipath scsi_dh dm_mod video backlight sbs i2c_ec button battery asus_acpi ac lp floppy sg pcspkr e1000e serio_raw e1000 i2c_i801 parport_pc i2c_core parport ide_cd cdrom serial_core ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd^M > > CPU: 0^M > > EIP: 0061:[<c040a1e6>] Not tainted VLI^M > > EFLAGS: 00010006 (2.6.18-116.el5dz_boot6xen #1) ^M > > EIP is at range_straddles_page_boundary+0x2c/0xd9^M > > eax: c0166000 ebx: 0018d9c0 ecx: 00002000 edx: 0018d9c0^M > > esi: 00000001 edi: 00002000 ebp: e4ef72a8 esp: c0723e30^M > > ds: 007b es: 007b ss: 0069^M > > Process swapper (pid: 0, ti=c0723000 task=c06752c0 task.ti=c06ee000)^M > > Stack: 00000002 0026d7c0 00000000 00000002 e4ef72a8 c04ebf47 00000027 c0c13448 ^M > > 00000000 00000000 00000000 00000000 00000017 8d9c0000 00000001 ffffffff ^M > > ffffffff e4ef7080 00000001 00040027 00050000 c040a5aa 00000001 00000027 ^M > > Call Trace:^M > > [<c04ebf47>] swiotlb_map_sg+0x100/0x22f^M > > [<c040a5aa>] dma_map_sg+0x7d/0x1a9^M > > [<ee0c94df>] megasas_make_sgl64+0x7d/0xb5 [megaraid_sas]^M > > [<ee0c9d50>] megasas_queue_command+0x23f/0x405 [megaraid_sas]^M > > [<ee0834d8>] scsi_done+0x0/0x16 [scsi_mod]^M > > [<ee083a7a>] scsi_dispatch_cmd+0x213/0x28c [scsi_mod]^M > > [<ee088254>] scsi_request_fn+0x24b/0x305 [scsi_mod]^M > > [<c04d88be>] blk_run_queue+0x37/0x63^M > > [<ee0873e0>] scsi_next_command+0x25/0x2f [scsi_mod]^M > > [<ee0874f7>] scsi_end_request+0xa1/0xab [scsi_mod]^M > > [<ee087641>] scsi_io_completion+0x140/0x2ea [scsi_mod]^M > > [<c0419591>] __wake_up+0x2a/0x3d^M > > [<ee061704>] sd_rw_intr+0x271/0x2b6 [sd_mod]^M > > [<c0427cf8>] del_timer+0x41/0x47^M > > [<c060f1b8>] _spin_lock_irqsave+0x8/0x28^M > > [<ee0833b9>] scsi_finish_command+0x73/0x77 [scsi_mod]^M > > [<c04d9183>] blk_done_softirq+0x55/0x60^M > > [<c0424563>] __do_softirq+0x8b/0x11c^M > > [<c0406e4d>] do_softirq+0x56/0xae^M > > [<c04464a8>] __do_IRQ+0x0/0xd6^M > > [<c0406f5a>] do_IRQ+0xb5/0xc3^M > > [<c054eb85>] evtchn_do_upcall+0xfa/0x191^M > > [<c04055d9>] hypervisor_callback+0x3d/0x48^M > > [<c0408654>] raw_safe_halt+0x8c/0xaf^M > > [<c040321a>] xen_idle+0x22/0x2e^M > > [<c0403339>] cpu_idle+0x91/0xab^M > > [<c06f39f5>] start_kernel+0x37a/0x381^M > > =======================^M > > Code: 57 56 89 d6 53 89 c3 25 ff 0f 00 00 83 ec 04 8d 3c 08 81 ff 00 10 00 00 0f 86 b2 00 00 00 89 da a1 20 41 77 c0 0f ac f2 0c 89 d3 <0f> a3 10 19 c0 85 c0 0f 85 98 00 00 00 a0 e2 d6 6e c0 88 44 24 ^M > > EIP: [<c040a1e6>] range_straddles_page_boundary+0x2c/0xd9 SS:ESP 0069:c0723e30^M > > <0>Kernel panic - not syncing: Fatal exception in interrupt^M > > BUG: warning at arch/i386/kernel/smp-xen.c:529/smp_call_function() (Not tainted)^M > > [<c041099b>] smp_call_function+0x59/0xfe^M > > [<c0410a53>] smp_send_stop+0x13/0x1e^M > > [<c041f877>] panic+0x4c/0x171^M > > [<c04060a5>] die+0x262/0x296^M > > [<c0610a79>] do_page_fault+0xa85/0xbf9^M > > [<c060fff4>] do_page_fault+0x0/0xbf9^M > > [<c0405597>] error_code+0x2b/0x30^M > > [<c040a1e6>] range_straddles_page_boundary+0x2c/0xd9^M > > [<c04ebf47>] swiotlb_map_sg+0x100/0x22f^M > > [<c040a5aa>] dma_map_sg+0x7d/0x1a9^M > > [<ee0c94df>] megasas_make_sgl64+0x7d/0xb5 [megaraid_sas]^M > > [<ee0c9d50>] megasas_queue_command+0x23f/0x405 [megaraid_sas]^M > > [<ee0834d8>] scsi_done+0x0/0x16 [scsi_mod]^M > > [<ee083a7a>] scsi_dispatch_cmd+0x213/0x28c [scsi_mod]^M > > [<ee088254>] scsi_request_fn+0x24b/0x305 [scsi_mod]^M > > [<c04d88be>] blk_run_queue+0x37/0x63^M > > [<ee0873e0>] scsi_next_command+0x25/0x2f [scsi_mod]^M > > [<ee0874f7>] scsi_end_request+0xa1/0xab [scsi_mod]^M > > [<ee087641>] scsi_io_completion+0x140/0x2ea [scsi_mod]^M > > [<c0419591>] __wake_up+0x2a/0x3d^M > > [<ee061704>] sd_rw_intr+0x271/0x2b6 [sd_mod]^M > > [<c0427cf8>] del_timer+0x41/0x47^M > > [<c060f1b8>] _spin_lock_irqsave+0x8/0x28^M > > [<ee0833b9>] scsi_finish_command+0x73/0x77 [scsi_mod]^M > > [<c04d9183>] blk_done_softirq+0x55/0x60^M > > [<c0424563>] __do_softirq+0x8b/0x11c^M > > [<c0406e4d>] do_softirq+0x56/0xae^M > > [<c04464a8>] __do_IRQ+0x0/0xd6^M > > [<c0406f5a>] do_IRQ+0xb5/0xc3^M > > [<c054eb85>] evtchn_do_upcall+0xfa/0x191^M > > [<c04055d9>] hypervisor_callback+0x3d/0x48^M > > [<c0408654>] raw_safe_halt+0x8c/0xaf^M > > [<c040321a>] xen_idle+0x22/0x2e^M > > [<c0403339>] cpu_idle+0x91/0xab^M > > [<c06f39f5>] start_kernel+0x37a/0x381^M > > =======================^M > > (XEN) Domain 0 crashed: rebooting machine in 5 seconds.^M
So, since this doesn't seem to be reproducible on demand, I'm going to have a look at the code. I'll start with the traces above: They all end up in range_straddles_page_boundary+0x2c, which is in arch/i386/kernel/pci-dma-xen.c. In there, range_straddles_page_boundary is trying to look through the contiguous_bitmap to find if this pfn was allocated contiguously (via xen_create_contiguous_region). However, it's here that it takes a page fault, most likely when accessing the contiguous_bitmap variable. That variable is allocated very early on in boot, in arch/i386/mm/init-xen.c:mem_init(): contiguous_bitmap = alloc_bootmem_low_pages( (max_low_pfn + 2*BITS_PER_LONG) >> 3); Given that, there are 2 possibilities that come to mind: 1) We never use contiguous_bitmap before this point, and somehow we relinquished the memory that it was using. Unlikely. 2) We allocated it, and have used it before, but somebody else stomped on the memory. This seems likely, but one interesting thing is that the addresses we always fault on *look* reasonable; that is, they are above c0000000, and look like reasonable addresses. However, it may not be random corruption, but just another kind of corruption. We still need to look at it more to find out what is going on. Chris Lalancette
OK. I still haven't figured this one out, but I have made progress (with a lot of help from bburns and jburke). So far, I've figured out why we are crashing, I'm just not 100% sure what to do about it at the moment. So to start with, jburke graciously collected me a core from one of the crashing machines. Looking through that core, the reason we crashed is because we took a page fault at address c01b98c0 in kernel space. Now, the backtrace looks like this: #0 [c124db04] die at c040607b #1 [c124db30] do_page_fault at c0611c2c #2 [c124dba8] error_code (via page_fault) at c0405595 EAX: c0166000 EBX: 0029c60c ECX: 00002000 EDX: 0029c60c EBP: e5d39cf8 DS: 007b ESI: 00000002 ES: 007b EDI: 00002000 CS: 0061 EIP: c040a1e6 ERR: ffffffff EFLAGS: 00010006 #3 [c124dbdc] range_straddles_page_boundary at c040a1e6 #4 [c124dbf4] swiotlb_map_sg at c04ece72 #5 [c124dc34] dma_map_sg at c040a5a5 #6 [c124dc58] megaraid_mbox_mksgl at ee0bc1b2 #7 [c124dc80] megaraid_queue_command at ee0bc6cd #8 [c124dcdc] scsi_dispatch_cmd at ee0dba77 #9 [c124dcf0] scsi_request_fn at ee0e024f #10 [c124dd08] __generic_unplug_device at c04d94cc #11 [c124dd10] __make_request at c04db5ee #12 [c124dd50] generic_make_request at c04d8642 (snipped for brevity) You can see that we were in range_straddles_page_boundary, and in particular, we were at "if (test_bit(pfn, contiguous_bitmap))". Now, the reason we took a page fault is that contiguous_bitmap starts at c0166000, and looking at %edx we see that the pfn is 0029c60c. test_bit() is basically a "bt" asm instruction, so if you do the math that bt does, you end up with: 0x0029c60c / 0x20 = 0x14E30 0x14E30 * 0x4 = 0x538C0 0xc0166000 + 0x538C0 = 0xC01B98C0, the address of the page fault. So, the first thing to recognize is that contiguous_bitmap is only allocated up to max_low_pfn, which in this case only goes up to 0x2D7FE. Obviously the 0x29c60c is way above this. However, there is another thing to realize. Here's a little more code from range_straddles_page_boundary: if (offset + size <= PAGE_SIZE) return 0; if (test_bit(pfn, contiguous_bitmap)) return 0; From that, it's clear that if you come in here with a pfn > max_low_pfn, it's perfectly fine as long as you don't ask for > 1 page worth of data (because you'll be guarded by the offset + size <= PAGE_SIZE test). Looking at the disassembly again, it's clear that we did come in here asking for too much data; namely, we asked for 0x2000 bytes of data. This is also why it is difficult to reproduce the problem; you have to get a pfn > max_low_pfn, *and* it has to ask for > 0x1000 bytes of data, which I guess doesn't happen very often. Now, there are two ways to fix this, as far as I can tell. Which one is correct depends upon one piece of knowledge I don't have. 1. We can increase the contiguous_bitmap to cover all of memory (that is, when we allocate it in init-xen.c, make sure to use max_pfn instead of max_low_pfn). What I'm not sure is whether this is even logical; does it make sense for xen_create_contiguous_region to allocate and mark memory in higher regions? 2. We can add additional checks in range_straddles_page_boundary(). In particular, if we know that this request is larger than a page, and it's not physically contiguous, and it's above max_low_pfn, it can't possibly be in the contiguous_bitmap, so we are going to have to split the request. I'm leaning towards 2 as the correct answer, but I'm not entirely sure yet. Chris Lalancette
Created attachment 321243 [details] Patch to completely remove contiguous_bitmap After I posted the patch upstream to fix this bug, upstream realized that there was no reason to further keep around the contiguous_bitmap at all (all of the cases it was handling were the same as check_pages_physicall_contiguous were handling). Therefore, they completely removed it. This patch is a backport of that removal to RHEL-5. Chris Lalancette
Created attachment 322272 [details] Add the check_physically_contiguous call to ia64 This is a patch to add the check_pages_physically_contiguous call on ia64, just like we have for x86. This is necessary to prevent some swiotlb exhaustion on platforms that use the swiotlb extensively. With this in place, it is safe to remove the contiguous bitmap from ia64, i386, and x86_64.
Created attachment 322273 [details] Updated patch to remove the contiguous_bitmap from Xen completely An updated version of the patch to completely remove the contiguous_bitmap. Pretty much the same as the last, but re-diffed after the previous "add check_pages_physically_contiguous" patch to ia64.
in kernel-2.6.18-123.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
*** Bug 454369 has been marked as a duplicate of this bug. ***
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html