Bug 156396 - System crash when dump or tar 64k blocksize to tape from raid
System crash when dump or tar 64k blocksize to tape from raid
Status: CLOSED ERRATA
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel (Show other bugs)
3.0
x86_64 Linux
medium Severity high
: ---
: ---
Assigned To: Tom Coughlan
Brian Brock
:
Depends On:
Blocks: 168424
  Show dependency treegraph
 
Reported: 2005-04-29 14:43 EDT by Jeff Johnson
Modified: 2007-11-30 17:07 EST (History)
7 users (show)

See Also:
Fixed In Version: RHSA-2006-0144
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2006-03-15 10:57:05 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
gziped text file of system boot, version information, system config. (7.69 KB, application/octet-stream)
2005-05-02 14:11 EDT, Jeff Johnson
no flags Details
Panic message from kernel (3.02 KB, text/plain)
2005-06-02 03:00 EDT, Sebastien BLAISOT
no flags Details
patch to free write buffer after write completes (1.10 KB, patch)
2005-10-12 13:20 EDT, Tom Coughlan
no flags Details | Diff

  None (edit)
Description Jeff Johnson 2005-04-29 14:43:58 EDT
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

Description of problem:
When running a dump or tar operation of 64k blocksize from anexternal scsi attached raid device to scsi attached tape I get a system crash. When running the same exact process on the UP variant kernel (2.4.21-27.0.2.EL) the operation runs flawlessly.

In order to eliminate possibilities of bugs in specific device drivers I have substituted a number of components in order to utilize different driver modules in the test.

SCSI Cards used: LSI 22320R dual channel U320 SCSI (mptbase/mptscsih), LSI 20320R single channel U320 SCSI (mptbase/mptscsih), Adaptec 39320A-R (aic79xx) 

Tape devices used: Quantum SuperDLT 600 (st), HP LTO (st)

RAID device used: Infortrend A16U (sd_mod)

SCSI configuration: raid and tape sharing bus, raid and tape on separate buses (dual channel scsi card)

Boot options: I have tried defaults as well as "noapic"

Commands used: `dump -0 -b 64 -f /dev/st0 /raid`, `tar -b 128 -cvf /dev/st0 /raid`, `restore -C -b 64 -f /dev/st0`

In all cases using the 2.4.21-27.0.2.ELsmp kernel crashes. Using the 2.4.21-27.0.2.EL (UP) the same tests that cause crashes works without error. On the UP kernel I am able to perform restore/compares without any errors. 



Version-Release number of selected component (if applicable):
kernel-2.4.21-27.0.2.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. Install RHEL3-UD4 with 2.4.21-27.0.2.ELsmp and 2.4.21-27.0.2.EL
2. Attach SCSI tape, SCSI raid (ext3fs)
3. Make ext3 fs on raid and mount at /raid
4. Boot SMP kernel
5. `mt -f /dev/st0 status`
6. `dump -0 -b 64 -f /dev/st0 /raid'
7. Observe crash

  

Additional info:

attachment of log files, rpm listings, sysinfo and crash screen soon to come
Comment 1 Ernie Petrides 2005-04-29 15:40:11 EDT
Please attach console output with oops/panic info.
Comment 2 Jeff Johnson 2005-05-02 13:59:51 EDT
dump -0 -b 64 -f /dev/st0 /raid
  DUMP: Date of this level 0 dump: Mon May  2 13:13:02 2005
  DUMP: Dumping /dev/sda1 (/raid) to /dev/st0
  DUMP: Added inode 8 to exclude list (journal inode)
  DUMP: Added inode 7 to exclude list (resize inode)
  DUMP: Label: none
  DUMP: mapping (Pass I) [regular files]
  DUMP: Kernel BUG at pci_dma:43
invalid operand: 0000
CPU 0
Pid: 2202, comm: dump Not tainted
RIP: 0010:[<ffffffff80116b2a>]{pci_map_sg+106}
RSP: 0018:0000010033901c68  EFLAGS: 00010086
RAX: 00000000801fadc0 RBX: 000001003f5ba080 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000010031361000 RDI: 0000010004b30000
RBP: 0000000000000002 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000009
R13: 0000000000000001 R14: 000001003f5ba040 R15: 0000010004b30000
FS:  0000002a969644c0(0000) GS:ffffffff805e1540(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000005a43e0 CR3: 0000000000101000 CR4: 00000000000006e0              
                                                                
Call Trace: [<ffffffff80116ba3>]{pci_map_sg+227}
[<ffffffffa002b41c>]{:aic79xx:ahd_linux_run_device_queue+796}
       [<ffffffffa0026ffb>]{:aic79xx:ahd_linux_queue+731}
       [<ffffffffa00008a0>]{:scsi_mod:scsi_dispatch_cmd+640}
       [<ffffffffa0009bbf>]{:scsi_mod:scsi_request_fn+1039}
       [<ffffffffa0008cff>]{:scsi_mod:__scsi_insert_special+127}
       [<ffffffffa0008d6f>]{:scsi_mod:scsi_insert_special_req+31}
       [<ffffffffa0000bae>]{:scsi_mod:scsi_do_req_Rsmp_6b1beddb+350}
       [<ffffffffa0096310>]{:st:st_sleep_done+0}
[<ffffffffa0096516>]{:st:st_do_scsi+310}
       [<ffffffffa0098089>]{:st:st_write+2121} [<ffffffff8015f452>]{sys_write+178}
       [<ffffffff801102a7>]{system_call+119}
Process dump (pid: 2202, stackpage=10033901000)
Stack: 0000010033901c68 0000000000000018 ffffffff80116ba3 0000000000000246
       000001003f0bf800 0000010004a61480 000001003f5ba040 000001003ffe0000
       000001003f75ae80 0000010004a5a85a ffffffffa002b41c 0000000000000000
       000001003f0bf800 000001003ffe0000 0000000000000000 0000010004a8c800
       0000000000000000 000001003f0bf930 ffffffffa0026ffb 000001003f82ba60
       000001003f0bf800 000001003f0bf800 ffffffffa00008a0 0000000000000282
       0000010033901d78 0000000000000293 000001003f82ba60 000001003f82ba60
       000001003f0bf800 000001003f790400 0000010004a8c800 000001003f790430
       ffffffffa0009bbf 000001000000e588 0000000000000000 000001003f82ba00
       000001003f790430 000001003f82ba00 0000010033901f08 0000010033901f08
Call Trace: [<ffffffff80116ba3>]{pci_map_sg+227}
[<ffffffffa002b41c>]{:aic79xx:ahd_linux_run_device_queue+796}
       [<ffffffffa0026ffb>]{:aic79xx:ahd_linux_queue+731}
       [<ffffffffa00008a0>]{:scsi_mod:scsi_dispatch_cmd+640}
       [<ffffffffa0009bbf>]{:scsi_mod:scsi_request_fn+1039}
       [<ffffffffa0008cff>]{:scsi_mod:__scsi_insert_special+127}
       [<ffffffffa0008d6f>]{:scsi_mod:scsi_insert_special_req+31}
       [<ffffffffa0000bae>]{:scsi_mod:scsi_do_req_Rsmp_6b1beddb+350}
       [<ffffffffa0096310>]{:st:st_sleep_done+0}
[<ffffffffa0096516>]{:st:st_do_scsi+310}
       [<ffffffffa0098089>]{:st:st_write+2121} [<ffffffff8015f452>]{sys_write+178}
       [<ffffffff801102a7>]{system_call+119}
                                                                                
Code: 0f 0b 20 34 2d 80 ff ff ff ff 2b 00 eb 5d 48 8b 4b 08 48 85
                                                                                
Kernel panic: Fatal exception
mapping (Pass IINMI Watchdog detected LOCKUP on CPU0, eip ffffffffa003d3f5,
registers:
CPU 0
Pid: 2202, comm: dump Not tainted
RIP: 0010:[<ffffffffa003d3f5>]{:aic79xx:.text.lock.aic79xx_core+55}
RSP: 0018:ffffffff805e63a8  EFLAGS: 00000086
RAX: 0000010004a7d000 RBX: 000001003ffe0000 RCX: ffffffffa003a4a0
RDX: 000001003ffe0000 RSI: 000001003ffe3180 RDI: 000001003ffe0000
RBP: 000001003ffe0000 R08: 0000000000000003 R09: 0000000000000000
R10: 0000000000000008 R11: 0000000000000010 R12: ffffffff805e63b0
R13: 0000000000000000 R14: 0000000000000000 R15: ffffffff804a1d00
FS:  0000002a969644c0(0000) GS:ffffffff805e1540(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000005a43e0 CR3: 0000000000101000 CR4: 00000000000006e0
                                                                                
Call Trace:  <EOE> [<ffffffff801302ec>]{timer_bh+684}
       [<ffffffff80110807>]{common_interrupt+95} [<ffffffff8012afbf>]{bh_action+79}
       [<ffffffff8012ae6b>]{tasklet_hi_action+139}
[<ffffffff8012ab2e>]{do_softirq+174}
       [<ffffffff80113463>]{do_IRQ+339} [<ffffffff80110807>]{common_interrupt+95}
        <EOI> [<ffffffff801f93bd>]{__make_request+1277}
[<ffffffff801f9347>]{__make_request+1159}
       [<ffffffff801f951b>]{generic_make_request+331}
[<ffffffff801f9591>]{submit_bh_rsector+97}
       [<ffffffff8016089e>]{write_locked_buffers+62}
[<ffffffff80160a24>]{write_some_buffers+372}
       [<ffffffff8012499c>]{call_console_drivers+268}
[<ffffffff80124cd5>]{printk+485}
       [<ffffffff80160a57>]{write_unlocked_buffers+23}
[<ffffffff80160b6e>]{sync_buffers+30}
       [<ffffffff80160cda>]{fsync_dev+10} [<ffffffff80160e1b>]{sys_sync+11}
       [<ffffffff80124148>]{panic+296} [<ffffffff801113c0>]{show_trace+640}
       [<ffffffff801114fd>]{show_stack+205} [<ffffffff80111640>]{show_registers+304}
       [<ffffffff801117ce>]{die+238} [<ffffffff8011192d>]{do_trap+301}
       [<ffffffff80111c76>]{do_invalid_op+166} [<ffffffff80116b2a>]{pci_map_sg+106}
       [<ffffffff8013e53f>]{do_no_page+95} [<ffffffff80110b06>]{error_exit+0}
       [<ffffffffa002b41c>]{:aic79xx:ahd_linux_run_device_queue+796}
       [<ffffffffa0026ffb>]{:aic79xx:ahd_linux_queue+731}
       [<ffffffffa00008a0>]{:scsi_mod:scsi_dispatch_cmd+640}
       [<ffffffffa0009bbf>]{:scsi_mod:scsi_request_fn+1039}
       [<ffffffffa0008cff>]{:scsi_mod:__scsi_insert_special+127}
       [<ffffffffa0008d6f>]{:scsi_mod:scsi_insert_special_req+31}
       [<ffffffffa0000bae>]{:scsi_mod:scsi_do_req_Rsmp_6b1beddb+350}
       [<ffffffffa0096310>]{:st:st_sleep_done+0}
[<ffffffffa0096516>]{:st:st_do_scsi+310}
       [<ffffffffa0098089>]{:st:st_write+2121} [<ffffffff8015f452>]{sys_write+178}
       [<ffffffff801102a7>]{system_call+119}
Process dump (pid: 2202, stackpage=10033901000)
Stack: ffffffff805e63a8 0000000000000018 0000000000000000 0000000000100000
       0000000000000000 0000000000000000 ffffffff803ed6e0 0000000000000001
       0000000000000000 0000000000000000 0000000000000000 0000010004b7a400
       0000000000000042 0000010004b83280 ffffff0000000000 000000fffffff000
       0000000000000000 0000010004b7ba80 0000000000000000 0000000000000000
       0000000100000001 0000000000000001 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000bbeb80 0000000000000001
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
       0000000000000000 0000000000000000 0000000000000000 0000000000000000
Call Trace:  <EOE> [<ffffffff801302ec>]{timer_bh+684}
       [<ffffffff80110807>]{common_interrupt+95} [<ffffffff8012afbf>]{bh_action+79}
       [<ffffffff8012ae6b>]{tasklet_hi_action+139}
[<ffffffff8012ab2e>]{do_softirq+174}
       [<ffffffff80113463>]{do_IRQ+339} [<ffffffff80110807>]{common_interrupt+95}
        <EOI> [<ffffffff801f93bd>]{__make_request+1277}
[<ffffffff801f9347>]{__make_request+1159}
       [<ffffffff801f951b>]{generic_make_request+331}
[<ffffffff801f9591>]{submit_bh_rsector+97}
       [<ffffffff8016089e>]{write_locked_buffers+62}
[<ffffffff80160a24>]{write_some_buffers+372}
       [<ffffffff8012499c>]{call_console_drivers+268}
[<ffffffff80124cd5>]{printk+485}
       [<ffffffff80160a57>]{write_unlocked_buffers+23}
[<ffffffff80160b6e>]{sync_buffers+30}
       [<ffffffff80160cda>]{fsync_dev+10} [<ffffffff80160e1b>]{sys_sync+11}
       [<ffffffff80124148>]{panic+296} [<ffffffff801113c0>]{show_trace+640}
       [<ffffffff801114fd>]{show_stack+205} [<ffffffff80111640>]{show_registers+304}
       [<ffffffff801117ce>]{die+238} [<ffffffff8011192d>]{do_trap+301}
       [<ffffffff80111c76>]{do_invalid_op+166} [<ffffffff80116b2a>]{pci_map_sg+106}
       [<ffffffff8013e53f>]{do_no_page+95} [<ffffffff80110b06>]{error_exit+0}
       [<ffffffff80116b2a>]{pci_map_sg+106} [<ffffffff80116ba3>]{pci_map_sg+227}
                                                                               
       [<ffffffffa002b41c>]{:aic79xx:ahd_linux_run_device_queue+796}
       [<ffffffffa0026ffb>]{:aic79xx:ahd_linux_queue+731}
       [<ffffffffa00008a0>]{:scsi_mod:scsi_dispatch_cmd+640}
       [<ffffffffa0009bbf>]{:scsi_mod:scsi_request_fn+1039}
       [<ffffffffa0008cff>]{:scsi_mod:__scsi_insert_special+127}
       [<ffffffffa0008d6f>]{:scsi_mod:scsi_insert_special_req+31}
       [<ffffffffa0000bae>]{:scsi_mod:scsi_do_req_Rsmp_6b1beddb+350}
       [<ffffffffa0096310>]{:st:st_sleep_done+0}
[<ffffffffa0096516>]{:st:st_do_scsi+310}
       [<ffffffffa0098089>]{:st:st_write+2121} [<ffffffff8015f452>]{sys_write+178}
       [<ffffffff801102a7>]{system_call+119}
                                                                                
Code: f3 90 7e f5 e9 f4 d0 ff ff 90 90 41 56 31 f6 41 55 41 54 49
                                                                                
console shuts up ...

Comment 3 Jeff Johnson 2005-05-02 14:11:44 EDT
Created attachment 113931 [details]
gziped text file of system boot, version information, system config.
Comment 4 Jeff Johnson 2005-05-02 20:38:03 EDT
* New information *

With the exact same hardware and software configuration I added the RHEL4/x86_64
kernel (2.6.9-5.ELsmp) and I am able to run successful dumps, restores and
post-restore file compares. No errors at all. 

It would appear that the cause of the crashes during dump operations are due to
something in 2.4.21-27.0.2.ELsmp that has been fixed in 2.6.9-5.ELsmp.

Information from kernel performing flawless backups and restores:
unme -a: Linux localhost.localdomain 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:29:47 EST
2005 x86_64 x86_64 x86_64 GNU/Linux
/proc/cmdline: ro root=LABEL=/ console=tty0 console=ttyS0,9600n8
lsmod:
Module                  Size  Used by
md5                     5697  1
ipv6                  280993  8
usbserial              33329  0
lp                     15089  0
button                  9057  0
autofs4                23241  0
e1000                  92749  0
floppy                 65809  0
sg                     42489  0
parport_pc             29313  0
parport                43981  2 lp,parport_pc
ohci_hcd               24273  0
st                     41957  0
ext3                  137297  4
jbd                    68849  1 ext3
aic79xx               178237  2
sd_mod                 19393  4
scsi_mod              140177  4 sg,st,aic79xx,sd_mod
Comment 5 Jeff Johnson 2005-05-04 13:48:06 EDT
Is running a Redhat released 2.6 variant kernel under RHEL3 a supported action?
Comment 6 Ernie Petrides 2005-05-04 15:43:09 EDT
Jeff, answer is "no".
Comment 7 Jeff Johnson 2005-05-04 17:13:05 EDT
Ernie,

Okay then given the fact that the bug exists in 2.4.21-27.0.2.ELsmp then how do
I obtain a solution to the bug that is also "supported".
Comment 8 Tom Coughlan 2005-05-04 17:58:49 EDT
We are going to need to fix the 2.4 kernel in RHEL 3. It is important for you to
file a problem report through the Red Hat support organization, to ensure that
this issue is prioritized corectly and gets attention from the correct
maintainers.  Bugzilla is an informal, less reliable, way to access support.

The root of the problem is "Kernel BUG at pci_dma:43". I will look in to this.
Comment 11 Jeff Johnson 2005-05-06 16:32:06 EDT
After a very long phone call with RH support they informed me that since my RHEL
license is "Educational" I am not entitled to submit a bug through RH Support.
They redirected me to this site and asked that I make it clear that this is the
avenue I have to take to submit this problem to Redhat for a fix.

Please escalate this within the Redhat organization in any way that you are able.

Thank You
Comment 12 Tom Coughlan 2005-05-06 17:19:48 EDT
Okay, sorry to send you off on an unproductive route. 

The crash is in the following code:

./arch/x86_64/kernel/pci-dma.c

int pci_map_sg(struct pci_dev *hwdev, struct scatterlist *sg,
			     int nents, int direction)
{
	int i;

	BUG_ON(direction == PCI_DMA_NONE);
 
 	/*
 	 * temporary 2.4 hack
 	 */
 	for (i = 0; i < nents; i++ ) {
		struct scatterlist *s = &sg[i];
		int flush = (i == nents-1);
		void *addr = s->address; 
		if (addr) 
===> line 43		BUG_ON(s->page || s->offset); 
		else if (s->page)
			addr = page_address(s->page) + s->offset; 
		else
			BUG(); 
		s->dma_address = __pci_map_single(hwdev, addr, s->length,
						  direction, flush); 
		if (unlikely(s->dma_address == bad_dma_address))
			goto error; 
 	}

------------------

Looks as though a driver is passing an illegal scatter-gather list. We will
investigate.
Comment 13 Sebastien BLAISOT 2005-06-02 02:59:20 EDT
Hi,

I'm having panic problems here too with an x86_64 machine with an adaptec
aic79xx scsi adapter, when trying to backup on tape. Maybe related. 

I'm attaching my panic informations
Comment 14 Sebastien BLAISOT 2005-06-02 03:00:27 EDT
Created attachment 115075 [details]
Panic message from kernel
Comment 15 Sebastien BLAISOT 2005-06-02 03:02:33 EDT
In fact, the bug appeared firs for me with kernel 2.4.21-27.0.4. It happens
almost every time I try to macke a tape backup with kernel 2.4.21-27.0.4 through
2.4.21-32.0.1. It has happened only once with kernel 2.4.21-27.0.2. Not sure to
understand why.

If this can help...
Comment 17 Tom Coughlan 2005-10-12 13:20:34 EDT
Created attachment 119845 [details]
patch to free write buffer after write completes

This bug appears to be a duplicate of BZ 162212. Please try the attached patch.
Let us know the result.
Comment 18 Josef Pfeiffer 2005-10-12 14:10:58 EDT
I have already verified a fix through the ET system.
Comment 19 Jeff Johnson 2005-10-12 14:19:08 EDT
Josef.. Can you elaborate on the fix you verified? Was it a patch or a compiled
kernel rpm? Who/where was the source of the fix?

Thanks
Comment 20 Josef Pfeiffer 2005-10-12 14:21:37 EDT
I worked with Chris Vanhoof who supplied a test rpm and I am currently waiting
for an officially supported rpm that we can supply to Bank of America.  You can
get more info in the ET #75570.

Thanks
Comment 21 Ernie Petrides 2005-10-13 17:25:13 EDT
Jeff, please try the patch in comment #17 to verify whether we've
fixed the problem.

Reverting to NEEDINFO.
Comment 22 Jeff Johnson 2005-10-13 21:03:22 EDT
I will test to find out if it fixes the problem. Management is refusing anything
except a fix embedded in an officially support RHEL kernel RPM update.
Comment 23 Josef Pfeiffer 2005-10-14 11:26:39 EDT
I will post a link with an officially supported RHEL kernel RPM update once I
receive one from Chris Vanhoof.  We are expecting one soon.
Comment 24 Matthew Micene 2005-10-17 10:08:59 EDT
We have been seeing a very similar kernel oops using Veritas NetBackup 5.1 over
a QLA2312 to a fiber attached SDLT220 library.  I would be very interested in
seeing a test kernel released in a beta channel perhaps.
Comment 27 Tom Coughlan 2005-10-17 10:47:29 EDT
Matthew,

Please try the test kernel located here:

http://people.redhat.com/dledford/st_tape_test/x86_64/

Please note that this is strictly a test kernel. Not supported. If the testing
goes well, the fix will be in U7. If is committed and you need it sooner,
contact the Red Hat support organization.
Comment 31 Ernie Petrides 2005-11-03 14:52:14 EST
Hi, Jeff.  Have you had a chance to try the patch in comment #17 yet?
Comment 33 Ernie Petrides 2005-11-03 15:32:25 EST
A fix for this problem was committed to the RHEL3 U7 patch pool
on 19-Oct-2005 (in kernel version 2.4.21-37.6.EL).

Propagating acks from bug 162212.


*** This bug has been marked as a duplicate of 162212 ***
Comment 34 Red Hat Bugzilla 2006-03-15 10:57:05 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0144.html

Note You need to log in before you can comment on or make changes to this bug.