Bug 69994 - kernel assertion failure on rm leads to system lockup
kernel assertion failure on rm leads to system lockup
Status: CLOSED WORKSFORME
Product: Red Hat Linux
Classification: Retired
Component: kernel (Show other bugs)
7.3
i386 Linux
medium Severity high
: ---
: ---
Assigned To: Arjan van de Ven
Brian Brock
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2002-07-28 02:57 EDT by Need Real Name
Modified: 2007-04-18 12:44 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2002-07-29 02:29:16 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Need Real Name 2002-07-28 02:57:27 EDT
Description of Problem:

While logged in over ssh as root, I ran 
rm -f "/backup"/*
where /backup contained a few tar files. This immediately segfaulted.

Immediately after the segfault, I got this message from syslog:
Message from syslogd@octopus at Sun Jul 28 17:38:39 2002 ...
octopus kernel: Assertion failure in journal_unmap_buffer() at
transaction.c:1849: "transaction == journal->j_running_transaction"

The terminal then stopped responding (I had to kill the ssh connection).

Now the box responds to ping and accepts TCP connections to ssh and httpd, but
the applications listening on those ports do not respond.

Not sure at this stage whether the file system has been damaged. Will drive out
to the hosting facility tomorrow to check (then reinstall on a different drive).


Version-Release number of selected component (if applicable):

kernel-2.4.18-5
other modules are probably up to date, the box was recently installed and sync'd
with updates.redhat.com.


How Reproducible:

Have not tried to reproduce since I am no longer able to log in over ssh.


Additional Information:

/ has ext3 fs.
18Gb seagate IDE drive.
Comment 1 Need Real Name 2002-07-28 03:02:42 EDT
Not sure if this is important, but one of the tarfiles in /backup was approx
3.1Gb in size.
Comment 2 Arjan van de Ven 2002-07-28 05:07:36 EDT
is this an smp machine?
have you tried the errata kernel that fixes some ext3 issues?
Comment 3 Need Real Name 2002-07-28 05:14:15 EDT
single processor (1Ghz Pentium III)
Running the 2.4.18-5 kernel.
Comment 4 Need Real Name 2002-07-28 09:43:15 EDT
Is there any debug info I should get off the system before I reset it later today?
Anything that might be of use to you. (Assuming its still accepting keyboard
input). If so, please explain how - I haven't messed with postcrash data before.
Comment 5 Stephen Tweedie 2002-07-28 16:55:38 EDT
At the very least, please include the complete "oops" trace from the logs.  You
can attach them to the bug report.
Comment 6 Need Real Name 2002-07-29 02:01:10 EDT
Ok, got the drive back (swapped it with another). Syslog shows:

Jul 28 17:38:39 octopus kernel: Assertion failure in journal_unmap_buffer() at
transaction.c:1849: "transaction == journal->j_running_transaction"
Jul 28 17:38:39 octopus kernel: ------------[ cut here ]------------
Jul 28 17:38:39 octopus kernel: kernel BUG at transaction.c:1849!
Jul 28 17:38:39 octopus kernel: invalid operand: 0000
Jul 28 17:38:39 octopus kernel: st aic7xxx scsi_mod sis900 autofs ide-cd cdrom
usb-ohci usbcore ext3 jbd  
Jul 28 17:38:39 octopus kernel: CPU:    0
Jul 28 17:38:39 octopus kernel: EIP:    0010:[<df80eacd>]    Not tainted
Jul 28 17:38:39 octopus kernel: EFLAGS: 00010282
Jul 28 17:38:39 octopus kernel: 
Jul 28 17:38:39 octopus kernel: EIP is at journal_unmap_buffer [jbd] 0x119
(2.4.18-5)
Jul 28 17:38:39 octopus kernel: eax: 00000022   ebx: 48435441   ecx: 00000001  
edx: 00001ed7
Jul 28 17:38:39 octopus kernel: esi: d8c09750   edi: cd34b320   ebp: 00000001  
esp: c8b0de58
Jul 28 17:38:39 octopus kernel: ds: 0018   es: 0018   ss: 0018
Jul 28 17:38:39 octopus kernel: Process rm (pid: 27760, stackpage=c8b0d000)
Jul 28 17:38:40 octopus kernel: Stack: df815810 00000739 dc66f3a0 00000003
d7753d00 00000000 cd34b320 00000000 
Jul 28 17:38:40 octopus kernel:        cd34b320 00001000 df80ec1b dee6c000
cd34b320 00000001 cd34b320 c106e3c8 
Jul 28 17:38:40 octopus kernel:        00000000 c106e3c8 cd39b304 df81e3bc
dee6c000 c106e3c8 00000000 c0126a0d 
Jul 28 17:38:40 octopus kernel: Call Trace: [<df815810>] .rodata.str1.1 [jbd] 0x10 
Jul 28 17:38:40 octopus kernel: [<df80ec1b>] journal_flushpage_R8438cea2 [jbd] 0xa3 
Jul 28 17:38:40 octopus kernel: [<df81e3bc>] ext3_flushpage [ext3] 0x20 
Jul 28 17:38:40 octopus kernel: [<c0126a0d>] do_flushpage [kernel] 0x19 
Jul 28 17:38:40 octopus kernel: [<c0126a4e>] truncate_complete_page [kernel] 0x2e 
Jul 28 17:38:40 octopus kernel: [<c0126c12>] truncate_list_pages [kernel] 0x192 
Jul 28 17:38:40 octopus kernel: [<c0126c97>] truncate_inode_pages [kernel] 0x3b 
Jul 28 17:38:40 octopus kernel: [<df8297a0>] ext3_sops [ext3] 0x0 
Jul 28 17:38:40 octopus kernel: [<c0148772>] iput [kernel] 0xb2 
Jul 28 17:38:40 octopus kernel: [<c0146d72>] d_delete [kernel] 0x4e 
Jul 28 17:38:40 octopus kernel: [<c01401ca>] vfs_unlink [kernel] 0x162 
Jul 28 17:38:40 octopus kernel: [<c0140285>] sys_unlink [kernel] 0x85 
Jul 28 17:38:40 octopus kernel: [<c01131d0>] do_page_fault [kernel] 0x0 
Jul 28 17:38:40 octopus kernel: [<c01085f7>] system_call [kernel] 0x33 
Jul 28 17:38:40 octopus kernel: 
Jul 28 17:38:40 octopus kernel: 
Jul 28 17:38:40 octopus kernel: Code: 0f 0b 59 5d 53 56 e8 88 fe ff ff 89 c5 58
5a 8b 47 18 a9 02 

Before shutting down, I tried to do an ls /, but it hung.
Comment 7 Need Real Name 2002-07-29 02:29:12 EDT
*sigh*. While configuring the new install, it locked up again. This time there
was no obvious segfault or syslog output, but otherwise the same symptoms are
present (TCP, ICMP up but daemons not responding).

There may be a hardware fault - possibly the motherboard - we'll have it
replaced tomorrow.
Comment 8 Stephen Tweedie 2002-07-29 10:54:49 EDT
OK, if you make any progress identifying a hardware or software fault here,
please just add it to this bug report and we can reopen it.
Comment 9 Need Real Name 2002-07-29 19:31:03 EDT
From the most recent crash:

Jul 29 18:28:36 octopus kernel: Unable to handle kernel NULL pointer dereference
at virtual address 0000006d
Jul 29 18:28:36 octopus kernel:  printing eip:
Jul 29 18:28:36 octopus kernel: c0133fa4
Jul 29 18:28:36 octopus kernel: *pde = 00000000
Jul 29 18:28:36 octopus kernel: Oops: 0000
Jul 29 18:28:36 octopus kernel: aic7xxx scsi_mod autofs sis900 ide-cd cdrom
usbcore ext3 jbd  
Jul 29 18:28:36 octopus kernel: CPU:    0
Jul 29 18:28:36 octopus kernel: EIP:    0010:[<c0133fa4>]    Not tainted
Jul 29 18:28:36 octopus kernel: EFLAGS: 00010202
Jul 29 18:28:36 octopus kernel: 
Jul 29 18:28:36 octopus kernel: EIP is at pte_chain_alloc [kernel] 0x1c (2.4.18-5)
Jul 29 18:28:36 octopus kernel: eax: 00000001   ebx: c02cdd04   ecx: d4df51f0  
edx: 0000006d
Jul 29 18:28:36 octopus kernel: esi: d4df51f0   edi: c012865c   ebp: 0807c040  
esp: d4a61e94
Jul 29 18:28:36 octopus kernel: ds: 0018   es: 0018   ss: 0018
Jul 29 18:28:36 octopus kernel: Process sshd (pid: 12139, stackpage=d4a61000)
Jul 29 18:28:36 octopus kernel: Stack: c1610250 c0133c2f c02cdd04 c1610250
00000000 c0124ee0 d4a5f080 0807c040 
Jul 29 18:28:36 octopus kernel:        dd37f250 0807c040 c0124fde dd37f250
d4eb7320 0807c040 00000000 d4df51f0 
Jul 29 18:28:36 octopus kernel:        d4a5f080 d4ebe964 d4a61f58 00004000
00000040 d4a61f0c c103400c c02cdd44 
Jul 29 18:28:36 octopus kernel: Call Trace: [<c0133c2f>] page_add_rmap [kernel]
0x37 
Jul 29 18:28:36 octopus kernel: [<c0124ee0>] do_no_page [kernel] 0x160 
Jul 29 18:28:36 octopus kernel: [<c0124fde>] handle_mm_fault [kernel] 0xde 
Jul 29 18:28:36 octopus kernel: [<c01132f3>] do_page_fault [kernel] 0x123 
Jul 29 18:28:36 octopus kernel: [<c01c739a>] sock_read [kernel] 0x86 
Jul 29 18:28:36 octopus kernel: [<c01131d0>] do_page_fault [kernel] 0x0 
Jul 29 18:28:36 octopus kernel: [<c01086e8>] error_code [kernel] 0x34 
Jul 29 18:28:36 octopus kernel: 
Jul 29 18:28:36 octopus kernel: 
Jul 29 18:28:36 octopus kernel: Code: 8b 02 89 83 bc 00 00 00 c7 02 00 00 00 00
89 d0 5b c3 89 f6 


Testing the RAM now - bzip2 was throwing a 1007 error which is most often caused
by faulty RAM..
Comment 10 Need Real Name 2002-07-29 21:07:42 EDT
memtest86 showed several errors on the single 512Mb RAM stick. Once replaced,
will burn the box in for about a week and see how we go. Thanks for looking at this.

Note You need to log in before you can comment on or make changes to this bug.