Bug 133905 - kernel crash, fatal exception, accessing /proc, EXT3-fs error
Summary: kernel crash, fatal exception, accessing /proc, EXT3-fs error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Ernie Petrides
QA Contact:
URL:
Whiteboard:
: 110890 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-09-28 13:34 UTC by Tapio Vaattanen
Modified: 2007-11-30 22:07 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-05-18 13:28:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:294 0 normal SHIPPED_LIVE Moderate: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 5 2005-05-18 04:00:00 UTC

Description Tapio Vaattanen 2004-09-28 13:34:57 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3)
Gecko/20040913 Firefox/0.10

Description of problem:
Our customers have for now taken full backups without excluding /proc
with tar. In the end, it seems that this has caused various kind of
problems. Kernel crashes are one of them. EXT3-errors the other.

Below example of one kernel crash:

--clip---

EIP is at prune_dcache [kernel] 0*3x (2.4.21-9 ELSmp/i686)
eax: c03a8658 ebx f1543118 ecx: df2d3b80 edx:00000001 esi: f1543100
edi: c9abe780 ebp: 0000327a esp:c37d5f88 ds:0068 es: 0068 ss:0068
Process kswapd (pid:7, stackpage=c37d5000
Stack: df2d3b88 f1543180 c03a3d00 000001f5 00000040 000001d0 c0179ee8
00003f45 00000040 c015388a 00000006 000001d0 00000014 00000235 0000000
00003e8d ffffffff 00000000 c0153a38 000001d0 00000001 000001d0
00000068 c01539d0
Call Trace: [<c0179ee8>] shrink_dcache_memory [kernel] 0x68 (0xc37d5fa0)
 
[<c015388a>] do_try_to_free_pages_kswapd [kernel] 0*13a (0xc37d5fac)
[<c0153a38>] kswapd [kernel] 0x68 (0xc37d5fd0)
[<c01539d0>] kswapd [kernel] 0x0 (0xc37d5ff0)
[<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc37d5ff0)
 
Code 89 02 89 5b 04 89 1b 8b 46 54 a9 08 00 00 00 00 0 85 4b 01 00
 
Kerne Panic: Fatal exception

--clip--

Also we are constantly facing EXT3 erros like below:

---clip---

EXT3-fs error (device cciss0(104,2)): ext3_readdir: bad entry in
directory #368740: rec_len is smaller than minimal - offset=0,
inode=0, rec_len=0, name_len=0
EXT3-fs error (device cciss0(104,2)): ext3_readdir: bad entry in
directory #368740: rec_len is smaller than minimal - offset=0,
inode=0, rec_len=0, name_len=0
EXT3-fs error (device cciss0(104,2)): ext3_readdir: bad entry in
directory #368740: rec_len is smaller than minimal - offset=0,
inode=0, rec_len=0, name_len=0

---clip---

I reproduced the above with one RHES3.0 test server with IDE drives:

---clip---
EXT3-fs error (device ide0(3,3)): ext3_readdir: bad entry in directory
#8208583: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device ide0(3,3)): ext3_readdir: bad entry in directory
#8208583: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device ide0(3,3)): ext3_readdir: bad entry in directory
#8208583: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device ide0(3,3)): ext3_readdir: bad entry in directory
#8208583: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
---clip--

And with RHES3.0 running on top of VMware ESX 2.1.2 

---clip---

EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory
#16010: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory
#16010: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory
#16010: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory
#16010: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory
#16010: rec_len is smaller than minimal - offset=0, inode=0,
rec_len=0, name_len=0

---clip---


Version-Release number of selected component (if applicable):
kernel-2.4.21-20.ELsmp

How reproducible:
Always

Steps to Reproduce:
1. On one virtual console something like "while true ; do tar cvf
/tmp/proc.tar ; done"
2. On other virtual console "while true; do ls -lR / ; done"
3. Depends on how long you run the step 1. are we going to get kernel
crash or some other problems like EXT3 errors.
    

Actual Results:  Either kernel crashes or we are starting to get EXT3
errors. In the end, kernel allways crashes.

Expected Results:  Nothing abnormal. Previous versions of Red Hat nor
current FC2 doesn't crash nor produce any error messages while
reproducing the steps above.

Additional info:

This has been a problem from first versions of RHES3.0. Can be
reproduced with latest kernel versions. Ofcourse the backup scripts
should include "--exclude ./proc", but this wasn't unfortunately the case.

Comment 1 Tapio Vaattanen 2004-09-28 13:39:56 UTC
Steps ro Reproduce step one should be:

1. On one virtual console something like "while true ; do tar cvf
/tmp/proc.tar /proc; done"

the /proc was frogotten from the while loop.



Comment 2 Debby Townsend 2004-11-18 21:10:02 UTC
What hardware was this problem seen on ?  ( lspci and lsmod would be 
helpful ). 

Comment 3 Tapio Vaattanen 2004-11-22 11:58:32 UTC
HP Proliant ML350, VMware 3.11 running RHES3,0 on virtual machine, HP
Deskpro. All HW where I tested the loop produced similar behaviour, no
exceptions. This really isn't HW related, since the loop example above
crashes all the systems I've tested it including VMware virtual machines.

Output of lspci on ML350:

[root@linux root]# lspci 
00:00.0 Host bridge: ServerWorks CMIC-LE Host Bridge (GC-LE chipset)
(rev 33)
00:00.1 Host bridge: ServerWorks CMIC-LE Host Bridge (GC-LE chipset)
00:00.2 Host bridge: ServerWorks CMIC-LE Host Bridge (GC-LE chipset)
00:02.0 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m
(rev 01)
00:02.1 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m
(rev 01)
00:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27)
00:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5702X
Gigabit Ethernet (rev 02)
00:05.0 System peripheral: Compaq Computer Corporation Advanced System
Management Controller
00:0f.0 ISA bridge: ServerWorks CSB5 South Bridge (rev 93)
00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 93)
00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 05)
00:0f.3 Host bridge: ServerWorks CSB5 LPC bridge
00:11.0 Host bridge: ServerWorks CIOB-X2 PCI-X I/O Bridge (rev 05)
00:11.2 Host bridge: ServerWorks CIOB-X2 PCI-X I/O Bridge (rev 05)
02:02.0 RAID bus controller: Compaq Computer Corporation Smart Array
64xx (rev 01)

And lsmod on ML350:

[root@linux root]# lsmod 
Module                  Size  Used by    Not tainted
parport_pc             18852   1  (autoclean)
lp                      9124   0  (autoclean)
parport                38816   1  (autoclean) [parport_pc lp]
autofs                 13620   0  (autoclean) (unused)
8021q                  17320   0  (autoclean) (unused)
tg3                    58312   1 
floppy                 57488   0  (autoclean)
sg                     37228   0  (autoclean)
microcode               6848   0  (autoclean)
st                     31428   0 
keybdev                 2976   0  (unused)
mousedev                5624   0  (unused)
hid                    22276   0  (unused)
input                   6144   0  [keybdev mousedev hid]
usb-ohci               23176   0  (unused)
usbcore                80928   1  [hid usb-ohci]
ext3                   89960   3 
jbd                    55060   3  [ext3]
cciss                  64032   8 
aic7xxx               162064   0 
sd_mod                 13360   0  (unused)
scsi_mod              112680   4  [sg st cciss aic7xxx sd_mod]




Comment 4 Debby Townsend 2004-12-16 23:55:52 UTC
On my machine ( shuttle ; SIS 651 with IDE disks )the problem was in 
the DMA code. The DMA interface has mapped into memory, read-volatile 
registers  whereby reading the memory location causes the register to 
shift to the next batch of data ( see ide_end_drive_cmd() ). THe tar 
of /proc/kcore was stealing data from the ide driver.

This is a specific example of a class of probem whereby reading /proc 
files can have unwelcome side-effects.  With some hardware, 
the /proc/bus files could have similar problems.

It can be legitimately challenged that this is not a bug. Only 
superuser can read the relevent files; and the files do reside in a 
file-system which should be treated with caution. 

However, these files are not "special files" to utilities 
like 'find'. Except for their location under /proc there is no reason 
to think that reading these files could cause side-effects. And to 
the average system administrator from a UNIX background, the 
characteristics of the /proc file-system may not immediately spring 
to mind when doing, for example, a spontaneous backup or a search. 

There are a number of different remedies for this specific situation -
 kcore can be made modular with only a little tweaking; or could skip 
uncacheable MTRRs by default.  But these do not address the larger 
issue, and since they change the functionality of a long-established 
file, could cause problems elsewhere. 

At the very least, I think a warning in the proc(5) man page is in 
order. 


Comment 6 Ernie Petrides 2005-01-31 22:13:40 UTC
It turns out that there was a kernel bug in the handling for /proc/kcore
that under certain conditions was causing random memory corruption.  A fix
for this problem was committed to the RHEL3 U5 patch pool on 28-Jan-2005
(in kernel version 2.4.21-27.10.EL).


Comment 9 Ernie Petrides 2005-02-03 00:47:03 UTC
Hi, Debby.  In response to comment #4, the /proc/kcore driver already has
logic to avoid access to mapped regions with the VM_IOREMAP flag set.  Do
you know of problematic regions that don't use VM_IOREMAP but should?

Comment 10 Tim Powers 2005-05-18 13:28:11 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html


Comment 11 Ernie Petrides 2005-10-06 01:30:51 UTC
*** Bug 110890 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.