Bug 133905
| Summary: | kernel crash, fatal exception, accessing /proc, EXT3-fs error | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 3 | Reporter: | Tapio Vaattanen <tapio.vaattanen> |
| Component: | kernel | Assignee: | Ernie Petrides <petrides> |
| Status: | CLOSED ERRATA | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.0 | CC: | anderson, aviro, dhoward, jhedstro, lwoodman, nixuser, peterm, petrides, riel, sct, tao |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | i686 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2005-05-18 13:28:11 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Steps ro Reproduce step one should be: 1. On one virtual console something like "while true ; do tar cvf /tmp/proc.tar /proc; done" the /proc was frogotten from the while loop. What hardware was this problem seen on ? ( lspci and lsmod would be helpful ). HP Proliant ML350, VMware 3.11 running RHES3,0 on virtual machine, HP Deskpro. All HW where I tested the loop produced similar behaviour, no exceptions. This really isn't HW related, since the loop example above crashes all the systems I've tested it including VMware virtual machines. Output of lspci on ML350: [root@linux root]# lspci 00:00.0 Host bridge: ServerWorks CMIC-LE Host Bridge (GC-LE chipset) (rev 33) 00:00.1 Host bridge: ServerWorks CMIC-LE Host Bridge (GC-LE chipset) 00:00.2 Host bridge: ServerWorks CMIC-LE Host Bridge (GC-LE chipset) 00:02.0 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01) 00:02.1 SCSI storage controller: Adaptec AHA-3960D / AIC-7899A U160/m (rev 01) 00:03.0 VGA compatible controller: ATI Technologies Inc Rage XL (rev 27) 00:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5702X Gigabit Ethernet (rev 02) 00:05.0 System peripheral: Compaq Computer Corporation Advanced System Management Controller 00:0f.0 ISA bridge: ServerWorks CSB5 South Bridge (rev 93) 00:0f.1 IDE interface: ServerWorks CSB5 IDE Controller (rev 93) 00:0f.2 USB Controller: ServerWorks OSB4/CSB5 OHCI USB Controller (rev 05) 00:0f.3 Host bridge: ServerWorks CSB5 LPC bridge 00:11.0 Host bridge: ServerWorks CIOB-X2 PCI-X I/O Bridge (rev 05) 00:11.2 Host bridge: ServerWorks CIOB-X2 PCI-X I/O Bridge (rev 05) 02:02.0 RAID bus controller: Compaq Computer Corporation Smart Array 64xx (rev 01) And lsmod on ML350: [root@linux root]# lsmod Module Size Used by Not tainted parport_pc 18852 1 (autoclean) lp 9124 0 (autoclean) parport 38816 1 (autoclean) [parport_pc lp] autofs 13620 0 (autoclean) (unused) 8021q 17320 0 (autoclean) (unused) tg3 58312 1 floppy 57488 0 (autoclean) sg 37228 0 (autoclean) microcode 6848 0 (autoclean) st 31428 0 keybdev 2976 0 (unused) mousedev 5624 0 (unused) hid 22276 0 (unused) input 6144 0 [keybdev mousedev hid] usb-ohci 23176 0 (unused) usbcore 80928 1 [hid usb-ohci] ext3 89960 3 jbd 55060 3 [ext3] cciss 64032 8 aic7xxx 162064 0 sd_mod 13360 0 (unused) scsi_mod 112680 4 [sg st cciss aic7xxx sd_mod] On my machine ( shuttle ; SIS 651 with IDE disks )the problem was in the DMA code. The DMA interface has mapped into memory, read-volatile registers whereby reading the memory location causes the register to shift to the next batch of data ( see ide_end_drive_cmd() ). THe tar of /proc/kcore was stealing data from the ide driver. This is a specific example of a class of probem whereby reading /proc files can have unwelcome side-effects. With some hardware, the /proc/bus files could have similar problems. It can be legitimately challenged that this is not a bug. Only superuser can read the relevent files; and the files do reside in a file-system which should be treated with caution. However, these files are not "special files" to utilities like 'find'. Except for their location under /proc there is no reason to think that reading these files could cause side-effects. And to the average system administrator from a UNIX background, the characteristics of the /proc file-system may not immediately spring to mind when doing, for example, a spontaneous backup or a search. There are a number of different remedies for this specific situation - kcore can be made modular with only a little tweaking; or could skip uncacheable MTRRs by default. But these do not address the larger issue, and since they change the functionality of a long-established file, could cause problems elsewhere. At the very least, I think a warning in the proc(5) man page is in order. It turns out that there was a kernel bug in the handling for /proc/kcore that under certain conditions was causing random memory corruption. A fix for this problem was committed to the RHEL3 U5 patch pool on 28-Jan-2005 (in kernel version 2.4.21-27.10.EL). Hi, Debby. In response to comment #4, the /proc/kcore driver already has logic to avoid access to mapped regions with the VM_IOREMAP flag set. Do you know of problematic regions that don't use VM_IOREMAP but should? An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html *** Bug 110890 has been marked as a duplicate of this bug. *** |
From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; rv:1.7.3) Gecko/20040913 Firefox/0.10 Description of problem: Our customers have for now taken full backups without excluding /proc with tar. In the end, it seems that this has caused various kind of problems. Kernel crashes are one of them. EXT3-errors the other. Below example of one kernel crash: --clip--- EIP is at prune_dcache [kernel] 0*3x (2.4.21-9 ELSmp/i686) eax: c03a8658 ebx f1543118 ecx: df2d3b80 edx:00000001 esi: f1543100 edi: c9abe780 ebp: 0000327a esp:c37d5f88 ds:0068 es: 0068 ss:0068 Process kswapd (pid:7, stackpage=c37d5000 Stack: df2d3b88 f1543180 c03a3d00 000001f5 00000040 000001d0 c0179ee8 00003f45 00000040 c015388a 00000006 000001d0 00000014 00000235 0000000 00003e8d ffffffff 00000000 c0153a38 000001d0 00000001 000001d0 00000068 c01539d0 Call Trace: [<c0179ee8>] shrink_dcache_memory [kernel] 0x68 (0xc37d5fa0) [<c015388a>] do_try_to_free_pages_kswapd [kernel] 0*13a (0xc37d5fac) [<c0153a38>] kswapd [kernel] 0x68 (0xc37d5fd0) [<c01539d0>] kswapd [kernel] 0x0 (0xc37d5ff0) [<c010958d>] kernel_thread_helper [kernel] 0x5 (0xc37d5ff0) Code 89 02 89 5b 04 89 1b 8b 46 54 a9 08 00 00 00 00 0 85 4b 01 00 Kerne Panic: Fatal exception --clip-- Also we are constantly facing EXT3 erros like below: ---clip--- EXT3-fs error (device cciss0(104,2)): ext3_readdir: bad entry in directory #368740: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device cciss0(104,2)): ext3_readdir: bad entry in directory #368740: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device cciss0(104,2)): ext3_readdir: bad entry in directory #368740: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 ---clip--- I reproduced the above with one RHES3.0 test server with IDE drives: ---clip--- EXT3-fs error (device ide0(3,3)): ext3_readdir: bad entry in directory #8208583: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device ide0(3,3)): ext3_readdir: bad entry in directory #8208583: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device ide0(3,3)): ext3_readdir: bad entry in directory #8208583: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device ide0(3,3)): ext3_readdir: bad entry in directory #8208583: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 ---clip-- And with RHES3.0 running on top of VMware ESX 2.1.2 ---clip--- EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #16010: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #16010: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #16010: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #16010: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #16010: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0 ---clip--- Version-Release number of selected component (if applicable): kernel-2.4.21-20.ELsmp How reproducible: Always Steps to Reproduce: 1. On one virtual console something like "while true ; do tar cvf /tmp/proc.tar ; done" 2. On other virtual console "while true; do ls -lR / ; done" 3. Depends on how long you run the step 1. are we going to get kernel crash or some other problems like EXT3 errors. Actual Results: Either kernel crashes or we are starting to get EXT3 errors. In the end, kernel allways crashes. Expected Results: Nothing abnormal. Previous versions of Red Hat nor current FC2 doesn't crash nor produce any error messages while reproducing the steps above. Additional info: This has been a problem from first versions of RHES3.0. Can be reproduced with latest kernel versions. Ofcourse the backup scripts should include "--exclude ./proc", but this wasn't unfortunately the case.