Bug 145563
Summary: | tar crashes DELL server every 4th day. | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Dante Alzamora <dac> | ||||||||||||||
Component: | kernel | Assignee: | Dave Anderson <anderson> | ||||||||||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||
Priority: | medium | ||||||||||||||||
Version: | 3.0 | CC: | jburke, peterm, petrides, riel | ||||||||||||||
Target Milestone: | --- | ||||||||||||||||
Target Release: | --- | ||||||||||||||||
Hardware: | i686 | ||||||||||||||||
OS: | Linux | ||||||||||||||||
Whiteboard: | |||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||
Clone Of: | Environment: | ||||||||||||||||
Last Closed: | 2005-05-18 13:29:09 UTC | Type: | --- | ||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
Embargoed: | |||||||||||||||||
Attachments: |
|
Description
Dante Alzamora
2005-01-19 19:22:40 UTC
Created attachment 109983 [details]
The last 5 top pages before the crash
Note that top shows tar starting and then moving to the bottom.
The machine shows a load because of 3 monitoring process to capture the state
of the machine. Otherwise the system should be idle.
The crashed occured right after tar started sending data to tape.
Created attachment 109984 [details]
vmstat -m 1 5 (right before the crash)
Created attachment 109985 [details]
slabtop --sort=c -o
I have other slabtop outputs with various switches and even repeated ones
previous to this one in case you want to see them.
Hello, Dante. Please attach the console oops output and/or panic message, which you might need to capture with a serial console. By the way, if the oops is at __audit_get_target()+0x1f6, then the problem has already been fixed in the latest security errata (2.4.21-27.0.2.EL), which was released last night. If not, then we'll need the oops output to investigate further. Thanks in advance. -ernie Ernie, I am new to kernel debugging problems. So bear with me please. I do not get a pannic or oops message on the screen or any log files (including dmesg). How can I plan for a crash? or at least to get the oops message? Do I just need to attach a dumb terminal (or other computer via null modem) to the serial port to capture the message. Do I need to reboot the machine with this terminal so it considers it the console? Do I need to turn on kernel variables (via proc or re-compiling the kernel)? We upgraded the system to 2.4.21-27.0.2 last night. Hopefully I had the __audit_get_target problem. By the way here's some info I left out hardware: Dell PowerEdge 2600 SCSI Tape controller: Adaptec Controller Dual Channel Again, let me know how can I plan for a crash, Thanks, Dante Bad news. The system crashed again. It stayed up for 1 day only. Now it is using 2.4.21-27.0.2.ELsmp. Something weird happen: The tar backup normally takes 1 hour and 5 minutes. Last night the backup started at 9:00 PM and the system crased @ 01:41:30 AM. The strange thing is that backup never finished (it logs the start and end to a file - we logged the beginning but not the end). And you could actually see tar running in the top I captured. It had accumulated about 3 hrs of CPU utilization. Again, there are no opps or pannic messages anywhere. Thanks, Created attachment 110049 [details]
last 5 tops before crash 2005-01-21 @ 01:41:30 AM
Created attachment 110050 [details]
vmstat before the crash 2005-01-21 @ 01:41:30 AM
Dante, Without the oops trace we cannot really determine what has happened. However, we have been debugging an issue that is "bumped into" by running tar on relatively small memory systems. To rule out that out, or hopefully prove that it's the same issue, please install and test the appropriate kernel from this location: http://people.redhat.com/~petrides/.kcore/kernel-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm http://people.redhat.com/~petrides/.kcore/kernel-smp-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm http://people.redhat.com/~petrides/.kcore/kernel-hugemem-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm The kernel source package contains this file: ./Documentation/serial-console.txt And there's an even more comprehensive web page here: http://www.faqs.org/docs/Linux-HOWTO/Remote-Serial-Console-HOWTO.html The amount of information is somewhat overwhelming, but for all practical purposes the easiest manner is to attach a null-modem cable to either of your serial ports (/dev/ttyS0 or /dev/ttyS1), and add a console specifier to the appropriate "kernel" line in your /boot/grub/grub.conf. For example, here's a change from an original kernel line: kernel /vmlinuz-2.4.9-21 ro root=/dev/hda6 to add console output from /dev/ttyS0, like this: kernel /vmlinuz-2.4.9-21 ro root=/dev/hda6 console=tty0 \ console=ttyS0,9600n8 Your line will obviously be different, but the point is to add the "console=tty0 console=ttyS0,9600n8" onto the end of the line. The null-modem cable can be plugged into a dump terminal, but to save the full output, it makes more sense to plug it into a serial port on another system, and then run "minicom" on that system to capture the output. Configure "minicom" to have a large scroll buffer. Alternatively you can configure a netdump-server machine, preferably on the same subnet as the panicking machine. To do so, simply do a "service netdump-server start" on the selected Red Hat server, presuming that the netdump-server user package has been pre-installed. Also, you must create a password for user "netdump" on the server. There should be enough memory in the /var/crash partition to be able to hold a file of the memory size of the panicking client. Back on the client (panicking machine), edit /etc/sysconfig/netdump, and simply enter the IP address of the netdump-server as the "NETDUMPADDR" configuration item. Then, do this just one time: $ service netdump propagate This will register the client with the netdump-server; it will ask for the password you created on the netdump-server machine. Then, upon each boot of the client, enter: $ service netdump start Then, upon the next panic, a vmcore will be created in a subdirectory in /var/crash. Note that to make the netdump client or netdump-server services happen automatically with every boot, you can do the following: On the server: # chkconfig --add netdump-server # chkconfig netdump-server on On the client: # chkconfig --add netdump # chkconfig netdump on This presumes that the netdump package has been installed on the client, and that the netdump-server package has been installed on the server. However, please first test the appropriate kernel listed in Comment #9. Also, if you can do so, can you give us the the actual "tar" command that you use (if it's just one)? What we believe is happening is that /proc/kcore is being accessed. I installed your kernel as soon as I got and was able to reboot the server on Friday @12:14PM. So the server has been running for 3 days. It has actually done 2 tars since then (Fridays and Mondays). I noticed that the rate that it writes to tape has changed: Total bytes written: 48797921280 (45GB, 11MB/s) Total bytes written: 48823685120 (45GB, 12MB/s) Total bytes written: 48810997760 (45GB, 12MB/s) Total bytes written: 48858214400 (45GB, 12MB/s) Total bytes written: 48818739200 (45GB, 11MB/s) Total bytes written: 48845414400 (45GB, 12MB/s) Total bytes written: 49110046720 (46GB, 12MB/s) Total bytes written: 49041520640 (46GB, 11MB/s) Total bytes written: 49042524160 (46GB, 11MB/s) Total bytes written: 49250979840 (46GB, 11MB/s) Total bytes written: 49456035840 (46GB, 12MB/s) Total bytes written: 49564026880 (46GB, 11MB/s) ----After Ernie's kernel--- Total bytes written: 49479208960 (46GB, 9.0MB/s) (Friday) Total bytes written: 49387980800 (46GB, 8.6MB/s) (Monday) The actual command is the following /bin/tar --totals --preserve --same-owner --exclude- from /usr/local/etc/skip.files --atime-preserve -cf /dev/st0 etc/passwd etc/shadow etc/group . >> /root/tape_data.err 2>&1 cat /usr/local/etc/skip.files proc lost+found */lost+found mnt ./dev/log ./dev/gpmctl ./tmp/.font-unix ./tmp/.gdm_socket ./tmp/.X11-unix ./tmp/.ICE-unix ./tmp/.iroha_unix ./tmp/.fam_socket ./tmp/ssh-XXQGtxJg ./tmp/orbit-root ./tmp/orbit-mbm ./tmp/orbit-dlewis ./tmp/orbit-mth ./tmp ./jb jb Since you are suggesting that we are backing up /proc, I will verify that. We may need to change the skip-files to ignore "./proc" and not simply "proc". Thanks! I wouldn't think that "proc" alone would be any different from the "mnt" or "lost+found" entries in your exclude file. I suppose you could just do a "tar -tvf" of your storage device and see if anything from /proc is on there. In any case, it's not necessarily the tar process that has to do the access of /proc/kcore; another process can do it, and only when the tar runs would it bump into the corruption. But looking back, you seem to be describing a "hard hang" as opposed to a "crash". What would be most helpful, then, would be the capture of "Alt-sysrq-p", "Alt-sysrq-w" and "Alt-sysrq-t" when the system hangs. That is presuming that the system can respond to console keyboard interrupts when it gets into the hung state. That will send diagnostic output to the console. If you've never done so, you need to use "Control-Alt-F2" (or F3, F4, F5 or F6) to get to a virtual terminal. Then while holding the Alt and Sysrq keys down, enter "p", then "w", then "t". ("Control-Alt-F7" gets you back to your X window.) There's probably a good chance that keyboard interrupts won't be accepted, but if it happens again, that's about all we have to go on. You might want to leave the console showing a virtual terminal via Control-Alt-F2 before leaving the machine. > The system does not log any error (even with *.* in /etc/syslog.conf). > The machine hangs. No replies to ping requests, no sysRq or <ctrl>- > <alt><del> response Dante, I'm sorry -- I missed the above. So it looks like everything I suggested in comment #11 and #13 won't help much if the system locks up as you've described. Unfortunately, it also means that we stilk have absolutely nothing useful to work with. So for now, let's see what happens with the test kernel; I'm guessing it won't help much. Here's an update: Our backup script does not backup /proc. I read the entire directory from a previous tape to make sure and it is not there. I did not really expect it to cause the crash simple because we had an instance where it crashed right after backup started (logged in the attachments). The system has not crashed yet. Now it makes it throught it's 4td day. But really it has done 3 succesful tars without crashing. Interesting enough, the rate to disk is still lower than before: [Total bytes written: 49403852800 (46GB, 8.1MB/s)] Not that I care. It's just a trivial note. I'm going to setup the console logging on another server in the meantime (to learn how to for the next crash). I do not want to reboot this server and reset the day count with Ernie's kernel test. What I did do on this system is added an entry in syslog.conf to send log messages to a remote syslog server (in case the local disks are locked hopefully TCP/IP will send the message to the other server). Thanks, Dante The wait is over... The system crashed this morning (6:49:17AM). tar did not kill it this time. I actually do not know what did. My top (-q) process did not pick it up. The backup finished OK last night. Here's the rate logs (again): Total bytes written: 48797921280 (45GB, 11MB/s) Total bytes written: 48823685120 (45GB, 12MB/s) Total bytes written: 48810997760 (45GB, 12MB/s) Total bytes written: 48858214400 (45GB, 12MB/s) Total bytes written: 48818739200 (45GB, 11MB/s) Total bytes written: 48845414400 (45GB, 12MB/s) Total bytes written: 49110046720 (46GB, 12MB/s) Total bytes written: 49041520640 (46GB, 11MB/s) Total bytes written: 49042524160 (46GB, 11MB/s) Total bytes written: 49250979840 (46GB, 11MB/s) Total bytes written: 49456035840 (46GB, 12MB/s) Total bytes written: 49564026880 (46GB, 11MB/s) ... with Ernie's kernel ... Total bytes written: 49479208960 (46GB, 9.0MB/s) (Fr) Total bytes written: 49387980800 (46GB, 8.6MB/s) (Mo) Total bytes written: 49403852800 (46GB, 8.1MB/s) (Tu) Total bytes written: 49407744000 (46GB, 8.3MB/s) (We) We can not afford to spend more time on this. Because on of our production systems is also crashing with this kernel (even with Ernie's kernel). That system has 4 GB and has crashed a couple of times during tar. But lately it has been crashing also during the day. The last two changes on it were the sharing of it disks via NFS and a samba mount to a SCO server. Both these systems had netdump on, the magic key enabled and *.* in syslog.conf to a remote system. No message was captured at the time of the crash. We are going back to December's kernels 2.4.21-20.0.1.ELsmp. We used to run 2.4.21-20.ELsmp with no crashes for months. We started having problem with the 2.4.21-27 family (as soon as it came out). One thing worth noting is that we only had problems on our SAN attached servers (5 of them). All other servers (4) running local SCSI raid controllers have not had any problems. Our conclusion here is that even though Ernie's kernel may have solved the tar problem (I guess we'll never know), there is still another problem which is harder to identify. Worse yet, it is now affecting a production system and does not seem to leave any trace to debug. Thanks for all your help and I hope you find this bug. We may try again in the future (after 21-27) or may be force to try a different Linux distribution. Dante For whatever is worth. Here's a new update. After we went back to the older kernel 2.4.21-20.0.1.ELsmp, the crashing has stopped. None of the servers have crashed anymore. Something interesting the rate we write to tape changed again under the old kernel: -----Kernel: 2.4.21-27.X Total bytes written: 48797921280 (45GB, 11MB/s) Total bytes written: 48823685120 (45GB, 12MB/s) Total bytes written: 48810997760 (45GB, 12MB/s) Total bytes written: 48858214400 (45GB, 12MB/s) Total bytes written: 48818739200 (45GB, 11MB/s) Total bytes written: 48845414400 (45GB, 12MB/s) Total bytes written: 49110046720 (46GB, 12MB/s) Total bytes written: 49041520640 (46GB, 11MB/s) Total bytes written: 49042524160 (46GB, 11MB/s) Total bytes written: 49250979840 (46GB, 11MB/s) Total bytes written: 49456035840 (46GB, 12MB/s) Total bytes written: 49564026880 (46GB, 11MB/s) ---- Ernie's Kernel Total bytes written: 49479208960 (46GB, 9.0MB/s) Total bytes written: 49387980800 (46GB, 8.6MB/s) Total bytes written: 49403852800 (46GB, 8.1MB/s) Total bytes written: 49407744000 (46GB, 8.3MB/s) ---- 2.4.21-20.0.1 Total bytes written: 49374279680 (46GB, 10MB/s) Total bytes written: 49394493440 (46GB, 10MB/s) Total bytes written: 49205708800 (46GB, 10MB/s) Total bytes written: 49293547520 (46GB, 10MB/s) Total bytes written: 49321799680 (46GB, 10MB/s) Total bytes written: 49403801600 (46GB, 10MB/s) I wish we could stay with the old release but there were some other problem (kswapd) with that release that got fixed with 21-27. When will the next kernel update be available? We are (and have been for a long time) waiting to migrate a SCO server to Linux but we need make sure it is stable. Thanks, Dante Created attachment 110875 [details]
Oops trace
Hi there, I have exactly the same problem as Dante Alzamora. My server is a dual Xeon 2.8GHz with 2 GB RAM. The crash happens since migrating to kernel 2.4.21-27.0.1.ELsmp but it also crashed with kernel 2.4.21-27.0.2.ELsmp. The crash happens almost every night, generally between 1am and 2am wich correspond to the end of the backup (cron, tar pcf /dev/st0 /). Doing the backup manually does not provoque a crash. So desabling the crontab backup stop the crashes. There was no information in logs and the only way to print the Oops trace was to log in console 1 and desabling powersave and blanc features (setterm -powersave off -blanc 0). The Oops trace is attachment id=110875. Thanks in advance to find the bug. Best Regards, Christophe "tar pcf /dev/st0 /" will often cause a subsequent oops after /proc/kcore is accessed by tar. The failure in prune_icache() is a typical failure mode. The kernel in Comment #9 addresses the issue; alternatively you can use --exclude of --exclude-from options to prevent /proc/kcore access. Good news! :o) As you may already know, Ernie's kernel did solve our tar crashing problem. After getting a new patch for PowerPath (EMCpower.LINUX-4.3.1-036) we decided to give it another try. And the machine has not crashed since then. Here's the i/o rate: -----Kernel: 2.4.21-27.X Total bytes written: 48797921280 (45GB, 11MB/s) Total bytes written: 48823685120 (45GB, 12MB/s) Total bytes written: 48810997760 (45GB, 12MB/s) Total bytes written: 48858214400 (45GB, 12MB/s) Total bytes written: 48818739200 (45GB, 11MB/s) Total bytes written: 48845414400 (45GB, 12MB/s) Total bytes written: 49110046720 (46GB, 12MB/s) Total bytes written: 49041520640 (46GB, 11MB/s) Total bytes written: 49042524160 (46GB, 11MB/s) Total bytes written: 49250979840 (46GB, 11MB/s) Total bytes written: 49456035840 (46GB, 12MB/s) Total bytes written: 49564026880 (46GB, 11MB/s) ---- Ernie's Kernel (PP 4.30) Total bytes written: 49479208960 (46GB, 9.0MB/s) Total bytes written: 49387980800 (46GB, 8.6MB/s) Total bytes written: 49403852800 (46GB, 8.1MB/s) Total bytes written: 49407744000 (46GB, 8.3MB/s) ---- 2.4.21-20.0.1 (PP 4.30) Total bytes written: 49374279680 (46GB, 10MB/s) Total bytes written: 49394493440 (46GB, 10MB/s) Total bytes written: 49205708800 (46GB, 10MB/s) Total bytes written: 49293547520 (46GB, 10MB/s) Total bytes written: 49321799680 (46GB, 10MB/s) Total bytes written: 49403801600 (46GB, 10MB/s) Total bytes written: 49417533440 (46GB, 10MB/s) Total bytes written: 49502699520 (46GB, 10MB/s) Total bytes written: 49480007680 (46GB, 10MB/s) Total bytes written: 49667973120 (46GB, 10MB/s) Total bytes written: 49646264320 (46GB, 10MB/s) Total bytes written: 49675458560 (46GB, 10MB/s) Total bytes written: 49756508160 (46GB, 10MB/s) Total bytes written: 49762908160 (46GB, 10MB/s) Total bytes written: 49775933440 (46GB, 10MB/s) Total bytes written: 49796474880 (46GB, 10MB/s) Total bytes written: 49899253760 (46GB, 10MB/s) Total bytes written: 50176737280 (47GB, 10MB/s) Total bytes written: 50196899840 (47GB, 10MB/s) ---- Ernie's Kernel with EMC powerpath PP431(EMCpower.LINUX-4.3.1-036) Total bytes written: 50232012800 (47GB, 12MB/s) Total bytes written: 50244126720 (47GB, 12MB/s) Total bytes written: 50257305600 (47GB, 12MB/s) Total bytes written: 50322063360 (47GB, 12MB/s) Total bytes written: 49007349760 (46GB, 12MB/s) Total bytes written: 49036001280 (46GB, 12MB/s) Total bytes written: 49058037760 (46GB, 12MB/s) Total bytes written: 49097533440 (46GB, 12MB/s) (Fri Mar 4) *** When will this fix be part of a regular kernel distribution? *** You may close this ticket! Thanks - Dante The /proc/kcore fix will included in the upcoming RHEL3-U5 kernel update. Thanks for the update, Dante. The /proc/kcore fix was committed to the RHEL3 U5 patch pool on 28-Jan-2005 (in kernel version 2.4.21-27.10.EL). I'm putting this BZ into MODIFIED state. When will RHEL3-U5 be available? When are the Updates released in general? Do you follow an schedule or you do it at various times?. I'd like to have this info so I can plan for testing and migration. Thanks - Dante Dante, RHEL3 U5 is scheduled for release towards the beginning of May, but the U5 external beta period is likely to start towards the end of next week. In general, we try to release updates every 4 months. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-294.html |