Bug 145563

Summary:

tar crashes DELL server every 4th day.

Product:

Red Hat Enterprise Linux 3

Reporter:

Dante Alzamora <dac>

Component:

kernel

Assignee:

Dave Anderson <anderson>

Status:

CLOSED ERRATA

QA Contact:

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.0

CC:

jburke, peterm, petrides, riel

Target Milestone:

---

Target Release:

---

Hardware:

i686

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2005-05-18 13:29:09 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
The last 5 top pages before the crash	none
vmstat -m 1 5 (right before the crash)	none
slabtop --sort=c -o	none
last 5 tops before crash 2005-01-21 @ 01:41:30 AM	none
vmstat before the crash 2005-01-21 @ 01:41:30 AM	none
Oops trace	none

Description Dante Alzamora 2005-01-19 19:22:40 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET 
CLR 1.0.3705; .NET CLR 1.1.4322)

Description of problem:
When the backup cron job starts (11:00PM) with no users or other load 
on the system, it crashes. It does not happen every night. Sometimes 
it lasts for 6 days.
The system does not log any error (even with *.* in /etc/syslog.conf).
The machine hangs. No replies to ping requests, no sysRq or <ctrl>-
<alt><del> response. 
This behaviour started after the kernel upgrade to kernel-smp-2.4.21-
27.0.1.
This system has 1GB of memory. We have other systems (with 4GB mem) 
running the same kernel that do not experience this problem (with the 
exception of one time that it did crash one system).




Version-Release number of selected component (if applicable):
kernel-smp-2.4.21-27.0.1.EL

How reproducible:
Sometimes

Steps to Reproduce:
1. We were able to reproduce it one time by simply starting the tar 
job inside cron. Some other times it worked fine.
2.
3.
    

Actual Results:  The machine hangs with no error messages anywere 
(console or log files). 

Additional info:

I compiled a series of top (with the q switch) to capture the 
offending process. It captured the last 5 tops before the crash.
Also, I captured a series of slabtop output and vmstat.
I'll be glad to add more routines to capture the state of the machine 
to help debug this problem.

Comment 1 Dante Alzamora 2005-01-19 19:28:40 UTC

Created attachment 109983 [details]
The last 5 top pages before the crash

Note that top shows tar starting and then moving to the bottom.
The machine shows a load because of 3 monitoring process to capture the state
of the machine. Otherwise the system should be idle.
The crashed occured right after tar started sending data to tape.

Comment 2 Dante Alzamora 2005-01-19 19:31:06 UTC

Created attachment 109984 [details]
 vmstat -m 1 5 (right before the crash)

Comment 3 Dante Alzamora 2005-01-19 19:41:01 UTC

Created attachment 109985 [details]
slabtop  --sort=c  -o 

I have other slabtop outputs with various switches and even repeated ones
previous to this one in case you want to see them.

Comment 4 Ernie Petrides 2005-01-20 00:00:01 UTC

Hello, Dante.  Please attach the console oops output and/or panic message,
which you might need to capture with a serial console.  By the way, if the
oops is at __audit_get_target()+0x1f6, then the problem has already been
fixed in the latest security errata (2.4.21-27.0.2.EL), which was released
last night.  If not, then we'll need the oops output to investigate further.

Thanks in advance.  -ernie

Comment 5 Dante Alzamora 2005-01-20 18:44:21 UTC

Ernie,
I am new to kernel debugging problems. So bear with me please.
I do not get a pannic or oops message on the screen or any log files 
(including dmesg).

How can I plan for a crash? or at least to get the oops message?
Do I just need to attach a dumb terminal (or other computer via null modem)
to the serial port to capture the message. 
Do I need to reboot the machine with this terminal so it considers it the 
console?
Do I need to turn on kernel variables (via proc or re-compiling the kernel)?

We upgraded the system to 2.4.21-27.0.2 last night. Hopefully I had the 
__audit_get_target problem.

By the way here's some info I left out
hardware: Dell PowerEdge 2600
SCSI Tape controller: Adaptec Controller Dual Channel

Again, let me know how can I plan for a crash,

Thanks,

Dante

Comment 6 Dante Alzamora 2005-01-21 14:30:54 UTC

Bad news. The system crashed again. It stayed up for 1 day only.
Now it is using 2.4.21-27.0.2.ELsmp.
Something weird happen: The tar backup normally takes 1 hour and 5 minutes.
Last night the backup started at 9:00 PM and the system crased @ 01:41:30 AM.
The strange thing is that backup never finished (it logs the start and end to 
a file - we logged the beginning but not the end). And you could actually see 
tar running in the top I captured. It had accumulated about 3 hrs of CPU 
utilization.
Again, there are no opps or pannic messages anywhere.
Thanks,

Comment 7 Dante Alzamora 2005-01-21 14:37:19 UTC

Created attachment 110049 [details]
last 5 tops before crash 2005-01-21 @ 01:41:30 AM

Comment 8 Dante Alzamora 2005-01-21 14:39:02 UTC

Created attachment 110050 [details]
vmstat before the crash 2005-01-21 @ 01:41:30 AM

Comment 9 Dave Anderson 2005-01-21 15:14:10 UTC

Dante,

Without the oops trace we cannot really determine what has happened.

However, we have been debugging an issue that is "bumped into" by
running tar on relatively small memory systems. To rule out that out,
or hopefully prove that it's the same issue, please install and test
the appropriate kernel from this location:
 
http://people.redhat.com/~petrides/.kcore/kernel-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm
    
http://people.redhat.com/~petrides/.kcore/kernel-smp-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm
    
http://people.redhat.com/~petrides/.kcore/kernel-hugemem-2.4.21-27.0.2.EL.ernie.kcore.1.i686.rpm

Comment 11 Dave Anderson 2005-01-25 16:07:26 UTC

The kernel source package contains this file:

 ./Documentation/serial-console.txt

And there's an even more comprehensive web page here:

http://www.faqs.org/docs/Linux-HOWTO/Remote-Serial-Console-HOWTO.html

The amount of information is somewhat overwhelming, but for all
practical purposes the easiest manner is to attach a null-modem cable
to either of your serial ports (/dev/ttyS0 or /dev/ttyS1), and add a
console specifier to the appropriate "kernel" line in your
/boot/grub/grub.conf.  For example, here's a change from an original
kernel line:

  kernel /vmlinuz-2.4.9-21 ro root=/dev/hda6

to add console output from /dev/ttyS0, like this:

  kernel /vmlinuz-2.4.9-21 ro root=/dev/hda6 console=tty0 \
console=ttyS0,9600n8

Your line will obviously be different, but the point is to add
the "console=tty0 console=ttyS0,9600n8" onto the end of the line.

The null-modem cable can be plugged into a dump terminal, but to
save the full output, it makes more sense to plug it into a serial
port on another system, and then run "minicom" on that system to
capture the output.  Configure "minicom" to have a large scroll
buffer.

Alternatively you can configure a netdump-server machine, preferably
on the same subnet as the panicking machine.  To do so, simply
do a "service netdump-server start" on the selected Red Hat server,
presuming that the netdump-server user package has been pre-installed.
Also, you must create a password for user "netdump" on the server.
There should be enough memory in the /var/crash partition to be able
to hold a file of the memory size of the panicking client.

Back on the client (panicking machine), edit /etc/sysconfig/netdump,
and simply enter the IP address of the netdump-server as the
"NETDUMPADDR" configuration item.  Then, do this just one time:

$ service netdump propagate

This will register the client with the netdump-server; it will ask
for the password you created on the netdump-server machine.

Then, upon each boot of the client, enter:

$ service netdump start

Then, upon the next panic, a vmcore will be created in a subdirectory
in /var/crash.

Note that to make the netdump client or netdump-server services
happen automatically with every boot, you can do the following:

On the server:

# chkconfig --add netdump-server
# chkconfig netdump-server on

On the client:

# chkconfig --add netdump
# chkconfig netdump on

This presumes that the netdump package has been installed on the
client, and that the netdump-server package has been installed on
the server.

However, please first test the appropriate kernel listed in
Comment #9.  Also, if you can do so, can you give us the
the actual "tar" command that you use (if it's just one)?
What we believe is happening is that /proc/kcore is being
accessed.

Comment 12 Dante Alzamora 2005-01-25 16:38:15 UTC

I installed your kernel as soon as I got and was able to reboot the 
server on Friday @12:14PM. So the server has been running for 3 days.
It has actually done 2 tars since then (Fridays and Mondays).
I noticed that the rate that it writes to tape has changed:
Total bytes written: 48797921280 (45GB, 11MB/s)
Total bytes written: 48823685120 (45GB, 12MB/s)
Total bytes written: 48810997760 (45GB, 12MB/s)
Total bytes written: 48858214400 (45GB, 12MB/s)
Total bytes written: 48818739200 (45GB, 11MB/s)
Total bytes written: 48845414400 (45GB, 12MB/s)
Total bytes written: 49110046720 (46GB, 12MB/s)
Total bytes written: 49041520640 (46GB, 11MB/s)
Total bytes written: 49042524160 (46GB, 11MB/s)
Total bytes written: 49250979840 (46GB, 11MB/s)
Total bytes written: 49456035840 (46GB, 12MB/s)
Total bytes written: 49564026880 (46GB, 11MB/s)
----After Ernie's kernel---
Total bytes written: 49479208960 (46GB, 9.0MB/s)  (Friday)
Total bytes written: 49387980800 (46GB, 8.6MB/s)  (Monday)

The actual command is the following

/bin/tar --totals --preserve --same-owner --exclude-
from /usr/local/etc/skip.files --atime-preserve -cf /dev/st0  
etc/passwd etc/shadow etc/group .   >> /root/tape_data.err 2>&1

cat /usr/local/etc/skip.files
proc
lost+found
*/lost+found
mnt
./dev/log
./dev/gpmctl
./tmp/.font-unix
./tmp/.gdm_socket
./tmp/.X11-unix
./tmp/.ICE-unix
./tmp/.iroha_unix
./tmp/.fam_socket
./tmp/ssh-XXQGtxJg
./tmp/orbit-root
./tmp/orbit-mbm
./tmp/orbit-dlewis
./tmp/orbit-mth
./tmp
./jb
jb

Since you are suggesting that we are backing up /proc, I will verify 
that. We may need to change the skip-files to ignore "./proc" and not 
simply "proc".

Thanks!

Comment 13 Dave Anderson 2005-01-25 18:22:24 UTC

I wouldn't think that "proc" alone would be any different from
the "mnt" or "lost+found" entries in your exclude file.  I suppose
you could just do a "tar -tvf" of your storage device and see if
anything from /proc is on there.

In any case, it's not necessarily the tar process that has to do
the access of /proc/kcore; another process can do it, and
only when the tar runs would it bump into the corruption.

But looking back, you seem to be describing a "hard hang" as
opposed to a "crash".  What would be most helpful, then, would
be the capture of "Alt-sysrq-p", "Alt-sysrq-w" and "Alt-sysrq-t"
when the system hangs.  That is presuming that the system can
respond to console keyboard interrupts when it gets into the hung
state.  That will send diagnostic output to the console.  If you've
never done so, you need to use "Control-Alt-F2" (or F3, F4, F5 or F6)
to get to a virtual terminal.  Then while holding the Alt and Sysrq
keys down, enter "p", then "w", then "t".  ("Control-Alt-F7" gets you
back to your X window.)  There's probably a good chance that keyboard
interrupts won't be accepted, but if it happens again, that's about
all we have to go on.  You might want to leave the console showing
a virtual terminal via Control-Alt-F2 before leaving the machine.

Comment 14 Dave Anderson 2005-01-25 21:28:03 UTC

> The system does not log any error (even with *.* in /etc/syslog.conf).
> The machine hangs. No replies to ping requests, no sysRq or <ctrl>-
> <alt><del> response

Dante,

I'm sorry -- I missed the above.  So it looks like everything I
suggested in comment #11 and #13 won't help much if the system
locks up as you've described.

Unfortunately, it also means that we stilk have absolutely nothing
useful to work with.  So for now, let's see what happens with the
test kernel; I'm guessing it won't help much.

Comment 15 Dante Alzamora 2005-01-26 08:31:47 UTC

Here's an update:
Our backup script does not backup /proc. I read the entire directory 
from a previous tape to make sure and it is not there.
I did not really expect it to cause the crash simple because we had 
an instance where it crashed right after backup started (logged in 
the attachments).

The system has not crashed yet. Now it makes it throught it's 4td 
day. But really it has done 3 succesful tars without crashing.
Interesting enough, the rate to disk is still lower than before:
[Total bytes written: 49403852800 (46GB, 8.1MB/s)]
Not that I care. It's just a trivial note.

I'm going to setup the console logging on another server in the 
meantime (to learn how to for the next crash). I do not want to 
reboot this server and reset the day count with Ernie's kernel test.
What I did do on this system is added an entry in syslog.conf to send 
log messages to a remote syslog server (in case the local disks are 
locked hopefully TCP/IP will send the message to the other server).

Thanks,  Dante

Comment 16 Dante Alzamora 2005-01-27 18:38:49 UTC

The wait is over...
The system crashed this morning (6:49:17AM). tar did not kill it this 
time. I actually do not know what did. My top (-q) process did not 
pick it up.
The backup finished OK last night. Here's the rate logs (again):
Total bytes written: 48797921280 (45GB, 11MB/s)
Total bytes written: 48823685120 (45GB, 12MB/s)
Total bytes written: 48810997760 (45GB, 12MB/s)
Total bytes written: 48858214400 (45GB, 12MB/s)
Total bytes written: 48818739200 (45GB, 11MB/s)
Total bytes written: 48845414400 (45GB, 12MB/s)
Total bytes written: 49110046720 (46GB, 12MB/s)
Total bytes written: 49041520640 (46GB, 11MB/s)
Total bytes written: 49042524160 (46GB, 11MB/s)
Total bytes written: 49250979840 (46GB, 11MB/s)
Total bytes written: 49456035840 (46GB, 12MB/s)
Total bytes written: 49564026880 (46GB, 11MB/s)
   ... with Ernie's kernel ...
Total bytes written: 49479208960 (46GB, 9.0MB/s) (Fr)
Total bytes written: 49387980800 (46GB, 8.6MB/s) (Mo)
Total bytes written: 49403852800 (46GB, 8.1MB/s) (Tu)
Total bytes written: 49407744000 (46GB, 8.3MB/s) (We)

We can not afford to spend more time on this. Because on of our 
production systems is also crashing with this kernel (even with 
Ernie's kernel). That system has 4 GB and has crashed a couple of 
times during tar. But lately it has been crashing also during the 
day. The last two changes on it were the sharing of it disks via NFS 
and a samba mount to a SCO server. 

Both these systems had netdump on, the magic key enabled and *.* in 
syslog.conf to a remote system. No message was captured at the time 
of the crash.

We are going back to December's kernels 2.4.21-20.0.1.ELsmp. We used 
to run 2.4.21-20.ELsmp with no crashes for months. We started having 
problem with the 2.4.21-27 family (as soon as it came out).

One thing worth noting is that we only had problems on our SAN 
attached servers (5 of them). All other servers (4) running local 
SCSI raid controllers have not had any problems. 

Our conclusion here is that even though Ernie's kernel may have 
solved the tar problem (I guess we'll never know), there is still 
another problem which is harder to identify. Worse yet, it is now 
affecting a production system and does not seem to leave any trace to 
debug. 

Thanks for all your help and I hope you find this bug. We may try 
again in the future (after 21-27) or may be force to try a different 
Linux distribution.

Dante

Comment 17 Dante Alzamora 2005-02-04 16:24:46 UTC

For whatever is worth. Here's a new update.
After we went back to the older kernel 2.4.21-20.0.1.ELsmp, the 
crashing has stopped. None of the servers have crashed anymore.
Something interesting the rate we write to tape changed again under 
the old kernel:
-----Kernel: 2.4.21-27.X
Total bytes written: 48797921280 (45GB, 11MB/s)
Total bytes written: 48823685120 (45GB, 12MB/s)
Total bytes written: 48810997760 (45GB, 12MB/s)
Total bytes written: 48858214400 (45GB, 12MB/s)
Total bytes written: 48818739200 (45GB, 11MB/s)
Total bytes written: 48845414400 (45GB, 12MB/s)
Total bytes written: 49110046720 (46GB, 12MB/s)
Total bytes written: 49041520640 (46GB, 11MB/s)
Total bytes written: 49042524160 (46GB, 11MB/s)
Total bytes written: 49250979840 (46GB, 11MB/s)
Total bytes written: 49456035840 (46GB, 12MB/s)
Total bytes written: 49564026880 (46GB, 11MB/s)
---- Ernie's Kernel
Total bytes written: 49479208960 (46GB, 9.0MB/s)
Total bytes written: 49387980800 (46GB, 8.6MB/s)
Total bytes written: 49403852800 (46GB, 8.1MB/s)
Total bytes written: 49407744000 (46GB, 8.3MB/s)
---- 2.4.21-20.0.1
Total bytes written: 49374279680 (46GB, 10MB/s)
Total bytes written: 49394493440 (46GB, 10MB/s)
Total bytes written: 49205708800 (46GB, 10MB/s)
Total bytes written: 49293547520 (46GB, 10MB/s)
Total bytes written: 49321799680 (46GB, 10MB/s)
Total bytes written: 49403801600 (46GB, 10MB/s)

I wish we could stay with the old release but there were some other 
problem (kswapd) with that release that got fixed with 21-27.

When will the next kernel update be available?
We are (and have been for a long time) waiting to migrate a SCO 
server to Linux but we need make sure it is stable.

Thanks, Dante

Comment 18 Christophe Lambert 2005-02-09 15:49:23 UTC

Created attachment 110875 [details]
Oops trace

Comment 19 Christophe Lambert 2005-02-09 15:56:21 UTC

Hi there,

I have exactly the same problem as Dante Alzamora.
My server is a dual Xeon 2.8GHz with 2 GB RAM.
The crash happens since migrating to kernel  2.4.21-27.0.1.ELsmp but
it also crashed with kernel 2.4.21-27.0.2.ELsmp.

The crash happens almost every night, generally between 1am and 2am
wich correspond to the end of the backup (cron, tar pcf /dev/st0 /).
Doing the backup manually does not provoque a crash. So desabling the
crontab backup stop the crashes.

There was no information in logs and the only way to print the Oops
trace was to log in console 1 and desabling powersave and blanc
features (setterm -powersave off -blanc 0).

The Oops trace is attachment id=110875.

Thanks in advance to find the bug.

Best Regards,
Christophe

Comment 20 Dave Anderson 2005-02-09 16:39:12 UTC


"tar pcf /dev/st0 /" will often cause a subsequent oops after
/proc/kcore is accessed by tar.  The failure in prune_icache()
is a typical failure mode.

The kernel in Comment #9 addresses the issue; alternatively you can
use --exclude of --exclude-from options to prevent /proc/kcore access.

Comment 21 Dante Alzamora 2005-03-07 20:07:57 UTC

Good news! :o)
As you may already know, Ernie's kernel did solve our tar crashing 
problem.
After getting a new patch for PowerPath (EMCpower.LINUX-4.3.1-036) we 
decided to give it another try. And the machine has not crashed since 
then.

Here's the i/o rate:

-----Kernel: 2.4.21-27.X
Total bytes written: 48797921280 (45GB, 11MB/s)
Total bytes written: 48823685120 (45GB, 12MB/s)
Total bytes written: 48810997760 (45GB, 12MB/s)
Total bytes written: 48858214400 (45GB, 12MB/s)
Total bytes written: 48818739200 (45GB, 11MB/s)
Total bytes written: 48845414400 (45GB, 12MB/s)
Total bytes written: 49110046720 (46GB, 12MB/s)
Total bytes written: 49041520640 (46GB, 11MB/s)
Total bytes written: 49042524160 (46GB, 11MB/s)
Total bytes written: 49250979840 (46GB, 11MB/s)
Total bytes written: 49456035840 (46GB, 12MB/s)
Total bytes written: 49564026880 (46GB, 11MB/s)
---- Ernie's Kernel (PP 4.30)
Total bytes written: 49479208960 (46GB, 9.0MB/s)
Total bytes written: 49387980800 (46GB, 8.6MB/s)
Total bytes written: 49403852800 (46GB, 8.1MB/s)
Total bytes written: 49407744000 (46GB, 8.3MB/s)
---- 2.4.21-20.0.1  (PP 4.30)
Total bytes written: 49374279680 (46GB, 10MB/s)
Total bytes written: 49394493440 (46GB, 10MB/s)
Total bytes written: 49205708800 (46GB, 10MB/s)
Total bytes written: 49293547520 (46GB, 10MB/s)
Total bytes written: 49321799680 (46GB, 10MB/s)
Total bytes written: 49403801600 (46GB, 10MB/s)
Total bytes written: 49417533440 (46GB, 10MB/s)
Total bytes written: 49502699520 (46GB, 10MB/s)
Total bytes written: 49480007680 (46GB, 10MB/s)
Total bytes written: 49667973120 (46GB, 10MB/s)
Total bytes written: 49646264320 (46GB, 10MB/s)
Total bytes written: 49675458560 (46GB, 10MB/s)
Total bytes written: 49756508160 (46GB, 10MB/s)
Total bytes written: 49762908160 (46GB, 10MB/s)
Total bytes written: 49775933440 (46GB, 10MB/s)
Total bytes written: 49796474880 (46GB, 10MB/s)
Total bytes written: 49899253760 (46GB, 10MB/s)
Total bytes written: 50176737280 (47GB, 10MB/s)
Total bytes written: 50196899840 (47GB, 10MB/s)
---- Ernie's Kernel with EMC powerpath PP431(EMCpower.LINUX-4.3.1-036)
Total bytes written: 50232012800 (47GB, 12MB/s)
Total bytes written: 50244126720 (47GB, 12MB/s)
Total bytes written: 50257305600 (47GB, 12MB/s)
Total bytes written: 50322063360 (47GB, 12MB/s)
Total bytes written: 49007349760 (46GB, 12MB/s)
Total bytes written: 49036001280 (46GB, 12MB/s)
Total bytes written: 49058037760 (46GB, 12MB/s)
Total bytes written: 49097533440 (46GB, 12MB/s)  (Fri Mar  4)

*** When will this fix be part of a regular kernel distribution? ***

You may close this ticket!

Thanks - Dante

Comment 22 Dave Anderson 2005-03-07 20:26:43 UTC

The /proc/kcore fix will included in the upcoming RHEL3-U5 kernel
update.

Comment 23 Ernie Petrides 2005-03-08 01:43:13 UTC

Thanks for the update, Dante.  The /proc/kcore fix was committed to the
RHEL3 U5 patch pool on 28-Jan-2005 (in kernel version 2.4.21-27.10.EL).

I'm putting this BZ into MODIFIED state.

Comment 24 Dante Alzamora 2005-03-08 21:16:28 UTC

When will RHEL3-U5 be available? 
When are the Updates released in general? 
Do you follow an schedule or you do it at various times?.
I'd like to have this info so I can plan for testing and migration.

Thanks - Dante

Comment 25 Ernie Petrides 2005-03-08 21:47:44 UTC

Dante, RHEL3 U5 is scheduled for release towards the beginning of May,
but the U5 external beta period is likely to start towards the end of
next week.  In general, we try to release updates every 4 months.

Comment 26 Tim Powers 2005-05-18 13:29:09 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-294.html