Bug 210419 - kdump won't terminate copying a vmcore even if cp fails.
Summary: kdump won't terminate copying a vmcore even if cp fails.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kexec-tools
Version: 5.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Neil Horman
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2006-10-11 22:21 UTC by Akira Imamura
Modified: 2007-11-30 22:07 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2006-11-04 15:14:40 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
kdump won't terminate copying a vmcore. (1.70 KB, text/plain)
2006-10-12 17:28 UTC, Akira Imamura
no flags Details

Description Akira Imamura 2006-10-11 22:21:37 UTC
Description of problem:
kdump won't terminate copying a vmcore even if space to save the core
is lacking. Consequently, kdump cannot carry out the next process, and
won't be able to reboot.

Version-Release number of selected component (if applicable):
kexec-tools-1.101-92.el5

How reproducible:
Always

Steps to Reproduce:
1. Configure path to save a core in /etc/kdump.conf so that its size becomes
less than RAM size.
2. Run "service kdump restart".
3. Make the system panic.
  
Actual results:
kdump won't terminate copying a vmcore

Expected results:
kdump terminates copying a vmcore and reboots the system.

Additional info:

Comment 1 Neil Horman 2006-10-12 11:50:35 UTC
As I look at the init script, I may understand what you are seeing.  If the core
file copy fails, we leave the file named vmcore-incomplete and continue with the
rootfs mount to try safely copy the file to vmcore, which probably doesn't use
the vmcore-incomplete semantics.  I'm hesitant to just reboot on error in, in
the event that the initramfs dump target doesn't match the rootfs /var/crash
target, but I'll add the -incomplete semantics to the kdump initscript to make
this more consistent and clear.


Comment 2 Neil Horman 2006-10-12 11:54:39 UTC
actally scratch my last comment, I don't understand this.  92.el5 has the
-incomplete semantics in the initscript, so if you would please provide me with
your kdump.conf and console capture of the dump session so I can figure out what
your seeing.  Thanks!

Comment 3 Akira Imamura 2006-10-12 17:28:20 UTC
Created attachment 138346 [details]
kdump won't terminate copying a vmcore.

Comment 4 Akira Imamura 2006-10-12 17:33:04 UTC
I only specified in /etc/kdump.conf as below in order to make kdump copy
a vmcore to a remote host.
-----
net dhcp78-28.lab.boston.redhat.com:/var/crash
-----

As for console capture of the dump session, please see the attachment.
FYI. when this happens, all I can do is to reset the machine.

Regards,
Akira

Comment 5 Neil Horman 2006-10-12 18:02:50 UTC
please dont set bugs assigned to me to NEEDINFO on me, I don't see them in my
normal list.  Just set them back to assigned please

Comment 6 Neil Horman 2006-10-12 18:07:08 UTC
looks like its hanging  trying to mount the server.  What does your exports file
on your nfs server look like?  Do you see a vmcore-incomplete file created on
your nfs server?  how big is it?

Comment 7 Akira Imamura 2006-10-12 18:46:05 UTC
> What does your exports file on your nfs server look like?
It looks as below.
/var/crash dhcp78-238(rw,no_root_squash)

> Do you see a vmcore-incomplete file created on your nfs server?
Yes, I do. However, I see it on
dhcp78-28.lab.boston.redhat.com:/var/crash/var/crash
That path looks ugly. This could be related to BZ#210056 that you
have modified. If so, the status of BZ#210056 should be changed.

> how big is it?
The RAM size is 16 GB, but the created vmcore_incomplete file size
is 3 GB. Is this related to this bug? (I don't think so.)
Anyway I will provide any information as much as I can in order
to resolve this problem.

Thanks,
Akira


Comment 8 Neil Horman 2006-10-12 19:12:40 UTC
>Yes, I do. However, I see it on
>dhcp78-28.lab.boston.redhat.com:/var/crash/var/crash
>That path looks ugly. This could be related to BZ#210056 that you
The path looks ugly because thats how you've configured it.  We had this email
thread on the kexec-kboot internal list for the past week about how to implement
path additions, and in bz 210056 you'll see that the conclusion we came to was
to add a path directive in kdump.conf that lets you specify the path, and
defaults to /var/crash if unset.  Since you indicated above that you only have
one active directive in your kdump.conf above, which mounts
dhcp78-28.lab.boston.redhat.com:/var/crash/ on /mnt
then the default patch saves to /mnt/$SAVE_PATH
you get /var/crash/var/crash on the server.  set the following in kdump.conf to
avoid this:
path /


>The RAM size is 16 GB, but the created vmcore_incomplete file size
>is 3 GB. Is this related to this bug?
It sounds to me like its the definition of the bug.  I assume that the target
filesystem only has 3GB of space on it?  If so, NFS should be reporting ENOSPC
back to the user app (in this case the copy operation) which should then fail).
 As it is though, it would appear that either NFS isn't reporting an error back
to the cp process (since I can manually do a busybox copy to fill up a
filesystem locally and have it error out).  I'll try to reporduce this here.  In
the meantime, let me know how much free space you have on your system, and see
if dumping to a too-small local filesystem results in the same failure for you.
 Thanks!

Comment 9 Akira Imamura 2006-10-12 20:03:04 UTC
Thanks for letting me know how to avoid that. I found kdump does work
as expected by configuring /etc/kdump.conf as below.
-----
net dhcp78-28.lab.boston.redhat.com:/var/crash/
path /
-----

I misunderstood the specification again. Sorry about that.

Here are the answer for your questions.
My nfs server has 3 GB free space to save a vmcore. That's why
kdump created a 3 GB vmcore-incomplete. Dumping to a too-small local
filesystem results in the same failure for me. NSF doesn't appear to
report an error back to the cp process. So, it's possible that
there's something wrong with NFS, isn't it?

Regards,
Akira


Comment 10 Neil Horman 2006-10-13 14:30:07 UTC
So, I just tried this on my system, and it worked exactly as I expected.  I
created an NFS mount that had only 512MB of space available to it, configured
kdump to dump to that mount and crashed the kernel.  The initramfs attempted to
save the core the the nfs mount, received an ENOSPC error, which caused the cp
operation to fail.  the system then mounted the rootfs, started the initscripts,
saved the core locally, and rebooted.

So, my only thought then is that your NFS server isn't responding properly to an
out of space condition.  What are you using as your NFS server?  Can you capture
a  tcpdump of all the communication with the NFS server during the dump?

Its also possible that you're not using cp to copy the core in the initramfs and
that isn't handling the copy operation properly.  Is it possible that (despite
your config in comment #9)  that you have makedumpfile specified as your
core_collector?

Comment 11 Akira Imamura 2006-10-24 21:15:11 UTC
Just to make sure. How long did it take to receive the error? Was it short time?

Comment 12 Neil Horman 2006-10-25 13:05:15 UTC
It was less than 30 seconds.
Still waiting on information requested in comment #10


Comment 13 Akira Imamura 2006-11-03 20:13:46 UTC
I did testing once again. As you told, the system rebooted automatically
just after it received an ENOSPC error. However it took very long time to
receive the error. It was much longer than 30 seconds. Although I didn't count
it, it seemed to be longer than 10 minutes. Anyway, it worked as expected.
BTW, the mechanism to avoid losing dump data doesn't work in the case that NFS
path or SSH path is configured. I expected kdump presents console so that
user can save a vmcore completely when copying vmcore fails due to an error.
Actually, it didn't present console, and just rebooted. Eventually, there
was a loss of dump data.

Thanks,
Akira


Comment 14 Neil Horman 2006-11-03 20:39:01 UTC
I don't know what to tell you.  You've just confirmed that with your testing (I
assume to a local fs) that you got an ENOSPC error just as I did, only it took
you much longer because the amount of available space on your target was bigger
than mine.  When I did my testing, I specifically used an NFS mount, so I know
that works for me, contrary to your statement, as does ssh/scp dumping, which I
just tested as well.  I assume you are claiming these facilities don't work
properly because you expected the console to be presented, which is incorrect. 
The only way an interactive console is going to be presented to you is if you
configure it as such, using the "default" directive in kdump.conf.  You can set
its value to shell or to reboot, the effects of which are self explanitory.  In
the event that it is left unset, the default action will be to mount the root
filesystem, switchroot to it and run /sbin/init, which should save the core file
locally in /var/crash and then reboot.  Thats probably what you are seeing and
labeling as a failure.  Set the default directive in kdump.conf to shell and it
should work as you expect

Comment 15 Akira Imamura 2006-11-03 23:38:10 UTC
As you told, it worked as expected. I don't know yet how the length of time
to receive an ENOSPC error is decided. Is approximately 30 seconds normal?

Thanks,
Akira

Comment 16 Neil Horman 2006-11-04 15:14:40 UTC
No.  As I explained before the time it takes to get an ENOSPC error is dependent
on the amount of space on your target, and the throughput of the transfer.  You
get an ENOSPC error as soon as you attempt to issue a write command with no
space left on the device.  so you have to take the time to fill the device
first.  That dictates the latency to the error.


Note You need to log in before you can comment on or make changes to this bug.