Description of problem: kdump won't terminate copying a vmcore even if space to save the core is lacking. Consequently, kdump cannot carry out the next process, and won't be able to reboot. Version-Release number of selected component (if applicable): kexec-tools-1.101-92.el5 How reproducible: Always Steps to Reproduce: 1. Configure path to save a core in /etc/kdump.conf so that its size becomes less than RAM size. 2. Run "service kdump restart". 3. Make the system panic. Actual results: kdump won't terminate copying a vmcore Expected results: kdump terminates copying a vmcore and reboots the system. Additional info:
As I look at the init script, I may understand what you are seeing. If the core file copy fails, we leave the file named vmcore-incomplete and continue with the rootfs mount to try safely copy the file to vmcore, which probably doesn't use the vmcore-incomplete semantics. I'm hesitant to just reboot on error in, in the event that the initramfs dump target doesn't match the rootfs /var/crash target, but I'll add the -incomplete semantics to the kdump initscript to make this more consistent and clear.
actally scratch my last comment, I don't understand this. 92.el5 has the -incomplete semantics in the initscript, so if you would please provide me with your kdump.conf and console capture of the dump session so I can figure out what your seeing. Thanks!
Created attachment 138346 [details] kdump won't terminate copying a vmcore.
I only specified in /etc/kdump.conf as below in order to make kdump copy a vmcore to a remote host. ----- net dhcp78-28.lab.boston.redhat.com:/var/crash ----- As for console capture of the dump session, please see the attachment. FYI. when this happens, all I can do is to reset the machine. Regards, Akira
please dont set bugs assigned to me to NEEDINFO on me, I don't see them in my normal list. Just set them back to assigned please
looks like its hanging trying to mount the server. What does your exports file on your nfs server look like? Do you see a vmcore-incomplete file created on your nfs server? how big is it?
> What does your exports file on your nfs server look like? It looks as below. /var/crash dhcp78-238(rw,no_root_squash) > Do you see a vmcore-incomplete file created on your nfs server? Yes, I do. However, I see it on dhcp78-28.lab.boston.redhat.com:/var/crash/var/crash That path looks ugly. This could be related to BZ#210056 that you have modified. If so, the status of BZ#210056 should be changed. > how big is it? The RAM size is 16 GB, but the created vmcore_incomplete file size is 3 GB. Is this related to this bug? (I don't think so.) Anyway I will provide any information as much as I can in order to resolve this problem. Thanks, Akira
>Yes, I do. However, I see it on >dhcp78-28.lab.boston.redhat.com:/var/crash/var/crash >That path looks ugly. This could be related to BZ#210056 that you The path looks ugly because thats how you've configured it. We had this email thread on the kexec-kboot internal list for the past week about how to implement path additions, and in bz 210056 you'll see that the conclusion we came to was to add a path directive in kdump.conf that lets you specify the path, and defaults to /var/crash if unset. Since you indicated above that you only have one active directive in your kdump.conf above, which mounts dhcp78-28.lab.boston.redhat.com:/var/crash/ on /mnt then the default patch saves to /mnt/$SAVE_PATH you get /var/crash/var/crash on the server. set the following in kdump.conf to avoid this: path / >The RAM size is 16 GB, but the created vmcore_incomplete file size >is 3 GB. Is this related to this bug? It sounds to me like its the definition of the bug. I assume that the target filesystem only has 3GB of space on it? If so, NFS should be reporting ENOSPC back to the user app (in this case the copy operation) which should then fail). As it is though, it would appear that either NFS isn't reporting an error back to the cp process (since I can manually do a busybox copy to fill up a filesystem locally and have it error out). I'll try to reporduce this here. In the meantime, let me know how much free space you have on your system, and see if dumping to a too-small local filesystem results in the same failure for you. Thanks!
Thanks for letting me know how to avoid that. I found kdump does work as expected by configuring /etc/kdump.conf as below. ----- net dhcp78-28.lab.boston.redhat.com:/var/crash/ path / ----- I misunderstood the specification again. Sorry about that. Here are the answer for your questions. My nfs server has 3 GB free space to save a vmcore. That's why kdump created a 3 GB vmcore-incomplete. Dumping to a too-small local filesystem results in the same failure for me. NSF doesn't appear to report an error back to the cp process. So, it's possible that there's something wrong with NFS, isn't it? Regards, Akira
So, I just tried this on my system, and it worked exactly as I expected. I created an NFS mount that had only 512MB of space available to it, configured kdump to dump to that mount and crashed the kernel. The initramfs attempted to save the core the the nfs mount, received an ENOSPC error, which caused the cp operation to fail. the system then mounted the rootfs, started the initscripts, saved the core locally, and rebooted. So, my only thought then is that your NFS server isn't responding properly to an out of space condition. What are you using as your NFS server? Can you capture a tcpdump of all the communication with the NFS server during the dump? Its also possible that you're not using cp to copy the core in the initramfs and that isn't handling the copy operation properly. Is it possible that (despite your config in comment #9) that you have makedumpfile specified as your core_collector?
Just to make sure. How long did it take to receive the error? Was it short time?
It was less than 30 seconds. Still waiting on information requested in comment #10
I did testing once again. As you told, the system rebooted automatically just after it received an ENOSPC error. However it took very long time to receive the error. It was much longer than 30 seconds. Although I didn't count it, it seemed to be longer than 10 minutes. Anyway, it worked as expected. BTW, the mechanism to avoid losing dump data doesn't work in the case that NFS path or SSH path is configured. I expected kdump presents console so that user can save a vmcore completely when copying vmcore fails due to an error. Actually, it didn't present console, and just rebooted. Eventually, there was a loss of dump data. Thanks, Akira
I don't know what to tell you. You've just confirmed that with your testing (I assume to a local fs) that you got an ENOSPC error just as I did, only it took you much longer because the amount of available space on your target was bigger than mine. When I did my testing, I specifically used an NFS mount, so I know that works for me, contrary to your statement, as does ssh/scp dumping, which I just tested as well. I assume you are claiming these facilities don't work properly because you expected the console to be presented, which is incorrect. The only way an interactive console is going to be presented to you is if you configure it as such, using the "default" directive in kdump.conf. You can set its value to shell or to reboot, the effects of which are self explanitory. In the event that it is left unset, the default action will be to mount the root filesystem, switchroot to it and run /sbin/init, which should save the core file locally in /var/crash and then reboot. Thats probably what you are seeing and labeling as a failure. Set the default directive in kdump.conf to shell and it should work as you expect
As you told, it worked as expected. I don't know yet how the length of time to receive an ENOSPC error is decided. Is approximately 30 seconds normal? Thanks, Akira
No. As I explained before the time it takes to get an ENOSPC error is dependent on the amount of space on your target, and the throughput of the transfer. You get an ENOSPC error as soon as you attempt to issue a write command with no space left on the device. so you have to take the time to fill the device first. That dictates the latency to the error.