Bug 1490946
Summary: | XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Luiz Capitulino <lcapitulino> |
Component: | kernel-rt | Assignee: | fs-maint |
kernel-rt sub component: | XFS | QA Contact: | Filesystem QE <fs-qe> |
Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
Severity: | unspecified | ||
Priority: | unspecified | CC: | bfoster, bhu, esandeen, lgoncalv, zlang |
Version: | 7.5 | ||
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-10-31 14:27:35 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1175461 |
Description
Luiz Capitulino
2017-09-12 14:39:20 UTC
This may be related to the following issue: Bug 1448770 - INFO: task xfsaild/dm-2:1175 blocked for more than 600 seconds > I was able to repair the FS by using xfs_repair.
Any chance you saved the xfs_repair output?
I suppose you had to use -L because log replay failed for you, so there would be a lot of spurious errors in there, but seeing what repair found might offer a clue about the corruption that broke log replay. Yes, I used -L. I didn't save xfs_repair output :( what I kept around are the files in lost+found. They all seem unimportant though: most are source files I probably had open in my edit when the crash happened and don't even remember having changed them. Btw, maybe xfs_repair should automatically write a log and ask the person desperately trying to fix the FS to mount the FS and save it :) Btw, I see that the BZ got assigned to fs-maint. For issues reproduced in RT, we first try to determine if it's an RT-specific issue and only contact people outside the RT realm if we're somewhat sure the non-RT kernel is afffected. So, feel free to ignore this for now (although debugging tips are very, very welcome). (In reply to Luiz Capitulino from comment #0) > Description of problem: > > While executing a series of 0-loss test-cases in a row for a few days with > KVM-RT, my host rebooted automatically and was unable to mount the root > file-system. I was able to retrieve dmesg from dracut shell: > What do you mean by your host rebooted automatically? Did an error occur or was the reboot intentional? If the latter, does that occur frequently during this test? The XFS error reports are during the subsequent log recovery. By itself this doesn't really indicate whether the corruption had occurred sometime earlier or is due to a logging/recovery issue. (In reply to Brian Foster from comment #7) > (In reply to Luiz Capitulino from comment #0) > > Description of problem: > > > > While executing a series of 0-loss test-cases in a row for a few days with > > KVM-RT, my host rebooted automatically and was unable to mount the root > > file-system. I was able to retrieve dmesg from dracut shell: > > > > What do you mean by your host rebooted automatically? Did an error occur or > was the reboot intentional? If the latter, does that occur frequently during > this test? What happened was: I ran the script that executes the test-cases on a Friday evening. Came back to check the results Sunday morning and all I had was a dracut shell prompt. dmesg had the traces in the description. That's all. After the FS was repaired, I looked for errors in /var/log/messages, vmcores etc. Found nothing. This happened only once. I have since ran the exactly same sequence of tests several times and nothing happened. (In reply to Luiz Capitulino from comment #8) ... > What happened was: I ran the script that executes the test-cases on a Friday > evening. Came back to check the results Sunday morning and all I had was a > dracut shell prompt. dmesg had the traces in the description. That's all. > After the FS was repaired, I looked for errors in /var/log/messages, vmcores > etc. Found nothing. > ... Ok, so it sounds like there are two potential issues here. One is an unexpected, unknown system reset and the second is the log recovery failure. I can't really speak to why the system might have reset, but there is a possibility that the XFS corruption was due to a logging problem. If that is the case, it may never reproduce unless the system restarts at the appropriate point in time to trigger the problem on recovery. Alternatively, this may not be the case at all and the issue could just be very difficult to reproduce. Let's keep this BZ open for a while as I'll be doing lots of this testing in the coming weeks. Also, is there anything I should do if it happens again (besides colleting xfs_repair's output)? (In reply to Luiz Capitulino from comment #10) > Let's keep this BZ open for a while as I'll be doing lots of this testing in > the coming weeks. Also, is there anything I should do if it happens again > (besides colleting xfs_repair's output)? A metadump of the broken fs before it is repaired might be useful (i.e., 'xfs_metadump -go <dev> <output>'). Otherwise, if this problem doesn't seem to reoccur absent of system resets, it might be interesting to try and explicitly incorporate hard resets (i.e., echo b > /proc/sysrq-trigger) into the test. (In reply to Luiz Capitulino from comment #10) > Let's keep this BZ open for a while as I'll be doing lots of this testing in > the coming weeks. Also, is there anything I should do if it happens again > (besides colleting xfs_repair's output)? Is this testing still in progress? Any luck reproducing? No, never happened again and we moved our testbed to newer kernels. So, I'm closing as WORKSFORME. |