Bug 1490946

Summary:	XFS (dm-0): Corruption of in-memory data detected. Shutting down filesystem
Product:	Red Hat Enterprise Linux 7	Reporter:	Luiz Capitulino <lcapitulino>
Component:	kernel-rt	Assignee:	fs-maint
kernel-rt sub component:	XFS	QA Contact:	Filesystem QE <fs-qe>
Status:	CLOSED INSUFFICIENT_DATA	Docs Contact:
Severity:	unspecified
Priority:	unspecified	CC:	bfoster, bhu, esandeen, lgoncalv, zlang
Version:	7.5
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-10-31 14:27:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1175461

Description Luiz Capitulino 2017-09-12 14:39:20 UTC

Description of problem:

While executing a series of 0-loss test-cases in a row for a few days with KVM-RT, my host rebooted automatically and was unable to mount the root file-system. I was able to retrieve dmesg from dracut shell:

[   14.889992] XFS (dm-0): Internal error XFS_WANT_CORRUPTED_RETURN at line 366 of file fs/xfs/libxfs/xfs_alloc.c.  Caller xfs_alloc_ag_vextent_size+0x5f8/0x790 [xfs]
[   14.906312] CPU: 12 PID: 690 Comm: mount Not tainted 3.10.0-693.rt56.617.el7.x86_64 #1
[   14.915150] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.2.5 09/08/2016
[   14.923495]  ffff880469610000 00000000ee02e1aa ffff8808687af930 ffffffff816aa004
[   14.931793]  ffff8808687af948 ffffffffc039a5bb ffffffffc03549d8 ffff8808687af9a0
[   14.940087]  ffffffffc03525f4 0000000600000006 ffff88086c073000 00000000c0357121
[   14.948383] Call Trace:
[   14.951119]  [<ffffffff816aa004>] dump_stack+0x19/0x1b
[   14.956887]  [<ffffffffc039a5bb>] xfs_error_report+0x3b/0x40 [xfs]
[   14.963803]  [<ffffffffc03549d8>] ? xfs_alloc_ag_vextent_size+0x5f8/0x790 [xfs]
[   14.971979]  [<ffffffffc03525f4>] xfs_alloc_fixup_trees+0x2c4/0x370 [xfs]
[   14.979574]  [<ffffffffc03549d8>] xfs_alloc_ag_vextent_size+0x5f8/0x790 [xfs]
[   14.987557]  [<ffffffffc0355885>] xfs_alloc_ag_vextent+0xe5/0x150 [xfs]
[   14.994956]  [<ffffffffc0355f3f>] xfs_alloc_fix_freelist+0x1cf/0x410 [xfs]
[   15.002648]  [<ffffffffc035699a>] xfs_free_extent+0x9a/0x130 [xfs]
[   15.009586]  [<ffffffffc03cb606>] xfs_trans_free_extent+0x26/0x60 [xfs]
[   15.017005]  [<ffffffffc03c2cce>] xlog_recover_process_efi+0x17e/0x1c0 [xfs]
 [xfs]
 [   15.033478]  [<ffffffffc03c9091>] xlog_recover_finish+0x21/0xb0 [xfs]
 [   15.040702]  [<ffffffffc03bb4e4>] xfs_log_mount_finish+0x34/0x50 [xfs]
 [   15.048023]  [<ffffffffc03b12fb>] xfs_mountfs+0x5fb/0x910 [xfs]
 [   15.054661]  [<ffffffffc039e820>] ? xfs_filestream_get_parent+0x80/0x80 [xfs]
 [   15.062659]  [<ffffffffc03b432b>] xfs_fs_fill_super+0x3fb/0x510 [xfs]
 [   15.069851]  [<ffffffff811ff340>] mount_bdev+0x1b0/0x1f0
 0 [xfs]
 [   15.084479]  [<ffffffffc03b2bc5>] xfs_fs_mount+0x15/0x20 [xfs]
 [   15.090983]  [<ffffffff811ffbe9>] mount_fs+0x39/0x1b0
 [   15.096617]  [<ffffffff8121d807>] vfs_kern_mount+0x67/0x120
 [   15.102838]  [<ffffffff8121ffbe>] do_mount+0x24e/0xb00
 [   15.108574]  [<ffffffff811a3bab>] ? strndup_user+0x4b/0xa0
 [   15.114698]  [<ffffffff81220bf6>] SyS_mount+0x96/0xf0
 [   15.120339]  [<ffffffff816ba314>] tracesys+0xdd/0xe2
[   15.125922] XFS (dm-0): Internal error xfs_trans_cancel at line 984 of file fs/xfs/xfs_trans.c.  Caller xlog_recover_process_efi+0x18e/0x1c0 [xfs]
[   15.140579] CPU: 12 PID: 690 Comm: mount Not tainted 3.10.0-693.rt56.617.el7.x86_64 #1
[   15.149415] Hardware name: Dell Inc. PowerEdge R430/0CN7X8, BIOS 2.2.5 09/08/2016
[   15.157766]  ffff88046af78000 00000000ee02e1aa ffff8808687afbd0 ffffffff816aa004
[   15.166059]  ffff8808687afbe8 ffffffffc039a5bb ffffffffc03c2cde ffff8808687afc10
[   15.174352]  ffffffffc03b80ad ffff880868f04000 0000000000000000 ffff88046a5b8000
[   15.182647] Call Trace:
[   15.185379]  [<ffffffff816aa004>] dump_stack+0x19/0x1b
[   15.191145]  [<ffffffffc039a5bb>] xfs_error_report+0x3b/0x40 [xfs]
[   15.198080]  [<ffffffffc03c2cde>] ? xlog_recover_process_efi+0x18e/0x1c0 [xfs]
[   15.206175]  [<ffffffffc03b80ad>] xfs_trans_cancel+0xbd/0xe0 [xfs]
[   15.213109]  [<ffffffffc03c2cde>] xlog_recover_process_efi+0x18e/0x1c0 [xfs]
[   15.221008]  [<ffffffffc03c42d4>] xlog_recover_process_efis.isra.30+0x84/0xf0 [xfs]
[   15.229587]  [<ffffffffc03c9091>] xlog_recover_finish+0x21/0xb0 [xfs]
[   15.236809]  [<ffffffffc03bb4e4>] xfs_log_mount_finish+0x34/0x50 [xfs]
[   15.244128]  [<ffffffffc03b12fb>] xfs_mountfs+0x5fb/0x910 [xfs]
[   15.250765]  [<ffffffffc039e820>] ? xfs_filestream_get_parent+0x80/0x80 [xfs]
[   15.258762]  [<ffffffffc03b432b>] xfs_fs_fill_super+0x3fb/0x510 [xfs]
[   15.265954]  [<ffffffff811ff340>] mount_bdev+0x1b0/0x1f0
[   15.271911]  [<ffffffffc03b3f30>] ? xfs_test_remount_options.isra.11+0x70/0x70 [xfs]
[   15.280582]  [<ffffffffc03b2bc5>] xfs_fs_mount+0x15/0x20 [xfs]
[   15.287093]  [<ffffffff811ffbe9>] mount_fs+0x39/0x1b0
[   15.292734]  [<ffffffff8121d807>] vfs_kern_mount+0x67/0x120
[   15.298954]  [<ffffffff8121ffbe>] do_mount+0x24e/0xb00
[   15.304688]  [<ffffffff811a3bab>] ? strndup_user+0x4b/0xa0
[   15.310811]  [<ffffffff81220bf6>] SyS_mount+0x96/0xf0
[   15.316449]  [<ffffffff816ba314>] tracesys+0xdd/0xe2
[   15.321992] XFS (dm-0): xfs_do_force_shutdown(0x8) called from line 985 of file fs/xfs/xfs_trans.c.  Return address = 0xffffffffc03b80c6
[   15.335674] XFS (dm-0): Corruption of in-memory data detected.  Shutting down filesystem
[   15.344704] XFS (dm-0): Please umount the filesystem and rectify the problem(s)
[   15.352868] XFS (dm-0): Failed to recover EFIs
[   15.357826] XFS (dm-0): log mount finish failed

I was able to repair the FS by using xfs_repair. I've ran the same series of tests a few times now without triggering this issue again.


Version-Release number of selected component (if applicable): kernel-3.10.0-693.rt56.617.el7.x86_64


How reproducible:

Environment and test cases are very complex. We're using realtime KVM. A testbed with two machines connected back-to-back. The host machine (which triggered the FS corruption) is running DPDK threads and a VM doing packet forwarding (all forwarding threads have fifo:1 priority). The test-cases test packet forwarding under different scenarios: idle, host I/O load and guest I/O load. The complete set of tests may take up to 2 days. I can't tell which test-case caused the issue.

Comment 2 Luiz Capitulino 2017-09-12 14:42:01 UTC

This may be related to the following issue:

Bug 1448770 - INFO: task xfsaild/dm-2:1175 blocked for more than 600 seconds

Comment 3 Eric Sandeen 2017-09-12 14:42:17 UTC

> I was able to repair the FS by using xfs_repair.

Any chance you saved the xfs_repair output?

Comment 4 Eric Sandeen 2017-09-12 14:44:06 UTC

I suppose you had to use -L because log replay failed for you, so there would be a lot of spurious errors in there, but seeing what repair found might offer a clue about the corruption that broke log replay.

Comment 5 Luiz Capitulino 2017-09-12 15:02:24 UTC

Yes, I used -L. I didn't save xfs_repair output :( what I kept around are the files in lost+found. They all seem unimportant though: most are source files I probably had open in my edit when the crash happened and don't even remember having changed them.

Btw, maybe xfs_repair should automatically write a log and ask the person desperately trying to fix the FS to mount the FS and save it :)

Comment 6 Luiz Capitulino 2017-09-12 15:09:15 UTC

Btw, I see that the BZ got assigned to fs-maint. For issues reproduced in RT, we first try to determine if it's an RT-specific issue and only contact people outside the RT realm if we're somewhat sure the non-RT kernel is afffected.

So, feel free to ignore this for now (although debugging tips are very, very welcome).

Comment 7 Brian Foster 2017-09-14 13:57:05 UTC

(In reply to Luiz Capitulino from comment #0)
> Description of problem:
> 
> While executing a series of 0-loss test-cases in a row for a few days with
> KVM-RT, my host rebooted automatically and was unable to mount the root
> file-system. I was able to retrieve dmesg from dracut shell:
> 

What do you mean by your host rebooted automatically? Did an error occur or was the reboot intentional? If the latter, does that occur frequently during this test?

The XFS error reports are during the subsequent log recovery. By itself this doesn't really indicate whether the corruption had occurred sometime earlier or is due to a logging/recovery issue.

Comment 8 Luiz Capitulino 2017-09-14 14:03:04 UTC

(In reply to Brian Foster from comment #7)
> (In reply to Luiz Capitulino from comment #0)
> > Description of problem:
> > 
> > While executing a series of 0-loss test-cases in a row for a few days with
> > KVM-RT, my host rebooted automatically and was unable to mount the root
> > file-system. I was able to retrieve dmesg from dracut shell:
> > 
> 
> What do you mean by your host rebooted automatically? Did an error occur or
> was the reboot intentional? If the latter, does that occur frequently during
> this test?

What happened was: I ran the script that executes the test-cases on a Friday evening. Came back to check the results Sunday morning and all I had was a dracut shell prompt. dmesg had the traces in the description. That's all. After the FS was repaired, I looked for errors in /var/log/messages, vmcores etc. Found nothing.

This happened only once. I have since ran the exactly same sequence of tests several times and nothing happened.

Comment 9 Brian Foster 2017-09-14 14:14:29 UTC

(In reply to Luiz Capitulino from comment #8)
...
> What happened was: I ran the script that executes the test-cases on a Friday
> evening. Came back to check the results Sunday morning and all I had was a
> dracut shell prompt. dmesg had the traces in the description. That's all.
> After the FS was repaired, I looked for errors in /var/log/messages, vmcores
> etc. Found nothing.
> 
...

Ok, so it sounds like there are two potential issues here. One is an unexpected, unknown system reset and the second is the log recovery failure.

I can't really speak to why the system might have reset, but there is a possibility that the XFS corruption was due to a logging problem. If that is the case, it may never reproduce unless the system restarts at the appropriate point in time to trigger the problem on recovery. Alternatively, this may not be the case at all and the issue could just be very difficult to reproduce.

Comment 10 Luiz Capitulino 2017-09-14 14:18:54 UTC

Let's keep this BZ open for a while as I'll be doing lots of this testing in the coming weeks. Also, is there anything I should do if it happens again (besides colleting xfs_repair's output)?

Comment 11 Brian Foster 2017-09-14 14:28:34 UTC

(In reply to Luiz Capitulino from comment #10)
> Let's keep this BZ open for a while as I'll be doing lots of this testing in
> the coming weeks. Also, is there anything I should do if it happens again
> (besides colleting xfs_repair's output)?

A metadump of the broken fs before it is repaired might be useful (i.e., 'xfs_metadump -go <dev> <output>'). Otherwise, if this problem doesn't seem to reoccur absent of system resets, it might be interesting to try and explicitly incorporate hard resets (i.e., echo b > /proc/sysrq-trigger) into the test.

Comment 12 Brian Foster 2017-10-31 14:09:02 UTC

(In reply to Luiz Capitulino from comment #10)
> Let's keep this BZ open for a while as I'll be doing lots of this testing in
> the coming weeks. Also, is there anything I should do if it happens again
> (besides colleting xfs_repair's output)?

Is this testing still in progress? Any luck reproducing?

Comment 13 Luiz Capitulino 2017-10-31 14:27:35 UTC

No, never happened again and we moved our testbed to newer kernels. So, I'm closing as WORKSFORME.