Bug 1317395 - xfsdump failing with kernel issues inside vm [NEEDINFO]
xfsdump failing with kernel issues inside vm
Status: CLOSED INSUFFICIENT_DATA
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
23
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-03-14 04:27 EDT by Michael Walton
Modified: 2016-10-26 12:44 EDT (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-10-26 12:44:32 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
labbott: needinfo? (mike.walton33)


Attachments (Terms of Use)

  None (edit)
Description Michael Walton 2016-03-14 04:27:11 EDT
Description of problem:

xfsdump is failing with a stack trace when ran inside a vm

Version-Release number of selected component (if applicable):

this is occuring in kernels in kernels 4.4.3 and 4.4.2
and probably others before. It does not occur when booted
with the rescue kernel (4.2.8) either because the bug is a regression
or the rescue kernel isn't doing as much.

How reproducible:

I am running xfsdump on the root file system inside qemu/kvm machines
and the problem occurs approx 80% of the time. The problem
occurs with less frequency when vm.swappable is set to 0 or 1 but still occurs.

Steps to Reproduce:
1. Run xfsdump on a typical linux vm (with virtio drivers)
  (e.g. xfsdump -v trace -f /mnt/mybackup /)
2.
3.

Actual results:

We end up with only a partial back up
and the following stack trace in the log.

Mar 11 22:29:11 ourdeploy kernel: XFS (vdb1): Mounting V5 Filesystem
Mar 11 22:29:11 ourdeploy kernel: XFS (vdb1): Starting recovery
(logdev: internal)
Mar 11 22:29:11 ourdeploy kernel: XFS (vdb1): Ending recovery (logdev:
internal)
Mar 11 22:29:17 ourdeploy audit[1]: SERVICE_STOP pid=1 uid=0
auid=4294967295 ses=4294967295 msg='unit=rolekit comm="systemd"
exe="/us
r/lib/systemd/systemd" hostname=? addr=? terminal=? res=success'
Mar 11 22:37:14 ourdeploy kernel: page:ffffea00000267c0 count:0
mapcount:-127 mapping:          (null) index:0x0
Mar 11 22:37:14 ourdeploy kernel: flags: 0x1ffff800000000()
Mar 11 22:37:14 ourdeploy kernel: page dumped because:
VM_BUG_ON_PAGE(atomic_read(&page->_count) == 0)
Mar 11 22:37:14 ourdeploy kernel: ------------[ cut here ]------------
Mar 11 22:37:14 ourdeploy kernel: kernel BUG at include/linux/mm.h:342!
Mar 11 22:37:14 ourdeploy kernel: invalid opcode: 0000 [#1] SMP 
Mar 11 22:37:14 ourdeploy kernel: Modules linked in: nls_utf8 cifs
dns_resolver fscache nf_conntrack_netbios_ns nf_conntrack_broadcas
t nf_conntrack_tftp ip6t_rpfilter ip6t_REJECT nf_reject_ipv6
xt_conntrack ip_set nfnetlink ebtable_broute bridge stp llc
ebtable_filt
er ebtable_nat ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6
nf_nat_ipv6 ip6table_raw ip6table_mangle ip6table_security ip6t
able_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 nf_nat nf_conntrack iptable_raw iptable_mangle iptabl
e_security snd_hda_codec_generic snd_hda_intel snd_hda_codec ppdev
snd_hda_core iosf_mbi crct10dif_pclmul snd_hwdep crc32_pclmul snd_
pcm joydev snd_timer virtio_balloon parport_pc snd parport acpi_cpufreq
i2c_piix4 soundcore tpm_tis tpm nfsd nfs_acl lockd grace auth
_rpcgss sunrpc xfs libcrc32c virtio_console
Mar 11 22:37:14 ourdeploy kernel:  virtio_net virtio_blk qxl
drm_kms_helper ttm drm crc32c_intel serio_raw virtio_pci ata_generic
vir
tio_ring virtio pata_acpi
Mar 11 22:37:14 ourdeploy kernel: CPU: 0 PID: 1458 Comm: xfsdump Not
tainted 4.4.4-301.fc23.x86_64 #1
Mar 11 22:37:14 ourdeploy kernel: Hardware name: QEMU Standard PC
(i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
Mar 11 22:37:14 ourdeploy kernel: task: ffff88003e1d9e00 ti:
ffff880028ccc000 task.ti: ffff880028ccc000
Mar 11 22:37:14 ourdeploy kernel: RIP:
0010:[<ffffffff811b5008>]  [<ffffffff811b5008>] __free_pages+0x38/0x40
Mar 11 22:37:14 ourdeploy kernel: RSP: 0018:ffff880028ccfa48  EFLAGS:
00010246
Mar 11 22:37:14 ourdeploy kernel: RAX: 0000000000000044 RBX:
ffff88003f547900 RCX: 0000000000000006
Mar 11 22:37:14 ourdeploy kernel: RDX: 0000000000000000 RSI:
0000000000000000 RDI: ffff880042c0dff0
Mar 11 22:37:14 ourdeploy kernel: RBP: ffff880028ccfa48 R08:
0000000000000000 R09: 0000000000000254
Mar 11 22:37:14 ourdeploy kernel: R10: 0000000000000001 R11:
0000000000000254 R12: 0000000000000001
Mar 11 22:37:14 ourdeploy kernel: R13: ffffffffa0180d4d R14:
ffff880028ccfb38 R15: 0000000000000001
Mar 11 22:37:14 ourdeploy kernel: FS:  00007f57b020a780(0000)
GS:ffff880042c00000(0000) knlGS:0000000000000000
Mar 11 22:37:14 ourdeploy kernel: CS:  0010 DS: 0000 ES: 0000 CR0:
0000000080050033
Mar 11 22:37:14 ourdeploy kernel: CR2: 00007f21861ec024 CR3:
0000000030025000 CR4: 00000000000406f0
Mar 11 22:37:14 ourdeploy kernel: Stack:
Mar 11 22:37:14 ourdeploy kernel:  ffff880028ccfa70 ffffffffa017fed3
ffff880036d073c0 0000000000000000
Mar 11 22:37:14 ourdeploy kernel:  0000000000010015 ffff880028ccfab0
ffffffffa0180d4d ffff88003f547900
Mar 11 22:37:14 ourdeploy kernel:  0000000000010015 0000000000000001
0000000000010014 ffff880036d073c0
Mar 11 22:37:14 ourdeploy kernel: Call Trace:
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa017fed3>]
xfs_buf_free+0x73/0x130 [xfs]
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa0180d4d>]
xfs_buf_get_map+0x22d/0x280 [xfs]
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa018187d>]
xfs_buf_read_map+0x2d/0x180 [xfs]
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa0181a22>]
xfs_buf_readahead_map+0x52/0x70 [xfs]
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa0158bc1>]
xfs_btree_reada_bufs+0x61/0x80 [xfs]
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa01962d8>]
xfs_bulkstat_ichunk_ra.isra.4+0xe8/0x140 [xfs]
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa016e8b3>] ?
xfs_inobt_get_rec+0x33/0xc0 [xfs]
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa0196973>]
xfs_bulkstat+0x303/0x670 [xfs]
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffff81244c3d>] ?
__dentry_kill+0x13d/0x1b0
Mar 11 22:37:14 ourdeploy kernel:  [<ffffffffa0196640>] ?
xfs_bulkstat_one_int+0x310/0x310 [xfs]


Expected results:

We would expect xfsdump to complete without kernel problems.


Additional info:
Comment 1 Laura Abbott 2016-03-14 16:29:05 EDT
There was a known issue compatibility issue with xfsprogs, see  http://jwboyer.livejournal.com/52090.html . Please update to the latest xfsprogs (should be 4.3.0) and see if that fixes the issue.
Comment 2 Michael Walton 2016-03-14 19:34:48 EDT
I've got a fully up todate system with x(In reply to Laura Abbott from comment #1)
> There was a known issue compatibility issue with xfsprogs, see 
> http://jwboyer.livejournal.com/52090.html . Please update to the latest
> xfsprogs (should be 4.3.0) and see if that fixes the issue.

I've got a fully up to date system with xfsprogs 4.3.0-1 installed.
I've already done an xfs_repair using an up to date xfsprogs on the
gparted cd. For anyone interested in this problem I imagine that
bumping up vm.swappiness to a high level will help to expose it.
Comment 3 Michael Walton 2016-03-17 21:14:37 EDT
I'm still encountering this issue with kernels 4.4.4 and 4.4.5
Comment 4 Michael Walton 2016-03-18 01:18:30 EDT
I was able to do an xfsdump 6 times in a row on the same machine after downgrading to kernel 4.2.3 so AFAICS this is most definitely a regression.
Comment 5 Michael Walton 2016-03-19 22:25:00 EDT
I have tried out a bunch of different kernels now. To make a long story short I installed 4.3.6 and encountered no bug and when I unstalled kernel 4.4,
I did encounter the bug.
I installed kernel 4.3.6 from
 https://kojipkgs.fedoraproject.org/packages/kernel/4.3.6/201.fc22/
and kernel 4.4 from
https://kojipkgs.fedoraproject.org/packages/kernel/4.4.0/1.fc24/

(Neither one was specifically an f23 kernel but I don't think that matters).

Therefore, in my holy opinion, the problem was introduced in kernel 4.4.
Comment 6 Michael Walton 2016-03-23 00:14:55 EDT
Hi again,
I used git bisect on the development kernel (or tried to anyway). Supposing that I haven't screwed up, the commit where things break is:
f77cf4e4cc9d40310a7224a1a67c733aeec78836
This isn't an xfs commit so either that commit has a bug or it exposes
a bug in xfs or at the very least it subtly breaks
xfs somehow (if I haven't screwed up my bisection of course). Thanks,
I'm hoping that you guys and look into this now.
Comment 7 Michael Walton 2016-03-23 00:28:24 EDT
Sorry posted too soon. I was doing 7 xfsdumps in a row as a test but apparently that's not good enough. I don't think I've bisected this right now. Back to the drawing board.
Comment 8 Michael Walton 2016-03-23 22:05:06 EDT
Well, I kept at it. Here where things stand now. Two repeat bisection attempts have landed me at

d0164adc89f6bb374d304ffcc375c6d2652fe67d

I really can produce the bug here and I really cannot produce it at the previous commit (11 xfsdumps with swappiness at 0 and no kernel errors).
Comment 9 Laura Abbott 2016-09-23 15:25:48 EDT
*********** MASS BUG UPDATE **************
 
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 23 kernel bugs.
 
Fedora 23 has now been rebased to 4.7.4-100.fc23.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 24 or 25, and are still experiencing this issue, please change the version to Fedora 24 or 25.
 
If you experience different issues, please open a new bug report for those.
Comment 10 Laura Abbott 2016-10-26 12:44:32 EDT
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.