Bug 1315895
Summary: | Metadata corruption detected at xfs_agf_read_verify | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Richard W.M. Jones <rjones> | ||||||||
Component: | kernel-aarch64 | Assignee: | Richard W.M. Jones <rjones> | ||||||||
kernel-aarch64 sub component: | XFS | QA Contact: | Erico Nunes <ernunes> | ||||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||||
Severity: | unspecified | ||||||||||
Priority: | unspecified | CC: | dchinner, eguan, ernunes, esandeen, jbastian, kchamart, mlangsdo, pbrobinson, zlang | ||||||||
Version: | 7.3 | Keywords: | Reopened | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | aarch64 | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | kernel-aarch64-4.5.0-0.38.el7 | Doc Type: | Bug Fix | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2016-11-03 22:36:10 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 910269 | ||||||||||
Attachments: |
|
Description
Richard W.M. Jones
2016-03-08 22:37:59 UTC
Created attachment 1134330 [details]
guest boot log
The console log from the guest booting, showing apparently massive
disk corruption.
Created attachment 1134331 [details]
virt-builder script used to create the guest
virt-builder script used to create & resize the guest. Note
this runs using a captive host (RHELSA) kernel. Also note that
it runs xfs_growfs as part of the process.
Created attachment 1134332 [details]
qemu log
qemu log showing the qemu command which libvirt runs. Note that
the format is correct (disk format is qcow2, format=qcow2 option is
correctly used).
Also note the host kernel -- used to grow the filesystem -- is newer than the guest kernel. 4.5.0 > 4.2.3 Was it during mount? Please try w/ http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/xfs?id=8e0bd4925bf693520295de403483efad4dc5cc16 xfs: fix endianness error when checking log block crc on big endian platforms Since the checksum function and the field are both __le32, don't perform endian conversion when comparing the two. This fixes mount failures on ppc64. Signed-off-by: Darrick J. Wong <darrick.wong> Reviewed-by: Brian Foster <bfoster> Signed-off-by: Dave Chinner <david> -Eric Disk image which is supposedly faulty is here: http://oirase.annexia.org/tmp/bz1315895/ (In reply to Eric Sandeen from comment #6) > Was it during mount? > > Please try w/ > http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/fs/ > xfs?id=8e0bd4925bf693520295de403483efad4dc5cc16 > > xfs: fix endianness error when checking log block crc on big endian platforms > > Since the checksum function and the field are both __le32, don't > perform endian conversion when comparing the two. This fixes mount > failures on ppc64. > > Signed-off-by: Darrick J. Wong <darrick.wong> > Reviewed-by: Brian Foster <bfoster> > Signed-off-by: Dave Chinner <david> > > -Eric Both guest and host are little endian, so this shouldn't be required? What is the xfs_repair from the corrupt filesystem? Also, can we please take triage of this problem (it's from on a bleeding edge upstream kernel) to the upstream mailing lists - RH bugzilla is not the place to triage problems with upstream kernels. -Dave. (In reply to Dave Chinner from comment #9) > What is the xfs_repair from the corrupt filesystem? Firstly I'm able to reproduce the problem now on x86_64. I took the filesystem (see comment 7) and downloaded it to a machine with kernel 4.3.3-301.fc23.x86_64. When mounting the filesystem I get the same kind of errors: # mount /dev/sda3 /tmp/mnt [ 273.845130] SGI XFS with ACLs, security attributes, no debug enabled [ 273.851789] XFS (sda3): Mounting V5 Filesystem [ 273.857753] XFS (sda3): Starting recovery (logdev: internal) [ 273.936094] XFS (sda3): Metadata corruption detected at xfs_agf_read_verify+0x70/0x120 [xfs], block 0x9d2001 [ 273.937303] XFS (sda3): Unmount and run xfs_repair [ 273.937894] XFS (sda3): First 64 bytes of corrupted metadata buffer: [ 273.938685] ffff88001c90da00: 58 41 47 46 00 00 00 01 00 00 00 04 00 04 e9 00 XAGF............ [ 273.939759] ffff88001c90da10: 00 00 00 01 00 00 00 02 00 00 00 00 00 00 00 01 ................ [ 273.940834] ffff88001c90da20: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 76 ...............v [ 273.941907] ffff88001c90da30: 00 00 00 00 00 04 e8 fb 00 04 e8 fb 00 00 00 00 ................ [ 273.942987] XFS (sda3): metadata I/O error: block 0x9d2001 ("xfs_trans_read_buf_map") error 117 numblks 1 mount: mount /dev/sda3 on /tmp/mnt failed: Structure needs cleaning After doing the mount, I also did xfs_repair: # xfs_repair /dev/sda3 Phase 1 - find and verify superblock... - reporting progress in intervals of 15 minutes Phase 2 - using internal log - zero log... - scan filesystem freespace and inode maps... Metadata corruption detected at block 0xc46801/0x200 Metadata corruption detected at block 0x9d2001/0x200 Metadata corruption detected at block 0x13a4001/0x200 Metadata corruption detected at block 0xebb001/0x200 Metadata corruption detected at block 0x1618801/0x200 Metadata corruption detected at block 0x188d001/0x200 Metadata corruption detected at block 0x112f801/0x200 Metadata corruption detected at block 0x1b01801/0x200 Metadata corruption detected at block 0x1fea801/0x200 Metadata corruption detected at block 0x1d76001/0x200 Metadata corruption detected at block 0x225f001/0x200 Metadata corruption detected at block 0x24d3801/0x200 Metadata corruption detected at block 0x2748001/0x200 Metadata corruption detected at block 0x2c31001/0x200 Metadata corruption detected at block 0x29bc801/0x200 Metadata corruption detected at block 0x2ea5801/0x200 Metadata corruption detected at block 0x311a001/0x200 Metadata corruption detected at block 0x338e801/0x200 Metadata corruption detected at block 0x3603001/0x200 Metadata corruption detected at block 0x3877801/0x200 fllast 118 in agf 4 too large (max = 118) fllast 118 in agf 5 too large (max = 118) fllast 118 in agf 9 too large (max = 118) fllast 118 in agf 10 too large (max = 118) fllast 118 in agf 12 too large (max = 118) fllast 118 in agf 13 too large (max = 118) fllast 118 in agf 8 too large (max = 118) fllast 118 in agf 6 too large (max = 118) fllast 118 in agf 18 too large (max = 118) fllast 118 in agf 14 too large (max = 118) fllast 118 in agf 7 too large (max = 118) fllast 118 in agf 11 too large (max = 118) fllast 118 in agf 19 too large (max = 118) fllast 118 in agf 17 too large (max = 118) fllast 118 in agf 16 too large (max = 118) fllast 118 in agf 20 too large (max = 118) fllast 118 in agf 21 too large (max = 118) fllast 118 in agf 22 too large (max = 118) fllast 118 in agf 15 too large (max = 118) fllast 118 in agf 23 too large (max = 118) sb_icount 25088, counted 25216 sb_ifree 238, counted 270 sb_fdblocks 7407927, counted 7407561 - 10:12:25: scanning filesystem freespace - 24 of 24 allocation groups done - found root inode chunk Phase 3 - for each AG... - scan and clear agi unlinked lists... - 10:12:25: scanning agi unlinked lists - 24 of 24 allocation groups done - process known inodes and perform inode discovery... - agno = 15 - agno = 0 - agno = 16 - agno = 17 - agno = 1 - agno = 18 - agno = 19 - agno = 20 - agno = 2 - agno = 21 - agno = 22 - agno = 23 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - 10:12:25: process known inodes and inode discovery - 25216 of 25088 inodes done - process newly discovered inodes... - 10:12:25: process newly discovered inodes - 24 of 24 allocation groups done Phase 4 - check for duplicate blocks... - setting up duplicate extent list... - 10:12:25: setting up duplicate extent list - 24 of 24 allocation groups done - check for inodes claiming duplicate blocks... - agno = 0 - agno = 1 - agno = 2 - agno = 3 - agno = 4 - agno = 5 - agno = 6 - agno = 7 - agno = 8 - agno = 9 - agno = 10 - agno = 11 - agno = 12 - agno = 13 - agno = 14 - agno = 15 - agno = 16 - agno = 17 - agno = 18 - agno = 19 - agno = 20 - agno = 21 - agno = 22 - agno = 23 - 10:12:25: check for inodes claiming duplicate blocks - 25216 of 25088 inodes done Phase 5 - rebuild AG headers and trees... - 10:12:25: rebuild AG headers and trees - 24 of 24 allocation groups done - reset superblock... Phase 6 - check inode connectivity... - resetting contents of realtime bitmap and summary inodes - traversing filesystem ... - traversal finished ... - moving disconnected inodes to lost+found ... Phase 7 - verify and correct link counts... done -------------- I suspect the problem could be: - page size (aarch64 has 64K page size, armv7 and x86_64 have 4K page size) - 4.5 kernel xfs_growfs somehow incompatible with 4.3 kernel XFS > Also, can we please take triage of this problem (it's from on a bleeding > edge upstream kernel) to the upstream mailing lists - RH bugzilla is not the > place to triage problems with upstream kernels. Sure. OK I see what's odd here. The filesystem (see comment 7) shows NO errors when opened on the aarch64 machine. It only shows gross corruption errors when opened on x86_64 or armv7hl. # mount /dev/sda3 /sysroot/ [ 12.536330] SGI XFS with ACLs, security attributes, no debug enabled [ 12.543086] XFS (sda3): Mounting V5 Filesystem [ 12.743466] XFS (sda3): Starting recovery (logdev: internal) [ 13.003872] XFS (sda3): Ending recovery (logdev: internal) # uname -a Linux (none) 4.5.0-0.rc7.31.el7.aarch64 #1 SMP Tue Mar 8 13:10:54 EST 2016 aarch64 aarch64 aarch64 GNU/Linux The differences are: - kernel version (4.5.0 on aarch64, 4.3.0 on the others) - page size (64K on aarch64, 4K on the others) Were you going to take this to the upstream list? -Eric This is NOT a reproducer, just a data point. Created a filesystem on aarch64 and copy it over to x86_64: aarch64$ truncate -s 1G start-on-aarch64.raw aarch64$ mkfs.xfs start-on-aarch64.raw aarch64$ file start-on-aarch64.raw start-on-aarch64.raw: SGI XFS filesystem data (blksz 4096, inosz 256, v2 dirs) aarch64$ scp start-on-aarch64.raw x86_64:/var/tmp This is added as /dev/sda to an x86_64 virtual machine: x86_64$ mount /dev/sda /sysroot/ [ 12.663853] SGI XFS with ACLs, security attributes, no debug enabled [ 12.671057] XFS (sda): Mounting V4 Filesystem [ 12.676669] XFS (sda): Ending clean mount This is interesting - but it looks like it's a different failure. Create an XFS filesystem on x86_64, copy it over to aarch64, grow it, copy it back to x86_64, see if we can still mount it. x86_64$ truncate -s 100M start-on-x86_64.raw x86_64$ mkfs.xfs start-on-x86_64.raw x86_64$ scp start-on-x86_64.raw aarch64:/var/tmp/ aarch64$ truncate -s 1G ss I attached the filesystem to an aarch64 virtual machine, so that I could mount it and grow it: aarch64$ mount /dev/sda /sysroot/ [ 13.451623] SGI XFS with ACLs, security attributes, no debug enabled [ 13.458059] XFS (sda): Mounting V5 Filesystem [ 13.465118] XFS (sda): Ending clean mount aarch64$ xfs_growfs /sysroot meta-data=/dev/sda isize=512 agcount=4, agsize=6400 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1 data = bsize=4096 blocks=25600, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=855, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 data blocks changed from 25600 to 262144 aarch64$ scp start-on-x86_64.raw x86_64:/var/tmp/ Back on x86_64 I try to mount it inside an x86_64 VM: x86_64$ mount /dev/sda /sysroot [ 6.216284] SGI XFS with ACLs, security attributes, no debug enabled [ 6.224916] XFS (sda): Mounting V5 Filesystem [ 6.229273] XFS (sda): Starting recovery (logdev: internal) [ 6.231202] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 6.231803] IP: [<ffffffff8178128c>] _raw_spin_lock+0xc/0x30 [ 6.232164] PGD 0 [ 6.232164] Oops: 0002 [#1] SMP [ 6.232164] Modules linked in: xfs snd_pcsp snd_pcm snd_timer iosf_mbi snd i2c_piix4 joydev ata_generic soundcore pata_acpi serio_raw libcrc32c crc8 crc_itu_t crc_ccitt virtio_pci virtio_mmio virtio_input virtio_balloon virtio_scsi sym53c8xx scsi_transport_spi megaraid_sas megaraid_mbox megaraid_mm megaraid ideapad_laptop rfkill sparse_keymap video virtio_net virtio_gpu ttm drm_kms_helper drm virtio_console virtio_rng virtio_blk virtio_ring virtio crc32 crct10dif_pclmul crc32c_intel crc32_pclmul [ 6.232164] CPU: 0 PID: 432 Comm: mount Not tainted 4.3.3-301.fc23.x86_64 #1 [ 6.232164] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 6.232164] task: ffff88001d310000 ti: ffff88001c994000 task.ti: ffff88001c994000 [ 6.232164] RIP: 0010:[<ffffffff8178128c>] [<ffffffff8178128c>] _raw_spin_lock+0xc/0x30 [ 6.232164] RSP: 0018:ffff88001c997a80 EFLAGS: 00010246 [ 6.232164] RAX: 0000000000000000 RBX: ffff88001c924840 RCX: 0000000000000000 [ 6.232164] RDX: 0000000000000001 RSI: 0000000000000004 RDI: 0000000000000098 [ 6.232164] RBP: ffff88001c997ab8 R08: 0000000000000001 R09: ffff88001d7c8000 [ 6.232164] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 [ 6.232164] R13: 0000000000000001 R14: 0000000000032001 R15: 0000000000000000 [ 6.232164] FS: 00007f9ee918c840(0000) GS:ffff88001ee00000(0000) knlGS:0000000000000000 [ 6.232164] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.232164] CR2: 0000000000000098 CR3: 000000001ca07000 CR4: 00000000003406f0 [ 6.232164] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.232164] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.232164] Stack: [ 6.232164] ffffffffa0249285 0000000000000000 ffff88001c924840 0000000000000001 [ 6.232164] 0000000000000001 ffff88001c997ba0 0000000000000001 ffff88001c997af8 [ 6.232164] ffffffffa02494fa 0000000000000000 0000000000000000 0000000000000001 [ 6.232164] Call Trace: [ 6.232164] [<ffffffffa0249285>] ? _xfs_buf_find+0x95/0x2e0 [xfs] [ 6.232164] [<ffffffffa02494fa>] xfs_buf_get_map+0x2a/0x1b0 [xfs] [ 6.232164] [<ffffffffa0249fac>] xfs_buf_read_map+0x2c/0x130 [xfs] [ 6.232164] [<ffffffffa027541d>] xfs_trans_read_buf_map+0xdd/0x2a0 [xfs] [ 6.232164] [<ffffffffa020d6b9>] xfs_read_agf+0x99/0x100 [xfs] [ 6.232164] [<ffffffffa020d769>] xfs_alloc_read_agf+0x49/0x110 [xfs] [ 6.232164] [<ffffffffa020d859>] xfs_alloc_pagf_init+0x29/0x60 [xfs] [ 6.232164] [<ffffffffa023fd89>] xfs_initialize_perag_data+0x99/0x110 [xfs] [ 6.232164] [<ffffffffa026012e>] xfs_mountfs+0x5de/0x7f0 [xfs] [ 6.232164] [<ffffffffa0260ddb>] ? xfs_mru_cache_create+0x12b/0x180 [xfs] [ 6.232164] [<ffffffffa0262a02>] xfs_fs_fill_super+0x342/0x4c0 [xfs] [ 6.232164] [<ffffffff81227016>] mount_bdev+0x1a6/0x1e0 [ 6.232164] [<ffffffffa02626c0>] ? xfs_parseargs+0xab0/0xab0 [xfs] [ 6.232164] [<ffffffffa02610a5>] xfs_fs_mount+0x15/0x20 [xfs] [ 6.232164] [<ffffffff812279e8>] mount_fs+0x38/0x160 [ 6.232164] [<ffffffff811c9e65>] ? __alloc_percpu+0x15/0x20 [ 6.232164] [<ffffffff81242a47>] vfs_kern_mount+0x67/0x100 [ 6.232164] [<ffffffff81244e8f>] do_mount+0x23f/0xdb0 [ 6.232164] [<ffffffff812258da>] ? __fput+0x17a/0x1e0 [ 6.232164] [<ffffffff812074b8>] ? __kmalloc_track_caller+0x1a8/0x250 [ 6.232164] [<ffffffff811c4642>] ? memdup_user+0x42/0x70 [ 6.232164] [<ffffffff81245d3f>] SyS_mount+0x9f/0x100 [ 6.232164] [<ffffffff817815ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 6.232164] Code: 00 00 3e 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 29 7e 96 ff 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <3e> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 00 7e 96 ff 5d [ 6.232164] RIP [<ffffffff8178128c>] _raw_spin_lock+0xc/0x30 [ 6.232164] RSP <ffff88001c997a80> [ 6.232164] CR2: 0000000000000098 [ 6.232164] ---[ end trace d7d1b497cefa1929 ]--- The start-on-x86_64.raw file from the previous comment is available here: http://oirase.annexia.org/tmp/bz1315895/ Another test, this time the filesystem is not empty but is filled with some dummy files. The error this time looks a lot more like the original bug report. (1) On x86_64, create an XFS image: x86_64$ virt-make-fs --size=200M --type=xfs ~/d/libguestfs/libguestfs-1.33.14.tar.gz with-content.img $ file with-content.img with-content.img: SGI XFS filesystem data (blksz 4096, inosz 512, v2 dirs) (2) Copy the file over to aarch64, and grow it (in a VM): aarch64$ truncate -s 1G with-content.img aarch64$ virt-rescue -a with-content.img ><rescue> mount /dev/sda /sysroot [ 6.769632] SGI XFS with ACLs, security attributes, no debug enabled [ 6.776039] XFS (sda): Mounting V5 Filesystem [ 6.938980] XFS (sda): Ending clean mount ><rescue> df -h /sysroot/ Filesystem Size Used Avail Use% Mounted on /dev/sda 197M 118M 80M 60% /sysroot ><rescue> xfs_growfs /sysroot/ meta-data=/dev/sda isize=512 agcount=4, agsize=12800 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1 data = bsize=4096 blocks=51200, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0 ftype=1 log =internal bsize=4096 blocks=855, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0 data blocks changed from 51200 to 262144 ><rescue> df -h /sysroot/ Filesystem Size Used Avail Use% Mounted on /dev/sda 1021M 118M 903M 12% /sysroot ><rescue> sync (3) The resized image is copied back to x86_64, and I try to mount it (in a VM): x86_64$ virt-rescue --ro -a with-content.img ><rescue> mount /dev/sda /sysroot [ 7.832342] SGI XFS with ACLs, security attributes, no debug enabled [ 7.838656] XFS (sda): Mounting V5 Filesystem [ 7.874606] XFS (sda): Starting recovery (logdev: internal) [ 7.875811] XFS (sda): Metadata corruption detected at xfs_agf_read_verify+0x70/0x120 [xfs], block 0x64001 [ 7.876519] XFS (sda): Unmount and run xfs_repair [ 7.876866] XFS (sda): First 64 bytes of corrupted metadata buffer: [ 7.877329] ffff88001c953200: 58 41 47 46 00 00 00 01 00 00 00 04 00 00 32 00 XAGF..........2. [ 7.877957] ffff88001c953210: 00 00 00 01 00 00 00 02 00 00 00 00 00 00 00 01 ................ [ 7.878659] ffff88001c953220: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 76 ...............v [ 7.879318] ffff88001c953230: 00 00 00 00 00 00 31 fb 00 00 31 fb 00 00 00 00 ......1...1..... [ 7.879944] XFS (sda): metadata I/O error: block 0x64001 ("xfs_trans_read_buf_map") error 117 numblks 1 mount: mount /dev/sda on /sysroot failed: Structure needs cleaning (4) I have uploaded with-content.img to http://oirase.annexia.org/tmp/bz1315895/ Creating the initial image on aarch64 and growing it on aarch64 and then trying to mount it on x86_64 fails. The final failure is: # mount /dev/sda /sysroot/ [ 6.218805] SGI XFS with ACLs, security attributes, no debug enabled [ 6.229121] XFS (sda): Mounting V4 Filesystem [ 6.335365] XFS (sda): Starting recovery (logdev: internal) [ 6.337132] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 6.338105] IP: [<ffffffff8178128c>] _raw_spin_lock+0xc/0x30 [ 6.338105] PGD 0 [ 6.338105] Oops: 0002 [#1] SMP [ 6.338105] Modules linked in: xfs snd_pcsp snd_pcm snd_timer iosf_mbi snd joydev serio_raw i2c_piix4 ata_generic soundcore pata_acpi libcrc32c crc8 crc_itu_t crc_ccitt virtio_pci virtio_mmio virtio_input virtio_balloon virtio_scsi sym53c8xx scsi_transport_spi megaraid_sas megaraid_mbox megaraid_mm megaraid ideapad_laptop rfkill sparse_keymap video virtio_net virtio_gpu ttm drm_kms_helper drm virtio_console virtio_rng virtio_blk virtio_ring virtio crc32 crct10dif_pclmul crc32c_intel crc32_pclmul [ 6.338105] CPU: 0 PID: 430 Comm: mount Not tainted 4.3.3-301.fc23.x86_64 #1 [ 6.338105] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014 [ 6.338105] task: ffff88001d328000 ti: ffff88001c9c0000 task.ti: ffff88001c9c0000 [ 6.338105] RIP: 0010:[<ffffffff8178128c>] [<ffffffff8178128c>] _raw_spin_lock+0xc/0x30 [ 6.338105] RSP: 0018:ffff88001c9c3a80 EFLAGS: 00010246 [ 6.338105] RAX: 0000000000000000 RBX: ffff88001c934000 RCX: 0000000000000000 [ 6.338105] RDX: 0000000000000001 RSI: 0000000000000004 RDI: 0000000000000098 [ 6.338105] RBP: ffff88001c9c3ab8 R08: 0000000000000001 R09: ffff88001d7ba000 [ 6.338105] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000001 [ 6.338105] R13: 0000000000000001 R14: 0000000000064001 R15: 0000000000000000 [ 6.338105] FS: 00007f453a20e840(0000) GS:ffff88001ee00000(0000) knlGS:0000000000000000 [ 6.338105] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6.338105] CR2: 0000000000000098 CR3: 000000001c999000 CR4: 00000000003406f0 [ 6.338105] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 6.338105] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 6.338105] Stack: [ 6.338105] ffffffffa024f285 0000000000000000 ffff88001c934000 0000000000000001 [ 6.338105] 0000000000000001 ffff88001c9c3ba0 0000000000000001 ffff88001c9c3af8 [ 6.338105] ffffffffa024f4fa 0000000000000000 0000000000000000 0000000000000001 [ 6.338105] Call Trace: [ 6.338105] [<ffffffffa024f285>] ? _xfs_buf_find+0x95/0x2e0 [xfs] [ 6.338105] [<ffffffffa024f4fa>] xfs_buf_get_map+0x2a/0x1b0 [xfs] [ 6.338105] [<ffffffffa024ffac>] xfs_buf_read_map+0x2c/0x130 [xfs] [ 6.338105] [<ffffffffa027b41d>] xfs_trans_read_buf_map+0xdd/0x2a0 [xfs] [ 6.338105] [<ffffffffa02136b9>] xfs_read_agf+0x99/0x100 [xfs] [ 6.338105] [<ffffffffa0213769>] xfs_alloc_read_agf+0x49/0x110 [xfs] [ 6.338105] [<ffffffffa0213859>] xfs_alloc_pagf_init+0x29/0x60 [xfs] [ 6.338105] [<ffffffffa0245d89>] xfs_initialize_perag_data+0x99/0x110 [xfs] [ 6.338105] [<ffffffffa026612e>] xfs_mountfs+0x5de/0x7f0 [xfs] [ 6.338105] [<ffffffffa0266ddb>] ? xfs_mru_cache_create+0x12b/0x180 [xfs] [ 6.338105] [<ffffffffa0268a02>] xfs_fs_fill_super+0x342/0x4c0 [xfs] [ 6.338105] [<ffffffff81227016>] mount_bdev+0x1a6/0x1e0 [ 6.338105] [<ffffffffa02686c0>] ? xfs_parseargs+0xab0/0xab0 [xfs] [ 6.338105] [<ffffffffa02670a5>] xfs_fs_mount+0x15/0x20 [xfs] [ 6.338105] [<ffffffff812279e8>] mount_fs+0x38/0x160 [ 6.338105] [<ffffffff811c9e65>] ? __alloc_percpu+0x15/0x20 [ 6.338105] [<ffffffff81242a47>] vfs_kern_mount+0x67/0x100 [ 6.338105] [<ffffffff81244e8f>] do_mount+0x23f/0xdb0 [ 6.338105] [<ffffffff812258da>] ? __fput+0x17a/0x1e0 [ 6.338105] [<ffffffff812074b8>] ? __kmalloc_track_caller+0x1a8/0x250 [ 6.338105] [<ffffffff811c4642>] ? memdup_user+0x42/0x70 [ 6.338105] [<ffffffff81245d3f>] SyS_mount+0x9f/0x100 [ 6.338105] [<ffffffff817815ee>] entry_SYSCALL_64_fastpath+0x12/0x71 [ 6.338105] Code: 00 00 3e 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 29 7e 96 ff 5d c3 0f 1f 80 00 00 00 00 0f 1f 44 00 00 31 c0 ba 01 00 00 00 <3e> 0f b1 17 85 c0 75 01 c3 55 89 c6 48 89 e5 e8 00 7e 96 ff 5d [ 6.338105] RIP [<ffffffff8178128c>] _raw_spin_lock+0xc/0x30 [ 6.338105] RSP <ffff88001c9c3a80> [ 6.338105] CR2: 0000000000000098 [ 6.338105] ---[ end trace a0441461f622d167 ]--- Image from comment 17 (with-content-on-aarch64.img) is now available on http://oirase.annexia.org/tmp/bz1315895/ this patch on the list: [PATCH 1/6] xfs: reinitialise per-AG structures if geometry changes during recovery probably fixes some of these cases. (In reply to Richard W.M. Jones from comment #11) > The differences are: > > - kernel version (4.5.0 on aarch64, 4.3.0 on the others) This. 96f859d libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct Please take it to the upstream list. -Dave. Now two people have hit the same issue on x86_64: https://www.redhat.com/archives/libguestfs/2016-March/msg00113.html https://rwmj.wordpress.com/2015/11/04/virt-builder-fedora-23-image/#comment-15668 Upstream thread started: http://oss.sgi.com/pipermail/xfs/2016-March/047684.html This is an upstream kernel, so closing out this RHEL7 bug; it needs to be (and I think it already is) fixed upstream. Reopening. I'm still very confused about what causes this bug, but I do know it still affects my RHELSA 7.2 machine with kernel 4.5.0-0.32.el7.aarch64, so it's not just some upstream thing we can forget about. Should the following commit be backported to RHELSA? To RHEL also? 96f859d libxfs: pack the agfl header structure so XFS_AGFL_SIZE is correct Can you please test 4.5.0-0.32.el7.aarch64 with 96f859d backported, and see if it resolves your problem? Turns out that kernel 4.5.0-0.33.el7.aarch64 already contains the 96f859d patch. Sorry - I should have tested that since the 0.33 package was available in brew before I posted comment 24. This kernel also demonstrates the problem. (I was going to post a reproducer here, but this kernel has another problem where it crashes in libguestfs which I'll have to file another bug about first). Can you explain exactly what I need to try? I am very unclear / confused about what precisely is the bug (and the commit message is not very helpful unless you deeply understand XFS internal structures). I finally made a reproducer for RHELSA 7.2. This is a very slippery bug that only happens in certain circumstances. Even very similar sets of commands to these don't trigger it. Note these commands are run on RHELSA 7.2 host with 4.5.0-0.33.el7.aarch64. $ virt-builder fedora-23 --size 30G --arch armv7l --format qcow2 --selinux-relabel $ virt-get-kernel -a fedora-23.qcow2 download: /boot/vmlinuz-4.2.3-300.fc23.armv7hl+lpae -> ./vmlinuz-4.2.3-300.fc23.armv7hl+lpae download: /boot/initramfs-4.2.3-300.fc23.armv7hl+lpae.img -> ./initramfs-4.2.3-300.fc23.armv7hl+lpae.img $ /usr/libexec/qemu-kvm -M virt,accel=kvm -cpu host,aarch64=off -smp 4 -m 4096 -kernel vmlinuz-4.2.3-300.fc23.armv7hl+lpae -initrd initramfs-4.2.3-300.fc23.armv7hl+lpae.img -append "ro root=/dev/vda3" -drive file=fedora-23.qcow2,format=qcow2,if=none,id=hda -device virtio-blk-device,drive=hda -serial stdio The bug happens in the guest during SELinux relabelling. If you remove the --selinux-relabel flag then the bug does not reproduce immediately. However if you log in to the guest and do something like `find /' then the bug is triggered after a while. Eric Sandeen pointed me to this commit: https://git.kernel.org/cgit/linux/kernel/git/dgc/linux-xfs.git/commit/?h=xfs-fixes-for-4.6-rc&id=ad747e3b299671e1a53db74963cc6c5f6cdb9f6d I added this patch on top of 4.5.0-0.33.el7.aarch64 and it fixes the problem for me. I would test this but I'm blocked on bug 1337705. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2145.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |