Red Hat Bugzilla – Bug 726814
btrfs filesystem corruption/crash
Last modified: 2015-05-17 21:40:00 EDT
My backups of /home stopped working a while ago. When run, they would crash the machine with:
[624609.318018] kernel BUG at fs/btrfs/extent-tree.c:1460!
[624609.318018] invalid opcode: 0000 [#1] SMP
[624609.318018] CPU 0
[624609.318018] Modules linked in: netconsole configfs tcp_lp tun fuse ppdev parport_pc lp parport acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 iptable_nat nf_defrag_ipv6 nf_nat ip6table_filter ip6_tables ipt_LOG arc4 iwl3945 iwl_legacy snd_hda_codec_idt mac80211 snd_hda_intel snd_hda_codec snd_hwdep snd_seq dell_wmi sparse_keymap irda snd_seq_device option btusb usb_wwan snd_pcm bluetooth usbserial dell_laptop dcdbas microcode joydev i2c_i801 iTCO_wdt iTCO_vendor_support snd_timer snd crc_ccitt tg3 cfg80211 soundcore rfkill snd_page_alloc virtio_net kvm_intel kvm ipv6 btrfs zlib_deflate libcrc32c xts gf128mul dm_crypt firewire_ohci firewire_core crc_itu_t yenta_socket wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: st]
[624609.318018] Pid: 890, comm: btrfs-delayed-m Not tainted 3.0.0-rc7+ #1 Dell Inc. Latitude D820 /0GF470
[624609.318018] RIP: 0010:[<ffffffffa011da0c>] [<ffffffffa011da0c>] lookup_inline_extent_backref+0x17f/0x34b [btrfs]
[624609.318018] RSP: 0018:ffff8800c65dda60 EFLAGS: 00010246
[624609.318018] RAX: 2a64000000000000 RBX: ffff8800356c43f0 RCX: 0000000000000a30
[624609.318018] RDX: 0000000000000996 RSI: ffff880081e8c000 RDI: ffff880081e8c000
[624609.318018] RBP: ffff8800c65ddb00 R08: ffff8800c65dda10 R09: 0000000000000332
[624609.318018] R10: 0000000000000a0b R11: 0000000000000008 R12: 0000000000000a13
[624609.318018] R13: ffff88008e287300 R14: 00000000000000b8 R15: 0000000000000035
[624609.318018] FS: 0000000000000000(0000) GS:ffff8800cf400000(0000) knlGS:0000000000000000
[624609.318018] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[624609.318018] CR2: 0000000004216280 CR3: 0000000054979000 CR4: 00000000000006f0
[624609.318018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[624609.318018] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[624609.318018] Process btrfs-delayed-m (pid: 890, threadinfo ffff8800c65dc000, task ffff8800c65e0000)
[624609.318018] 00000000000009fb 0000000000000996 ffff8800c65ddb58 ffffffffa012c12b
[624609.318018] 0000000000000a30 000000089a20c000 ffff8800b3117800 0000000d00000f66
[624609.318018] ffff8800c65ddb00 ffff8800c24762d0 ffff8800b3117800 0000000000000000
[624609.318018] Call Trace:
[624609.318018] [<ffffffffa012c12b>] ? btrfs_mark_buffer_dirty+0x8c/0xcf [btrfs]
[624609.318018] [<ffffffffa011e355>] insert_inline_extent_backref+0x5c/0xd1 [btrfs]
[624609.318018] [<ffffffff81112527>] ? kmem_cache_alloc+0x44/0x10b
[624609.318018] [<ffffffffa011e462>] __btrfs_inc_extent_ref+0x98/0x1aa [btrfs]
[624609.318018] [<ffffffff811105b0>] ? virt_to_head_page+0xe/0x31
[624609.318018] [<ffffffffa0162828>] ? btrfs_delayed_ref_lock+0x3f/0x9d [btrfs]
[624609.318018] [<ffffffffa01219e2>] run_clustered_refs+0x5d8/0x672 [btrfs]
[624609.318018] [<ffffffffa0121b4d>] btrfs_run_delayed_refs+0xd1/0x193 [btrfs]
[624609.318018] [<ffffffffa012fdd1>] __btrfs_end_transaction+0x8e/0x1f0 [btrfs]
[624609.318018] [<ffffffffa012ff4b>] btrfs_end_transaction_dmeta+0x18/0x1a [btrfs]
[624609.318018] [<ffffffffa016a3c1>] btrfs_async_run_delayed_node_done+0x106/0x158 [btrfs]
[624609.318018] [<ffffffffa0153d6c>] worker_loop+0x145/0x451 [btrfs]
[624609.318018] [<ffffffffa0153c27>] ? btrfs_queue_worker+0x24f/0x24f [btrfs]
[624609.318018] [<ffffffff8106e60f>] kthread+0x84/0x8c
[624609.318018] [<ffffffff8148d724>] kernel_thread_helper+0x4/0x10
[624609.318018] [<ffffffff8106e58b>] ? kthread_worker_fn+0x148/0x148
[624609.318018] [<ffffffff8148d720>] ? gs_change+0x13/0x13
[624609.318018] Code: ff ff ff 48 8b 95 68 ff ff ff 4c 01 f9 a8 02 4c 8d 62 7d 48 89 4d 80 74 0e 4c 8d a2 8f 00 00 00 49 39 cc 76 08 0f 0b a8 01 75 02 <0f> 0b 4c 3b 65 80 72 22 41 bd fe ff ff ff 0f 86 18 01 00 00 be
[624609.318018] RIP [<ffffffffa011da0c>] lookup_inline_extent_backref+0x17f/0x34b [btrfs]
[624609.318018] RSP <ffff8800c65dda60>
[624609.771952] ---[ end trace 9f653694f5e2b357 ]---
If I boot single and btrfsck the volume I get a "Aborted" from it.
I can also cause it to crash with a btrfs filesystem defragment
Happy to try patches or more debugging or a more robust fsck. ;)
Created attachment 517263 [details]
You can apply this on top of the patch I've already given you. This will dump the leaf so I can see where the corruption is. Make sure to run dmesg -n8 with netconsole so that it does actually send all the kernel messages across the wire.
ok. I have applied that patch and this one, rebooted in the new kernel.
I then did a:
btrfs filesystem defragment /home/kevin/Mail/lists/fedora-extras-commits/
and got the first oops in the attached file. This is one of the dirs my backups blow up on.
Then, I did a 'sync' and got the second one.
Happy to test further things.
Created attachment 517306 [details]
netconnect dmesg output from 2 oopses.
Created attachment 519705 [details]
So I don't have a broken fs to test this on, but it passes just fine on a clean one so it seems like all my checking is right. That being said, I probably screwed up writing somewhere, so this may blow up, and if it does just load it into gdb and do a bt so I can see where it segfaulted. Hopefully this won't make your fs worse than it already is, but I make no promises :). Please run with -d first, this will do a dry run and just spit out all the errors and not actually fix anything. Attach this output to this bz so I can verify it's going to fix things correctly, you'll want to do something like
./repair -d /dev/whatever > out.txt 2>&1
and attach out.txt. You'll need to apply this patch onto btrfs-progs-unstable from upstream, here is the git tree
Good luck :).
Created attachment 519720 [details]
repair -d output
Here's the output from the -d.
Doesn't tell me much, but perhaps it will tell you something. ;)
Created attachment 519722 [details]
Just being extra paranoid, but can you apply this on top of what you have and try again and attach the output? I just want to make absolute sure it's going to do the right thing when you do the real run.
Created attachment 519724 [details]
repair2 -d output
next repair -d run
Created attachment 519725 [details]
Ooops I forget you can have items with 0 size that are valid. Just apply this over the top and do the dry run again.
Created attachment 519729 [details]
This will work eventually, I promise :).
Created attachment 519908 [details]
Patch to check all fs roots
Same old song and dance. This one prints out what its doing so you can feel like its doing something :).
This patch doesn't seem to apply. Says already applied or reversed?
Created attachment 519935 [details]
Full new repair patch
Ok just unapply everything I've sent you and apply this new one, this should be everything up to this point. Sorry about that.
Created attachment 519982 [details]
output of last repair run
Here's the last repair run output.
Created attachment 520098 [details]
Ok so my repair program still isn't finding errors. This will check all the data extents, hopefully this finds something. If not it looks like this may just be a normal bug and not corruption, which would be great and crappy all at the same time.
ok, same output:
Checking extent root
Finding fs roots
Checking fs roots
Checking root 5
Checking root 5 refs
Ok its image creation time. I'm pretty sure it's built in fedora, but if its not just run
in your btrfs-progs-unstable tree. Just run
btrfs-image -c 9 -t <number of threads> /dev/whatever
and then put the image somewhere I can suck it down. Maybe now that Chris is back from vacation I can get him to run his fsck against it and see if his picks anything up.
Let me know if I can provide anything further.
Any news? Did the image help any?
I'd love to be able to get my old data off the drive... ;)
Sorry I've been messing with the repair tool with another user who's fs is even more screwed than yours. Good news is my repair tool now finds a problem with your corrupt image, the bad news is the other user ran my tool without the -d option and it made things worse, so I'm going to rig up a tool that will just pull all of your data off the disk since neither of your actual fs roots are corrupted, and we'll leave the repair tool to Chris. Should have this all rigged up in a day.
Ok clone this tree
and run make (make sure you have zlib-devel installed) and then run
./restore /your/device /some/dir
this will dump everything from your device into that directory. It will skip any snapshots, but will work right with subvolumes. Let me know if something goes wrong.
Sadly, it cranked along for about 1.75GB worth, then:
# ./restore /dev/mapper/vg_ohm-lv_home /tmp/ohm/
Short write: 0
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.
(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)
More information and reason for this action is here:
this has been open for a really long time, with no further apparent progress.
Kevin, is this something you still run into ?
I'm no longer running btrfs here, so no idea. :(
I guess you can close it out...
unfortunate, but I can't say I blame you.