726814 – btrfs filesystem corruption/crash

Bug 726814 - btrfs filesystem corruption/crash

Summary: btrfs filesystem corruption/crash

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	19
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Zach Brown
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	689509
TreeView+	depends on / blocked

Reported:	2011-07-29 20:50 UTC by Kevin Fenzi
Modified:	2015-05-18 01:40 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-05-15 19:27:30 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
More debugging (543 bytes, patch) 2011-08-08 16:29 UTC, Josef Bacik	no flags	Details \| Diff
netconnect dmesg output from 2 oopses. (18.08 KB, text/plain) 2011-08-08 20:45 UTC, Kevin Fenzi	no flags	Details
repair program (21.12 KB, patch) 2011-08-24 20:07 UTC, Josef Bacik	no flags	Details \| Diff
repair -d output (4.84 KB, text/plain) 2011-08-24 21:05 UTC, Kevin Fenzi	no flags	Details
Incremental patch (288 bytes, patch) 2011-08-24 21:17 UTC, Josef Bacik	no flags	Details \| Diff
repair2 -d output (46 bytes, text/plain) 2011-08-24 21:26 UTC, Kevin Fenzi	no flags	Details
Another incremental (2.34 KB, patch) 2011-08-24 21:40 UTC, Josef Bacik	no flags	Details \| Diff
Another incremental (1.01 KB, patch) 2011-08-24 21:57 UTC, Josef Bacik	no flags	Details \| Diff
Patch to check all fs roots (3.51 KB, patch) 2011-08-25 15:29 UTC, Josef Bacik	no flags	Details \| Diff
Full new repair patch (23.28 KB, patch) 2011-08-25 17:08 UTC, Josef Bacik	no flags	Details \| Diff
output of last repair run (46 bytes, text/plain) 2011-08-25 21:12 UTC, Kevin Fenzi	no flags	Details
An incremental (2.78 KB, patch) 2011-08-26 14:18 UTC, Josef Bacik	no flags	Details \| Diff
Show Obsolete (5) View All

Description Kevin Fenzi 2011-07-29 20:50:14 UTC

My backups of /home stopped working a while ago. When run, they would crash the machine with: 

[624609.318018] kernel BUG at fs/btrfs/extent-tree.c:1460!
[624609.318018] invalid opcode: 0000 [#1] SMP 
[624609.318018] CPU 0 
[624609.318018] Modules linked in: netconsole configfs tcp_lp tun fuse ppdev parport_pc lp parport acpi_cpufreq freq_table mperf ip6t_REJECT nf_conntrack_ipv6 iptable_nat nf_defrag_ipv6 nf_nat ip6table_filter ip6_tables ipt_LOG arc4 iwl3945 iwl_legacy snd_hda_codec_idt mac80211 snd_hda_intel snd_hda_codec snd_hwdep snd_seq dell_wmi sparse_keymap irda snd_seq_device option btusb usb_wwan snd_pcm bluetooth usbserial dell_laptop dcdbas microcode joydev i2c_i801 iTCO_wdt iTCO_vendor_support snd_timer snd crc_ccitt tg3 cfg80211 soundcore rfkill snd_page_alloc virtio_net kvm_intel kvm ipv6 btrfs zlib_deflate libcrc32c xts gf128mul dm_crypt firewire_ohci firewire_core crc_itu_t yenta_socket wmi i915 drm_kms_helper drm i2c_algo_bit i2c_core video [last unloaded: st]
[624609.318018] 
[624609.318018] Pid: 890, comm: btrfs-delayed-m Not tainted 3.0.0-rc7+ #1 Dell Inc. Latitude D820                   /0GF470
[624609.318018] RIP: 0010:[<ffffffffa011da0c>]  [<ffffffffa011da0c>] lookup_inline_extent_backref+0x17f/0x34b [btrfs]
[624609.318018] RSP: 0018:ffff8800c65dda60  EFLAGS: 00010246
[624609.318018] RAX: 2a64000000000000 RBX: ffff8800356c43f0 RCX: 0000000000000a30
[624609.318018] RDX: 0000000000000996 RSI: ffff880081e8c000 RDI: ffff880081e8c000
[624609.318018] RBP: ffff8800c65ddb00 R08: ffff8800c65dda10 R09: 0000000000000332
[624609.318018] R10: 0000000000000a0b R11: 0000000000000008 R12: 0000000000000a13
[624609.318018] R13: ffff88008e287300 R14: 00000000000000b8 R15: 0000000000000035
[624609.318018] FS:  0000000000000000(0000) GS:ffff8800cf400000(0000) knlGS:0000000000000000
[624609.318018] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[624609.318018] CR2: 0000000004216280 CR3: 0000000054979000 CR4: 00000000000006f0
[624609.318018] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[624609.318018] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[624609.318018] Process btrfs-delayed-m (pid: 890, threadinfo ffff8800c65dc000, task ffff8800c65e0000)
[624609.318018] Stack:
[624609.318018]  00000000000009fb 0000000000000996 ffff8800c65ddb58 ffffffffa012c12b
[624609.318018]  0000000000000a30 000000089a20c000 ffff8800b3117800 0000000d00000f66
[624609.318018]  ffff8800c65ddb00 ffff8800c24762d0 ffff8800b3117800 0000000000000000
[624609.318018] Call Trace:
[624609.318018]  [<ffffffffa012c12b>] ? btrfs_mark_buffer_dirty+0x8c/0xcf [btrfs]
[624609.318018]  [<ffffffffa011e355>] insert_inline_extent_backref+0x5c/0xd1 [btrfs]
[624609.318018]  [<ffffffff81112527>] ? kmem_cache_alloc+0x44/0x10b
[624609.318018]  [<ffffffffa011e462>] __btrfs_inc_extent_ref+0x98/0x1aa [btrfs]
[624609.318018]  [<ffffffff811105b0>] ? virt_to_head_page+0xe/0x31
[624609.318018]  [<ffffffffa0162828>] ? btrfs_delayed_ref_lock+0x3f/0x9d [btrfs]
[624609.318018]  [<ffffffffa01219e2>] run_clustered_refs+0x5d8/0x672 [btrfs]
[624609.318018]  [<ffffffffa0121b4d>] btrfs_run_delayed_refs+0xd1/0x193 [btrfs]
[624609.318018]  [<ffffffffa012fdd1>] __btrfs_end_transaction+0x8e/0x1f0 [btrfs]
[624609.318018]  [<ffffffffa012ff4b>] btrfs_end_transaction_dmeta+0x18/0x1a [btrfs]
[624609.318018]  [<ffffffffa016a3c1>] btrfs_async_run_delayed_node_done+0x106/0x158 [btrfs]
[624609.318018]  [<ffffffffa0153d6c>] worker_loop+0x145/0x451 [btrfs]
[624609.318018]  [<ffffffffa0153c27>] ? btrfs_queue_worker+0x24f/0x24f [btrfs]
[624609.318018]  [<ffffffff8106e60f>] kthread+0x84/0x8c
[624609.318018]  [<ffffffff8148d724>] kernel_thread_helper+0x4/0x10
[624609.318018]  [<ffffffff8106e58b>] ? kthread_worker_fn+0x148/0x148
[624609.318018]  [<ffffffff8148d720>] ? gs_change+0x13/0x13
[624609.318018] Code: ff ff ff 48 8b 95 68 ff ff ff 4c 01 f9 a8 02 4c 8d 62 7d 48 89 4d 80 74 0e 4c 8d a2 8f 00 00 00 49 39 cc 76 08 0f 0b a8 01 75 02 <0f> 0b 4c 3b 65 80 72 22 41 bd fe ff ff ff 0f 86 18 01 00 00 be 
[624609.318018] RIP  [<ffffffffa011da0c>] lookup_inline_extent_backref+0x17f/0x34b [btrfs]
[624609.318018]  RSP <ffff8800c65dda60>
[624609.771952] ---[ end trace 9f653694f5e2b357 ]---

If I boot single and btrfsck the volume I get a "Aborted" from it. 
I can also cause it to crash with a btrfs filesystem defragment

Happy to try patches or more debugging or a more robust fsck. ;)

Comment 1 Josef Bacik 2011-08-08 16:29:53 UTC

Created attachment 517263 [details]
More debugging

You can apply this on top of the patch I've already given you.  This will dump the leaf so I can see where the corruption is.  Make sure to run dmesg -n8 with netconsole so that it does actually send all the kernel messages across the wire.

Comment 2 Kevin Fenzi 2011-08-08 20:44:20 UTC

ok. I have applied that patch and this one, rebooted in the new kernel. 

I then did a: 
btrfs filesystem defragment /home/kevin/Mail/lists/fedora-extras-commits/
and got the first oops in the attached file. This is one of the dirs my backups blow up on. 

Then, I did a 'sync' and got the second one. 

Happy to test further things.

Comment 3 Kevin Fenzi 2011-08-08 20:45:08 UTC

Created attachment 517306 [details]
netconnect dmesg output from 2 oopses.

Comment 4 Josef Bacik 2011-08-24 20:07:38 UTC

Created attachment 519705 [details]
repair program

So I don't have a broken fs to test this on, but it passes just fine on a clean one so it seems like all my checking is right.  That being said, I probably screwed up writing somewhere, so this may blow up, and if it does just load it into gdb and do a bt so I can see where it segfaulted.  Hopefully this won't make your fs worse than it already is, but I make no promises :).  Please run with -d first, this will do a dry run and just spit out all the errors and not actually fix anything.  Attach this output to this bz so I can verify it's going to fix things correctly, you'll want to do something like

./repair -d /dev/whatever > out.txt 2>&1

and attach out.txt.  You'll need to apply this patch onto btrfs-progs-unstable from upstream, here is the git tree

git://git.kernel.org/pub/scm/linux/kernel/git/mason/btrfs-progs-unstable.git

Good luck :).

Comment 5 Kevin Fenzi 2011-08-24 21:05:50 UTC

Created attachment 519720 [details]
repair -d output

Here's the output from the -d. 

Doesn't tell me much, but perhaps it will tell you something. ;)

Comment 6 Josef Bacik 2011-08-24 21:17:35 UTC

Created attachment 519722 [details]
Incremental patch

Just being extra paranoid, but can you apply this on top of what you have and try again and attach the output?  I just want to make absolute sure it's going to do the right thing when you do the real run.

Comment 7 Kevin Fenzi 2011-08-24 21:26:44 UTC

Created attachment 519724 [details]
repair2 -d output

next repair -d run

Comment 8 Josef Bacik 2011-08-24 21:40:21 UTC

Created attachment 519725 [details]
Another incremental

Ooops I forget you can have items with 0 size that are valid.  Just apply this over the top and do the dry run again.

Comment 9 Josef Bacik 2011-08-24 21:57:32 UTC

Created attachment 519729 [details]
Another incremental

This will work eventually, I promise :).

Comment 10 Josef Bacik 2011-08-25 15:29:21 UTC

Created attachment 519908 [details]
Patch to check all fs roots

Same old song and dance.  This one prints out what its doing so you can feel like its doing something :).

Comment 11 Kevin Fenzi 2011-08-25 16:43:45 UTC

This patch doesn't seem to apply. Says already applied or reversed?

Comment 12 Josef Bacik 2011-08-25 17:08:30 UTC

Created attachment 519935 [details]
Full new repair patch

Ok just unapply everything I've sent you and apply this new one, this should be everything up to this point.  Sorry about that.

Comment 13 Kevin Fenzi 2011-08-25 21:12:49 UTC

Created attachment 519982 [details]
output of last repair run

Here's the last repair run output.

Comment 14 Josef Bacik 2011-08-26 14:18:09 UTC

Created attachment 520098 [details]
An incremental

Ok so my repair program still isn't finding errors.  This will check all the data extents, hopefully this finds something.  If not it looks like this may just be a normal bug and not corruption, which would be great and crappy all at the same time.

Comment 15 Kevin Fenzi 2011-08-26 15:17:11 UTC

ok, same output: 

Checking extent root
Finding fs roots
Checking fs roots
Checking root 5
Checking root 5 refs

Comment 16 Josef Bacik 2011-08-29 14:27:54 UTC

Ok its image creation time.  I'm pretty sure it's built in fedora, but if its not just run

make btrfs-image

in your btrfs-progs-unstable tree.  Just run

btrfs-image -c 9 -t <number of threads> /dev/whatever

and then put the image somewhere I can suck it down.  Maybe now that Chris is back from vacation I can get him to run his fsck against it and see if his picks anything up.

Comment 17 Kevin Fenzi 2011-08-29 23:15:50 UTC

http://www.scrye.com/~kevin/fedora/corrupt-home-20110829

Let me know if I can provide anything further.

Comment 18 Kevin Fenzi 2011-10-01 16:18:43 UTC

Any news? Did the image help any? 
I'd love to be able to get my old data off the drive... ;)

Comment 19 Josef Bacik 2011-10-03 14:46:06 UTC

Sorry I've been messing with the repair tool with another user who's fs is even more screwed than yours.  Good news is my repair tool now finds a problem with your corrupt image, the bad news is the other user ran my tool without the -d option and it made things worse, so I'm going to rig up a tool that will just pull all of your data off the disk since neither of your actual fs roots are corrupted, and we'll leave the repair tool to Chris.  Should have this all rigged up in a day.

Comment 20 Josef Bacik 2011-10-04 19:34:01 UTC

Ok clone this tree

git://github.com/josefbacik/btrfs-progs.git

and run make (make sure you have zlib-devel installed) and then run

./restore /your/device /some/dir

this will dump everything from your device into that directory.  It will skip any snapshots, but will work right with subvolumes.  Let me know if something goes wrong.

Comment 21 Kevin Fenzi 2011-10-04 20:21:37 UTC

Sadly, it cranked along for about 1.75GB worth, then: 

# ./restore /dev/mapper/vg_ohm-lv_home /tmp/ohm/
Short write: 0
#

Comment 22 Fedora End Of Life 2013-04-03 18:05:29 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 23 Dave Jones 2013-05-15 18:40:36 UTC

this has been open for a really long time, with no further apparent progress.

Kevin, is this something you still run into ?

Comment 24 Kevin Fenzi 2013-05-15 18:46:30 UTC

I'm no longer running btrfs here, so no idea. :( 

I guess you can close it out...

Comment 25 Dave Jones 2013-05-15 19:27:30 UTC

unfortunate, but I can't say I blame you.

Note You need to log in before you can comment on or make changes to this bug.