Bug 1641116

Summary:	xfs_repair crashes at phase6.c:1410: longform_dir2_rebuild: Assertion `done' failed.
Product:	[Fedora] Fedora	Reporter:	Tomasz Torcz <tomek>
Component:	xfsprogs	Assignee:	Eric Sandeen <esandeen>
Status:	CLOSED UPSTREAM	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rawhide	CC:	esandeen
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-01-09 17:22:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tomasz Torcz 2018-10-19 16:36:37 UTC

Description of problem:
HDD developed few bad sectors. Trying to xfs_repair (both original disc and dd_rescue'd copy) crashes xfs_repair:

…
Phase 5 - rebuild AG headers and trees...
        - reset superblock...
Phase 6 - check inode connectivity...
        - resetting contents of realtime bitmap and summary inodes
        - traversing filesystem ...
rebuilding directory inode 134218179
bad hash table for directory inode 134644569 (no data entry): rebuilding
rebuilding directory inode 134644569
rebuilding directory inode 135789656
xfs_repair: phase6.c:1410: longform_dir2_rebuild: Assertion `done' failed.
Aborted (core dumped)


Backtrace:
(gdb) bt
#0  0x00007ffff7dac5cf in raise () from /lib64/libc.so.6
#1  0x00007ffff7d96895 in abort () from /lib64/libc.so.6
#2  0x00007ffff7d96769 in __assert_fail_base.cold.0 () from /lib64/libc.so.6
#3  0x00007ffff7da49f6 in __assert_fail () from /lib64/libc.so.6
#4  0x00005555555797ef in longform_dir2_rebuild (hashtab=<optimized out>, ino_offset=24, irec=<optimized out>, ip=0x5555556f5270, ino=135789656, mp=<optimized out>)      
    at phase6.c:1410
#5  longform_dir2_entry_check (hashtab=<optimized out>, ino_offset=24, irec=<optimized out>, need_dot=0x7fffffffd608, num_illegal=0x7fffffffd610, ip=0x5555556f5270,      
    ino=135789656, mp=<optimized out>) at phase6.c:2481
#6  process_dir_inode (mp=<optimized out>, agno=agno@entry=1, irec=irec@entry=0x7fffd818a310, ino_offset=ino_offset@entry=24) at phase6.c:2983                            
#7  0x0000555555579ab2 in traverse_function (wq=0x7fffffffdb00, agno=1, arg=0x55555567f050) at phase6.c:3254                                                              
#8  0x000055555557dff5 in prefetch_ag_range (work=0x7fffffffdb00, start_ag=<optimized out>, end_ag=4, dirs_only=true, func=0x555555579a10 <traverse_function>)            
    at prefetch.c:964
#9  0x000055555557fa25 in do_inode_prefetch (mp=0x7fffffffdf70, stride=0, func=0x555555579a10 <traverse_function>, check_cache=<optimized out>, dirs_only=true)           
    at prefetch.c:1027
#10 0x000055555557acd4 in traverse_ags (mp=0x7fffffffdf70) at phase6.c:3372
#11 phase6 (mp=0x7fffffffdf70) at phase6.c:3372
#12 0x000055555555ac2e in main (argc=<optimized out>, argv=<optimized out>) at xfs_repair.c:949


Version-Release number of selected component (if applicable):
xfsprogs-4.18.0-1.fc30.x86_64

I'm happy to run any further diagnostic, but I cannot share the disk image.

Comment 1 Eric Sandeen 2018-10-19 17:14:51 UTC

Any chance you can share an xfs_metadump image, which obfuscates nearly all metadata and zeros out unused portions of sectors?

Comment 2 Eric Sandeen 2018-10-19 17:15:53 UTC

(and contains no data blocks at all)

Comment 3 Tomasz Torcz 2018-10-19 18:01:52 UTC

Created attachment 1495716 [details]
vdb5.xfs_metadump.xz

% xfs_metadump -g /dev/vdb5 vdb5.xfs_metadump                                                                                                               
Copied 90112 of 328832 inodes (1 of 4 AGs)                 Metadata corruption detected at 0x55a4cd81925e, xfs_inode block 0x4b68340/0x8000                               
Copied 144832 of 328832 inodes (1 of 4 AGs)                Unknown directory buffer type!                                                                                 
Zeroing clean log


Uncompressess to 265M.

Comment 4 Eric Sandeen 2018-10-22 15:01:05 UTC

Ok, got it.  The filesystem looks heavily damaged FWIW.  I do see the 
xfs_repair: phase6.c:1362: longform_dir2_rebuild: Assertion `done' failed.

error though.  I'll look into why it failed, but how bad was the disk you rescued from, just out of curiosity?

Comment 5 Eric Sandeen 2018-10-23 04:12:42 UTC

Send this patch to the list, forgot to cc: you sorry, I'll bounce it to you.

====

xfs_repair: continue after xfs_bunmapi deadlock avoidance

After commit:

15a8bcc xfs: fix multi-AG deadlock in xfs_bunmapi

xfs_bunmapi can legitimately return before all work is done.
Sadly nobody told xfs_repair, so it fires an assert:

 phase6.c:1410: longform_dir2_rebuild: Assertion `done' failed. 

Fix this by calling back in until all work is done, as we do
in the kernel.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1641116
Reported-by: Tomasz Torcz <tomek>
Signed-off-by: Eric Sandeen <sandeen>
---

diff --git a/repair/phase6.c b/repair/phase6.c
index e017326..b87c751 100644
--- a/repair/phase6.c
+++ b/repair/phase6.c
@@ -1317,7 +1317,7 @@ longform_dir2_rebuild(
 	xfs_fileoff_t		lastblock;
 	xfs_inode_t		pip;
 	dir_hash_ent_t		*p;
-	int			done;
+	int			done = 0;
 
 	/*
 	 * trash directory completely and rebuild from scratch using the
@@ -1352,12 +1352,25 @@ longform_dir2_rebuild(
 			error);
 
 	/* free all data, leaf, node and freespace blocks */
-	error = -libxfs_bunmapi(tp, ip, 0, lastblock, XFS_BMAPI_METADATA, 0,
-				&done);
-	if (error) {
-		do_warn(_("xfs_bunmapi failed -- error - %d\n"), error);
-		goto out_bmap_cancel;
-	}
+	while (!done) {
+	       error = -libxfs_bunmapi(tp, ip, 0, lastblock, XFS_BMAPI_METADATA,
+			               0, &done);
+	       if (error) {
+		       do_warn(_("xfs_bunmapi failed -- error - %d\n"), error);
+		       goto out_bmap_cancel;
+	       }
+	       error = xfs_defer_finish(&tp);
+	       if (error) {
+		       do_warn(("defer_finish failed -- error - %d\n"), error);
+		       goto out_bmap_cancel;
+	       }
+	       /*
+		* Close out trans and start the next one in the chain.
+		*/
+	       error = xfs_trans_roll_inode(&tp, ip);
+	       if (error)
+			goto out_bmap_cancel;
+        }
 
 	ASSERT(done);

Comment 6 Tomasz Torcz 2018-10-23 16:28:18 UTC

I've tested your V2 patch with success. xfs_repair was able to fix enough problems to mount the partition. I can now proceed with copying home directory from it. Thank you, Eric!

The disk itself was few years old HDD, used in laptop. dd_rescue displayed 26 read errors (IIRC) during the copying. XFS partition served as rootfs (Fedora 28) on this laptop, and before I got it, the laptop was powered on couple of times. Each boot  ended in initrd not being able to mount rootfs, and then the laptop was forced off.
Before I discovered read errors, I've tried couple of unsuccessful xfs_repair runs. So nothing quite special, just worn out HDD.

Comment 7 Eric Sandeen 2019-01-09 17:22:28 UTC

Ok, I'm going to close this as an upstream bug, I think it's solved your problem ,and fedora will inherit it with normal updates.  I don't think theres' any reason to push this patch to the fedora packages ahead of upstream.  if you disagree, let me know & reopen.

Thanks,
-eric