2215725 – xfs_repair continues to show different errors on repeated runs - smarttools says drive is OK

Bug 2215725 - xfs_repair continues to show different errors on repeated runs - smarttools says drive is OK

Summary: xfs_repair continues to show different errors on repeated runs - smarttools s...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	xfsprogs
Sub Component:
Version:	38
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Eric Sandeen
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2023-06-18 02:25 UTC by Gerald Cox
Modified:	2024-05-12 18:09 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2024-05-12 18:09:14 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Gerald Cox 2023-06-18 02:25:33 UTC

shutdown system with usb drive attached, xfs recovered but I thought I'd run xfs_repair to check the disk. Found that each time xfs_repair runs, it lists multiple different errors, such as:
block (7,1302-1303) multiply claimed by bno space tree, state - 2
clearing needsrepair flag and regenerating metadata
block (1,5468-5469) multiply claimed by bno space tree, state - 2
block (0,12921436-12921436) multiply claimed by bno space tree, state - 2
block (3,2173-2174) multiply claimed by bno space tree, state - 2
block (6,3451-3452) multiply claimed by bno space tree, state - 2
block (4,527038-527039) multiply claimed by bno space tree, state - 2
block (5,2686-2687) multiply claimed by bno space tree, state - 2
block (7,1302-1303) multiply claimed by cnt space tree, state - 2
Metadata CRC error detected at 0x5640aac626ad, xfs_cntbt block 0x100184740/0x1000
btree block 2/198890 is suspect, error -74
bad magic # 0 in btcnt block 2/198890

each time I run I get different errors. When I check smarttools it says no errors, drive is OK. I even ran longtest on 2 different drives and they both reported no errors, yet xfs_repair has issues every time it is run.

I've run seagate linux tools, and they also report the drive is OK. This is also happening on a drive I purchased this last week. I suppose it could be defective, but having this problem on 4 different USB drives at the same time seems a bit too much of a coincidence.

Reproducible: Always

Steps to Reproduce:
See above.
Actual Results:
See above.

Expected Results:
See above.

I'm at a bit of a loss as to whether this is a file system bug, or if xfs_repair is broken. Shouldn't 1 or 2 passes of xfs_repair fix issues if they actually exist. Seems odd that it just finds another problem everytime it is run.

I am aware of the xfs bug: https://www.phoronix.com/news/Linux-6.3.5-Released

I am running 6.3.5 but for a time was running 6.3.3 and 6.3.4 - but didn't encounter any errors in the journal at that time so thought I wasn't affected by the bug.

Is it ill-advised to run xfs_repair unless you're in a situation of last resort, i.e. it is a tool that can cause damage?

Thanks!

Comment 1 Gerald Cox 2023-06-18 15:38:19 UTC

I'm doing some more testing moving the usb drives to another system using another cable and am now getting different results.  It will take some time for me to do different tests so would like to keep this open just in case I can't figure this out.  In the meantime if you have any tips or ideas of other things to try, please advise.  Lowering priority and severity.

Comment 2 Gerald Cox 2023-06-19 14:57:24 UTC

OK, I purchased a new usb 3.2 card and new cables and the problem still occurs on my workstation.  If I plug it into the USB port on my laptop it works fine. On the workstation, no errors in the logs about the usb drive.  For some reason xfs_repair is detecting issues which do not exist.  Any ideas to figure out what is going on?

Comment 3 Eric Sandeen 2023-06-19 15:37:40 UTC

For starters, this has nothing to do with the kernel bug reported by phoronix.

Please provide an xfs_metadump of the filesystem in question[1] (this will let us recreate the problem from a filesystem image), and let us know which version of xfs_repair you are using (xfs_repair -V)

If the metadump image is too big to attach to the bug, feel free to reach out to me via email.

[1]
# umount /dev/whatever
# xfs_metadump /dev/whatever filename.meta
# bzip2 filename.meta

Comment 4 Gerald Cox 2023-06-20 15:50:00 UTC

Thanks Eric, I have found some additional information which may help.  Please review and then let me know what additional information I should provide to assist.

I am using xfs_repair version 6.1.0

First of all, to try to resolve the issue I purchased a new USB card:
Inateck PCIe to USB 3.2 Gen 2 Card with 20 Gbps Bandwidth, 3 USB Type-A and 2 USB Type-C Ports, RedComets U21 

I then purchased all new USB cables.

The problem still occurs.

I found however, that if I unplug the drive from the workstation and run xfs_repair on my laptop, then
it runs clean, and finds no errors.  To me, that seems to imply that XFS itself is running fine on the workstation
but that xfs_repair for some reason is reporting false positives.  It's not causing a problem for me in that these
USB drives are being used for backup purposes and I can just restore the drive by running another rsync, but the
concerning thing is that is it appears, at least for some systems, if you run xfs_repair against usb drives, then you'll lose
data because xfs_repair will be moving files unnecessarily to lost+found.  I suppose the data is still actually
there in lost+found, but IMO it would be a PITA to get the files renamed, etc.

Below is a sample of the errors I receive on my workstation, followed by the clean run on my laptop.  I've deleted quite
a bit of the error messages, just wanted to give you an idea of the type of issues being reported.

>>>>>> HERE ARE THE ERRORS SHOWN ON THE WORKSTATION
>>>>>> HERE ARE THE ERRORS SHOWN ON THE WORKSTATION
>>>>>> HERE ARE THE ERRORS SHOWN ON THE WORKSTATION

xfs_repair -n /dev/sdh

xfs_repair reported alot of issues, I deleted quite a bit, here is a small sample:
Phase 1 - find and verify superblock…
Phase 2 - using internal log
- zero log…
- scan filesystem freespace and inode maps…
- found root inode chunk
Phase 3 - for each AG…
- scan (but don’t clear) agi unlinked lists…
- process known inodes and perform inode discovery…
- agno = 0
- agno = 1
inode identifier 2147862912 mismatch on inode 2147869056
would have cleared inode 2147869056
inode identifier 2147862913 mismatch on inode 2147869057
would have cleared inode 2147869057
- agno = 2
inode identifier 4295556032 mismatch on inode 4295565184
would have cleared inode 4295565184
inode identifier 4295556033 mismatch on inode 4295565185
would have cleared inode 4295565185
- agno = 3
inode identifier 6453851908 mismatch on inode 6453931524
would have cleared inode 6453931524
inode identifier 6453851909 mismatch on inode 6453931525
would have cleared inode 6453931525
- agno = 4
- agno = 5
- agno = 6
inode identifier 12885557312 mismatch on inode 12885563840
would have cleared inode 12885563840
- agno = 7
- process newly discovered inodes…
Phase 4 - check for duplicate blocks…
- setting up duplicate extent list…
- check for inodes claiming duplicate blocks…
- agno = 0
- agno = 4
- agno = 1
- agno = 7
- agno = 6
- agno = 3
- agno = 2
- agno = 5
entry “background-43.jpg” at block 1 offset 96 in directory inode 2147622656 references free inode 2147869056
would clear inode number in entry at offset 96…
entry “04_last_night.opus” at block 0 offset 200 in directory inode 2147864530 references free inode 2147869057
would clear inode number in entry at offset 200…
inode identifier 2147862912 mismatch on inode 2147869056
would have cleared inode 2147869056
inode identifier 2147862913 mismatch on inode 2147869057
No modify flag set, skipping phase 5
Phase 6 - check inode connectivity…
- traversing filesystem …
Metadata CRC error detected at 0x55f6d4375b70, xfs_dir3_block block 0x347e8/0x1000
expected owner inode 152592, got 146784, directory block 215016
would rebuild directory inode 152592
would create missing “.” entry in dir ino 152592
entry “background-43.jpg” in directory inode 2147622656 points to free inode 2147869056, would junk entry
bad hash table for directory inode 2147622656 (no data entry): would rebuild
would rebuild directory inode 2147622656
Metadata CRC error detected at 0x55f6d4377460, xfs_dir3_leaf1 block 0x284019d48/0x1000
leaf block 8388608 for directory inode 10737498380 bad CRC
would rebuild directory inode 10737498380
- traversal finished …
- moving disconnected inodes to lost+found …
disconnected inode 751929, would move to lost+found
disconnected inode 751930, would move to lost+found
disconnected inode 751931, would move to lost+found
disconnected inode 751932, would move to lost+found
Phase 7 - verify link counts…
would have reset inode 4304567810 nlinks from 6708 to 6706
No modify flag set, skipping filesystem flush and exiting.

>>>>>> HERE IS THE CLEAN RUN ON MY LAPTOP
>>>>>> HERE IS THE CLEAN RUN ON MY LAPTOP
>>>>>> HERE IS THE CLEAN RUN ON MY LAPTOP

And here is the same drive a few moments later with xfs_repair on my laptop:

xfs_repair /dev/sdb
Phase 1 - find and verify superblock…
Phase 2 - using internal log
- zero log…
- scan filesystem freespace and inode maps…
- found root inode chunk
Phase 3 - for each AG…
- scan and clear agi unlinked lists…
- process known inodes and perform inode discovery…
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
- process newly discovered inodes…
Phase 4 - check for duplicate blocks…
- setting up duplicate extent list…
- check for inodes claiming duplicate blocks…
- agno = 0
- agno = 1
- agno = 2
- agno = 3
- agno = 4
- agno = 5
- agno = 6
- agno = 7
Phase 5 - rebuild AG headers and trees…
- reset superblock…
Phase 6 - check inode connectivity…
- resetting contents of realtime bitmap and summary inodes
- traversing filesystem …
- traversal finished …
- moving disconnected inodes to lost+found …
Phase 7 - verify and correct link counts…
done

Comment 5 Eric Sandeen 2023-06-20 16:50:43 UTC

Please provide the metadump as requested, then we can quickly differentiate between "this set of on-disk metadata is in fact repairable in one pass by xfs_repair" and "something about the metadata on *this hardware* does not stay 'repaired' after an xfs_repair run."

I do tend to suspect hardware errors are a possible culprit.  For example:

> inode identifier 2147862912 mismatch on inode 2147869056

Those 2 numbers in binary are:
10000000000001011100100110000000
10000000000001011110000110000000
                    ^
> inode identifier 2147862913 mismatch on inode 2147869057

10000000000001011100100110000001
10000000000001011110000110000001
                    ^
Those looks suspiciously like bit-flips.

> inode identifier 4295556033 mismatch on inode 4295565185
100000000000010001111101111000001
100000000000010010001111110000001

That one is more than one bit off though, so not sure.

This is almost never the culprit, but any chance you have memory errors? Is the laptop flaky in any other way? Can you run a memory tester, just for fun?

Comment 6 Gerald Cox 2023-06-20 17:59:10 UTC

Hi Eric, thanks for the quick reply.  The metadata file is too large for attachment, so I'll send a link to your email.  If you have any issues downloading, please let me know.

Also, just to make sure you understand:

The error is happening on my workstation.  The same drive when attached to my laptop and then using xfs_repair finishes
with no errors.  That is why I'm thinking it was a false positive.

My workstation doesn't have any errors in the journal nor warnings regarding memory.  I'll look into running a memory tester.

Here is the info from neofetch, fyi.

    OS: Fedora release 38 (Thirty Eight) x86_64 
    .:cccccccccccccccccccccccccc:.       Kernel: 6.3.8-200.fc38.x86_64 
  .;ccccccccccccc;.:dddl:.;ccccccc;.     Uptime: 15 hours, 7 mins 
 .:ccccccccccccc;OWMKOOXMWd;ccccccc:.    Packages: 4935 (rpm), 1 (flatpak) 
.:ccccccccccccc;KMMc;cc;xMMc:ccccccc:.   Shell: bash 5.2.15 
,cccccccccccccc;MMM.;cc;;WW::cccccccc,   Resolution: 1920x1080 
:cccccccccccccc;MMM.;cccccccccccccccc:   DE: Plasma 
:ccccccc;oxOOOo;MMM0OOk.;cccccccccccc:   WM: kwin 
cccccc:0MMKxdd:;MMMkddc.;cccccccccccc;   Theme: Breeze [GTK2], Adwaita [GTK3] 
ccccc:XM0';cccc;MMM.;cccccccccccccccc'   Icons: breeze [GTK2], Adwaita [GTK3] 
ccccc;MMo;ccccc;MMW.;ccccccccccccccc;    Terminal: konsole 
ccccc;0MNc.ccc.xMMd:ccccccccccccccc;     CPU: AMD FX-8350 (8) @ 3.740GHz 
cccccc;dNMWXXXWM0::cccccccccccccc:,      GPU: AMD ATI Radeon HD 7850 / R7 265 / R9 270 1024SP 
cccccccc;.:odl:.;cccccccccccccc:,.       Memory: 10599MiB / 31975MiB

Comment 7 Eric Sandeen 2023-06-20 20:51:00 UTC

Ah ok so I had the laptop & workstation backwards - in any case, it does seem to point to hardware.

The metadump you provided (thanks) contains no metadata inconsistencies, xfs_repair runs clean.  Did you gather it from the laptop or from the workstation?

Comment 8 Gerald Cox 2023-06-20 22:01:28 UTC

(In reply to Eric Sandeen from comment #7)
> Ah ok so I had the laptop & workstation backwards - in any case, it does
> seem to point to hardware.
> 
> The metadump you provided (thanks) contains no metadata inconsistencies,
> xfs_repair runs clean.  Did you gather it from the laptop or from the
> workstation?

Hey Eric, again thanks for the quick response.

I gathered it from the system that is getting the errors, the workstation.
So to rehash:

1.  xfs_repair gets alot of errors when running against usb drives on my workstation.
2.  xfs_metadata creates a file with no errors on my the workstation
3.  when running xfs_repair against the drive on the laptop, it finds no errors, which
I suppose makes sense, since the file created by xfs_metadata on the workstation has
no errors.

The following may or may not mean anything, because I know nothing about XFS.  I'm just hacking around with bard.google.com
helping me - but I thought I'd share:

I was curious so I ran xfs_db -f sea8000_2.meta and received this reply:
xfs_db: sea8000_2.meta is not a valid XFS filesystem (unexpected SB magic number 0x5846534d)
Use -F to force a read attempt.

Then I used -F:
xfs_db: sea8000_2.meta is not a valid XFS filesystem (unexpected SB magic number 0x5846534d)
xfs_db: V1 inodes unsupported. Please try an older xfsprogs.

I then found this: 
You can easily check what on-disk format you are using by running xfs_info /mount/point. It will say crc=0 if you are using v4 and crc=1 if you are using v5. 

So then I ran xfs_info:
xfs_info /dev/sdh
meta-data=/dev/sdh               isize=512    agcount=8, agsize=268435455 blks
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1    bigtime=0 inobtcount=0 nrext64=0
data     =                       bsize=4096   blocks=1953506645, imaxpct=5
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=521728, version=2
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

My understanding is that crc=1 means I'm using the V5 format, so my understanding is I shouldn't be getting this message.

Comment 9 Gerald Cox 2023-06-20 22:02:58 UTC

P.S.  Also, another data point that may or may not be helpful.  These drives do not have
a partition table, I created with mkfs.xfs /dev/xxx

Comment 10 Eric Sandeen 2023-06-22 17:33:02 UTC

The xfs metadump is not a file system image, it is more like a file that contains a filesystem image. You can use xfs_mdrestore to turn it back into a proper image that xfs_db can look at.

Whether or not you have a partition table doesn't really matter at all here.

At this point I think you have a hardware issue somewhere. If you have a disposable drive that you can overwrite, perhaps you can try some block device integrity checking or read/write IO to see what you get.

Or, back up to an older kernel just to see if there is some regression in a hardware driver. But I'm just throwing darts now ... this does not look like an xfsprogs bug to me, so I'm not sure how much further help I can provide.

Comment 11 Gerald Cox 2023-06-22 17:52:00 UTC

Thanks for the reply Eric.  I'm trying to understand what type of hardware issue it could be.  

When I remove the drive from my workstation and plug into the laptop, it works fine - so to me that
would eliminate the drive itself from having a hardware issue.  

I purchased a new usb card for my computer and the issue still occurs, wouldn't that eliminate the
usb port as a cause?

When I run xfs_metadump it creates the file image cleanly, it has no errors.

Since no errors are found on the laptop when I run xfs_repair, and when I create the xfs_metadump on 
the workstation, it has no errors, I would conclude that xfs itself is not having any issues on the workstation.
Is that not the correct assumption?

The only issue is xfs_repair running on this particular machine, apparently for whatever reason
detecting false positives.  Why would it be a hardware issue if xfs itself and xfs_metadata
are working properly.  The thing that is failing is xfs_repair.

Comment 12 Eric Sandeen 2023-06-27 21:23:33 UTC

Are the workstation & the laptop the same architecture?

Do kernel version or xfsprogs version differ? Might be interesting to try to boot the laptop kernel version on the workstation, and install the same xfsprogs if that is at all possible.

If it's same arch, same kernel, and same xfsprogs but it finds errors on one machine and not the other I'm kind of out of ideas aside from a hardware difference (or problem).

Is there /any/ chance that when you're running repair on the workstation, it's been magically mounted somewhere else by $SOMETHING (systemd-fu or gnome-fu), and xfs_repair is trying to repair a live, mounted device?

Comment 13 Gerald Cox 2023-06-28 00:48:50 UTC

Hi Eric, Thanks again for trying to help.   Much appreciated.  

Yes, it's definitely weird, the workstation and laptop are the same 
architecture, using the same version of kernel and same version of
xfsprogs. 

I don't believe the drive is mounted anywhere else.  

I just purchased yet another usb drive and am getting the same result, 
so that is two new drives that are experience the same issue.  

I agree that there is something on my workstation that is causing
the issue with xfs_repair, but I don't have a clue as to what it could be.

Since when running xfs_repair on the drive using the laptop runs clean I'm
assuming that the data on the drive is good.  

If you can think of some additional way to trace xfs_repair to find out
what could be happening let me know and I'll try it.  My workstation was 
built in 2015 so maybe it's some bios quirk that xfs_repair doesn't like 
for some reason.

Comment 14 Eric Sandeen 2023-07-18 01:44:06 UTC

The only other thing I can think of is to test your memory, I'm afraid.

Or if you have space, dd the drive (without mounting it on either) to a file image on both systems, and compare the results. if they differ, the problem lies well outside xfs_repair.

Note You need to log in before you can comment on or make changes to this bug.