Bug 2232497 - btrfs error object already exists failed to recover log tree
Summary: btrfs error object already exists failed to recover log tree
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 38
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
Assignee: fedora-kernel-btrfs
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-08-17 06:30 UTC by cornel panceac
Modified: 2023-08-23 17:33 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
picture of the error screen (2.29 MB, image/jpeg)
2023-08-17 06:30 UTC, cornel panceac
no flags Details
screenshot of bad colours in gnome terminal (164.68 KB, image/png)
2023-08-18 06:22 UTC, cornel panceac
no flags Details
pic1 (5.63 MB, image/jpeg)
2023-08-18 06:32 UTC, cornel panceac
no flags Details
pic2 (3.64 MB, image/jpeg)
2023-08-18 06:33 UTC, cornel panceac
no flags Details
pic3 (4.25 MB, image/jpeg)
2023-08-18 06:34 UTC, cornel panceac
no flags Details

Description cornel panceac 2023-08-17 06:30:09 UTC
Created attachment 1983737 [details]
picture of the error screen

Description of problem:
After power failure i get this error:
BTRFS error: Device nvme0n1p6: state ....error=-n17 Object already exists (Failed to recover log tree)

How can i fix the file system without losing data?



Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
See attached picture.

Comment 1 cornel panceac 2023-08-17 06:40:24 UTC
$ sudo btrfsck /dev/nvme0n1p6
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p6
UUID: 8476540f-ac0e-41a1-9ef9-7f833de63382
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
[4/7] checking fs roots
root 257 inode 1686811 errors 200, dir isize wrong
root 257 inode 2722042 errors 1, no inode item
	unresolved ref dir 1686811 index 418215 namelen 15 name imjournal.state filetype 1 errors 5, no dir item, no inode ref
root 257 inode 2722043 errors 1, no inode item
	unresolved ref dir 1686811 index 418217 namelen 15 name imjournal.state filetype 1 errors 5, no dir item, no inode ref
ERROR: errors found in fs roots
found 984442568704 bytes used, error(s) found
total csum bytes: 826128480
total tree bytes: 3355852800
total fs tree bytes: 2182397952
total extent tree bytes: 187236352
btree space waste bytes: 631685011
file data blocks allocated: 4400592211968
 referenced 1015906455552

Comment 2 cornel panceac 2023-08-17 06:49:44 UTC
$ sudo btrfsck --repair /dev/nvme0n1p6
enabling repair mode
WARNING:

	Do not use --repair unless you are advised to do so by a developer
	or an experienced user, and then only after having accepted that no
	fsck can successfully repair all types of filesystem corruption. E.g.
	some software or hardware bugs can fatally damage a volume.
	The operation will start in 10 seconds.
	Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p6
UUID: 8476540f-ac0e-41a1-9ef9-7f833de63382
repair mode will force to clear out log tree, are you sure? [y/N]: n


***

Please advise as to what is the next step.

Comment 3 cornel panceac 2023-08-17 07:10:50 UTC
Due to the fact i can not use my computer, i can not access my user data, i've increased the severity of this ticket.

Comment 4 Neal Gompa 2023-08-17 08:03:35 UTC
Switching to the right component...

Comment 5 Neal Gompa 2023-08-17 08:05:08 UTC
Actually switch it to the kernel btrfs, since this is a kernel-space thing.

Comment 6 Josef Bacik 2023-08-17 15:06:50 UTC
To get your machine back answer yes when it asks if it's ok to clear the log, you'll at most lose the last 30 seconds worth of changes to the disk.

What kernel was this on?  We had a bug in this area that was sent back to stable, it should have made it to all the relevant fedora kernels a while ago.

Comment 7 cornel panceac 2023-08-17 17:00:37 UTC
ok, thank you. Here are the results so far (after this post i'll reboot and check it is really back to life, then i'll report some more):

$ sudo btrfsck --repair /dev/nvme0n1p6
enabling repair mode
WARNING:

	Do not use --repair unless you are advised to do so by a developer
	or an experienced user, and then only after having accepted that no
	fsck can successfully repair all types of filesystem corruption. E.g.
	some software or hardware bugs can fatally damage a volume.
	The operation will start in 10 seconds.
	Use Ctrl-C to stop it.
10 9 8 7 6 5 4 3 2 1
Starting repair.
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p6
UUID: 8476540f-ac0e-41a1-9ef9-7f833de63382
repair mode will force to clear out log tree, are you sure? [y/N]: y
[1/7] checking root items
Fixed 0 roots.
[2/7] checking extents
super bytes used 984442552320 mismatches actual used 984442568704
No device size related problem found
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
Deleting bad dir index [1686811,96,418215] root 257
Deleting bad dir index [1686811,96,418217] root 257
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 1968885121024 bytes used, no error found
total csum bytes: 1652256960
total tree bytes: 6711689216
total fs tree bytes: 4364795904
total extent tree bytes: 374456320
btree space waste bytes: 1263353830
file data blocks allocated: 8801184423936
 referenced 2031812911104

Then i checked the file system once more:

$ sudo btrfsck /dev/nvme0n1p6
Opening filesystem to check...
Checking filesystem on /dev/nvme0n1p6
UUID: 8476540f-ac0e-41a1-9ef9-7f833de63382
[1/7] checking root items
[2/7] checking extents
[3/7] checking free space cache
cache and super generation don't match, space cache will be invalidated
[4/7] checking fs roots
[5/7] checking only csums items (without verifying data)
[6/7] checking root refs
[7/7] checking quota groups skipped (not enabled on this FS)
found 984442568704 bytes used, no error found
total csum bytes: 826128480
total tree bytes: 3355852800
total fs tree bytes: 2182397952
total extent tree bytes: 187219968
btree space waste bytes: 631683789
file data blocks allocated: 4400592211968
 referenced 1015906455552

All these were done from the F38 workstation livecd (updated in 1st of august, as linked in IRC #fedora channel).

Comment 8 cornel panceac 2023-08-17 17:04:40 UTC
ok, computer is back. 
THANK YOU!
here's the kernel:

$ uname -a
Linux fedora 6.4.7-200.fc38.x86_64 #1 SMP PREEMPT_DYNAMIC Thu Jul 27 20:01:18 UTC 2023 x86_64 GNU/Linux

Probably nothing was lost since it was early in the morning and i've done no real work at that time.

Comment 9 cornel panceac 2023-08-17 17:22:00 UTC
Hmmm, maybe i need some time to adapt to the new reality but, is it possible that more than the last 30 seconds was lost? My local git repos seem to be in a rather old state. Tomorrow (Bucharest time) i'll compare to the upstream and let you know if my current perception is correct.

Comment 10 cornel panceac 2023-08-18 06:22:12 UTC
Created attachment 1983930 [details]
screenshot of bad colours in gnome terminal

Comment 11 cornel panceac 2023-08-18 06:24:07 UTC
Sorry about this, it seems that attaching a file discards the current comment :/
Here's the discarded comment:

"
Before anything else, i did a mistake: seeing i'm behind with updates, immediately after recovering the filesystem i did a dnf upgrade. I'll attach the updated package list and one screenshot.

Then, the git repo looks good (just too many untracked files made me see red :)). However, there's a bunch of applications that have display problems, like the font color is wrong in gnome terminal, or the applications bar colors is wrong for example in screenshot application, gnome terminal, firefox or chrome.

Because i did that upgrade , i can not tell if it's caused by the upgrade, caused by the btrfs issue, or any other reason.
After sending this update i'll create a new user and check if there everything is ok.
"

I'll come back with some more picture(s). Also with report from status for new user.

Comment 12 cornel panceac 2023-08-18 06:32:34 UTC
Created attachment 1983932 [details]
pic1

Comment 13 cornel panceac 2023-08-18 06:33:05 UTC
Created attachment 1983933 [details]
pic2

Comment 14 cornel panceac 2023-08-18 06:34:52 UTC
Created attachment 1983934 [details]
pic3

This three pictures shows the problematic behaviour for the user that was logged in when the btrfs was affected by the power failure and the correct behaviour for the user whic was created after btrfs was fixed.

Comment 15 cornel panceac 2023-08-19 04:12:03 UTC
How can i check if indeed i've lost last 30 seconds of changes and not last three days of changes?

I believe the problem here is bigger than '#SomeUSer has lost his files'.
For one thing, this may happen to anyone and then , if a big enough number of users have this problem, the pressure on Fedora project may be way bigger.
Then, can the messages from btrfs tools be more user friendly and less scary?
For example, is there any doc where an user can understand what does it mean that the log tree will be 'clear out'?
If user decides to wait for a '(btrfs?) developer or an experienced (btrfs?) user' to provide feedback but such a feedback never comes, what are the user alternatives?

I understand that despite RedHat giving up on btrfs as a technology preview, btrfs has certain qualities that convinced Fedora project to use it as the default filesystem.
Ad it certainly served me well till the point where i've met the reported problem.
It could be useful if this would be accompanied not only by better tools by but also a better (or more visible) documentation.

Another example would be: when installing Fedora, provide some inline summary on to why to choose btrfs and why to choose some other filesystem. Also provide some external links, usable mostly when installing from live CD.

Beside reporting this problem and providing my 2 cents ideas, what else can i do to improve this situation?

Comment 16 Josef Bacik 2023-08-23 17:33:20 UTC
The messages are a bit unfriendly, I will send patches to make the tooling less scary.

Additionally fsck with --check should indeed allow for the log to be cleared without asking first.  I will update this as well.

Unfortunately you got hit with a bug in the logging code, the bug was short lived upstream, but was still there if you didn't upgraded your kernel after an update, which is a common occurrence.

As for the rest of your symptoms, those are unlikely related to the file system, just unhappy coincidence.  The fsck did the correct thing in fixing your file system, it simply updated the incorrect directory index entries, which don't affect actual files, simply are a readdir optimization.  The tree log will only have what would have happened in up to the last 30 seconds, so that's all you would have truly lost.

I agree, the recovery tools for btrfs are relatively scary, the hope is they only have to be brought out in extreme cases.  We will put some effort into documenting this and making it less terrifying when they do have to be used.


Note You need to log in before you can comment on or make changes to this bug.