Bug 2312886 - BTRFS critical, corrupt node, unaligned pointer, should be aligned to 4096
Summary: BTRFS critical, corrupt node, unaligned pointer, should be aligned to 4096
Keywords:
Status: CLOSED UPSTREAM
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 40
Hardware: Unspecified
OS: Linux
unspecified
urgent
Target Milestone: ---
Assignee: fedora-kernel-btrfs
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-09-17 17:42 UTC by Chris Murphy
Modified: 2025-03-11 16:40 UTC (History)
19 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2025-03-11 16:40:37 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
kernel messages 1 (71.43 KB, text/plain)
2024-09-17 17:44 UTC, Chris Murphy
no flags Details
kernel messages 2 (508.23 KB, text/plain)
2024-09-17 18:03 UTC, Chris Murphy
no flags Details
kernel.log 6.10.5 buildvm-x86-10 (1.16 MB, text/plain)
2024-09-24 19:44 UTC, Chris Murphy
no flags Details
kernel.log 6.10.5 buildvm-x86-27 (631.61 KB, text/plain)
2024-09-26 00:14 UTC, Chris Murphy
no flags Details
systeminfo (30.85 KB, text/plain)
2024-09-29 22:20 UTC, Chris Murphy
no flags Details

Description Chris Murphy 2024-09-17 17:42:15 UTC
1. Please describe the problem:

Koji builders run in VMs. And it's the VM experiencing the problem. Btrfs write time tree block corruption detected, fs goes read-only.

2. What is the Version-Release number of the kernel:
6.10.6 and higher; 6.10.10 currently in testing.

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?:

Unknown, could be present since 6.10.0. The builders were on 6.9.7 (no problem), then on 6.10.6 (problem is sporadic and not reproducible)


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

No. But it isn't transient. It's happening only on koji builders (which run in a VM).


5. Does this problem occur with the latest Rawhide kernel?

Unknown.


6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.



Reproducible: Sporadic

Comment 1 Chris Murphy 2024-09-17 17:44:45 UTC
Created attachment 2047279 [details]
kernel messages 1

kernel 6.10.9

Comment 2 Chris Murphy 2024-09-17 18:03:55 UTC
Created attachment 2047281 [details]
kernel messages 2

I can't tell if this is related. It happens about 10 days earlier.

Kernel is 6.10.6, we get a call trace implicating Btrfs, but there are no write time tree checker messages.

Comment 4 Kevin Fenzi 2024-09-23 17:09:32 UTC
Saw something on a aarch64 builder: 

Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS critical (device vda3): corrupt leaf: root=256 block=553541238784 slot=9 ino=1551466593, name hash mismatch with key, have 0x00000000cb63b37b expect 0x00000000fec4d99b
Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS info (device vda3): leaf 553541238784 gen 623185 total ptrs 91 free space 5959 owner 256
Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS error (device vda3): block=553541238784 write time tree block corruption detected
Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS: error (device vda3) in btrfs_commit_transaction:2505: errno=-5 IO failure (Error while writing out transaction)
Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS info (device vda3 state E): forced readonly
Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS warning (device vda3 state E): Skipping commit of aborted transaction.
Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS error (device vda3 state EA): Transaction aborted (error -5)
Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS: error (device vda3 state EA) in cleanup_transaction:1999: errno=-5 IO failure
Sep 21 14:06:14 buildvm-a64-16.iad2.fedoraproject.org kernel: BTRFS: error (device vda3 state EA) in btrfs_sync_log:3174: errno=-5 IO failure

Thats with 6.10.5-200.fc40.aarch64

And then another: 

6.10.5-200.fc40.x86_64

Sep 23 05:17:20 buildvm-x86-10.iad2.fedoraproject.org kernel: BTRFS warning (device vda3): csum hole found for disk bytenr range [4880289792, 4880293888)
Sep 23 05:17:20 buildvm-x86-10.iad2.fedoraproject.org kernel: BTRFS warning (device vda3): csum hole found for disk bytenr range [4880293888, 4880297984)
Sep 23 05:17:20 buildvm-x86-10.iad2.fedoraproject.org kernel: BTRFS warning (device vda3): csum failed root 256 ino 281 off 66670592 csum 0xb85d0050 expected csum 0x00000000 mirror 1
Sep 23 05:17:20 buildvm-x86-10.iad2.fedoraproject.org kernel: BTRFS error (device vda3): bdev /dev/vda3 errs: wr 0, rd 0, flush 0, corrupt 1, gen 0
...
Sep 23 09:38:53 buildvm-x86-10.iad2.fedoraproject.org kernel: BTRFS info (device loop0p4): last unmount of filesystem 2cf58b61-2878-4bc4-b0b0-64e75d467fdf
Sep 23 09:39:30 buildvm-x86-10.iad2.fedoraproject.org kernel: BTRFS info (device loop2p4): last unmount of filesystem 57ad289b-2ac5-48ea-86bf-841e20af8720
Sep 23 10:24:52 buildvm-x86-10.iad2.fedoraproject.org kernel: BTRFS critical (device vda3): corrupt leaf: root=7 block=1374024122368 slot=150, unexpected item end, have 4706614063 expect 14527
Sep 23 10:24:52 buildvm-x86-10.iad2.fedoraproject.org kernel: BTRFS info (device vda3): leaf 1374024122368 gen 699790 total ptrs 346 free space 2909 owner 7

That vm is on the same virthost as the orig ones. 

Those 3 ( buildvm-x86-01/02/03 have been running 6.11.0 and have not had any problems since then).

I don't understand why this would start happening on the two above, they hadn't been rebooted in a while, and were running 6.10.5 which seemed like it was fine. 
Perhaps something in userspace has changed? Or perhaps somehow it is a problem on that virthost, but then I don't see why the aarch64 vm on another virthost would have any problems. ;(

Comment 5 Qu Wenruo 2024-09-23 23:00:02 UTC
The one at  Sep 21 14:06:14 looks like a random corruption, maybe related to this one?
https://bugzilla.redhat.com/show_bug.cgi?id=2312886

And since it's write-time corruption, I guess the fsck will not report anything.

For the one at Sep 23 05:17:20, firstly the csum tree is corrupted (csum holes), then corrupted leaf.
(No more lines to know if it's a write time or a persistent corruption on disk).

Neither of them is simple corruption (e.g. bitflip), but a range of metadata is corrupted with seemingly garbage.


You mentioned that those reports are from VMs on different hosts, any more reports to find some common factors?

Comment 6 Qu Wenruo 2024-09-23 23:10:25 UTC
BTW, I find out that for "Sep 17 00:55:43 buildvm-x86-02.iad2.fedoraproject.org", the corruption in tree node is also happening at node 121,
the same node slot as the one reported at https://bugzilla.redhat.com/show_bug.cgi?id=2312886.

And for node slot 121, it's exactly at offset 4096, so the corruption all happens at the beginning of the 2nd page of a metadata.
(By default it's 16K nodesize, so 4 4K pages involved).

Are all these aarch64 systems using 4K page size?

Comment 7 Chris Murphy 2024-09-23 23:22:32 UTC
>Are all these aarch64 systems using 4K page size?

Yes

>the same node slot as the one reported at https://bugzilla.redhat.com/show_bug.cgi?id=2312886.

This is bug 2312886, maybe you mean bug 2314331?

We definitely are seeing way more than the usual amount of read and write time tree checker warnings and I'm not finding a pattern so far, except they aren't single big flips. It may be workload specific.

Comment 8 Qu Wenruo 2024-09-23 23:36:43 UTC
> This is bug 2312886, maybe you mean bug 2314331?

My bad, yes I mean 2314331.

> We definitely are seeing way more than the usual amount of read and write time tree checker warnings and I'm not finding a pattern so far, except they aren't single big flips. It may be workload specific.

Mind to extract all tree-checker errors found so far? (need to include the read|write time corruption line).

So far I'm mostly seeing write time errors, but more dmesgs won't hurt.
And if possible, full dmesg please.

Comment 9 Kevin Fenzi 2024-09-23 23:53:09 UTC
Just to provide more data:

all the buildvm-x86 ones are x86_64 vm's and so far all on the same virthost. 

The buildvm-a64 one is a aarch64 vm on another (aarch64) virthost. 

Before seeing the aarch64 one I was starting to think there was some odd hardware issue on the x86_64 vmhost.

Comment 10 Qu Wenruo 2024-09-24 00:07:52 UTC
I'm going through the v6.10~v6.11 PRs.

And it turns out that is the PR introduced a change to how we handle the page allocation for extent buffers (tree blocks).
That introduced a bug, but got fixed in v6.10-rc3, so that should not be the cause.


For both systems, do you still have the fs intact? Mind to run "btrfs check --check-data-csum" on both filesystems?


And does Fedora provide 16K/64K page sized kernel? If so, mind to try such larger page sized kernels (with the same v6.10.x)?
Since the node 121 is crossing page boundary, I'm wondering if larger page will eliminate the cross-page handling, thus help us pinning down the bug.

Comment 11 Kevin Fenzi 2024-09-24 00:32:23 UTC
Oops. I just did a reinstall of those... I should have kept the disk, but I didn't think to. ;( 
I'll save the next one that happens.

No larger page size kernels that I am aware of. ;( RHEL has them, but fedora hasn't bothered...

Comment 12 Chris Murphy 2024-09-24 01:09:44 UTC
Thanks Qu! dmesg is a problem because the fs goes ro, journald can't write, dmesg buffer fills up with system complaints that it can't write - but we do have central logging with rsyslog, only problem is it also contains user space messages, but we'll try to filter out the bulk of those. It shouldn't take long for the new VM to run into an issue if history is a predictor.

Comment 13 Qu Wenruo 2024-09-24 01:23:24 UTC
Just in case it's some weird bug in v6.10.x (or even older), a VM with v6.11 as a control would also be appreciated.

Comment 14 Kevin Fenzi 2024-09-24 02:37:32 UTC
Yeah, I moved the 3 compose vm's to 6.11 last week and so far no issues. 

buildvm-x86-02.iad2.fedoraproject.org | CHANGED | rc=0 >>
6.11.0-63.fc41.x86_64
 02:36:10 up 5 days,  3:38,  1 user,  load average: 0.00, 0.00, 0.17
buildvm-x86-03.iad2.fedoraproject.org | CHANGED | rc=0 >>
6.11.0-63.fc41.x86_64
 02:36:10 up 5 days,  3:38,  1 user,  load average: 0.00, 0.00, 0.06
buildvm-x86-01.iad2.fedoraproject.org | CHANGED | rc=0 >>
6.11.0-63.fc41.x86_64
 02:36:10 up 5 days,  3:39,  1 user,  load average: 0.00, 0.00, 0.17

Comment 15 Chris Murphy 2024-09-24 19:44:51 UTC
Created attachment 2048547 [details]
kernel.log 6.10.5 buildvm-x86-10

Complete kernel.log for the 2nd excerpt in comment 4. We don't have this file system anymore. But maybe the more full dmesg helps.

Note, you can ignore all loop devices (various file systems including btrfs). The workload generates many mount and umount events using loop devices. The only device of interest for the bug is vda3, which is the root file system.

kernel 6.10.5-200.fc40.x86_64, buildvm-x86-10

--
Write time tree checker catches these items:

Sep 23 10:24:52 buildvm-x86-10 kernel: #011item 150 key (18446744073709551606 128 4880269312) itemoff -318752581 itemsize 730399348
Sep 23 10:24:52 buildvm-x86-10 kernel: #011item 151 key (12258805333187274739 82 4880318196) itemoff 14499 itemsize 24

Comment 16 Qu Wenruo 2024-09-25 00:19:22 UTC
The new example shows a wider range of corruption in csum tree, that item 150 has its item off and item size corrupted.
That is at eb offset 3868.

And the key of item 151 is also corrupted, that is at eb offset 3876.

The range is not at page boundary, so I guess the possibility of btrfs cross-page handling causing the problem is much lower.

Comment 17 Chris Murphy 2024-09-26 00:14:00 UTC
Created attachment 2048731 [details]
kernel.log 6.10.5 buildvm-x86-27

New event. Uptime 33 days. kernel 6.10.5

buildvm-x86-27 is located on a different but similar virthost than the others

I've also tacked on a btrfs check at the end of this log (no errors).

Comment 18 Qu Wenruo 2024-09-26 01:01:51 UTC
The log from buildvm-x86-27 shows the same memory range corruption.
Thankfully the fs flips RO to prevent further corruption, so the metadata is fine.

Furthermore, before the corruption, I have seen a lot of segfaults:

```
Aug 23 04:49:49 buildvm-x86-27 kernel: python[281272]: segfault at 10 ip 00007f4fbe56b9f5 sp 00007fff1e4d9d40 error 4 in libpython3.13d.so.1.0[16b9f5,7f4fbe400000+4b0000] likely on CPU 2 (core 0, socket 2)
Aug 23 04:49:49 buildvm-x86-27 kernel: Code: da fd ff b8 01 00 00 00 c9 c3 55 48 89 e5 48 8b 05 f0 4d 5c 00 48 8b 00 5d c3 55 48 89 e5 48 89 7d f8 48 89 75 f0 48 8b 45 f8 <48> 8b 40 10 48 8b 88 a0 38 00 00 48 8b 45 f8 48 8b 40 10 48 8b 55
```

```
Aug 25 16:57:08 buildvm-x86-27 kernel: absl_poison_tes[2846866]: segfault at 7ff7388a8800 ip 00005615a3d8459a sp 00007ffd01359c10 error 4 in absl_poison_test[359a,5615a3d81000+4000] likely on CPU 2 (core 0, socket 2)
Aug 25 16:57:08 buildvm-x86-27 kernel: Code: b8 ff ff ff ff f0 0f c1 01 e9 a7 fc ff ff b8 ff ff ff ff f0 0f c1 01 e9 dc f9 ff ff 48 8b 85 38 ff ff ff 48 8d 3d e6 5a 00 00 <8b> 30 e8 9f df ff ff e9 38 f8 ff ff 49 8b 07 48 8d 0d 90 ec ff ff
```

```
Aug 27 23:32:08 buildvm-x86-27 kernel: hydra_pmi_proxy[2369591]: segfault at 0 ip 00007fb5e736dbd6 sp 00007ffe1354f6e8 error 4 in libc.so.6[16dbd6,7fb5e7228000+175000] likely on CPU 1 (core 0, socket 1)
Aug 27 23:32:08 buildvm-x86-27 kernel: Code: 00 00 00 00 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 f3 0f 1e fa 89 f8 09 f0 c1 e0 14 3d 00 00 00 f8 0f 87 2a 03 00 00 <62> e1 fe 28 6f 0f 62 b2 75 20 26 d1 62 f3 75 22 3f 0e 00 c5 fb 93
```

```
Sep 25 14:55:26 buildvm-x86-27 kernel: null-9.exe[2025206]: segfault at 0 ip 0000000000401139 sp 00007ffcb9cd5190 error 4 in null-9.exe[1139,400000+2000] likely on CPU 0 (core 0, socket 0)
Sep 25 14:55:26 buildvm-x86-27 kernel: Code: 90 c3 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 f3 0f 1e fa eb 8a 48 83 ec 08 be 00 00 00 00 bf 40 40 40 00 e8 f7 fe ff ff <8b> 04 25 00 00 00 00 48 83 c4 08 c3 00 00 00 f3 0f 1e fa 48 83 ec
```

```
Sep 25 14:55:42 buildvm-x86-27 kernel: pr59667.exe[2028995]: segfault at 0 ip 000000000040127c sp 00007ffc8efc0190 error 6 in pr59667.exe[127c,400000+2000] likely on CPU 2 (core 0, socket 2)
Sep 25 14:55:42 buildvm-x86-27 kernel: Code: 83 7d d8 00 74 0b 48 89 c2 83 e2 03 48 85 d2 74 0d 48 89 c6 bf 40 41 40 00 e8 f0 fd ff ff 48 8b 45 d8 f3 0f 10 05 a4 13 00 00 <f3> 0f 11 00 b8 00 00 00 00 48 83 c4 28 5b 41 5c 41 5d 5d c3 f3 0f
```

Not sure what's going wrong, especially a lot of the crashed programs seem to be Win32 programs.
Maybe it's some corrupted data from fses (not only btrfs)

Do those crashed program also show up in other non-crashing kernels (e.g. v6.11?)

Comment 19 Kevin Fenzi 2024-09-26 01:28:50 UTC
Those are all from builds that happened on the builder. Crashing for whatever reason... they don't mean anything for the host system I don't think. 

For example "absl_poison_test" is a chromium test

Clearly not great that those failed, but scratch builds could contain anything. 

I don't think those are related to the actual problem with the fs.

Comment 20 Chris Murphy 2024-09-26 05:33:59 UTC
Workload seems relevant since some of the builders hit the problem much more frequently than others. But in any case it's a recent problem, yet the workload is the same (including various failed builds).

In bug 2312886 (this one), the problem appeared in 6.10 series. In bug 2314331 (user report) it began in 6.8 series. But it's unclear to me if they're related.

Comment 21 Qu Wenruo 2024-09-26 06:54:54 UTC
Mind to deploy several build with debug v6.10.x kernels?

The user report (bug 2314331) got random crash unrelated to btrfs after migrating to debug kernel.
I'm not sure if it's some weird MM regression causing the problem now.
Just btrfs is much better at detect MM problems due to its higher memory usage for metadata.

Comment 22 Kevin Fenzi 2024-09-26 23:18:01 UTC
Well, we have 140 builders. This seems to be happening to some random one every few days. If I moved a few to debug kernels, they likely wouldn't hit it for quite some time. ;( 

If we moved all of them to debug kernels, maintainers would revolt as things would be a lot slower (I assume).

Still no signs of it on the 3 I moved to 6.11.x...

Comment 23 Chris Murphy 2024-09-29 22:20:01 UTC
Created attachment 2049456 [details]
systeminfo

system and btrfs info for one of the VM's. 

The workload is koji, which uses systemd-nspawn for running its many varied tasks. By the looks of it, it's not that memory or IO bound, it's more CPU bound than anything else, but I'm not sure to what degree this varies as the tasks progress. The artifacts are stored locally, then copied via NFS to network storage.

Comment 24 Chris Murphy 2024-09-30 17:51:44 UTC
See also bug 2315716

Comment 25 Chris Murphy 2025-03-11 16:40:37 UTC
We haven't seen this in some time so it must've been fixed upstream.


Note You need to log in before you can comment on or make changes to this bug.