Bug 2256861 - btrfs forced read-only out of space with kernel backtrace
Summary: btrfs forced read-only out of space with kernel backtrace
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 39
Hardware: Unspecified
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: fedora-kernel-btrfs
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2024-01-04 21:40 UTC by Dusty Mabe
Modified: 2024-02-22 02:23 UTC (History)
24 users (show)

Fixed In Version: kernel-6.7.5-200.fc39 kernel-6.7.5-100.fc38
Clone Of:
Environment:
Last Closed: 2024-02-22 02:18:52 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
dmesg (66.10 KB, text/plain)
2024-01-04 21:43 UTC, Dusty Mabe
no flags Details
drgn script to dump unused and discard bgs (902 bytes, text/plain)
2024-01-05 20:17 UTC, Boris Burkov
no flags Details
updated drgn script v2 (1.89 KB, text/plain)
2024-01-08 20:50 UTC, Boris Burkov
no flags Details
drgnv2 output (7.10 KB, text/plain)
2024-01-09 03:23 UTC, Dusty Mabe
no flags Details
drgn-out-2.txt (25.51 KB, text/plain)
2024-01-10 22:13 UTC, Dusty Mabe
no flags Details

Description Dusty Mabe 2024-01-04 21:40:30 UTC
1. Please describe the problem:

Hitting a kernel backtrace on btrfs filesystem when I wasn't expecting to run out of space (should have had plenty of space left):


```
[88650.100940] RIP: 0010:btrfs_start_dirty_block_groups+0x565/0x5a0
[88650.101719] Code: 4c 89 f7 e8 5d b7 93 00 48 8b 34 24 48 8b 7c 24 08 e8 7f 09 f5 ff e9 02 fd ff ff 44 89 e6 48 c7 c7 30 a9 93 99 e8 2b 4f a5 ff <0f> 0b e9 25 fe ff ff be 01 00 00 00 e8 0a 39 15 00 e9 36 ff ff ff
[88650.103728] RSP: 0018:ffffae77c7a3bc58 EFLAGS: 00010286
[88650.104443] RAX: 0000000000000000 RBX: ffff9d05c05d3d00 RCX: 0000000000000000
[88650.105280] RDX: ffff9d08e46ae680 RSI: ffff9d08e46a1580 RDI: ffff9d08e46a1580
[88650.106127] RBP: ffff9d05c0d42400 R08: 0000000000000000 R09: ffffae77c7a3bae0
[88650.106968] R10: 0000000000000003 R11: ffffffff9a346508 R12: 00000000ffffffe4
[88650.107814] R13: ffff9d05c82bb770 R14: ffff9d06dca10930 R15: ffff9d05c0d42400
[88650.108665] FS:  00007f2a47dfe6c0(0000) GS:ffff9d08e4680000(0000) knlGS:0000000000000000
[88650.109581] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[88650.110295] CR2: 00007f2a47df2f78 CR3: 00000001bc4a2000 CR4: 0000000000350ee0
[88650.111137] Call Trace:
[88650.111626]  <TASK>
[88650.112065]  ? btrfs_start_dirty_block_groups+0x565/0x5a0
[88650.112786]  ? __warn+0x81/0x130
[88650.113346]  ? btrfs_start_dirty_block_groups+0x565/0x5a0
[88650.114049]  ? report_bug+0x171/0x1a0
[88650.114670]  ? prb_read_valid+0x1b/0x30
[88650.115269]  ? srso_return_thunk+0x5/0x10
[88650.115891]  ? handle_bug+0x3c/0x80
[88650.116546]  ? exc_invalid_op+0x17/0x70
[88650.117287]  ? asm_exc_invalid_op+0x1a/0x20
[88650.118137]  ? btrfs_start_dirty_block_groups+0x565/0x5a0
[88650.119030]  ? btrfs_start_dirty_block_groups+0x565/0x5a0
[88650.119943]  ? srso_return_thunk+0x5/0x10
[88650.120703]  btrfs_commit_transaction+0xb5/0xf50
[88650.121667]  btrfs_sync_file+0x4b3/0x580
[88650.122402]  __x64_sys_fsync+0x3b/0x70
[88650.123149]  do_syscall_64+0x60/0x90
[88650.123876]  ? srso_return_thunk+0x5/0x10
[88650.124605]  ? do_sys_openat2+0x97/0xe0
[88650.125319]  ? srso_return_thunk+0x5/0x10
[88650.126053]  ? srso_return_thunk+0x5/0x10
[88650.126679]  ? srso_return_thunk+0x5/0x10
[88650.127208]  ? syscall_exit_to_user_mode+0x2b/0x40
[88650.127824]  ? srso_return_thunk+0x5/0x10
[88650.128343]  ? do_syscall_64+0x6c/0x90
[88650.128859]  ? srso_return_thunk+0x5/0x10
[88650.129375]  ? do_syscall_64+0x6c/0x90
[88650.129884]  ? do_syscall_64+0x6c/0x90
[88650.130380]  ? do_syscall_64+0x6c/0x90
[88650.130882]  ? srso_return_thunk+0x5/0x10
[88650.131389]  ? do_syscall_64+0x6c/0x90
[88650.131880]  ? srso_return_thunk+0x5/0x10
[88650.132373]  ? do_syscall_64+0x6c/0x90
[88650.132859]  ? do_syscall_64+0x6c/0x90
[88650.133337]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[88650.133921] RIP: 0033:0x7f2a5dacf27c
[88650.134405] Code: 0f 05 48 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d fc e8 a0 f7 f7 ff 8b 7d fc 89 c2 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2c 89 d7 89 45 fc e8 02 f8 f7 ff 8b 45 fc c9
[88650.136196] RSP: 002b:00007f2a47dfc400 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[88650.136961] RAX: ffffffffffffffda RBX: 0000564edd14eb80 RCX: 00007f2a5dacf27c
[88650.137703] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000015
[88650.138450] RBP: 00007f2a47dfc410 R08: 0000000000000000 R09: 0000000000000001
[88650.139170] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f2a47dfc548
[88650.139907] R13: 00007f2a3806c1cb R14: 00007f2a38000d80 R15: 00007f2a47dfc470
[88650.140647]  </TASK>
[88650.140981] ---[ end trace 0000000000000000 ]---
[88650.141525] BTRFS info (device vda5: state EA): dumping space info:
[88650.142173] BTRFS info (device vda5: state EA): space_info DATA has 31601569792 free, is not full
[88650.143035] BTRFS info (device vda5: state EA): space_info total=41234202624, used=9632632832, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
[88650.144466] BTRFS info (device vda5: state EA): space_info METADATA has -7602176 free, is full
[88650.145300] BTRFS info (device vda5: state EA): space_info total=268435456, used=225853440, pinned=41402368, reserved=1114112, may_use=7602176, readonly=65536 zone_unusable=0
[88650.146843] BTRFS info (device vda5: state EA): space_info SYSTEM has 8372224 free, is not full
[88650.147715] BTRFS info (device vda5: state EA): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
[88650.149103] BTRFS info (device vda5: state EA): global_block_rsv: size 5767168 reserved 5767168
[88650.149978] BTRFS info (device vda5: state EA): trans_block_rsv: size 0 reserved 0
[88650.150773] BTRFS info (device vda5: state EA): chunk_block_rsv: size 0 reserved 0
[88650.151562] BTRFS info (device vda5: state EA): delayed_block_rsv: size 1310720 reserved 1310720
[88650.152451] BTRFS info (device vda5: state EA): delayed_refs_rsv: size 524288 reserved 524288
[88650.153309] BTRFS: error (device vda5: state EA) in btrfs_start_dirty_block_groups:3365: errno=-28 No space left
```


`df` reports I had space left:

```
[vagrant@vanilla-f39 osbuild]$ df -kh /
Filesystem      Size  Used Avail Use% Mounted on
/dev/vda5        39G  9.4G   30G  25% /
```

but I think I was actually running out of metadata space:

```
[vagrant@vanilla-f39 osbuild]$ sudo btrfs fi df /
Data, single: total=38.40GiB, used=8.97GiB
System, DUP: total=8.00MiB, used=16.00KiB
Metadata, DUP: total=256.00MiB, used=215.39MiB
GlobalReserve, single: total=5.50MiB, used=0.00B
[vagrant@vanilla-f39 osbuild]$ 
[vagrant@vanilla-f39 osbuild]$ sudo btrfs fi usage /
Overall:
    Device size:                  38.92GiB
    Device allocated:             38.92GiB
    Device unallocated:            1.00MiB
    Device missing:                  0.00B
    Device slack:                    0.00B
    Used:                          9.39GiB
    Free (estimated):             29.43GiB      (min: 29.43GiB)
    Free (statfs, df):            29.43GiB
    Data ratio:                       1.00
    Metadata ratio:                   2.00
    Global reserve:                5.50MiB      (used: 0.00B)
    Multiple profiles:                  no

Data,single: Size:38.40GiB, Used:8.97GiB (23.36%)
   /dev/vda5      38.40GiB

Metadata,DUP: Size:256.00MiB, Used:215.39MiB (84.14%)
   /dev/vda5     512.00MiB

System,DUP: Size:8.00MiB, Used:16.00KiB (0.20%)
   /dev/vda5      16.00MiB

Unallocated:
   /dev/vda5       1.00MiB
```


When this happened I was running `osbuild`, which may or may not be relevant here:


```
[vagrant@vanilla-f39 osbuild]$ sudo python3 -m osbuild --libdir .  --store /var/osbuild/store/ --output-directory /var/osbuild/out/ --checkpoint=* --cache-max-size 20GiB --export qemu test/data/manifests/fedora-coreos-container.json
Pipeline oci-archive: d3feae20028e7ab8a73585709f34a72ed12cb7beebf77c49965914b770fbc3d2                                    
Build                                                        
  root: <host>                                               
  runner: org.osbuild.fedora38 (org.osbuild.fedora38)                                                                     
org.osbuild.copy: d3feae20028e7ab8a73585709f34a72ed12cb7beebf77c49965914b770fbc3d2 {                                      
  "paths": [                                                 
    {                                                        
      "from": "input://inlinefile/sha256:e6347908fb0621e0e37f0ef185d79b238de294d83b937969129979c67e7d20f2",                                                                                                                                          
      "to": "tree:///fcos.ociarchive"                        
    }                                                        
  ]                                                          
}                                                            
Failed to open file "/sys/fs/selinux/checkreqprot": Read-only file system                                                 
copying '/run/osbuild/inputs/inlinefile/sha256:e6347908fb0621e0e37f0ef185d79b238de294d83b937969129979c67e7d20f2' -> '/run/osbuild/tree/fcos.ociarchive'

⏱  Duration: 1s                                              
Pipeline tree: 9d862d300f0b2716ee0bb624afd8da67d944d38809aacea79b195ceb8ec8320f                                           
Build                                                        
  root: dc6e3a66fef3ebe7c815eb24d348215b9e5e2ed0cd808c15ebbe85fc73181a86                                                  
  runner: org.osbuild.fedora39 (org.osbuild.fedora38)                                                                     
  source-epoch: Mon Aug  1 23:42:11 2022 [1659397331]                                                                     
org.osbuild.ostree.deploy.container: 65a831a708f864d6693f7cc627f8b95c6acf997811c8248f62e341dd0938009a {                                                                                                                                              
  "osname": "fedora-coreos",                                 
  "target_imgref": "ostree-remote-registry:fedora:quay.io/fedora/fedora-coreos:stable",                                   
  "mounts": [                                                
    "/boot",                                                 
    "/boot/efi"                                              
  ],                                                         
  "kernel_opts": [                                           
    "rw",                                                    
    "$ignition_firstboot"                                    
  ]                                                          
}                                                            
input/images (org.osbuild.containers): target /var/osbuild/store/tmp/buildroot-tmp-rg130kxf/inputs/images                                                                                                                                            
Failed to open file "/sys/fs/selinux/checkreqprot": Read-only file system                                                 
ostree container image deploy --imgref=ostree-unverified-image:oci-archive:/tmp/tmpnya0xstb/image --stateroot=fedora-coreos --target-imgref=ostree-remote-registry:fedora:quay.io/fedora/fedora-coreos:stable --karg=rw --karg=$ignition_firstboot --
sysroot=/run/osbuild/tree                                    
error: Performing deployment: Importing: Unencapsulating base: fsync objdirs: fsync: No space left on device                                                                                                                                         
Traceback (most recent call last):                           
  File "/run/osbuild/bin/org.osbuild.ostree.deploy.container", line 140, in <module>                                      
    r = main(stage_args["tree"],                             
        ^^^^^^^^^^^^^^^^^^^^^^^^                             
  File "/run/osbuild/bin/org.osbuild.ostree.deploy.container", line 135, in main                                          
    ostree_container_deploy(tree, inputs, osname, target_imgref, kopts)                                                   
  File "/run/osbuild/bin/org.osbuild.ostree.deploy.container", line 109, in ostree_container_deploy                                                                                                                                                  
    ostree.cli("container", "image", "deploy",               
  File "/run/osbuild/lib/osbuild/util/ostree.py", line 202, in cli                                                        
    return subprocess.run(["ostree"] + args,                 
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                 
  File "/usr/lib64/python3.12/subprocess.py", line 571, in run                                                            
    raise CalledProcessError(retcode, process.args,                                                                       
subprocess.CalledProcessError: Command '['ostree', 'container', 'image', 'deploy', '--imgref=ostree-unverified-image:oci-archive:/tmp/tmpnya0xstb/image', '--stateroot=fedora-coreos', '--target-imgref=ostree-remote-registry:fedora:quay.io/fedora/
fedora-coreos:stable', '--karg=rw', '--karg=$ignition_firstboot', '--sysroot=/run/osbuild/tree']' returned non-zero exit status 1.
Traceback (most recent call last):                           
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/main_cli.py", line 171, in osbuild_cli                                                                                                                                                 
    r = manifest.build(                                      
        ^^^^^^^^^^^^^^^                                      
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/pipeline.py", line 468, in build                            
    res = pl.run(store, monitor, libdir, stage_timeout)                                                                   
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                                                                   
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/pipeline.py", line 372, in run                              
    results = self.build_stages(store,                       
              ^^^^^^^^^^^^^^^^^^^^^^^^                       
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/pipeline.py", line 345, in build_stages                                                                                                                                                
    r = stage.run(tree,                                      
        ^^^^^^^^^^^^^^^                                      
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/pipeline.py", line 146, in run                              
    with contextlib.ExitStack() as cm:                       
  File "/usr/lib64/python3.12/contextlib.py", line 610, in __exit__                                                       
    raise exc_details[1]                                     
  File "/usr/lib64/python3.12/contextlib.py", line 595, in __exit__                                                       
    if cb(*exc_details):                                     
       ^^^^^^^^^^^^^^^^                                      
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/buildroot.py", line 166, in __exit__                                                                                                                                                   
    self._exitstack.close()                                  
  File "/usr/lib64/python3.12/contextlib.py", line 618, in close                                                          
    self.__exit__(None, None, None)                          
  File "/usr/lib64/python3.12/contextlib.py", line 610, in __exit__                                                       
    raise exc_details[1]                                     
  File "/usr/lib64/python3.12/contextlib.py", line 595, in __exit__                                                       
    if cb(*exc_details):                                     
       ^^^^^^^^^^^^^^^^                                      
  File "/usr/lib64/python3.12/tempfile.py", line 946, in __exit__                                                         
    self.cleanup()                                           
  File "/usr/lib64/python3.12/tempfile.py", line 950, in cleanup                                                          
    self._rmtree(self.name, ignore_errors=self._ignore_cleanup_errors)                                                    
  File "/usr/lib64/python3.12/tempfile.py", line 930, in _rmtree                                                          
    _shutil.rmtree(name, onexc=onexc)
  File "/usr/lib64/python3.12/shutil.py", line 773, in rmtree                                                             
    _rmtree_safe_fd(fd, path, onexc)                         
  File "/usr/lib64/python3.12/shutil.py", line 685, in _rmtree_safe_fd                                                    
    onexc(os.rmdir, fullname, err)                           
  File "/usr/lib64/python3.12/shutil.py", line 683, in _rmtree_safe_fd                                                    
    os.rmdir(entry.name, dir_fd=topfd)                       
OSError: [Errno 30] Read-only file system: 'usr'             
Exception ignored in: <generator object FsCache.stage at 0x7f25ca0e3440>                                                  
Traceback (most recent call last):                           
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 862, in stage                                                                                                                                                   
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 797, in _rm_r_object                                                                                                                                            
OSError: [Errno 30] Read-only file system: '/var/osbuild/store/stage/uuid-1b4fe14c213c422e96b7108833c5c3ac/object.info'                                                                                                                              
Exception ignored in: <generator object FsCache.stage at 0x7f25ca0e2540>                                                  
Traceback (most recent call last):                           
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 862, in stage                                                                                                                                                   
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 797, in _rm_r_object                                                                                                                                            
OSError: [Errno 30] Read-only file system: '/var/osbuild/store/stage/uuid-e2f89ac5f79e4130a6ed7d26545205d3/object.info'                                                                                                                              
[vagrant@vanilla-f39 osbuild]$                               
[vagrant@vanilla-f39 osbuild]$ sudo python3 -m osbuild --libdir .  --store /var/osbuild/store/ --output-directory /var/osbuild/out/ --checkpoint=* --cache-max-size 20GiB --export qemu test/data/manifests/fedora-coreos-container.json
Traceback (most recent call last):                           
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 479, in _atomic_file                                                                                                                                            
    with self._atomic_open(                                  
  File "/usr/lib64/python3.12/contextlib.py", line 137, in __enter__                                                      
    return next(self.gen)                                    
           ^^^^^^^^^^^^^^                                    
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 378, in _atomic_open                                                                                                                                            
    fd = os.open(path, flags, 0o644)                         
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^                         
OSError: [Errno 30] Read-only file system: '/var/osbuild/store/objects/uuid-014c9623317f48b19b15e6fb3da1328f'                                                                                                                                        

During handling of the above exception, another exception occurred:                                                       

Traceback (most recent call last):                           
  File "<frozen runpy>", line 198, in _run_module_as_main                                                                 
  File "<frozen runpy>", line 88, in _run_code                                                                            
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/__main__.py", line 12, in <module>                                                                                                                                                     
    r = main()                                               
        ^^^^^^                                               
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/main_cli.py", line 161, in osbuild_cli                                                                                                                                                 
    with ObjectStore(args.store) as object_store:                                                                         
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/objectstore.py", line 453, in __enter__                                                                                                                                                
    self._stack.enter_context(self.cache)                                                                                 
  File "/usr/lib64/python3.12/contextlib.py", line 526, in enter_context                                                  
    result = _enter(cm)                                      
             ^^^^^^^^^^                                      
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 696, in __enter__                                                                                                                                               
    self._create_scaffolding()                               
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 623, in _create_scaffolding                                                                                                                                     
    with self._atomic_file(self._filename_cache_tag, self._dirname_objects, ignore_exist=True) as f:                                                                                                                                                 
  File "/usr/lib64/python3.12/contextlib.py", line 137, in __enter__                                                      
    return next(self.gen)                                    
           ^^^^^^^^^^^^^^                                    
  File "/var/b/shared/code/github.com/osbuild/osbuild/osbuild/util/fscache.py", line 523, in _atomic_file                                                                                                                                            
    os.unlink(self._path(rpath_tmp))                         
OSError: [Errno 30] Read-only file system: '/var/osbuild/store/objects/uuid-014c9623317f48b19b15e6fb3da1328f'
```




2. What is the Version-Release number of the kernel:

[vagrant@vanilla-f39 osbuild]$ uname -r
6.6.8-200.fc39.x86_64
[vagrant@vanilla-f39 osbuild]$ rpm -q btrfs-progs 
btrfs-progs-6.6.2-1.fc39.x86_64


Noting that this is the Fedora Cloud Vagrant image so it was probably created with btrfsprogs that matched Fedora 39 release day version.

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

Not sure. I don't think I've seen this before and I've been building using OSBuild for a while, but that was on F38.


4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:


Not sure. I'll let you know if I'm able to reproduce it reliably. This is the first time I hit it.


5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

Not sure.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.


Reproducible: Always

Comment 1 Dusty Mabe 2024-01-04 21:43:18 UTC
Created attachment 2007271 [details]
dmesg

Comment 2 Josef Bacik 2024-01-04 21:46:57 UTC
Well that's odd, seems like we're not reclaiming those empty data block groups to allow for metadata allocations to use them.  We'll dig into this and see what's going on.

Comment 3 Boris Burkov 2024-01-04 22:56:55 UTC
To debug the question of block group reclaim Josef mentioned above, can you run and share the output from:
cat /sys/fs/btrfs/$(findmnt -n -o UUID /)/allocation/data/bg_reclaim_threshold

that will tell us whether automatic reclaim is enabled, in which case and how the threshold compares to your (rather empty) data block allocation.

Another interesting thing would be to get at the block groups on disk and see how the extents are distributed with a bit more detail. That's a little more complicated, but can be done. Options for that are:

easiest:
btrfs inspect-internal dump-tree $(findmnt -n -o SOURCE /) | grep -A1 'BLOCK_GROUP_ITEM' | grep -B1 'flags DATA'

easy, more detail possible but maybe too big of a file:
btrfs inspect-internal dump-tree $(findmnt -n -o SOURCE /)

tricky to run, but more detail still:
use
https://github.com/boryas/scripts/blob/main/frag/bg-dump.btrd
https://github.com/boryas/scripts/tree/main/rust/btrfs-frag-view

instructions for that:
install btrd (https://github.com/danobi/btrd)
modify my bg-dump.btrd to start with: filesystem "/";
# dump bg data with btrd
btrd bg-dump.btrd > bg-dump.txt
# cd to btrfs-frag-view
# build the rust program and run it
cargo run dump.txt -r

Comment 4 Boris Burkov 2024-01-04 23:03:50 UTC
Also, since this is a vm that seems to be for some building purpose (sorry if I misinterpreted that from the context), I am assuming you aren't desperately trying to recover it. If you are, I can try to help with that as well. The basic pattern is to re-mount the fs rw, and try to run a 'btrfs balance start -dusage=X /' for increasing X, probably starting with 0 or 30 in your case.

Comment 5 Dusty Mabe 2024-01-05 00:44:09 UTC
(In reply to Boris Burkov from comment #3)
> To debug the question of block group reclaim Josef mentioned above, can you
> run and share the output from:
> cat /sys/fs/btrfs/$(findmnt -n -o UUID /)/allocation/data/bg_reclaim_threshold

```
[vagrant@vanilla-f39 ~]$ cat /sys/fs/btrfs/$(findmnt -n -o UUID /)/allocation/data/bg_reclaim_threshold
0
```

> 
> that will tell us whether automatic reclaim is enabled, in which case and
> how the threshold compares to your (rather empty) data block allocation.
> 
> Another interesting thing would be to get at the block groups on disk and
> see how the extents are distributed with a bit more detail. That's a little
> more complicated, but can be done. Options for that are:
> 
> easiest:
> btrfs inspect-internal dump-tree $(findmnt -n -o SOURCE /) | grep -A1
> 'BLOCK_GROUP_ITEM' | grep -B1 'flags DATA'

```
[vagrant@vanilla-f39 ~]$ sudo btrfs inspect-internal dump-tree /dev/vda5 | grep -A1 'BLOCK_GROUP_ITEM' | grep -B1 'flags DATA'
        item 1 key (13631488 BLOCK_GROUP_ITEM 8388608) itemoff 16206 itemsize 24
                block group used 8388608 chunk_objectid 256 flags DATA|single
--
        item 12 key (298844160 BLOCK_GROUP_ITEM 1073741824) itemoff 15843 itemsize 24
                block group used 1069670400 chunk_objectid 256 flags DATA|single
--
        item 47 key (1372585984 BLOCK_GROUP_ITEM 1073741824) itemoff 11854 itemsize 24
                block group used 1062731776 chunk_objectid 256 flags DATA|single
--
        item 86 key (2446327808 BLOCK_GROUP_ITEM 1073741824) itemoff 9526 itemsize 24
                block group used 1073741824 chunk_objectid 256 flags DATA|single
--
        item 0 key (3520069632 BLOCK_GROUP_ITEM 1073741824) itemoff 16259 itemsize 24
                block group used 1066274816 chunk_objectid 256 flags DATA|single
--
        item 35 key (4593811456 BLOCK_GROUP_ITEM 1073741824) itemoff 10750 itemsize 24
                block group used 801816576 chunk_objectid 256 flags DATA|single
--
        item 85 key (5667553280 BLOCK_GROUP_ITEM 1073741824) itemoff 11754 itemsize 24
                block group used 1073590272 chunk_objectid 256 flags DATA|single
--
        item 104 key (6741295104 BLOCK_GROUP_ITEM 1073741824) itemoff 10631 itemsize 24
                block group used 1073590272 chunk_objectid 256 flags DATA|single
--
        item 127 key (7815036928 BLOCK_GROUP_ITEM 1073741824) itemoff 8397 itemsize 24
                block group used 1067851776 chunk_objectid 256 flags DATA|single
--
        item 155 key (8888778752 BLOCK_GROUP_ITEM 1073741824) itemoff 6710 itemsize 24
                block group used 352387072 chunk_objectid 256 flags DATA|single
--
        item 72 key (9962520576 BLOCK_GROUP_ITEM 1073741824) itemoff 12385 itemsize 24
                block group used 982589440 chunk_objectid 256 flags DATA|single
--
        item 96 key (11036262400 BLOCK_GROUP_ITEM 1073741824) itemoff 11084 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 97 key (12110004224 BLOCK_GROUP_ITEM 1073741824) itemoff 11060 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 98 key (13183746048 BLOCK_GROUP_ITEM 1073741824) itemoff 11036 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 99 key (14257487872 BLOCK_GROUP_ITEM 1073741824) itemoff 11012 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 100 key (15331229696 BLOCK_GROUP_ITEM 1073741824) itemoff 10988 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 101 key (16404971520 BLOCK_GROUP_ITEM 1073741824) itemoff 10964 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 102 key (17478713344 BLOCK_GROUP_ITEM 1073741824) itemoff 10940 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 103 key (18552455168 BLOCK_GROUP_ITEM 1073741824) itemoff 10916 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 104 key (19626196992 BLOCK_GROUP_ITEM 1073741824) itemoff 10892 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 105 key (20699938816 BLOCK_GROUP_ITEM 1073741824) itemoff 10868 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 106 key (21773680640 BLOCK_GROUP_ITEM 1073741824) itemoff 10844 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 107 key (22847422464 BLOCK_GROUP_ITEM 1073741824) itemoff 10820 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 108 key (23921164288 BLOCK_GROUP_ITEM 1073741824) itemoff 10796 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 109 key (24994906112 BLOCK_GROUP_ITEM 1073741824) itemoff 10772 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 110 key (26068647936 BLOCK_GROUP_ITEM 1073741824) itemoff 10748 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 111 key (27142389760 BLOCK_GROUP_ITEM 1073741824) itemoff 10724 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 112 key (28216131584 BLOCK_GROUP_ITEM 1073741824) itemoff 10700 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 113 key (29289873408 BLOCK_GROUP_ITEM 1073741824) itemoff 10676 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 114 key (30363615232 BLOCK_GROUP_ITEM 1073741824) itemoff 10652 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 115 key (31437357056 BLOCK_GROUP_ITEM 1073741824) itemoff 10628 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 116 key (32511098880 BLOCK_GROUP_ITEM 1073741824) itemoff 10604 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 117 key (33584840704 BLOCK_GROUP_ITEM 1073741824) itemoff 10580 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 118 key (34658582528 BLOCK_GROUP_ITEM 1073741824) itemoff 10556 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 119 key (35732324352 BLOCK_GROUP_ITEM 1073741824) itemoff 10532 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 120 key (36806066176 BLOCK_GROUP_ITEM 1073741824) itemoff 10508 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 121 key (37879808000 BLOCK_GROUP_ITEM 1073741824) itemoff 10484 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 122 key (38953549824 BLOCK_GROUP_ITEM 1073741824) itemoff 10460 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 123 key (40027291648 BLOCK_GROUP_ITEM 1073741824) itemoff 10436 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 124 key (41101033472 BLOCK_GROUP_ITEM 411041792) itemoff 10412 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single
        item 125 key (41512075264 BLOCK_GROUP_ITEM 12582912) itemoff 10388 itemsize 24
                block group used 0 chunk_objectid 256 flags DATA|single

```

> 
> easy, more detail possible but maybe too big of a file:
> btrfs inspect-internal dump-tree $(findmnt -n -o SOURCE /)

I can try to run this if more detail is needed.

Comment 6 Dusty Mabe 2024-01-05 00:45:14 UTC
(In reply to Boris Burkov from comment #4)
> Also, since this is a vm that seems to be for some building purpose (sorry
> if I misinterpreted that from the context), I am assuming you aren't
> desperately trying to recover it. If you are, I can try to help with that as
> well. The basic pattern is to re-mount the fs rw, and try to run a 'btrfs
> balance start -dusage=X /' for increasing X, probably starting with 0 or 30
> in your case.

I'm leaving this VM in the current state for the time being so we can debug it live. I don't know if the r/o filesystem will eventually cause enough problems that it won't keep running but I'll leave it up until it does.

Comment 7 Boris Burkov 2024-01-05 19:41:13 UTC
Some observations:
1. you do not have autorelocation enabled. echoing 30 into that file will help with this sort of thing in general. We have a planned follow-up to make that the default (while hopefully communicating the change well!)
2. all the extra block groups have used==0, so they should be getting cleaned up by empty block group cleanup.
3. you have async discard on (good), which does complicate the empty block group cleanup as they are discarded before they are cleaned.

So I think that either async discard is somehow stuck or failing to finish these block groups all the way to unused cleanup, or something is going wrong with the cleaner thread.

I am going to write some drgn scripts we can run to look at some of the relevant structures, but while I do that, it would also be helpful to see if there are any threads stuck out of async discard or the cleaner. Do you have a good script handy to dump tasks in D state or should I dig one up for that as well?

Either way, I'll have some more data collection ready soon. Thanks for the report and quick follow up, btw!

Comment 8 Boris Burkov 2024-01-05 20:17:07 UTC
Created attachment 2007407 [details]
drgn script to dump unused and discard bgs

Comment 9 Boris Burkov 2024-01-05 20:20:34 UTC
I've uploaded a drgn script that should dump unused block groups and the three async discard lists.

you can run it by giving it executable permissions and letting the shebang do the work. You will need to install drgn which should be fine with 'dnf install drgn'. It automatically looks in the usual places for debug symbols if the kernel is built with them, but also has options to point it at debug symbol files if necessary. Some more instructions about debug symbols are in the drgn docs here: https://drgn.readthedocs.io/en/latest/getting_debugging_symbols.html

The output of that should point us to why the block groups aren't getting cleaned up, hopefully.

Comment 10 Dusty Mabe 2024-01-05 20:35:04 UTC
In order to install anything I would need to remount my filesystem back read/write, which could affect the ability to debug this further. Is that something you'd want me to do?

Comment 11 Boris Burkov 2024-01-05 20:56:33 UTC
Can you install it into a tmpfs?

Comment 12 Boris Burkov 2024-01-05 21:37:44 UTC
/sys/fs/btrfs/<uuid>/discard/discardable_extents would be interesting as well.

Comment 13 Omar Sandoval 2024-01-05 21:42:51 UTC
You might be able to install drgn in a venv in `/dev/shm`. Something like:

cd /dev/shm
python3 -m venv tmpvenv
source tmpvenv/bin/activate
pip install drgn

The next thing you'll need to do is get kernel debugging symbols. Again, we can probably cheese these onto /dev/shm. If you happen to have elfutils-debuginfod-client and perf installed, you could do:

export DEBUGINFOD_CACHE_PATH=/dev/shm/tmpvenv/debuginfod
drgn --no-default-symbols $(DEBUGINFOD_VERBOSE=1 debuginfod-find debuginfo $(perf buildid-list -k))

If you don't have those installed, let me know and I can probably hack something together for you to do instead.

Comment 14 Dusty Mabe 2024-01-06 18:50:56 UTC
(In reply to Boris Burkov from comment #12)
> /sys/fs/btrfs/<uuid>/discard/discardable_extents would be interesting as
> well.

```
[root@vanilla-f39 ~]# cat /sys/fs/btrfs/$(findmnt -n -o UUID /)/discard/discardable_extents
45
```

Comment 15 Dusty Mabe 2024-01-07 20:15:42 UTC
(In reply to Boris Burkov from comment #9)
> I've uploaded a drgn script that should dump unused block groups and the
> three async discard lists.
> 
> you can run it by giving it executable permissions and letting the shebang
> do the work. You will need to install drgn which should be fine with 'dnf
> install drgn'. It automatically looks in the usual places for debug symbols
> if the kernel is built with them, but also has options to point it at debug
> symbol files if necessary. Some more instructions about debug symbols are in
> the drgn docs here:
> https://drgn.readthedocs.io/en/latest/getting_debugging_symbols.html
> 
> The output of that should point us to why the block groups aren't getting
> cleaned up, hopefully.


OK I don't know it it's set up right (had to do some stuff in a chroot), does this output look right?


```
[root@vanilla-f39 ~]# chroot /dev/shm/root/
bash-5.2# ./drgn.sh                
UNUSED BG LIST


DISCARD QUEUES
DISCARD QUEUE 0
DISCARD QUEUE 1
start (u64)13631488 flags (u64)1 used (u64)8388608
start (u64)2446327808 flags (u64)1 used (u64)107374182
DISCARD QUEUE 2
start (u64)4593811456 flags (u64)1 used (u64)801816576
```

Comment 16 Dusty Mabe 2024-01-08 17:25:14 UTC
FTR, since I'm trying to continue on with my work I am continuing to use osbuild and was able to reproduce this on a second VM:


```
[328562.181275] ------------[ cut here ]------------
[328562.182602] BTRFS: Transaction aborted (error -28)
[328562.184427] WARNING: CPU: 0 PID: 16977 at fs/btrfs/block-group.c:3351 btrfs_start_dirty_block_groups+0x569/0x5b0
[328562.186677] Modules linked in: xfs vfat fat intel_rapl_msr intel_rapl_common kvm_amd ccp kvm virtio_net net_failover failover irqbypass cirrus virtio_balloon i2c_piix4 joydev fuse loop zram crct10dif_pclmul crc32_pclmul crc32c_intel polyval_
[328562.190238] CPU: 0 PID: 16977 Comm: tokio-runtime-w Not tainted 6.5.6-300.fc39.x86_64 #1
[328562.191327] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-1.fc39 04/01/2014
[328562.192312] RIP: 0010:btrfs_start_dirty_block_groups+0x569/0x5b0
[328562.193079] Code: 4c 89 f7 e8 79 74 92 00 48 8b 34 24 48 8b 7c 24 08 e8 4b f6 f4 ff e9 fe fc ff ff 44 89 e6 48 c7 c7 28 5c 92 9b e8 a7 4b a6 ff <0f> 0b e9 10 fe ff ff be 01 00 00 00 e8 e6 3a 15 00 e9 36 ff ff ff
[328562.195120] RSP: 0018:ffffa69dc070fd10 EFLAGS: 00010286
[328562.195865] RAX: 0000000000000000 RBX: ffff974280437f70 RCX: 0000000000000000
[328562.196732] RDX: ffff9745a462e580 RSI: ffff9745a4621540 RDI: ffff9745a4621540
[328562.197603] RBP: ffff974286b30800 R08: 0000000000000000 R09: ffffa69dc070fba0
[328562.198453] R10: 0000000000000003 R11: ffffffff9c345d28 R12: 00000000ffffffe4
[328562.199326] R13: ffff974288f975b0 R14: ffff9744b4ea5130 R15: ffff974286b30800
[328562.200181] FS:  00007f866abb66c0(0000) GS:ffff9745a4600000(0000) knlGS:0000000000000000
[328562.201100] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[328562.201857] CR2: 00007f865420a007 CR3: 000000020c836000 CR4: 0000000000350ef0
[328562.202711] Call Trace:
[328562.203198]  <TASK>
[328562.203658]  ? btrfs_start_dirty_block_groups+0x569/0x5b0
[328562.204372]  ? __warn+0x81/0x130
[328562.205117]  ? btrfs_start_dirty_block_groups+0x569/0x5b0
[328562.205833]  ? report_bug+0x171/0x1a0
[328562.206524]  ? prb_read_valid+0x1b/0x30
[328562.207273]  ? srso_return_thunk+0x5/0x10
[328562.207889]  ? handle_bug+0x3c/0x80
[328562.208454]  ? exc_invalid_op+0x17/0x70
[328562.209025]  ? asm_exc_invalid_op+0x1a/0x20
[328562.209649]  ? btrfs_start_dirty_block_groups+0x569/0x5b0
[328562.210325]  ? srso_return_thunk+0x5/0x10
[328562.210899]  btrfs_commit_transaction+0xb5/0xf00
[328562.211549]  btrfs_sync_file+0x4b3/0x580
[328562.212123]  __x64_sys_fsync+0x3b/0x70
[328562.212703]  do_syscall_64+0x60/0x90
[328562.213234]  ? srso_return_thunk+0x5/0x10
[328562.213839]  ? syscall_exit_to_user_mode+0x2b/0x40
[328562.214471]  ? srso_return_thunk+0x5/0x10
[328562.215030]  ? do_syscall_64+0x6c/0x90
[328562.215551]  ? srso_return_thunk+0x5/0x10
[328562.216096]  ? do_syscall_64+0x6c/0x90
[328562.216629]  ? do_syscall_64+0x6c/0x90
[328562.217140]  ? do_syscall_64+0x6c/0x90
[328562.217673]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[328562.218280] RIP: 0033:0x7f866c2ab27c
[328562.219062] Code: 0f 05 48 3d 00 f0 ff ff 77 3c c3 0f 1f 00 55 48 89 e5 48 83 ec 10 89 7d fc e8 a0 f7 f7 ff 8b 7d fc 89 c2 b8 4a 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 2c 89 d7 89 45 fc e8 02 f8 f7 ff 8b 45 fc c9
[328562.220928] RSP: 002b:00007f866abb4400 EFLAGS: 00000293 ORIG_RAX: 000000000000004a
[328562.221755] RAX: ffffffffffffffda RBX: 0000555c4d8d7b80 RCX: 00007f866c2ab27c
[328562.222525] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000015
[328562.223299] RBP: 00007f866abb4410 R08: 0000000000000000 R09: 0000000000000001
[328562.224069] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f866abb4548
[328562.224849] R13: 00007f86580945ab R14: 00007f8658000cf0 R15: 00007f866abb4470
[328562.225632]  </TASK>
[328562.226012] ---[ end trace 0000000000000000 ]---
[328562.226608] BTRFS info (device vda5: state A): dumping space info:
[328562.227436] BTRFS info (device vda5: state A): space_info DATA has 33510932480 free, is not full
[328562.228433] BTRFS info (device vda5: state A): space_info total=41234202624, used=7723266048, pinned=0, reserved=0, may_use=4096, readonly=0 zone_unusable=0
[328562.230059] BTRFS info (device vda5: state A): space_info METADATA has -6291456 free, is full
[328562.231117] BTRFS info (device vda5: state A): space_info total=268435456, used=219840512, pinned=47267840, reserved=1261568, may_use=6291456, readonly=65536 zone_unusable=0
[328562.233046] BTRFS info (device vda5: state A): space_info SYSTEM has 8372224 free, is not full
[328562.234117] BTRFS info (device vda5: state A): space_info total=8388608, used=16384, pinned=0, reserved=0, may_use=0, readonly=0 zone_unusable=0
[328562.235953] BTRFS info (device vda5: state A): global_block_rsv: size 5767168 reserved 5767168
[328562.237016] BTRFS info (device vda5: state A): trans_block_rsv: size 0 reserved 0
[328562.237871] BTRFS info (device vda5: state A): chunk_block_rsv: size 0 reserved 0
[328562.238668] BTRFS info (device vda5: state A): delayed_block_rsv: size 0 reserved 0
[328562.239475] BTRFS info (device vda5: state A): delayed_refs_rsv: size 524288 reserved 524288
[328562.240343] BTRFS: error (device vda5: state A) in btrfs_start_dirty_block_groups:3351: errno=-28 No space left
[328562.241343] BTRFS info (device vda5: state EA): forced readonly
[328562.242155] BTRFS warning (device vda5: state EA): Skipping commit of aborted transaction.
[328562.243967] BTRFS: error (device vda5: state EA) in cleanup_transaction:1997: errno=-28 No space left
```

Comment 17 Dusty Mabe 2024-01-08 17:26:58 UTC
(In reply to Dusty Mabe from comment #16)
> FTR, since I'm trying to continue on with my work I am continuing to use
> osbuild and was able to reproduce this on a second VM:


It's worth noting here that in order to try to avoid the problem I did not update the kernel and so this trace is from 

```
[vagrant@vanilla-f38 osbuild]$ uname -r
6.5.6-300.fc39.x86_64
```

Comment 18 Boris Burkov 2024-01-08 20:50:11 UTC
Created attachment 2007843 [details]
updated drgn script v2

Comment 19 Boris Burkov 2024-01-08 20:58:36 UTC
Since the drgn script was able to hit some of the discard lists and find some of the non-empty block groups, it appears that your drgn install was successful, and async discard is functioning at some basic level (not the bug we hit from making it default + remounting into the option)

I updated the script to hunt for more details, so if you could run that on the first host again, that would be great. Also, could you share the output of:
cat /sys/block/vda5/queue/discard_max_bytes

If you think you are able to reliably reproduce this by hammering that same workload, then we should also talk about the most valuable tracepoints or printks we can add to see how we are losing track of the empty block groups.

Thanks!

Comment 20 Dusty Mabe 2024-01-09 03:23:05 UTC
Created attachment 2007863 [details]
drgnv2 output

Comment 21 Dusty Mabe 2024-01-09 03:29:29 UTC
(In reply to Boris Burkov from comment #19)
> Since the drgn script was able to hit some of the discard lists and find
> some of the non-empty block groups, it appears that your drgn install was
> successful, and async discard is functioning at some basic level (not the
> bug we hit from making it default + remounting into the option)
> 
> I updated the script to hunt for more details, so if you could run that on
> the first host again, that would be great. Also, could you share the output
> of:
> cat /sys/block/vda5/queue/discard_max_bytes

Just uploaded an attachment with the output of the drgnv2 script.

Also `/sys/block/vda5` doesn't exist. `/sys/block/vda` does? Did you mean a different path? 


> 
> If you think you are able to reliably reproduce this by hammering that same
> workload, then we should also talk about the most valuable tracepoints or
> printks we can add to see how we are losing track of the empty block groups.

I have reproduced it twice, but I'm not really comfortable saying it's reliably
reproducible and I don't really have a lot of extra time to try unfortunately.

Comment 22 Boris Burkov 2024-01-09 21:52:45 UTC
hmm, weird about vda5, I thought that was the name of the device in your dmesg. /sys/block/<dev>/queue/max_discard_bytes where <dev> is where the btrfs is mounted.

At any rate, it appears pretty clearly that something went wrong in async discard. Either the freed extents did not get queued for discard or they did but somehow got removed without getting fully trimmed. I am trying to spot how this could happen or think of some ways to harden it, but at this point I don't think there is much more we can glean from the state of your system. If you have the time to run the drgn v2 one more time, with the modification of changing:

print(f"untrimmed extent! {free_extent.offset} {free_extent.bytes}")
to
print(f"untrimmed extent! {free_extent.offset} {free_extent.bytes} {free_extent.trim_state}")

I do think that would still be interesting.

Comment 23 Boris Burkov 2024-01-09 22:47:19 UTC
In fact if you do get a chance to rerun the drgn script, just replace that print with
print(f"untrimmed extent {free_extent}")
I think the full state of the messed up free extent could be a clue.

Comment 24 Boris Burkov 2024-01-09 23:32:46 UTC
One more question: Is there any chance your vm is low on memory at some point during these issues? I have identified one or two paths where an kmalloc failing might cause us to potentially miss out on some discard, but it's hard to be confident that is actually causing your problem.

Comment 25 Dusty Mabe 2024-01-10 22:09:08 UTC
(In reply to Boris Burkov from comment #22)
> hmm, weird about vda5, I thought that was the name of the device in your
> dmesg. /sys/block/<dev>/queue/max_discard_bytes where <dev> is where the
> btrfs is mounted.

Maybe this is what you're looking for?

```
[root@vanilla-f39 ~]# ls /sys/block/
loop0  vda  zram0
[root@vanilla-f39 ~]# ls /sys/block/vda/queue/
add_random           discard_max_hw_bytes  io_poll               max_hw_sectors_kb       minimum_io_size  physical_block_size  stable_writes         write_zeroes_max_bytes
chunk_sectors        discard_zeroes_data   io_poll_delay         max_integrity_segments  nomerges         read_ahead_kb        virt_boundary_mask    zone_append_max_bytes
dax                  dma_alignment         iostats               max_sectors_kb          nr_requests      rotational           wbt_lat_usec          zoned
discard_granularity  fua                   logical_block_size    max_segments            nr_zones         rq_affinity          write_cache           zone_write_granularity
discard_max_bytes    hw_sector_size        max_discard_segments  max_segment_size        optimal_io_size  scheduler            write_same_max_bytes
[root@vanilla-f39 ~]# 
[root@vanilla-f39 ~]# cat /sys/block/vda/queue/discard_max_bytes 
2147483136
[root@vanilla-f39 ~]# cat /sys/block/vda/queue/max_discard_segments 
1
```

Comment 26 Dusty Mabe 2024-01-10 22:11:15 UTC
(In reply to Boris Burkov from comment #23)
> In fact if you do get a chance to rerun the drgn script, just replace that
> print with
> print(f"untrimmed extent {free_extent}")
> I think the full state of the messed up free extent could be a clue.

Will do a new upload.

Comment 27 Dusty Mabe 2024-01-10 22:13:10 UTC
Created attachment 2008159 [details]
drgn-out-2.txt

Comment 28 Dusty Mabe 2024-01-10 22:14:06 UTC
(In reply to Boris Burkov from comment #24)
> One more question: Is there any chance your vm is low on memory at some
> point during these issues? I have identified one or two paths where an
> kmalloc failing might cause us to potentially miss out on some discard, but
> it's hard to be confident that is actually causing your problem.

The machine has 16G of memory. It's only really doing osbuild related things and only one at a time. I can't imagine that it got low on memory, but I'm not able to say for sure.

Comment 29 Dusty Mabe 2024-01-16 14:48:57 UTC
Any other info you want from this VM machine? I'll probably take it down soon to free up memory/disk on my system.

I'll also chime in here if I reproduce the issue again (beyond the two times I have already).

Comment 30 Chris Murphy 2024-01-31 18:28:47 UTC
>chris@fovo:~$ uname -r
>6.7.2-200.fc39.x86_64
>chris@fovo:~$ cat /sys/fs/btrfs/8ca10d3c-b425-485b-bd0f-97802a9c7fc4/bg_reclaim_threshold 
>75

OK so it seems like it's enabled by default starting with kernel 6.7 now? I'm wondering if this problem is reproducible with a 6.7 series kernel, keeping everything else (layout and workload) the same?

Comment 31 Chris Murphy 2024-01-31 18:31:16 UTC
OH sorry, this is less about about lack of bg reclaim and more about discards not happening, so the underlying storage really does fill up.

Comment 32 Pim Vermeer 2024-02-09 17:37:43 UTC
I ran into this issue after updating to kernel 6.7.3.

There is a severe overallocation of metadata space of btrfs. My btrfs metadata space allocation went up by 10's of GB's in a matter of minutes. A full balance made it worse. Booting previous kernel (6.6.14) and a metadata balance fixed it. 

This is probably caused by reserving metadata space before transactions. These are not cleared when not used.

I found the issue in a regression tracker:

https://linux-regtracking.leemhuis.info/regzbot/regression/lore/9cdbf0ca9cdda1b4c84e15e548af7d7f9f926382.camel@intelfx.name

Comment 33 Pim Vermeer 2024-02-09 20:09:50 UTC
I did a test on my other system that has a lot of free space so here's some extra info.

I balance this filesystem with full-balance so a lot of data got moved around.

Before balance:

```
[mediac@mediacenter ~]$ sudo btrfs fi usage /
Overall:
    Device size:		 231.30GiB
    Device allocated:		  82.07GiB
    Device unallocated:		 149.23GiB
    Device missing:		     0.00B
    Device slack:		     0.00B
    Used:			  61.29GiB
    Free (estimated):		 166.17GiB	(min: 91.56GiB)
    Free (statfs, df):		 166.17GiB
    Data ratio:			      1.00
    Metadata ratio:		      2.00
    Global reserve:		 162.64MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:74.01GiB, Used:57.07GiB (77.11%)
   /dev/nvme0n1p3	  74.01GiB

Metadata,DUP: Size:4.00GiB, Used:2.11GiB (52.78%)
   /dev/nvme0n1p3	   8.00GiB

System,DUP: Size:32.00MiB, Used:16.00KiB (0.05%)
   /dev/nvme0n1p3	  64.00MiB

Unallocated:
   /dev/nvme0n1p3	 149.23GiB

```

After balance:

```
[mediac@mediacenter ~]$ sudo btrfs fi usage /
Overall:
    Device size:		 231.30GiB
    Device allocated:		 228.29GiB
    Device unallocated:		   3.01GiB
    Device missing:		     0.00B
    Device slack:		     0.00B
    Used:			  60.84GiB
    Free (estimated):		   4.33GiB	(min: 2.83GiB)
    Free (statfs, df):		   4.33GiB
    Data ratio:			      1.00
    Metadata ratio:		      2.00
    Global reserve:		 157.67MiB	(used: 0.00B)
    Multiple profiles:		        no

Data,single: Size:58.00GiB, Used:56.68GiB (97.72%)
   /dev/disk/by-uuid/c91a38c9-5ce7-4249-a7f9-49ba01ed935c	  58.00GiB

Metadata,DUP: Size:85.11GiB, Used:2.08GiB (2.44%)
   /dev/disk/by-uuid/c91a38c9-5ce7-4249-a7f9-49ba01ed935c	 170.23GiB

System,DUP: Size:32.00MiB, Used:48.00KiB (0.15%)
   /dev/disk/by-uuid/c91a38c9-5ce7-4249-a7f9-49ba01ed935c	  64.00MiB

Unallocated:
   /dev/disk/by-uuid/c91a38c9-5ce7-4249-a7f9-49ba01ed935c	   3.01GiB
```

Comment 34 Chris Murphy 2024-02-13 19:10:35 UTC
Re: comments 32 and 33, please see

https://lore.kernel.org/linux-btrfs/CAL3q7H7W+QQE8bc9VqgXZe071NxvESBtq+D0S3+XUZdmQ5NKnA@mail.gmail.com/T/#t

It should be fixed soon.

Comment 35 Dan Čermák 2024-02-15 07:31:56 UTC
Fixes have been posted: https://lore.kernel.org/linux-btrfs/cover.1707900530.git.dsterba@suse.com/T/#u

Can you please cherry pick & send them to f39 & f38? I got bitten by this kernel bug two days ago and filled up my / completely.

Comment 36 Fedora Update System 2024-02-17 19:36:28 UTC
FEDORA-2024-88847bc77a (kernel-6.7.5-200.fc39) has been submitted as an update to Fedora 39.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-88847bc77a

Comment 37 Fedora Update System 2024-02-17 19:36:30 UTC
FEDORA-2024-987089eca2 (kernel-6.7.5-100.fc38) has been submitted as an update to Fedora 38.
https://bodhi.fedoraproject.org/updates/FEDORA-2024-987089eca2

Comment 38 Pim Vermeer 2024-02-17 19:59:03 UTC
Thanks!

Comment 39 Fedora Update System 2024-02-18 01:49:18 UTC
FEDORA-2024-88847bc77a has been pushed to the Fedora 39 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-88847bc77a`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-88847bc77a

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 40 Fedora Update System 2024-02-18 02:32:55 UTC
FEDORA-2024-987089eca2 has been pushed to the Fedora 38 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2024-987089eca2`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2024-987089eca2

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 41 Fedora Update System 2024-02-22 02:18:52 UTC
FEDORA-2024-88847bc77a (kernel-6.7.5-200.fc39) has been pushed to the Fedora 39 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 42 Fedora Update System 2024-02-22 02:23:03 UTC
FEDORA-2024-987089eca2 (kernel-6.7.5-100.fc38) has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.