Bug 2275252 - kernel crashes: BUG: kernel NULL pointer dereference, address: 0000000000000600 in zswap_shrinker_count
Summary: kernel crashes: BUG: kernel NULL pointer dereference, address: 00000000000006...
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: rawhide
Hardware: x86_64
OS: Linux
unspecified
unspecified
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks: TRACKER-bugs-affecting-libguestfs
TreeView+ depends on / blocked
 
Reported: 2024-04-16 09:21 UTC by Mark W
Modified: 2024-08-05 07:51 UTC (History)
21 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type: Bug
Embargoed:


Attachments (Terms of Use)
virt-resize executed with LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 log (147.56 KB, text/plain)
2024-04-16 09:21 UTC, Mark W
no flags Details

Description Mark W 2024-04-16 09:21:04 UTC
Created attachment 2027179 [details]
virt-resize executed with LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 log

Description of problem:
When resizing a `qcow2` volume, the `virt-resize` fails.

Version-Release number of selected component (if applicable):
guestfish 1.52.0

How reproducible:
Consistently fails on any qcow2 image file.

Steps to Reproduce:
1. Download arch basic image from 
>https://geo.mirror.pkgbuild.com/images/v20240415.229275/Arch-Linux-x86_64-basic.qcow2
and save it as beforeresize.qcow2
2. Execute: 
>qemu-img create -f qcow2 -o preallocation=metadata,compression_type=zstd newdisk.qcow2 100G
3. Execute: 
>virt-resize --expand /dev/sda3 beforeresize.qcow2 newdisk.qcow2

Actual results:
virt-resize: error: libguestfs error: appliance closed the connection 
unexpectedly.
This usually means the libguestfs appliance crashed.
...


Expected results:
newdisk.qcow2 should be a copy of beforeresize.qcow2, but with /dev/sda3 increased to 100 GB.

Additional info:
Happens with other unrelated images too. 
Never used to happen. This was introduced in the past week or two by an Arch Linux pacman -Syu upgrade.

OS: Arch Linux x86_64 
Kernel: 6.8.5-arch1-1 
CPU: Intel i5-6600K

Comment 1 Richard W.M. Jones 2024-04-16 09:38:23 UTC
[   18.533479] BUG: kernel NULL pointer dereference, address: 0000000000000600
[   18.534302] #PF: supervisor read access in kernel mode
[   18.534862] #PF: error_code(0x0000) - not-present page
[   18.535429] PGD 0 P4D 0 
[   18.535715] Oops: 0000 [#1] PREEMPT SMP PTI
[   18.536175] CPU: 0 PID: 43 Comm: kswapd0 Not tainted 6.8.5-arch1-1 #1 5f12b795066ab8d27a5fe9971245067df4fb99ed
[   18.537241] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   18.538220] RIP: 0010:memcg_page_state+0x9/0x30
[   18.538721] Code: c3 cc cc cc cc eb f9 e9 05 b8 ff ff 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48
[   18.540694] RSP: 0018:ffff9c4840163af0 EFLAGS: 00010246
[   18.541256] RAX: 00000000fffff33f RBX: ffff9c4840163bc0 RCX: 0000000000000002
[   18.542026] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000
[   18.542791] RBP: 0000000000000000 R08: ffff8d64c314e000 R09: 0000000000000000
[   18.543554] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8d650ffdb780
[   18.544317] R13: ffff8d64c1c28400 R14: 0000000000000000 R15: ffff8d64c31e1d80
[   18.545083] FS:  0000000000000000(0000) GS:ffff8d650de00000(0000) knlGS:0000000000000000
[   18.545948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.546570] CR2: 0000000000000600 CR3: 00000000040f6003 CR4: 0000000000370ef0
[   18.547338] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   18.548102] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   18.548873] Call Trace:
[   18.549147]  <TASK>
[   18.549389]  ? __die+0x23/0x70
[   18.549732]  ? page_fault_oops+0x171/0x4e0
[   18.550178]  ? free_unref_page_list+0x2f4/0x400
[   18.550677]  ? exc_page_fault+0x7f/0x180
[   18.551106]  ? asm_exc_page_fault+0x26/0x30
[   18.551564]  ? memcg_page_state+0x9/0x30
[   18.552001]  zswap_shrinker_count+0xb4/0x120
[   18.552470]  do_shrink_slab+0x37/0x360
[   18.552885]  shrink_slab+0xc7/0x3c0
[   18.553274]  ? try_to_shrink_lruvec+0x1bf/0x290
[   18.553772]  shrink_one+0x123/0x1b0
[   18.554163]  shrink_node+0xa7f/0xbc0
[   18.554557]  ? psi_group_change+0x213/0x3c0
[   18.555014]  balance_pgdat+0x523/0x960
[   18.555442]  ? psi_task_switch+0xd6/0x230
[   18.555886]  ? __switch_to_asm+0x3e/0x70
[   18.556320]  ? finish_task_switch.isra.0+0x94/0x2f0
[   18.556852]  kswapd+0x20d/0x400
[   18.557203]  ? __pfx_autoremove_wake_function+0x10/0x10
[   18.557774]  ? __pfx_kswapd+0x10/0x10
[   18.558174]  kthread+0xe5/0x120
[   18.558525]  ? __pfx_kthread+0x10/0x10
[   18.558943]  ret_from_fork+0x31/0x50
[   18.559337]  ? __pfx_kthread+0x10/0x10
[   18.559752]  ret_from_fork_asm+0x1b/0x30
[   18.560183]  </TASK>
[   18.560429] Modules linked in: vfat fat dm_mod btrfs blake2b_generic xor raid6_pq virtio_snd snd_pcm snd_timer snd soundcore libcrc32c crc8 crc7 crc4 crc_itu_t virtiofs fuse ext4 mbcache jbd2 virtio_vdpa vdpa virtio_mmio virtio_mem virtio_input virtio_dma_buf virtio_balloon virtio_vfio_pci virtio_pci virtio_pci_modern_dev virtio_pci_legacy_dev vfio_pci_core irqbypass vfio iommufd virtio_scsi virtio_rpmsg_bus rpmsg_ns rpmsg_core nd_virtio virtio_net net_failover failover virtio_iommu virtio_crypto crypto_engine virtio_console virtio_rng virtio_bt bluetooth rfkill crc16 ecdh_generic virtio_blk ata_piix trusted asn1_encoder tee crc32c_generic crc32_generic crct10dif_pclmul crc32c_intel crc32_pclmul
[   18.566908] CR2: 0000000000000600
[   18.567254] ---[ end trace 0000000000000000 ]---

This is a recent kernel bug, we saw it on Arch too:

https://github.com/libguestfs/libguestfs/issues/139#issuecomment-2056607791

It's a kernel bug, we have no idea yet what causes it.

Comment 2 Mark W 2024-04-16 10:49:12 UTC
Thanks for moving on this so quickly, Richard. I had wondered whether to report it directly as a kernel bug, but decided reporting it here in the first instance.

Comment 3 Christian Heusel 2024-04-16 12:21:24 UTC
I have reported my findings in https://lore.kernel.org/all/3iccc6vjl5gminut3lvpl4va2lbnsgku5ei2d7ylftoofy3n2v@gcfdvtsq6dx2/

Comment 4 Yuxuan Shui 2024-05-14 05:45:10 UTC
Hi, I recently updated from 6.8.5 to 6.8.9 and started to have problems with zswap. I noticed "mm: zswap: fix shrinker NULL crash with cgroup_disable=memory" is the only change to zswap between these two versions. So although I have no idea if this is related, I decided to report it here.

Here are some stack traces:

A hang inside zswap:

INFO: task [redacted]:2870 blocked for more than 122 seconds.
      Tainted: P           O       6.8.9-zen1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:[redacted] state:D stack:0     pid:2870  tgid:2870  ppid:2788   flags:0x00004002
Call Trace:
 <TASK>
 __schedule+0x5fe/0xaf0
 schedule+0x6e/0xc0
 schedule_preempt_disabled+0x15/0x30
 __mutex_lock+0x28c/0x6a0
 __zswap_load+0x5d/0x1f0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? xas_store+0x3f1/0x5c0
 zswap_load+0xbd/0x270
 ? srso_alias_return_thunk+0x5/0xfbef5
 swap_read_folio+0x75/0x6e0
 ? workingset_refault+0x26e/0x4d0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __read_swap_cache_async+0x1fe/0x2b0
 swapin_readahead+0x437/0x450
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __filemap_get_folio+0x3b/0x320
 do_swap_page+0x1a6/0xaa0
 ? __pte_offset_map+0x1d/0xf0
 handle_mm_fault+0x7f4/0xc60
 do_user_addr_fault+0x46a/0x690
 exc_page_fault+0x62/0x150
 asm_exc_page_fault+0x26/0x30
RIP: 0033:0x78c1145a5f51
RSP: 002b:00007ffc0fd057b0 EFLAGS: 00010206
RAX: 0000000001e28840 RBX: 0000000000000020 RCX: 000078c1146ecb30
RDX: 000078c1146ecb40 RSI: 000000000211b890 RDI: 000078c1146ecac0
RBP: 000078c1146ecac0 R08: 0000000000000005 R09: 0000000000000004
R10: 000078c1146ecac0 R11: 0000000001ed0db0 R12: 0000000000000002
R13: 0000000000000014 R14: 0000000000000039 R15: 0000000000000000
 </TASK>

And a kernel BUG:

kernel BUG at mm/zswap.c:1395!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 283 Comm: kswapd0 Kdump: loaded Tainted: P        W  O       6.8.9-zen1
Hardware name: [redacted]
RIP: 0010:__zswap_load+0x1dc/0x1f0
Code: 04 25 28 00 00 00 48 3b 44 24 48 75 14 48 83 c4 50 5b 41 5c 41 5d 41 5e 41 5f 5d e9 a9 39 9c 00 cc e8 68 cc 6e 00 90 0f 0b 90 <0f> 0b 90 0f 0b cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 66 0f
RSP: 0018:ffff91ab80b83910 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: ffff8c3f02a42c30 RCX: 00000000ffffffea
RDX: 000000000000000a RSI: ffff91ab81343000 RDI: ffff8c4a2da2a4c0
RBP: ffff91ab80b83938 R08: 0000000000000005 R09: ffff91ab8210d510
R10: ffff91ab8210d4c0 R11: ffff91ab82106020 R12: ffffdcc9fb95ffc0
R13: ffff91ab80b83918 R14: ffff8c3b8008c1c0 R15: ffff8c4a2da3eb58
FS:  0000000000000000(0000) GS:ffff8c4a2da00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffb956d008 CR3: 0000000bffe46000 CR4: 0000000000f50ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die_body+0x68/0xb0
 ? die+0xa4/0xd0
 ? do_trap+0xa5/0x180
 ? __zswap_load+0x1dc/0x1f0
 ? __zswap_load+0x1dc/0x1f0
 ? handle_invalid_op+0x65/0x80
 ? __zswap_load+0x1dc/0x1f0
 ? exc_invalid_op+0x39/0x50
 ? asm_exc_invalid_op+0x1a/0x20
 ? __zswap_load+0x1dc/0x1f0
 shrink_memcg_cb+0x25c/0x530
 ? sysvec_call_function_single+0xe/0x80
 ? zswap_shrinker_count+0x170/0x170
 __list_lru_walk_one+0x110/0x220
 ? zswap_shrinker_count+0x170/0x170
 list_lru_walk_one+0x5e/0x80
 zswap_shrinker_scan+0xc4/0x140
 do_shrink_slab+0x160/0x330
 shrink_slab+0x354/0x4d0
 shrink_one+0xbe/0x1f0
 shrink_node+0xcab/0xea0
 kswapd+0x95d/0xf70
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __schedule+0x606/0xaf0
 ? shrink_all_memory+0x170/0x170
 kthread+0xe8/0x110
 ? kthread_blkcg+0x40/0x40
 ret_from_fork+0x37/0x50
 ? kthread_blkcg+0x40/0x40
 ret_from_fork_asm+0x11/0x20
 </TASK>

Comment 5 jamesstrickland207 2024-08-05 07:51:14 UTC Comment hidden (spam)

Note You need to log in before you can comment on or make changes to this bug.