2275252 – kernel crashes: BUG: kernel NULL pointer dereference, address: 0000000000000600 in zswap_shrinker_count

Bug 2275252 - kernel crashes: BUG: kernel NULL pointer dereference, address: 0000000000000600 in zswap_shrinker_count

Summary: kernel crashes: BUG: kernel NULL pointer dereference, address: 00000000000006...

Keywords:
Status:	NEW
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	rawhide
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	TRACKER-bugs-affecting-libguestfs
TreeView+	depends on / blocked

Reported:	2024-04-16 09:21 UTC by Mark W
Modified:	2024-08-05 07:51 UTC (History)
CC List:	21 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
virt-resize executed with LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 log (147.56 KB, text/plain) 2024-04-16 09:21 UTC, Mark W	no flags	Details
View All

Description Mark W 2024-04-16 09:21:04 UTC

Created attachment 2027179 [details]
virt-resize executed with LIBGUESTFS_DEBUG=1 LIBGUESTFS_TRACE=1 log

Description of problem:
When resizing a `qcow2` volume, the `virt-resize` fails.

Version-Release number of selected component (if applicable):
guestfish 1.52.0

How reproducible:
Consistently fails on any qcow2 image file.

Steps to Reproduce:
1. Download arch basic image from 
>https://geo.mirror.pkgbuild.com/images/v20240415.229275/Arch-Linux-x86_64-basic.qcow2
and save it as beforeresize.qcow2
2. Execute: 
>qemu-img create -f qcow2 -o preallocation=metadata,compression_type=zstd newdisk.qcow2 100G
3. Execute: 
>virt-resize --expand /dev/sda3 beforeresize.qcow2 newdisk.qcow2

Actual results:
virt-resize: error: libguestfs error: appliance closed the connection 
unexpectedly.
This usually means the libguestfs appliance crashed.
...


Expected results:
newdisk.qcow2 should be a copy of beforeresize.qcow2, but with /dev/sda3 increased to 100 GB.

Additional info:
Happens with other unrelated images too. 
Never used to happen. This was introduced in the past week or two by an Arch Linux pacman -Syu upgrade.

OS: Arch Linux x86_64 
Kernel: 6.8.5-arch1-1 
CPU: Intel i5-6600K

Comment 1 Richard W.M. Jones 2024-04-16 09:38:23 UTC

[   18.533479] BUG: kernel NULL pointer dereference, address: 0000000000000600
[   18.534302] #PF: supervisor read access in kernel mode
[   18.534862] #PF: error_code(0x0000) - not-present page
[   18.535429] PGD 0 P4D 0 
[   18.535715] Oops: 0000 [#1] PREEMPT SMP PTI
[   18.536175] CPU: 0 PID: 43 Comm: kswapd0 Not tainted 6.8.5-arch1-1 #1 5f12b795066ab8d27a5fe9971245067df4fb99ed
[   18.537241] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.3-1-1 04/01/2014
[   18.538220] RIP: 0010:memcg_page_state+0x9/0x30
[   18.538721] Code: c3 cc cc cc cc eb f9 e9 05 b8 ff ff 0f 1f 44 00 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 66 0f 1f 00 0f 1f 44 00 00 <48> 8b 87 00 06 00 00 48 63 f6 31 d2 48 8b 04 f0 48 85 c0 48 0f 48
[   18.540694] RSP: 0018:ffff9c4840163af0 EFLAGS: 00010246
[   18.541256] RAX: 00000000fffff33f RBX: ffff9c4840163bc0 RCX: 0000000000000002
[   18.542026] RDX: 0000000000000001 RSI: 0000000000000033 RDI: 0000000000000000
[   18.542791] RBP: 0000000000000000 R08: ffff8d64c314e000 R09: 0000000000000000
[   18.543554] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8d650ffdb780
[   18.544317] R13: ffff8d64c1c28400 R14: 0000000000000000 R15: ffff8d64c31e1d80
[   18.545083] FS:  0000000000000000(0000) GS:ffff8d650de00000(0000) knlGS:0000000000000000
[   18.545948] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   18.546570] CR2: 0000000000000600 CR3: 00000000040f6003 CR4: 0000000000370ef0
[   18.547338] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   18.548102] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   18.548873] Call Trace:
[   18.549147]  <TASK>
[   18.549389]  ? __die+0x23/0x70
[   18.549732]  ? page_fault_oops+0x171/0x4e0
[   18.550178]  ? free_unref_page_list+0x2f4/0x400
[   18.550677]  ? exc_page_fault+0x7f/0x180
[   18.551106]  ? asm_exc_page_fault+0x26/0x30
[   18.551564]  ? memcg_page_state+0x9/0x30
[   18.552001]  zswap_shrinker_count+0xb4/0x120
[   18.552470]  do_shrink_slab+0x37/0x360
[   18.552885]  shrink_slab+0xc7/0x3c0
[   18.553274]  ? try_to_shrink_lruvec+0x1bf/0x290
[   18.553772]  shrink_one+0x123/0x1b0
[   18.554163]  shrink_node+0xa7f/0xbc0
[   18.554557]  ? psi_group_change+0x213/0x3c0
[   18.555014]  balance_pgdat+0x523/0x960
[   18.555442]  ? psi_task_switch+0xd6/0x230
[   18.555886]  ? __switch_to_asm+0x3e/0x70
[   18.556320]  ? finish_task_switch.isra.0+0x94/0x2f0
[   18.556852]  kswapd+0x20d/0x400
[   18.557203]  ? __pfx_autoremove_wake_function+0x10/0x10
[   18.557774]  ? __pfx_kswapd+0x10/0x10
[   18.558174]  kthread+0xe5/0x120
[   18.558525]  ? __pfx_kthread+0x10/0x10
[   18.558943]  ret_from_fork+0x31/0x50
[   18.559337]  ? __pfx_kthread+0x10/0x10
[   18.559752]  ret_from_fork_asm+0x1b/0x30
[   18.560183]  </TASK>
[   18.560429] Modules linked in: vfat fat dm_mod btrfs blake2b_generic xor raid6_pq virtio_snd snd_pcm snd_timer snd soundcore libcrc32c crc8 crc7 crc4 crc_itu_t virtiofs fuse ext4 mbcache jbd2 virtio_vdpa vdpa virtio_mmio virtio_mem virtio_input virtio_dma_buf virtio_balloon virtio_vfio_pci virtio_pci virtio_pci_modern_dev virtio_pci_legacy_dev vfio_pci_core irqbypass vfio iommufd virtio_scsi virtio_rpmsg_bus rpmsg_ns rpmsg_core nd_virtio virtio_net net_failover failover virtio_iommu virtio_crypto crypto_engine virtio_console virtio_rng virtio_bt bluetooth rfkill crc16 ecdh_generic virtio_blk ata_piix trusted asn1_encoder tee crc32c_generic crc32_generic crct10dif_pclmul crc32c_intel crc32_pclmul
[   18.566908] CR2: 0000000000000600
[   18.567254] ---[ end trace 0000000000000000 ]---

This is a recent kernel bug, we saw it on Arch too:

https://github.com/libguestfs/libguestfs/issues/139#issuecomment-2056607791

It's a kernel bug, we have no idea yet what causes it.

Comment 2 Mark W 2024-04-16 10:49:12 UTC

Thanks for moving on this so quickly, Richard. I had wondered whether to report it directly as a kernel bug, but decided reporting it here in the first instance.

Comment 3 Christian Heusel 2024-04-16 12:21:24 UTC

I have reported my findings in https://lore.kernel.org/all/3iccc6vjl5gminut3lvpl4va2lbnsgku5ei2d7ylftoofy3n2v@gcfdvtsq6dx2/

Comment 4 Yuxuan Shui 2024-05-14 05:45:10 UTC

Hi, I recently updated from 6.8.5 to 6.8.9 and started to have problems with zswap. I noticed "mm: zswap: fix shrinker NULL crash with cgroup_disable=memory" is the only change to zswap between these two versions. So although I have no idea if this is related, I decided to report it here.

Here are some stack traces:

A hang inside zswap:

INFO: task [redacted]:2870 blocked for more than 122 seconds.
      Tainted: P           O       6.8.9-zen1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:[redacted] state:D stack:0     pid:2870  tgid:2870  ppid:2788   flags:0x00004002
Call Trace:
 <TASK>
 __schedule+0x5fe/0xaf0
 schedule+0x6e/0xc0
 schedule_preempt_disabled+0x15/0x30
 __mutex_lock+0x28c/0x6a0
 __zswap_load+0x5d/0x1f0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? xas_store+0x3f1/0x5c0
 zswap_load+0xbd/0x270
 ? srso_alias_return_thunk+0x5/0xfbef5
 swap_read_folio+0x75/0x6e0
 ? workingset_refault+0x26e/0x4d0
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __read_swap_cache_async+0x1fe/0x2b0
 swapin_readahead+0x437/0x450
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __filemap_get_folio+0x3b/0x320
 do_swap_page+0x1a6/0xaa0
 ? __pte_offset_map+0x1d/0xf0
 handle_mm_fault+0x7f4/0xc60
 do_user_addr_fault+0x46a/0x690
 exc_page_fault+0x62/0x150
 asm_exc_page_fault+0x26/0x30
RIP: 0033:0x78c1145a5f51
RSP: 002b:00007ffc0fd057b0 EFLAGS: 00010206
RAX: 0000000001e28840 RBX: 0000000000000020 RCX: 000078c1146ecb30
RDX: 000078c1146ecb40 RSI: 000000000211b890 RDI: 000078c1146ecac0
RBP: 000078c1146ecac0 R08: 0000000000000005 R09: 0000000000000004
R10: 000078c1146ecac0 R11: 0000000001ed0db0 R12: 0000000000000002
R13: 0000000000000014 R14: 0000000000000039 R15: 0000000000000000
 </TASK>

And a kernel BUG:

kernel BUG at mm/zswap.c:1395!
invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
CPU: 4 PID: 283 Comm: kswapd0 Kdump: loaded Tainted: P        W  O       6.8.9-zen1
Hardware name: [redacted]
RIP: 0010:__zswap_load+0x1dc/0x1f0
Code: 04 25 28 00 00 00 48 3b 44 24 48 75 14 48 83 c4 50 5b 41 5c 41 5d 41 5e 41 5f 5d e9 a9 39 9c 00 cc e8 68 cc 6e 00 90 0f 0b 90 <0f> 0b 90 0f 0b cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 66 0f
RSP: 0018:ffff91ab80b83910 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: ffff8c3f02a42c30 RCX: 00000000ffffffea
RDX: 000000000000000a RSI: ffff91ab81343000 RDI: ffff8c4a2da2a4c0
RBP: ffff91ab80b83938 R08: 0000000000000005 R09: ffff91ab8210d510
R10: ffff91ab8210d4c0 R11: ffff91ab82106020 R12: ffffdcc9fb95ffc0
R13: ffff91ab80b83918 R14: ffff8c3b8008c1c0 R15: ffff8c4a2da3eb58
FS:  0000000000000000(0000) GS:ffff8c4a2da00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fffb956d008 CR3: 0000000bffe46000 CR4: 0000000000f50ef0
PKRU: 55555554
Call Trace:
 <TASK>
 ? __die_body+0x68/0xb0
 ? die+0xa4/0xd0
 ? do_trap+0xa5/0x180
 ? __zswap_load+0x1dc/0x1f0
 ? __zswap_load+0x1dc/0x1f0
 ? handle_invalid_op+0x65/0x80
 ? __zswap_load+0x1dc/0x1f0
 ? exc_invalid_op+0x39/0x50
 ? asm_exc_invalid_op+0x1a/0x20
 ? __zswap_load+0x1dc/0x1f0
 shrink_memcg_cb+0x25c/0x530
 ? sysvec_call_function_single+0xe/0x80
 ? zswap_shrinker_count+0x170/0x170
 __list_lru_walk_one+0x110/0x220
 ? zswap_shrinker_count+0x170/0x170
 list_lru_walk_one+0x5e/0x80
 zswap_shrinker_scan+0xc4/0x140
 do_shrink_slab+0x160/0x330
 shrink_slab+0x354/0x4d0
 shrink_one+0xbe/0x1f0
 shrink_node+0xcab/0xea0
 kswapd+0x95d/0xf70
 ? srso_alias_return_thunk+0x5/0xfbef5
 ? __schedule+0x606/0xaf0
 ? shrink_all_memory+0x170/0x170
 kthread+0xe8/0x110
 ? kthread_blkcg+0x40/0x40
 ret_from_fork+0x37/0x50
 ? kthread_blkcg+0x40/0x40
 ret_from_fork_asm+0x11/0x20
 </TASK>

Comment 5 jamesstrickland207 2024-08-05 07:51:14 UTC Comment hidden (spam)

(In reply to Christian Heusel from comment #3)
> I have reported my findings in
> https://lore.kernel.org/all/
> 3iccc6vjl5gminut3lvpl4va2lbnsgku5ei2d7ylftoofy3n2v@gcfdvtsq6dx2/ https://geometrydashsubzero.io

You know, your finding is so useful to me. So great!

Note You need to log in before you can comment on or make changes to this bug.

acaringi
adscvr
airlied
alciregi
bskeggs
christian
hdegoede
hpa
jamesstrickland207
jarod
josef
kernel-maint
linville
masami256
mchehab
mhicks
ptalbert
ptoscano
rjones
steved
yshuiv7