Description of problem: once iSCSI and multipath is configured, running following command makes kworkers spin on 100% CPU. Seems that there is one kworker per iSCSI portal, in my case that is 4. Version-Release number of selected component (if applicable): Known to be broken: 4.14.18-300.fc27.x86_64 Known to work fine: 4.13.9-300.fc27.x86_64 How reproducible: 100% of the time Steps to Reproduce: 1. initiate iSCSI sessions to multiple portals 2. let multipathd to create multipath devices 3. run `iscsiadm -m node --logoutall=all` Actual results: 100% CPU time on 4 kworkers (1 per iSCSI session/portal) Sometimes linked list error is reported to dmesg: Feb 16 10:37:11 localhost.localdomain kernel: alua: device handler registered Feb 16 10:37:11 localhost.localdomain kernel: emc: device handler registered Feb 16 10:37:11 localhost.localdomain kernel: rdac: device handler registered Feb 16 10:37:11 localhost.localdomain kernel: device-mapper: multipath service-time: version 0.3.0 loaded Feb 16 10:38:38 localhost.localdomain kernel: list_del corruption, ffff88c1cd6bb810->next is LIST_POISON1 (dead000000000100) Feb 16 10:38:38 localhost.localdomain kernel: ------------[ cut here ]------------ Feb 16 10:38:38 localhost.localdomain kernel: kernel BUG at lib/list_debug.c:47! Feb 16 10:38:38 localhost.localdomain kernel: invalid opcode: 0000 [#1] SMP PTI Feb 16 10:38:38 localhost.localdomain kernel: Modules linked in: dm_service_time dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua binfmt_misc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6tabl Feb 16 10:38:38 localhost.localdomain kernel: pata_acpi Feb 16 10:38:38 localhost.localdomain kernel: CPU: 2 PID: 5 Comm: kworker/u24:0 Not tainted 4.14.18-300.fc27.x86_64 #1 Feb 16 10:38:38 localhost.localdomain kernel: Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 Feb 16 10:38:38 localhost.localdomain kernel: Workqueue: scsi_wq_5 __iscsi_unbind_session [scsi_transport_iscsi] Feb 16 10:38:38 localhost.localdomain kernel: task: ffff88bdede83e80 task.stack: ffffb15043158000 Feb 16 10:38:38 localhost.localdomain kernel: RIP: 0010:__list_del_entry_valid+0x4e/0x90 Feb 16 10:38:38 localhost.localdomain kernel: RSP: 0018:ffffb1504315bd88 EFLAGS: 00010082 Feb 16 10:38:38 localhost.localdomain kernel: RAX: 000000000000004e RBX: ffff88c1cd6bbf38 RCX: 0000000000000000 Feb 16 10:38:38 localhost.localdomain kernel: RDX: 0000000000000000 RSI: ffff88bdefc96a38 RDI: ffff88bdefc96a38 Feb 16 10:38:38 localhost.localdomain kernel: RBP: 0000000000000246 R08: 00000000000007be R09: 0000000000aaaaaa Feb 16 10:38:38 localhost.localdomain kernel: R10: ffffb1504315bd58 R11: 00000000ffffffff R12: ffff88c1ebb659c0 Feb 16 10:38:38 localhost.localdomain kernel: R13: ffff88bdec827010 R14: ffff88c1cd6bb800 R15: ffff88c1cd6bb800 Feb 16 10:38:38 localhost.localdomain kernel: FS: 0000000000000000(0000) GS:ffff88bdefc80000(0000) knlGS:0000000000000000 Feb 16 10:38:38 localhost.localdomain kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Feb 16 10:38:38 localhost.localdomain kernel: CR2: 0000563d0c1ed280 CR3: 000000057120a001 CR4: 00000000001606e0 Feb 16 10:38:38 localhost.localdomain kernel: Call Trace: Feb 16 10:38:38 localhost.localdomain kernel: scsi_device_dev_release_usercontext+0x52/0x250 Feb 16 10:38:38 localhost.localdomain kernel: ? __schedule+0x10f/0x880 Feb 16 10:38:38 localhost.localdomain kernel: execute_in_process_context+0x21/0x60 Feb 16 10:38:38 localhost.localdomain kernel: device_release+0x30/0x80 Feb 16 10:38:38 localhost.localdomain kernel: kobject_put+0x80/0x1a0 Feb 16 10:38:38 localhost.localdomain kernel: scsi_remove_target+0x16d/0x1b0 Feb 16 10:38:38 localhost.localdomain kernel: __iscsi_unbind_session+0xad/0x150 [scsi_transport_iscsi] Feb 16 10:38:38 localhost.localdomain kernel: process_one_work+0x184/0x3a0 Feb 16 10:38:38 localhost.localdomain kernel: worker_thread+0x2e/0x380 Feb 16 10:38:38 localhost.localdomain kernel: ? process_one_work+0x3a0/0x3a0 Feb 16 10:38:38 localhost.localdomain kernel: kthread+0x11a/0x130 Feb 16 10:38:38 localhost.localdomain kernel: ? kthread_park+0x60/0x60 Feb 16 10:38:38 localhost.localdomain kernel: ret_from_fork+0x35/0x40 Feb 16 10:38:38 localhost.localdomain kernel: Code: 74 2b 48 8b 32 48 39 fe 75 34 48 8b 50 08 48 39 f2 75 3f b8 01 00 00 00 c3 48 89 fe 48 89 c2 48 c7 c7 90 c6 0c bf e8 0d 94 cc ff <0f> 0b 48 89 fe 48 c7 c7 c8 c6 0c bf e8 fc 93 cc ff 0f 0b 48 89 Feb 16 10:38:38 localhost.localdomain kernel: RIP: __list_del_entry_valid+0x4e/0x90 RSP: ffffb1504315bd88 Feb 16 10:38:38 localhost.localdomain kernel: ---[ end trace 20bb15aa6c0307f2 ]--- Feb 16 10:38:38 localhost.localdomain kernel: device-mapper: multipath: Failing path 65:0. Feb 16 10:38:39 localhost.localdomain kernel: print_req_error: I/O error, dev dm-2, sector 2097024 Feb 16 10:38:39 localhost.localdomain kernel: print_req_error: I/O error, dev dm-2, sector 2097024 Feb 16 10:38:39 localhost.localdomain kernel: Buffer I/O error on dev dm-2, logical block 262128, async page read Feb 16 10:38:39 localhost.localdomain kernel: device-mapper: multipath: Failing path 65:64. Feb 16 10:38:40 localhost.localdomain kernel: device-mapper: multipath: Failing path 65:112. Feb 16 10:38:41 localhost.localdomain kernel: device-mapper: multipath: Failing path 65:208. Feb 16 10:38:42 localhost.localdomain kernel: device-mapper: multipath: Failing path 8:112. Feb 16 10:38:43 localhost.localdomain kernel: device-mapper: multipath: Failing path 8:208. Feb 16 10:38:43 localhost.localdomain kernel: device-mapper: multipath: Failing path 8:208. Feb 16 10:38:43 localhost.localdomain kernel: list_del corruption, ffff88c1ebc26010->next is LIST_POISON1 (dead000000000100) Someimes it is __slab_free error: [ 70.433229] scsi 3:0:0:0: alua: Detached [ 70.433233] ------------[ cut here ]------------ [ 70.433234] kernel BUG at mm/slub.c:295! [ 70.433253] invalid opcode: 0000 [#1] SMP PTI [ 70.433265] Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ip_set nfnetlink ebtable_nat ebtable_broute bridge stp llc ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_raw iptable_security dm_service_time ebtable_filter ebtables ip6table_filter ip6_tables vmw_vsock_vmci_transport vsock sunrpc sb_edac crct10dif_pclmul crc32_pclmul vmw_balloon ghash_clmulni_intel intel_rapl_perf joydev i2c_piix4 vmw_vmci shpchp dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi xfs libcrc32c vmwgfx drm_kms_helper ttm drm crc32c_intel ata_generic serio_raw vmxnet3 vmw_pvscsi pata_acpi [ 70.433441] CPU: 10 PID: 776 Comm: multipathd Not tainted 4.14.18-300.fc27.x86_64 #1 [ 70.433458] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 09/21/2015 [ 70.433483] task: ffff99ab2b5ebe80 task.stack: ffffb07f83a64000 [ 70.433500] RIP: 0010:__slab_free+0x18b/0x300 [ 70.433511] RSP: 0018:ffffb07f83a67cb0 EFLAGS: 00010246 [ 70.433525] RAX: ffff99ab2b11ff00 RBX: ffff99ab2b11ff00 RCX: 00000001802a0022 [ 70.433542] RDX: ffff99ab2b11ff00 RSI: fffff31490ac47c0 RDI: ffff99ab2f803680 [ 70.433558] RBP: ffffb07f83a67d50 R08: 0000000000000001 R09: ffffffff815cccbe [ 70.433574] R10: fffff31490ac47c0 R11: 0000000000000000 R12: ffff99ab2f803680 [ 70.433591] R13: fffff31490ac47c0 R14: ffff99ab2b11ff00 R15: ffff99ab2d742fb0 [ 70.433607] FS: 00007f6d65a7a700(0000) GS:ffff99af3fd00000(0000) knlGS:0000000000000000 [ 70.433626] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 70.433640] CR2: 00007f4a983c67f8 CR3: 000000082a726001 CR4: 00000000001606e0 [ 70.433688] Call Trace: [ 70.433701] ? __wake_up_common_lock+0x89/0xc0 [ 70.433715] ? __slab_free+0x15b/0x300 [ 70.433726] ? scsi_dh_release_device+0x1e/0x60 [ 70.433738] scsi_dh_release_device+0x1e/0x60 [ 70.433750] scsi_device_dev_release_usercontext+0x2a/0x250 [ 70.434206] execute_in_process_context+0x21/0x60 [ 70.434643] device_release+0x30/0x80 [ 70.435098] kobject_put+0x80/0x1a0 [ 70.435525] scsi_disk_put+0x2b/0x40 [ 70.435942] __blkdev_put+0x197/0x1f0 [ 70.436355] blkdev_close+0x21/0x30 [ 70.436756] __fput+0xd0/0x1e0 [ 70.437152] task_work_run+0x84/0xa0 [ 70.437539] exit_to_usermode_loop+0x9e/0xa0 [ 70.437917] do_syscall_64+0x173/0x180 [ 70.438133] device-mapper: multipath: Failing path 65:16. [ 70.438140] device-mapper: multipath: Failing path 65:0. [ 70.438985] entry_SYSCALL_64_after_hwframe+0x21/0x86 [ 70.439325] RIP: 0033:0x7f6d64a3a70c [ 70.439648] RSP: 002b:00007f6d65a78880 EFLAGS: 00000293 ORIG_RAX: 0000000000000003 [ 70.439974] RAX: 0000000000000000 RBX: 00007f6d4c0306f0 RCX: 00007f6d64a3a70c [ 70.440293] RDX: 0000000000000000 RSI: 0000000000000001 RDI: 000000000000000a [ 70.440605] RBP: 00007f6d4c030a68 R08: 0000000000000054 R09: 0000000000000080 [ 70.440907] R10: 00000000ffffffdf R11: 0000000000000293 R12: 000055867dd47410 [ 70.441199] R13: 0000000000000000 R14: 0000000000000001 R15: 00007f6d4c03be00 [ 70.441486] Code: fe ff ff 44 0f b6 bd 7f ff ff ff 80 7d ab 00 79 05 45 84 ff 74 7b 48 83 c4 78 5b 41 5c 41 5d 41 5e 41 5f 5d 49 8d 65 f0 41 5d c3 <0f> 0b 4c 89 d0 4c 89 55 88 45 89 fa 48 85 c0 44 0f b6 bd 7f ff [ 70.442406] RIP: __slab_free+0x18b/0x300 RSP: ffffb07f83a67cb0 [ 70.442735] ---[ end trace c79c468d1b715f7e ]--- [ 70.540044] device-mapper: multipath: Failing path 8:32. Expected results: Additional info:
I am trying to run bisect, but being completely new to Fedora/RH ecosystem I am a bit lost. I am trying to use https://pagure.io/fedbisect, but it fails to build: kernel-4.14.0-0.rc8.git2.1.fc28 build (rawhide, /rpms/kernel:d8bca9aaecd50e9254af30aac99bd764595039d2) d8bca9aaecd50e9254af30aac99bd764595039d2 cp: cannot stat '/root/multipath/pkg-git/config-*': No such file or directory cp: cannot stat '/root/multipath/pkg-git/Makefile.config': No such file or directory now calling building for kernel-4.14.0-0.rc8.git2.1.fc28. This may take a while... sources file doesn't exist. Source files download skipped. warning: line 751: Possible unexpanded macro in: %package %{pae} warning: line 751: Possible unexpanded macro in: Provides: kernel-x86_64 = 9.9.9-0+%{p .... error: Failed build dependencies: perl is needed by kernel-9.9.9-0.x86_64 Could not execute local: rpmbuild --define '_sourcedir /root/multipath/build' --define '_specdir /root/multipath/build' --define '_builddir /root/multipath/build' --define '_srcrpmdir /root/multipath/build' --define '_rpmdir /root/multipath/build' --define 'dist .fc27' -- define 'fedora 27' --eval '%undefine rhel' --define 'fc27 1' -ba /root/multipath/build/kernel.spec | tee .build-9.9.9-0.log well that didn't work... Shouldn't it be calling `mock` to build stuff?
Fedbisect is really out of date and basically abandoned. You are better off just bisecting with the kernel tree manually.
Laura Abbott, would it work if I try to build vanilla kernel without patches? I'll give it a go.
Vanilla kernel usually works just fine on most hardware if you use the same configuration file.
Git bisect points to this commit: fbce4d97fd4333bcffd00a73b9d98412be630332 is the first bad commit commit fbce4d97fd4333bcffd00a73b9d98412be630332 Author: Hannes Reinecke <hare> Date: Wed Oct 4 10:28:56 2017 +0200 scsi: fixup kernel warning during rmmod() Calling rmmod() on a FC driver will results in warnings like WARNING: CPU: 60 PID: 14640 at fs/sysfs/group.c:237 device_del+0x54/0x240() sysfs group ffffffff81eff140 not found for kobject '3:0:0:3' The problem here is that during scsi_remove_target() we will iterate over all devices, but fail to remove any of those as the call to scsi_device_get() fails the check to module_is_live(). Hence the devices will not be removed at this point, but all intermediate structures like fc rport etc. will be. Later on during scsi_forget_host() the devices are removed for real, but the device parent is already removed and causes this warning. Signed-off-by: Hannes Reinecke <hare> Reviewed-by: Johannes Thumshirn <jthumshirn> Reviewed-by: Kyle Fortin <kyle.fortin> Tested-by: Don Brace <don.brace> Signed-off-by: Martin K. Petersen <martin.petersen> git bisect log: git bisect start # bad: [39dae59d66acd86d1de24294bd2f343fd5e7a625] Linux 4.14-rc8 git bisect bad 39dae59d66acd86d1de24294bd2f343fd5e7a625 # good: [e87c13993f16549e77abce9744af844c55154349] Linux 4.13.16 git bisect good e87c13993f16549e77abce9744af844c55154349 # good: [569dbb88e80deb68974ef6fdd6a13edb9d686261] Linux 4.13 git bisect good 569dbb88e80deb68974ef6fdd6a13edb9d686261 # good: [a0725ab0c7536076d5477264420ef420ebb64501] Merge branch 'for-4.14/block' of git://git.kernel.dk/linux-block git bisect good a0725ab0c7536076d5477264420ef420ebb64501 # good: [ae46654bcff303b33facbbd04a3ad9c21d303f9b] Merge tag 'armsoc-drivers' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc git bisect good ae46654bcff303b33facbbd04a3ad9c21d303f9b # good: [6ff15f8db7eaf29ef5ead6afbec9b25485fe8703] x86/fpu: Change 'size_total' parameter to unsigned and standardize the size checks in copy_xstate_to_*() git bisect good 6ff15f8db7eaf29ef5ead6afbec9b25485fe8703 # good: [238604ca0b708319e089e22545bcda39afb5faa8] drm/exynos: Clear drvdata after component unbind git bisect good 238604ca0b708319e089e22545bcda39afb5faa8 # bad: [b5ac3beb5a9f0ef0ea64cd85faf94c0dc4de0e42] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net git bisect bad b5ac3beb5a9f0ef0ea64cd85faf94c0dc4de0e42 # good: [3728e6a255b50382591ee374c70e6f5276a47d0a] Merge tag 'media/v4.14-2' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media git bisect good 3728e6a255b50382591ee374c70e6f5276a47d0a # bad: [e5f468b3f23313994c5e6c356135f9b0d76bcb94] Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input git bisect bad e5f468b3f23313994c5e6c356135f9b0d76bcb94 # bad: [eb62722a06c4e6715c71e5c8fcf14aed857c0964] Merge branch 'i2c/for-current' of git://git.kernel.org/pub/scm/linux/kernel/git/wsa/linux git bisect bad eb62722a06c4e6715c71e5c8fcf14aed857c0964 # bad: [96f893abc87bd29339c973edf6068a064cb8a756] Merge branch 'parisc-4.14-3' of git://git.kernel.org/pub/scm/linux/kernel/git/deller/parisc-linux git bisect bad 96f893abc87bd29339c973edf6068a064cb8a756 # bad: [020b3023762fdf6cc816ed154e3b1f8eafaf0836] Merge branch 'for-linus' of git://git.kernel.dk/linux-block git bisect bad 020b3023762fdf6cc816ed154e3b1f8eafaf0836 # bad: [503f7e297d76bf109533eef3287ad821f05e47a4] Merge tag 'trace-v4.14-rc3' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace git bisect bad 503f7e297d76bf109533eef3287ad821f05e47a4 # bad: [a9e170e28636fd577249f39029d59e4e960a42b8] scsi: qla2xxx: Fix uninitialized work element git bisect bad a9e170e28636fd577249f39029d59e4e960a42b8 # bad: [d1b3f51ee1eab3a6db1b09a60e61280c48eb0b01] scsi: libfc: fix a deadlock in fc_rport_work git bisect bad d1b3f51ee1eab3a6db1b09a60e61280c48eb0b01 # bad: [fbce4d97fd4333bcffd00a73b9d98412be630332] scsi: fixup kernel warning during rmmod() git bisect bad fbce4d97fd4333bcffd00a73b9d98412be630332 # first bad commit: [fbce4d97fd4333bcffd00a73b9d98412be630332] scsi: fixup kernel warning during rmmod()
great! Your best bet is to send an e-mail to the authors of the patch + LKML + scsi mailing list. Feel free to cc me as well.
Posted https://lkml.org/lkml/2018/2/18/89
cherry-picling commit 81b6c9998979 ('scsi: core: check for device state in __scsi_remove_target()') on top of 4.4.20 fixes the problem for me
edit: on top of 4.14.20 of course
Thanks for bisecting and reporting. Fedora is getting upgraded to the 4.15 series (which should have the fix) so I'm going to close this bug as fixed in the current release.