Bug 1559692

Summary:	Hang or NULL pointer dereference if reading sysfs during VDO start/stop.
Product:	Red Hat Enterprise Linux 7	Reporter:	Sweet Tea Dorminy <sweettea>
Component:	kmod-kvdo	Assignee:	Thomas Jaskiewicz <tjaskiew>
Status:	CLOSED ERRATA	QA Contact:	Jakub Krysl <jkrysl>
Severity:	unspecified	Docs Contact:
Priority:	high
Version:	7.5	CC:	awalsh, bgurney, jkrysl, rhandlin, tjaskiew
Target Milestone:	rc	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	6.1.1.60	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1567744 (view as bug list)		Environment:
Last Closed:	2018-10-30 09:39:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1567744

Description Sweet Tea Dorminy 2018-03-23 03:21:31 UTC

Description of problem:
If someone attempts to read sysfs entries while a VDO is starting up or shutting down, a hang or a NULL pointer dereference may occur. The VDO may be freed or not yet exist for the specific parts needed for the sysfs invocation.

Version-Release number of selected component (if applicable):
6.1.0.155

How reproducible:
1 in 10

Steps to Reproduce:
1. In a shell, run 'while true; cat /sys/kvdo/vdo0/statistics/data_blocks_used || true; done;'
2. Make a VDO but don't start it.
3. Start and stop the VDO in a loop.

Actual results:
Eventually, 120s hung task warnings will result for both a dmsetup command and a cat. Alternately, a NULL pointer dereference may occur.

Expected results:
No hung tasks or crashes.

Additional info:

Comment 2 Jakub Krysl 2018-04-13 14:24:40 UTC

Reproduced, acking...

1) # vdo create --name vdo0 --device /dev/sdb --activate disabled
2) # vdo activate --name vdo0
3) # while true; do vdo start --name vdo0 --verbose; vdo stop --name vdo0 --verbose; done;
4) (in separate terminal after few cycles of 3) ) # while true; do cat /sys/kvdo/vdo0/statistics/data_blocks_used || true; done;
5) terminals gets stuck
6) sudo shutdown -r now
7) this appears in console:

[  OK  ] Stopped Availability of block devices.
[     *] (2 of 2) A stop job is running for ... user root (1min 25s / 1min 30s)[  963.792399] INFO: task dmsetup:7961 blocked for more than 120 seconds.
[  963.827624] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  963.865243] Call Trace:
[  963.876890]  [<ffffffff89b12f49>] schedule+0x29/0x70
[  963.900072]  [<ffffffff896a124d>] __kernfs_remove+0x17d/0x260
[  963.926933]  [<ffffffff894bbe20>] ? wake_up_atomic_t+0x30/0x30
[  963.954539]  [<ffffffff896a21b1>] kernfs_remove+0x21/0x30
[  963.980037]  [<ffffffff896a46c0>] sysfs_remove_dir+0x50/0x80
[  964.006754]  [<ffffffff8974ccb8>] kobject_del+0x18/0x50
[  964.031334]  [<ffffffff8974cd4e>] kobject_release+0x5e/0x1b0
[  964.057751]  [<ffffffff8974cc08>] kobject_put+0x28/0x60
[  964.082184]  [<ffffffffc087d663>] freeKernelLayer+0x223/0x2f0 [kvdo]
[  964.112320]  [<ffffffffc086e8ad>] vdoDtr+0xfd/0x1b0 [kvdo]
[  964.138357]  [<ffffffff89548250>] ? dyntick_save_progress_counter+0x30/0x30
[  964.171241]  [<ffffffffc0140763>] dm_table_destroy+0x73/0x120 [dm_mod]
[  964.171252]  [<ffffffffc013c726>] __dm_destroy+0x136/0x230 [dm_mod]
[  964.171269]  [<ffffffffc013ec23>] dm_destroy+0x13/0x20 [dm_mod]
[  964.171281]  [<ffffffffc0144c5e>] dev_remove+0x11e/0x1a0 [dm_mod]
[  964.171292]  [<ffffffffc0145b02>] ctl_ioctl+0x212/0x4e0 [dm_mod]
[  964.171308]  [<ffffffffc0144b40>] ? dev_suspend+0x260/0x260 [dm_mod]
[  964.171319]  [<ffffffffc0145dde>] dm_ctl_ioctl+0xe/0x20 [dm_mod]
[  964.171325]  [<ffffffff8962fb90>] do_vfs_ioctl+0x350/0x560
[  964.171329]  [<ffffffff896d82bf>] ? file_has_perm+0x9f/0xb0
[  964.171333]  [<ffffffff8962fe41>] SyS_ioctl+0xa1/0xc0
[  964.171340]  [<ffffffff89b1f7d5>] system_call_fastpath+0x1c/0x21
[  964.171343] INFO: task cat:7963 blocked for more than 120 seconds.
[  964.171343] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  964.171510] Call Trace:
[  964.171515]  [<ffffffff89751384>] ? __radix_tree_lookup+0x84/0xf0
[  964.171521]  [<ffffffff89b12f49>] schedule+0x29/0x70
[  964.171524]  [<ffffffff89b108b9>] schedule_timeout+0x239/0x2c0
[  964.171530]  [<ffffffff895962de>] ? filemap_fault+0x17e/0x490
[  964.171535]  [<ffffffff894f8c5f>] ? __getnstimeofday64+0x3f/0xd0
[  964.171538]  [<ffffffff894f8cfe>] ? getnstimeofday64+0xe/0x30
[  964.171542]  [<ffffffff89b132fd>] wait_for_completion+0xfd/0x140
[  964.171548]  [<ffffffff894cee80>] ? wake_up_state+0x20/0x20
[  964.171570]  [<ffffffffc0873040>] ? finishVDOAction+0x20/0x20 [kvdo]
[  964.171587]  [<ffffffffc0872fb2>] performKVDOOperation+0xb2/0xe0 [kvdo]
[  964.171603]  [<ffffffffc0873040>] ? finishVDOAction+0x20/0x20 [kvdo]
[  964.171616]  [<ffffffffc0873040>] ? finishVDOAction+0x20/0x20 [kvdo]
[  964.171632]  [<ffffffffc08737a7>] getKVDOStatistics+0x57/0x80 [kvdo]
[  964.171648]  [<ffffffffc0877c06>] poolStatsDataBlocksUsedShow+0x36/0x70 [kvdo]
[  964.171663]  [<ffffffffc0873c41>] poolStatsAttrShow+0x21/0x30 [kvdo]
[  964.171667]  [<ffffffff896a3e8f>] sysfs_kf_seq_show+0xcf/0x1f0
[  964.171671]  [<ffffffff896a25d6>] kernfs_seq_show+0x26/0x30
[  964.171675]  [<ffffffff89641410>] seq_read+0x110/0x3f0
[  964.171679]  [<ffffffff896a2e35>] kernfs_fop_read+0xf5/0x160
[  964.171683]  [<ffffffff8961ab3f>] vfs_read+0x9f/0x170
[  964.171686]  [<ffffffff8961ba0f>] SyS_read+0x7f/0xf0
[  964.171692]  [<ffffffff89b1f7d5>] system_call_fastpath+0x1c/0x21
[   ***] (1 of 2) A stop job is running for ...1 of user root (2min 28s / 3min)

Comment 3 Thomas Jaskiewicz 2018-04-13 18:24:12 UTC

*** Bug 1567215 has been marked as a duplicate of this bug. ***

Comment 6 Jakub Krysl 2018-07-03 08:01:22 UTC

Tested on:
RHEL-7.6-20180626.0
kernel-3.10.0-915.el7
kmod-vdo-6.1.1.99-1.el7
vdo-6.1.1.99-2.el7

I was not able to reproduce this anymore, the stop/start cycle keeps going.
Regression testing did not find any issues.

Comment 8 errata-xmlrpc 2018-10-30 09:39:22 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3094