Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1533260

Summary:	General Protection Fault unlocking UDS callback mutex
Product:	Red Hat Enterprise Linux 7	Reporter:	Thomas Jaskiewicz <tjaskiew>
Component:	kmod-kvdo	Assignee:	Thomas Jaskiewicz <tjaskiew>
Status:	CLOSED ERRATA	QA Contact:	Jakub Krysl <jkrysl>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	7.5	CC:	awalsh, jkrysl, limershe, mgandhi, salmy, tjaskiew
Target Milestone:	rc
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	6.1.0.126	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-10 16:26:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Thomas Jaskiewicz 2018-01-10 21:25:43 UTC

Description of problem:

In our nightly testing, we have seen an occasional general protection fault related to the UDS callback code path.  Examining the source code, we have found a code path that sleeps while holding a spinlock.

Version-Release number of selected component (if applicable):


How reproducible:

Hard to reproduce.  We use this code path 10,000 in nightly testing and see the failure once (or maybe twice).

Steps to Reproduce:
1.
2.
3.

Actual results:
[ 5336.182295] general protection fault: 0000 [#1] SMP  
[ 5336.188746] Modules linked in: kvdo(OE) uds(OE) nfsv3 nfs_acl nfs lockd grace fscache dm_mirror dm_region_hash dm_log sb_edac dm_mod edac_core intel_rapl ios$
[ 5336.238873] CPU: 0 PID: 17257 Comm: kvdo305:callbac Tainted: P        W  OE  ------------   3.10.0-693.2.2.el7.pbit3.x86_64 #1 
[ 5336.238873] task: ffff880002408000 ti: ffff88004f504000 task.ti: ffff88004f504000 
[ 5336.395388] RIP: e030:[<ffffffff810c0567>]  [<ffffffff810c0567>] wake_q_add+0x17/0x50 
[ 5336.395388] RSP: e02b:ffff88004f507dc0  EFLAGS: 00010246 
[ 5336.395388] RAX: 0000000000000000 RBX: ffff880047ca3c18 RCX: e5894855000053cf 
[ 5336.395388] RDX: 0000000000000001 RSI: e58948550000441f RDI: ffff88004f507dd0 
[ 5336.395388] RBP: ffff88004f507dc0 R08: 0000000000000000 R09: ffff880060ed1fa0 
[ 5336.395388] R10: 0000000000000000 R11: 0000000000000400 R12: ffff880047ca3c14 
[ 5336.395388] R13: ffff88004f507dd0 R14: 0000000000000000 R15: 0000000000000000 
[ 5336.395388] FS:  0000000000000000(0000) GS:ffff880063800000(0000) knlGS:ffff880063800000 
[ 5336.395388] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b 
[ 5336.395388] CR2: 000000000135b4b0 CR3: 0000000050918000 CR4: 0000000000000660 
[ 5336.395388] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[ 5336.395388] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 
[ 5336.395388] Stack: 
[ 5336.395388]  ffff88004f507e00 ffffffff816a834e 0000000000000001 ffff88004f507dd0 
[ 5336.395388]  00000000263658f1 ffff880022dde900 ffff880047ca3c10 0000000000000001 
[ 5336.395388]  ffff88004f507e10 ffffffff816a77fb ffff88004f507e30 ffffffffa023285c 
[ 5336.395388] Call Trace: 
[ 5336.395388]  [<ffffffff816a834e>] __mutex_unlock_slowpath+0x5e/0x90 
[ 5336.395388]  [<ffffffff816a77fb>] mutex_unlock+0x1b/0x20 
[ 5336.395388]  [<ffffffffa023285c>] enterCallbackStage+0x6c/0xc0 [uds] 
[ 5336.395388]  [<ffffffffa0221f80>] handleCallbacks+0xa0/0xc0 [uds] 
[ 5336.395388]  [<ffffffffa0232d86>] requestQueueWorker+0x156/0x1a0 [uds] 
[ 5336.395388]  [<ffffffffa0240090>] ? releaseSemaphore+0x70/0x70 [uds] 
[ 5336.395388]  [<ffffffffa0240114>] threadStarter+0x84/0xa0 [uds] 
[ 5336.395388]  [<ffffffff810b0a8f>] kthread+0xcf/0xe0 
[ 5336.395388]  [<ffffffff810b09c0>] ? insert_kthread_work+0x40/0x40 
[ 5336.395388]  [<ffffffff816b5018>] ret_from_fork+0x58/0x90 
[ 5336.395388]  [<ffffffff810b09c0>] ? insert_kthread_work+0x40/0x40 
[ 5336.395388] Code: f0 41 ff ff eb e6 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8d 8e b0 0f 00 00 31 c0 ba 01 00 00 00 48 89 e5 <f0> 48 0f$
[ 5336.395388] RIP  [<ffffffff810c0567>] wake_q_add+0x17/0x50 
[ 5336.395388]  RSP <ffff88004f507dc0> 
[ 5337.186198] ---[ end trace 54f65fb33337141c ]--- 
[ 5337.203780] Kernel panic - not syncing: Fatal exception 

Expected results:
no general protection fault

Additional info:

Comment 2 Jakub Krysl 2018-01-18 14:14:09 UTC

Tom,
is there an easy script that could run this 10000 times in reasonable time? If not, I will just do SanityOnly.
Waiting with ack for your response.

Comment 3 Thomas Jaskiewicz 2018-01-18 18:14:08 UTC

No. We have a test that does 1024 start/stop cycles of a single VDO device.  Its purpose is to ensure that VDO instance numbers behave properly, but it was the first test to see the general protection fault (after 500 cycles).

We did a code inspection based upon the stack trace of the fault, and found where the code was holding a spinlock while calling a method that would sleep.  This is a definite problem, and we have a fix for that code path.

Since we have put the fix in, we have found the failure very hard to produce.  We have had one failure that looks the same, but do not have enough data to tell.

The fix we have is definitely needed, and SanityOnly is the best thing to do.

Comment 5 Jakub Krysl 2018-02-15 10:08:37 UTC

Could not find any regression, setting to verified.

Comment 8 errata-xmlrpc 2018-04-10 16:26:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:0900