Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1579155

Summary: Nvidia driver crashes with RT Kernal 3.10 on starting X.
Product: Red Hat Enterprise MRG Reporter: gsamaiya
Component: realtime-kernelAssignee: Red Hat Real Time Maintenance <rt-maint>
Status: CLOSED NOTABUG QA Contact: Jiri Kastner <jkastner>
Severity: high Docs Contact:
Priority: high    
Version: 2.5CC: bhu, crwood, gsamaiya, jkachuck, lgoncalv, williams
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-24 13:33:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1575505, 1619417, 1639487    
Bug Blocks: 1507957, 1547686    
Attachments:
Description Flags
UNTESTED patch to change nv_spinlock_t none

Description gsamaiya 2018-05-17 06:00:13 UTC
Description of problem: Nvidia driver crashes with RT Kernal 3.10 on starting X.


Version-Release number of selected component (if applicable):
RT Kernel 3.10

How reproducible:
Install r375 Nvidia driver or latest ones if 1575505 is fixed on kernel.
Start X server, Nvidia driver reproduces crash.
Call Trace:
BUG: scheduling while atomic: swapper/0/0/0x00010002
Hardware name: Hewlett-Packard HP Z840 Workstation/2129, BIOS M60 v02.41 01/25/2018
 ffffffff81951ae0 ffff880244603b78 ffffffff8162dbe6 ffff880244603b88
 ffffffff810a350b ffff880244603c18 ffffffff8162f495 0000000000000000
 ffffffff81930000 ffffffff81930010 ffffffff81930000 ffffffff81933fd8
Call Trace:
 <IRQ>  [<ffffffff8162dbe6>] dump_stack+0x19/0x1b
 [<ffffffff810a350b>] __schedule_bug+0x4b/0x60
 [<ffffffff8162f495>] __schedule+0x7e5/0x800
 [<ffffffff8162f7a4>] schedule+0x34/0xa0
 [<ffffffff81630b2d>] rt_spin_lock_slowlock+0x13d/0x340
 [<ffffffff81631646>] rt_spin_lock+0x26/0x30
 [<ffffffffa023d7ce>] os_acquire_spinlock+0x1e/0x30 [nvidia]
 [<ffffffffa07683be>] _nv019576rm+0x35e/0x5f0 [nvidia]
 [<ffffffffa07686d3>] ? _nv019568rm+0x73/0x700 [nvidia]
 [<ffffffffa07cdce0>] ? _nv017491rm+0x40/0x310 [nvidia]
 [<ffffffffa07caa0b>] ? _nv017511rm+0xbb/0xe0 [nvidia]
 [<ffffffffa07d2f2c>] ? rm_isr+0x7c/0x130 [nvidia]
 [<ffffffffa0231a5f>] ? nvidia_isr+0x7f/0x100 [nvidia]

Steps to Reproduce: As mentioned above.
1.
2.
3.

Actual results:
System doesn't respond and may result in hang.


Expected results:
System shouldn't hang.


Additional info:
With kernel 4.9, this crash is no longer seen, even with same driver.
From our analysis, it looks that NV driver is in ISR context and kernel is trying to schedule the ISR thread.

We would like to understand better, as to what has changed between 3.10 and 4.9 RT kernel patches, that can fix the issue.

Comment 1 Joseph Kachuck 2018-05-17 17:45:51 UTC
Hello Nvidia,
Please provide the following information:
Which RHEL 3.10 kernel are you seeing the issue in. Please confirm kernel-rt-3.10.0-693.31.1.rt56.620.el6rt is seeing this issue.
Which RHEL 4.9 kernel and you not seeing the issue in.

Please let me know if you would be able to provide the entire core dump. If it is to large to attach. Please upload to:
https://access.redhat.com/solutions/2112

Please also attach a sosreport from the system directly after seeing this issue.

Thank You
Joe Kachuck

Comment 2 Clark Williams 2018-05-17 18:43:07 UTC
This is a typical problem on the RT kernel. The RT patchset (PREEMPT_RT) converts regular spinlock_t to rtmutex_t, which is a sleeping lock. If the lock in question doesn't have a long scope (i.e. is acquired and released with out a long time interval between them) then the quick fix is to convert the lock to a raw_spinlock_t, which is the same locktype on both stock Linux and RT Linux. If the code section being protected by the lock may be held for a long time then we'll need to look at other mutual exclusion mechanisms, since the whole point of the RT kernel is to remain preemptable.

Comment 3 gsamaiya 2018-05-18 06:47:36 UTC
(In reply to Joseph Kachuck from comment #1)
> Hello Nvidia,
> Please provide the following information:
> Which RHEL 3.10 kernel are you seeing the issue in. Please confirm
> kernel-rt-3.10.0-693.31.1.rt56.620.el6rt is seeing this issue.
> Which RHEL 4.9 kernel and you not seeing the issue in.
> 
> Please let me know if you would be able to provide the entire core dump. If
> it is to large to attach. Please upload to:
> https://access.redhat.com/solutions/2112
> 
> Please also attach a sosreport from the system directly after seeing this
> issue.
> 
> Thank You
> Joe Kachuck

Failing Kernel Versions:
Customer reported on 3.10.0-693.17.1.rt56b.604 and we locally reproduced on 3.10.0-514.rt56.228.el6rt.x86_64.

Passing Kernel Version: 4.9.84-rt62.

Comment 5 Clark Williams 2018-05-23 23:01:20 UTC
Would you attach the driver source to this bz, or at least where the abstractions for things like os_acquire_spinlock() and family are defined?

I suspect that it's using spinlock_t and for RT we will probably want it to use raw_spinlock_t. That depends a lot on the scope of the regions being protected by the locks.

Comment 7 Clark Williams 2018-05-30 16:15:04 UTC
Created attachment 1445953 [details]
UNTESTED patch to change nv_spinlock_t

Note this is an untested patch, to illustrate what may be needed to get the Nvidia drivers working on an RT kernel

Comment 8 gsamaiya 2018-06-12 17:42:57 UTC
Thanks for the patch.
We will try to make similar change in our driver and test it.

But from the test results that we have so far, 4.9.84 kernel seems to be passing for us.
Is the spinlock implementation between 4.9.84 and 3.10 kernel same? and we are just getting lucky to not hit the issue with 4.9.84 kernek.

Comment 9 Clark Williams 2018-06-12 18:05:04 UTC
I suspect it's luck :)

The RT patch is the same wrt locking, so it's bound to be code paths that have changed. It's entirely possible that 4.9 RT dodges this, but it might come back in 4.16+. Best thing would be to change locks that might be taken in atomic context (e.g. nv_lock_t) to raw spinlocks, at least in the PREEMPT_RT case.

Comment 10 gsamaiya 2018-07-24 10:34:26 UTC
Thanke for all the help.
We changed our locks to raw_spinlock_t and things are pretty good.
We can close this bug.

Comment 11 Beth Uptagrafft 2018-07-24 13:33:58 UTC
(In reply to gsamaiya from comment #10)
> Thanke for all the help.
> We changed our locks to raw_spinlock_t and things are pretty good.
> We can close this bug.

Glad to hear we were able to help and that things are working better.  Will close this bug as you suggest.