Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1579155

Summary:

Nvidia driver crashes with RT Kernal 3.10 on starting X.

Product:

Red Hat Enterprise MRG

Reporter:

gsamaiya

Component:

realtime-kernel

Assignee:

Red Hat Real Time Maintenance <rt-maint>

Status:

CLOSED NOTABUG

QA Contact:

Jiri Kastner <jkastner>

Severity:

high

Docs Contact:

Priority:

high

Version:

2.5

CC:

bhu, crwood, gsamaiya, jkachuck, lgoncalv, williams

Target Milestone:

---

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-24 13:33:58 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1575505, 1619417, 1639487

Bug Blocks:

1507957, 1547686

Attachments:

Description	Flags
UNTESTED patch to change nv_spinlock_t	none

Description gsamaiya 2018-05-17 06:00:13 UTC

Description of problem: Nvidia driver crashes with RT Kernal 3.10 on starting X.


Version-Release number of selected component (if applicable):
RT Kernel 3.10

How reproducible:
Install r375 Nvidia driver or latest ones if 1575505 is fixed on kernel.
Start X server, Nvidia driver reproduces crash.
Call Trace:
BUG: scheduling while atomic: swapper/0/0/0x00010002
Hardware name: Hewlett-Packard HP Z840 Workstation/2129, BIOS M60 v02.41 01/25/2018
 ffffffff81951ae0 ffff880244603b78 ffffffff8162dbe6 ffff880244603b88
 ffffffff810a350b ffff880244603c18 ffffffff8162f495 0000000000000000
 ffffffff81930000 ffffffff81930010 ffffffff81930000 ffffffff81933fd8
Call Trace:
 <IRQ>  [<ffffffff8162dbe6>] dump_stack+0x19/0x1b
 [<ffffffff810a350b>] __schedule_bug+0x4b/0x60
 [<ffffffff8162f495>] __schedule+0x7e5/0x800
 [<ffffffff8162f7a4>] schedule+0x34/0xa0
 [<ffffffff81630b2d>] rt_spin_lock_slowlock+0x13d/0x340
 [<ffffffff81631646>] rt_spin_lock+0x26/0x30
 [<ffffffffa023d7ce>] os_acquire_spinlock+0x1e/0x30 [nvidia]
 [<ffffffffa07683be>] _nv019576rm+0x35e/0x5f0 [nvidia]
 [<ffffffffa07686d3>] ? _nv019568rm+0x73/0x700 [nvidia]
 [<ffffffffa07cdce0>] ? _nv017491rm+0x40/0x310 [nvidia]
 [<ffffffffa07caa0b>] ? _nv017511rm+0xbb/0xe0 [nvidia]
 [<ffffffffa07d2f2c>] ? rm_isr+0x7c/0x130 [nvidia]
 [<ffffffffa0231a5f>] ? nvidia_isr+0x7f/0x100 [nvidia]

Steps to Reproduce: As mentioned above.
1.
2.
3.

Actual results:
System doesn't respond and may result in hang.


Expected results:
System shouldn't hang.


Additional info:
With kernel 4.9, this crash is no longer seen, even with same driver.
From our analysis, it looks that NV driver is in ISR context and kernel is trying to schedule the ISR thread.

We would like to understand better, as to what has changed between 3.10 and 4.9 RT kernel patches, that can fix the issue.

Comment 1 Joseph Kachuck 2018-05-17 17:45:51 UTC

Hello Nvidia,
Please provide the following information:
Which RHEL 3.10 kernel are you seeing the issue in. Please confirm kernel-rt-3.10.0-693.31.1.rt56.620.el6rt is seeing this issue.
Which RHEL 4.9 kernel and you not seeing the issue in.

Please let me know if you would be able to provide the entire core dump. If it is to large to attach. Please upload to:
https://access.redhat.com/solutions/2112

Please also attach a sosreport from the system directly after seeing this issue.

Thank You
Joe Kachuck

Comment 2 Clark Williams 2018-05-17 18:43:07 UTC

This is a typical problem on the RT kernel. The RT patchset (PREEMPT_RT) converts regular spinlock_t to rtmutex_t, which is a sleeping lock. If the lock in question doesn't have a long scope (i.e. is acquired and released with out a long time interval between them) then the quick fix is to convert the lock to a raw_spinlock_t, which is the same locktype on both stock Linux and RT Linux. If the code section being protected by the lock may be held for a long time then we'll need to look at other mutual exclusion mechanisms, since the whole point of the RT kernel is to remain preemptable.

Comment 3 gsamaiya 2018-05-18 06:47:36 UTC

(In reply to Joseph Kachuck from comment #1)
> Hello Nvidia,
> Please provide the following information:
> Which RHEL 3.10 kernel are you seeing the issue in. Please confirm
> kernel-rt-3.10.0-693.31.1.rt56.620.el6rt is seeing this issue.
> Which RHEL 4.9 kernel and you not seeing the issue in.
> 
> Please let me know if you would be able to provide the entire core dump. If
> it is to large to attach. Please upload to:
> https://access.redhat.com/solutions/2112
> 
> Please also attach a sosreport from the system directly after seeing this
> issue.
> 
> Thank You
> Joe Kachuck

Failing Kernel Versions:
Customer reported on 3.10.0-693.17.1.rt56b.604 and we locally reproduced on 3.10.0-514.rt56.228.el6rt.x86_64.

Passing Kernel Version: 4.9.84-rt62.

Comment 5 Clark Williams 2018-05-23 23:01:20 UTC

Would you attach the driver source to this bz, or at least where the abstractions for things like os_acquire_spinlock() and family are defined?

I suspect that it's using spinlock_t and for RT we will probably want it to use raw_spinlock_t. That depends a lot on the scope of the regions being protected by the locks.

Comment 7 Clark Williams 2018-05-30 16:15:04 UTC

Created attachment 1445953 [details]
UNTESTED patch to change nv_spinlock_t

Note this is an untested patch, to illustrate what may be needed to get the Nvidia drivers working on an RT kernel

Comment 8 gsamaiya 2018-06-12 17:42:57 UTC

Thanks for the patch.
We will try to make similar change in our driver and test it.

But from the test results that we have so far, 4.9.84 kernel seems to be passing for us.
Is the spinlock implementation between 4.9.84 and 3.10 kernel same? and we are just getting lucky to not hit the issue with 4.9.84 kernek.

Comment 9 Clark Williams 2018-06-12 18:05:04 UTC

I suspect it's luck :)

The RT patch is the same wrt locking, so it's bound to be code paths that have changed. It's entirely possible that 4.9 RT dodges this, but it might come back in 4.16+. Best thing would be to change locks that might be taken in atomic context (e.g. nv_lock_t) to raw spinlocks, at least in the PREEMPT_RT case.

Comment 10 gsamaiya 2018-07-24 10:34:26 UTC

Thanke for all the help.
We changed our locks to raw_spinlock_t and things are pretty good.
We can close this bug.

Comment 11 Beth Uptagrafft 2018-07-24 13:33:58 UTC

(In reply to gsamaiya from comment #10)
> Thanke for all the help.
> We changed our locks to raw_spinlock_t and things are pretty good.
> We can close this bug.

Glad to hear we were able to help and that things are working better.  Will close this bug as you suggest.