Bug 1579155
| Summary: | Nvidia driver crashes with RT Kernal 3.10 on starting X. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise MRG | Reporter: | gsamaiya | ||||
| Component: | realtime-kernel | Assignee: | Red Hat Real Time Maintenance <rt-maint> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Jiri Kastner <jkastner> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 2.5 | CC: | bhu, crwood, gsamaiya, jkachuck, lgoncalv, williams | ||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | x86_64 | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-07-24 13:33:58 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Bug Depends On: | 1575505, 1619417, 1639487 | ||||||
| Bug Blocks: | 1507957, 1547686 | ||||||
| Attachments: |
|
||||||
|
Description
gsamaiya
2018-05-17 06:00:13 UTC
Hello Nvidia, Please provide the following information: Which RHEL 3.10 kernel are you seeing the issue in. Please confirm kernel-rt-3.10.0-693.31.1.rt56.620.el6rt is seeing this issue. Which RHEL 4.9 kernel and you not seeing the issue in. Please let me know if you would be able to provide the entire core dump. If it is to large to attach. Please upload to: https://access.redhat.com/solutions/2112 Please also attach a sosreport from the system directly after seeing this issue. Thank You Joe Kachuck This is a typical problem on the RT kernel. The RT patchset (PREEMPT_RT) converts regular spinlock_t to rtmutex_t, which is a sleeping lock. If the lock in question doesn't have a long scope (i.e. is acquired and released with out a long time interval between them) then the quick fix is to convert the lock to a raw_spinlock_t, which is the same locktype on both stock Linux and RT Linux. If the code section being protected by the lock may be held for a long time then we'll need to look at other mutual exclusion mechanisms, since the whole point of the RT kernel is to remain preemptable. (In reply to Joseph Kachuck from comment #1) > Hello Nvidia, > Please provide the following information: > Which RHEL 3.10 kernel are you seeing the issue in. Please confirm > kernel-rt-3.10.0-693.31.1.rt56.620.el6rt is seeing this issue. > Which RHEL 4.9 kernel and you not seeing the issue in. > > Please let me know if you would be able to provide the entire core dump. If > it is to large to attach. Please upload to: > https://access.redhat.com/solutions/2112 > > Please also attach a sosreport from the system directly after seeing this > issue. > > Thank You > Joe Kachuck Failing Kernel Versions: Customer reported on 3.10.0-693.17.1.rt56b.604 and we locally reproduced on 3.10.0-514.rt56.228.el6rt.x86_64. Passing Kernel Version: 4.9.84-rt62. Would you attach the driver source to this bz, or at least where the abstractions for things like os_acquire_spinlock() and family are defined? I suspect that it's using spinlock_t and for RT we will probably want it to use raw_spinlock_t. That depends a lot on the scope of the regions being protected by the locks. Created attachment 1445953 [details]
UNTESTED patch to change nv_spinlock_t
Note this is an untested patch, to illustrate what may be needed to get the Nvidia drivers working on an RT kernel
Thanks for the patch. We will try to make similar change in our driver and test it. But from the test results that we have so far, 4.9.84 kernel seems to be passing for us. Is the spinlock implementation between 4.9.84 and 3.10 kernel same? and we are just getting lucky to not hit the issue with 4.9.84 kernek. I suspect it's luck :) The RT patch is the same wrt locking, so it's bound to be code paths that have changed. It's entirely possible that 4.9 RT dodges this, but it might come back in 4.16+. Best thing would be to change locks that might be taken in atomic context (e.g. nv_lock_t) to raw spinlocks, at least in the PREEMPT_RT case. Thanke for all the help. We changed our locks to raw_spinlock_t and things are pretty good. We can close this bug. (In reply to gsamaiya from comment #10) > Thanke for all the help. > We changed our locks to raw_spinlock_t and things are pretty good. > We can close this bug. Glad to hear we were able to help and that things are working better. Will close this bug as you suggest. |