Bug 2218022 - Frequent Graphics System Crashes with AMD 7900 + OpenCL application E@H
Summary: Frequent Graphics System Crashes with AMD 7900 + OpenCL application E@H
Keywords:
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: rocm-opencl
Version: 38
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Jeremy Newton
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-06-27 21:07 UTC by Paul DeStefano
Modified: 2023-07-22 09:00 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:
Type: ---
Embargoed:


Attachments (Terms of Use)
Journal logs of a representative but elaborated failure (1.74 MB, text/plain)
2023-06-27 21:09 UTC, Paul DeStefano
no flags Details

Description Paul DeStefano 2023-06-27 21:07:54 UTC
With 7900 GPU installed and running Einstein@Home, system is unstable/unusable.  The system will run for as little as 15 min or up to several hours, even a whole day, but that is rare.  I will include logs, but AMDGPU will report an error, then some type of hardware resets are attempted, but these fail, and all video on the system hangs and then crashes, taking down the desktop. The whole system doesn't crash or reset, though; it stays running and on the network.

Reproducible: Always

Steps to Reproduce:
1. Install BOINC & Einstein@Home
2. Install 7900
3. Crashes randomly, but often, under load.

All *I* have to do is swap out my 6800XT for my 7900XTX.
Actual Results:  
Faults reported in/by AMDGPU and OpenCL application.

Expected Results:  
This is well know load that hasn't changed in over a decade, although it is not a benchmark.  The software should work as it has in the past, and does work on older hardware.

I have finally confirmed that the same GPU works with the same OpenCL application on Ubuntu LTS with the official AMDGPU installer.  It's not clear how much of the OSS stack is used in that setup, but clearly the official ROCm pkgs are being installed.  AMDGPU-PRO may or may not be used; can't tell.  This may be an AMDGPU issue or ROCm issue.

I realize Fedora isn't officially supported by AMD, but past experience shows that Fedora is usually quite stable with the OSS stack for AMD GPUs.

Comment 1 Paul DeStefano 2023-06-27 21:09:26 UTC
Created attachment 1972901 [details]
Journal logs of a representative but elaborated failure

Comment 2 Jeremy Newton 2023-07-07 20:04:55 UTC
Sorry I'm a bit busy, can you try installing the 5.6 packages from rawhide and see if that works?
I don't mind back-porting ROCm 5.6 to Fedora 38 if it resolves this issue easily.

Comment 3 Paul DeStefano 2023-07-08 01:01:18 UTC
Yes!  I will do that; more soon...

Comment 4 Paul DeStefano 2023-07-08 21:52:50 UTC
Thanks for your help, Jeremy.

Well, I tried, but rawhide ROCm pkgs depend on newer glibc, so dnf pulled them in from rawhide, too.  Consequently, the whole device was unavailable in clinfo; it didn't even show up as a platform.  glixinfo still saw it.  I reverted back with distrosync, and things are back to normal.  (Although, I have a slightly newer glibc, now, than I did before; I think 2.37 must have dropped for f38 very recently because I did an update just two days ago.)

I was told by another Fedora maintainer *never* to pull from another release version in this way, and I can see why doing so for libs, especially libc, is a bad idea.  But, I couldn't think of another way to do what you asked.

I think a testing build would be better.  I assume from the fact that you didn't do this already that it is not as easy as it should be?

The other option is that I commit to rawhide.  I keep meaning to do that, but, as a test, I've been running rawhide in a VM for many years, and it periodically experiences fatal consequences after weekly updates.  This makes me concerned.

In any case, let me know what you think.

Comment 5 Jeremy Newton 2023-07-08 21:59:24 UTC
Ah sorry, I should have thought of that.

Please use my copr:
https://copr.fedorainfracloud.org/coprs/mystro256/rocm-hip/

I use it for testing the packages before I put them into rawhide, as I don't actually drive fedora rawhide on my local machine.

The repo is usually in flux, so your millage may vary.

Comment 6 Paul DeStefano 2023-07-10 08:14:09 UTC
I installed your ROCm.  Thank you for making this copr.  They seem to work very well on my 6800XT.

But, I haven't had time to swap in the 7900.  But, I'll try to do that this week or weekend.  Fingers crossed.

Comment 7 Jeremy Newton 2023-07-10 17:32:50 UTC
No problem! I use it for my own testing, I might advertise it more for people looking to use the latest ROCm.
I'm trying to avoid back-porting ROCm too much unless there's something broken, e.g. this ticket.

Comment 8 Paul DeStefano 2023-07-22 09:00:19 UTC
Just an update: I'm still working on testing it.  A new problem has cropped up, so testing is delayed.  I'll get to it ASAP.  I'm on your upstream repo, though, so I'm prepared.

I did want to say that rc1 & rc2 have been unstable.  The OpenCL problem causes a video crash, but doesn't really hurt the system.  When rc1/rc2 have crashed, it has caused FS errors that require fsck, which isn't the worst thing, but is concerning, takes a bit of effort to fix, and could potentially cause a serious problem.  These crashes also produce no information; the journal is damaged and nothing useful survives. FYI.


Note You need to log in before you can comment on or make changes to this bug.