Bug 2227061

Summary: uit64_t-variant patch inteferes with rocfft build
Product: [Fedora] Fedora Reporter: Tim Flink <tflink>
Component: rocclrAssignee: Jeremy Newton <alexjnewt>
Status: NEW --- QA Contact:
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 39CC: alexjnewt, trix
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tim Flink 2023-07-27 16:00:50 UTC
There is a patch in the rocfft package which, according to comments, was added to help with building blender. It turns out that the patch interferes with the current packaging effort for rocfft in that the rocfft kernel cache building doesn't work. The upstream issue is https://github.com/ROCmSoftwarePlatform/rocFFT/issues/422

The patch in question is https://src.fedoraproject.org/rpms/rocclr/blob/rawhide/f/0001-add-uint64_t-variant-for-__ffsll.patch

Reproducing is simple - build the upstream project on a Fedora system using Fedora packaged dependencies and the build process will fail. If rocclr is rebuilt without the uint64_t-variant patch, the rocfft build finishes (eventually - the kernel cache process takes a long time - at least an hour on my system).

Disabling the kernel cache isn't a realistic solution because of the time required to build the kernels and if those kernels aren't cached, they will be built at runtime. On my system (ryzen 7 5700X, 64GB memory), the kernel cache build takes over an hour. While building the kernels at runtime wouldn't take that long, it still seems like an unreasonable demand of users.

Additionally, I have been trying to triage an issue with rocfft that was exclusive to Fedora - built without the cached kernels. Simple code from the documentation (https://rocfft.readthedocs.io/en/rocm-5.6.0/#example) would throw errors and 100% of the test suite would fail with rocfft built against rocclr as packaged in Fedora. It isn't filed anywhere because I was still triaging the issue.

When I built rocfft against the de-patched rocclr, the example code no longer errors out and produces the same results as when it is built and run on Debian sid built with Debian packaged dependencies and RHEL 9.2 built with AMD supplied dependencies. Additionally, the test suite passes when built against rocfft without the patch.

I can provide details on the runtime errors I was seeing if desired.

Comment 1 Fedora Release Engineering 2023-08-16 08:06:01 UTC
This bug appears to have been reported against 'rawhide' during the Fedora Linux 39 development cycle.
Changing version to 39.

Comment 2 Jeremy Newton 2023-08-16 19:42:22 UTC
I believe Tom Rix landed this patch and I pulled it in.
I can revert it, but I would need to look into this a bit more to understand the issue.

Sorry I've been sick so I'm un-burying myself in unanswered emails.

Comment 3 Tom Rix 2023-08-16 21:30:16 UTC
This change was to fix a build error with blender.
Reverting it will likely break blender again.
IIRC - the default hander was a template for any int type, the rocm handler handled just 2 type.
A better solution would be for the rocm handler to be more like the template handler and handle any int type.