Bug 2187824 - Folding@Home client crashes after Fedora 38 upgrade
Summary: Folding@Home client crashes after Fedora 38 upgrade
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: llvm
Version: 38
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Tulio Magno Quites Machado Filho
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2023-04-18 19:32 UTC by Brad Jackson
Modified: 2023-04-27 12:26 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-04-27 12:26:05 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)

Description Brad Jackson 2023-04-18 19:32:02 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/113.0
Build Identifier: 

Working properly on Fedora 37, but after upgrade to 38, running the fahclient init script causes:

CommandLine Error: Option 'use-dbg-addr' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
Aborted (core dumped)

This appears to be a bug in LLVM 16.0.0 that's fixed in 16.0.1. It's unknown what other applications are also broken by this bug.

Reproducible: Always

Steps to Reproduce:
1. Install Fedora 38
2. Install Folding@Home client RPM from https://foldingathome.org/start-folding/?lng=en-US
3. Start fahclient service with service fahclient start or by running /usr/bin/FAHClient /etc/fahclient/config.xml --run-as fahclient --pid-file=/var/run/fahclient.pid

Actual Results:  
CommandLine Error: Option 'use-dbg-addr' registered more than once!
LLVM ERROR: inconsistency in registered CommandLine options
Aborted (core dumped)

strace excerpt:
openat(AT_FDCWD, "/lib64/libSPIRV-Tools-opt.so", O_RDONLY|O_CLOEXEC) = 4
read(4, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(4, "", {st_mode=S_IFREG|0755, st_size=2110640, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 2068536, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 4, 0) = 0x7fbe6c406000
mmap(0x7fbe6c45b000, 1400832, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x55000) = 0x7fbe6c45b000
mmap(0x7fbe6c5b1000, 270336, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x1ab000) = 0x7fbe6c5b1000
mmap(0x7fbe6c5f3000, 53248, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 4, 0x1ec000) = 0x7fbe6c5f3000
close(4)                                = 0
mprotect(0x7fbe6c842000, 53248, PROT_READ) = 0
mprotect(0x7fbe6c9c4000, 241664, PROT_READ) = 0
mprotect(0x7fbe6c5f3000, 49152, PROT_READ) = 0
mprotect(0x7fbe6c8a7000, 16384, PROT_READ) = 0
mprotect(0x7fbe6cd6a000, 8192, PROT_READ) = 0
mprotect(0x7fbe6cd7a000, 4096, PROT_READ) = 0
mprotect(0x7fbe6cd8f000, 4096, PROT_READ) = 0
mprotect(0x7fbe6cec4000, 4096, PROT_READ) = 0
mprotect(0x7fbe5c65d000, 700416, PROT_READ) = 0
mprotect(0x7fbe6ceaa000, 4096, PROT_READ) = 0
mprotect(0x7fbe635bc000, 7102464, PROT_READ) = 0
mprotect(0x7fbe6cdb9000, 8192, PROT_READ) = 0
mprotect(0x7fbe6cdd1000, 4096, PROT_READ) = 0
mprotect(0x7fbe67cce000, 3153920, PROT_READ) = 0
mprotect(0x7fbe6cdee000, 4096, PROT_READ) = 0
mprotect(0x7fbe6cca8000, 188416, PROT_READ) = 0
futex(0x7fbe6c8506fc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
futex(0x7fbe6c850708, FUTEX_WAKE_PRIVATE, 2147483647) = 0
brk(0x241a000)                          = 0x241a000
brk(0x243b000)                          = 0x243b000
lseek(2, 0, SEEK_CUR)                   = -1 ESPIPE (Illegal seek)
newfstatat(2, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0
brk(0x245e000)                          = 0x245e000
write(2, ": CommandLine Error: Option '", 29: CommandLine Error: Option ') = 29
write(2, "use-dbg-addr", 12use-dbg-addr)            = 12
write(2, "' registered more than once!\n", 29' registered more than once!
) = 29
write(2, "LLVM ERROR: inconsistency in reg"..., 60LLVM ERROR: inconsistency in registered CommandLine options
) = 60
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
gettid()                                = 2120428
getpid()                                = 2120428
tgkill(2120428, 2120428, SIGABRT)       = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=2120428, si_uid=0} ---
+++ killed by SIGABRT (core dumped) +++
Aborted (core dumped)

Expected Results:  
The init script should start normally

I was able to fix this, but the exact required steps, cause and fix are not clear.

1. Built LLVM 16.0.1 from Rawhide srpm, but there were four minor tests that failed on my system and I had to use --nocheck to bypass them
2. Installed newly built LLVM 16.0.1 rpms
3. Built spirv-llvm-translator from Rawhide srpm, editing SOURCES/0001-Fix-standalone-builds-with-LLVM_LINK_LLVM_DYLIB-ON.patch as described at https://www.linuxquestions.org/questions/slackware-14/llvm-16-and-static-libraries-4175723703/#post6422334
4. Installed newly built spirv-llvm-translator rpm

Comment 1 Tulio Magno Quites Machado Filho 2023-04-19 13:43:04 UTC
@bjackson0971 , if this has been fixed with LLVM 16.0.1, could you check if the following update fixes the issue for you, please?
https://bodhi.fedoraproject.org/updates/FEDORA-2023-36b95f852a

Comment 2 Brad Jackson 2023-04-19 14:33:48 UTC
(In reply to Tulio Magno Quites Machado Filho from comment #1)
> @bjackson0971 , if this has been fixed with LLVM 16.0.1, could you
> check if the following update fixes the issue for you, please?
> https://bodhi.fedoraproject.org/updates/FEDORA-2023-36b95f852a

The crash still happens with this llvm and llvm-libs 16.0.1-1.fc38 build and the stock spirv-llvm-translator package. I have to also install my patched spirv-llvm-translator srpm build to fix the crash.

I also tried downgrading llvm and llvm-libs to 16.0.0-2.fc38 and kept my patched spirv-llvm-translator, and that also fixes it. It appears the problem is actually in the translator package.

Comment 3 Tulio Magno Quites Machado Filho 2023-04-19 14:56:18 UTC
@bjackson0971 , Could you confirm which version of spirv-llvm-translator is installed when the issue happens, please?

Comment 4 Brad Jackson 2023-04-19 15:05:32 UTC
(In reply to Tulio Magno Quites Machado Filho from comment #3)
> @bjackson0971 , Could you confirm which version of
> spirv-llvm-translator is installed when the issue happens, please?

Version spirv-llvm-translator-16.0.0-1.fc38.x86_64 is the only stock version available for Fedora 38 and the crash happens with it installed. My patched srpm build is the only fix I've found.

Comment 5 Fedora Update System 2023-04-19 16:13:03 UTC
FEDORA-2023-36b95f852a has been submitted as an update to Fedora 38. https://bodhi.fedoraproject.org/updates/FEDORA-2023-36b95f852a

Comment 6 František Zatloukal 2023-04-19 16:14:47 UTC
I've updated spirv-llvm-translator downstream patch which should, together with llvm 16.0.1, address the issue.

Comment 7 Brad Jackson 2023-04-19 16:37:15 UTC
Confirmed that spirv-llvm-translator-16.0.0-2.fc38 fixes the crash with both llvm-16.0.0-2.fc38 and llvm-16.0.1-1.fc38.

Comment 8 Fedora Update System 2023-04-20 04:29:28 UTC
FEDORA-2023-36b95f852a has been pushed to the Fedora 38 testing repository.
Soon you'll be able to install the update with the following command:
`sudo dnf upgrade --enablerepo=updates-testing --refresh --advisory=FEDORA-2023-36b95f852a`
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2023-36b95f852a

See also https://fedoraproject.org/wiki/QA:Updates_Testing for more information on how to test updates.

Comment 9 Dr. David Alan Gilbert 2023-04-23 15:03:04 UTC
This seems to fix my pyopencl program ( https://github.com/ali1234/vhs-teletext ) - but the performance is AWFUL - at least 3/4 times slower than f37.

Comment 10 Tulio Magno Quites Machado Filho 2023-04-24 12:42:11 UTC
(In reply to Dr. David Alan Gilbert from comment #9)
> This seems to fix my pyopencl program (
> https://github.com/ali1234/vhs-teletext ) - but the performance is AWFUL -
> at least 3/4 times slower than f37.

David, could you report this performance issue in a new bug, please?
We'll need some details, e.g.:
1. How can I reproduce this slowdown? i.e. which steps do I have to execute. A small reproducer is ideal.
2. If you can profile the code before and after is even better.
3. Details about the execution, e.g. processor, OS used before and after.

Comment 11 Dr. David Alan Gilbert 2023-04-24 13:03:25 UTC
(In reply to Tulio Magno Quites Machado Filho from comment #10)
> (In reply to Dr. David Alan Gilbert from comment #9)
> > This seems to fix my pyopencl program (
> > https://github.com/ali1234/vhs-teletext ) - but the performance is AWFUL -
> > at least 3/4 times slower than f37.
> 
> David, could you report this performance issue in a new bug, please?

Sure, will do - what component would you like it against?

> We'll need some details, e.g.:
> 1. How can I reproduce this slowdown? i.e. which steps do I have to execute.
> A small reproducer is ideal.

It's tricky, since I've only got the one OpenCL application I've been using
regularly and have perf numbers for; it is open but you need a datafile to process with it.

> 2. If you can profile the code before and after is even better.

There's very little host CPU usage (before or after), so I assume it's one of:
   a) The SPIR code generated (except that I tried forcing the old code in and that's still slow
    as far as I can tell)
   b) The translation of the SPIR to the native Radeon
   c) Something else in the environment (but I have tried downgrading the kernel to f37)

Tips on profiling of the GPU behaviour are welcome.

> 3. Details about the execution, e.g. processor, OS used before and after.

Sure.

Dave

Comment 12 Tulio Magno Quites Machado Filho 2023-04-24 13:29:55 UTC
(In reply to Dr. David Alan Gilbert from comment #11)
> Sure, will do - what component would you like it against?

LLVM is fine. We can change that later as we get more details.

> It's tricky, since I've only got the one OpenCL application I've been using
> regularly and have perf numbers for; it is open but you need a datafile to
> process with it.

No problem. All we need is to reproduce and debug the issue.

Comment 13 Fedora Update System 2023-04-25 01:53:19 UTC
FEDORA-2023-36b95f852a has been pushed to the Fedora 38 stable repository.
If problem still persists, please make note of it in this bug report.

Comment 14 Dr. David Alan Gilbert 2023-04-27 11:57:37 UTC
(In reply to Tulio Magno Quites Machado Filho from comment #12)
> (In reply to Dr. David Alan Gilbert from comment #11)
> > Sure, will do - what component would you like it against?
> 
> LLVM is fine. We can change that later as we get more details.
> 
> > It's tricky, since I've only got the one OpenCL application I've been using
> > regularly and have perf numbers for; it is open but you need a datafile to
> > process with it.
> 
> No problem. All we need is to reproduce and debug the issue.

This was a red herring; so it's actually all fine - sorry for the noise.
(The speed is data dependent, and I'd previously seen ranges of 700-1200 lps
on this test; the recovered data on the day I upgraded to f38 triggered a
case of ~200 lps which I'd never seen something anywhere that bad before;
bad luck it happened the same day)

Comment 15 Tulio Magno Quites Machado Filho 2023-04-27 12:26:05 UTC
(In reply to Dr. David Alan Gilbert from comment #14)
> This was a red herring; so it's actually all fine - sorry for the noise.
> (The speed is data dependent, and I'd previously seen ranges of 700-1200 lps
> on this test; the recovered data on the day I upgraded to f38 triggered a
> case of ~200 lps which I'd never seen something anywhere that bad before;
> bad luck it happened the same day)

Great!
Thanks for the update!


Note You need to log in before you can comment on or make changes to this bug.