Description of problem:
Starting gnome on an ARM64 machine without hardware GL support, results in a crash with large pointers in llvmpipe JIT'ed code. Running the mesa unit tests results in failures too. In both cases running the programs under valgrind makes the problem disappear.
../../../../bin/test-driver: line 107: 7204 Segmentation fault (core dumped) "$@" > $log_file 2>&1
../../../../bin/test-driver: line 107: 7212 Segmentation fault (core dumped) "$@" > $log_file 2>&1
Testsuite summary for Mesa 17.0.0
# TOTAL: 5
# PASS: 3
# SKIP: 0
# XFAIL: 0
# FAIL: 2
# XPASS: 0
# ERROR: 0
Please report to https://bugs.freedesktop.org/enter_bug.cgi?product=Mesa
Version-Release number of selected component (if applicable):
How reproducible: 100% of the time starting gnome-shell (via vncserver) or running 'make check' in the fedora mesa repo.
Steps to Reproduce:
Running `src/gallium/drivers/llvmpipe/lp_test_arit -v`
rsqrt.v4(0): ref = inf, out = inf, precision = 24.000000 bits, PASS
rsqrt.v4(1): ref = 1, out = 1, precision = 24.000000 bits, PASS
rsqrt.v4(1.00000001e-07): ref = 3162.27783, out = 3162.27783, precision = 24.000000 bits, PASS
rsqrt.v4(4): ref = 0.5, out = 0.5, precision = 24.000000 bits, PASS
rsqrt.v4(100000): ref = 0.00316227786, out = 0.00316227786, precision = 24.000000 bits, PASS
rsqrt.v4(1.00000004e+35): ref = 3.16227777e-18, out = 3.16227777e-18, precision = 24.000000 bits, PASS
rsqrt.v4(5.8799997e-39): ref = 1.30410138e+19, out = 1.30410138e+19, precision = 24.000000 bits, PASS
rsqrt.v4(inf): ref = 0, out = 0, precision = 24.000000 bits, PASS
Segmentation fault (core dumped)
The backtrace looks like:
#0 0x0000ffff9b2400d0 in ?? ()
#1 0x0000000000000001 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
It seems that this is a call to the function returned from
gallivm_jit_function(struct gallivm_state *gallivm,
(gdb) disassemble 0x0000ffff9b2400c0,0x0000ffff9b2400e0
Dump of assembler code from 0xffff9b2400c0 to 0xffff9b2400e0:
0x0000ffff9b2400c0: scvtf s1, w12
0x0000ffff9b2400c4: movk x11, #0x9b25, lsl #16
0x0000ffff9b2400c8: movk x9, #0x1c
0x0000ffff9b2400cc: movk x10, #0x20
=> 0x0000ffff9b2400d0: ldr s7, [x13]
0x0000ffff9b2400d4: fmadd s2, s1, s2, s5
0x0000ffff9b2400d8: movk x11, #0x24
0x0000ffff9b2400dc: ldr s5, [x9]
(gdb) info registers
x0 0x51250a0 85086368
x1 0x51250c0 85086400
x2 0xffff9b240000 281473284571136
x3 0x4531e0 4534752
x4 0x515b870 85309552
x5 0x0 0
x6 0xffffffffff 1099511627775
x7 0x514dcb0 85253296
x8 0x80000000 2147483648
x9 0xffff9b25001c 281473284636700
x10 0xffff9b250020 281473284636704
x11 0xffff9b25000c 281473284636684
x12 0x80000000 2147483648
x13 0xffffffff9b250014 -1692073964
x14 0x0 0
x15 0x2 2
x16 0xffff9b2001c8 281473284309448
x17 0xffff988f57f8 281473241274360
x18 0x0 0
x19 0x51250c0 85086400
x20 0x1 1
x21 0x512ddc0 85122496
x22 0x7f800000 2139095040
x23 0x51250a0 85086368
x24 0x1 1
x25 0x0 0
x26 0x1 1
x27 0x0 0
x28 0x4065e0 4220384
x29 0xfffff4eefc50 281474791046224
x30 0x406f1c 4222748
sp 0xfffff4eefc50 0xfffff4eefc50
pc 0xffff9b2400d0 0xffff9b2400d0
cpsr 0x60000000 [ EL=0 C Z ]
fpsr 0x0 0
fpcr 0x0 0
(more to come)
Ok, found a fun pile of unit test failures in llvm too (starting with the fact that ARM64 can't DC against a page without write permissions). The bottom line is that there is a LLVMPipe code generation problem. Its trying to load the constant 2.44331568e-05 into s7 (in this example) and its loading the address of that constant 16 bits at a time into x13 with movk's, but it fails to load the top 16 bits, leaving whatever happens to be in that register stale.
Interrestingly it seems that it has code which is trying to clear the top 16 bits as well, but the target register (zxr in this case!) seems to be incorrect.
The code in question is in lp_build_sin_or_cos().
and looks like:
LLVMValueRef coscof_p0 = lp_build_const_vec(gallivm, bld->type, 2.443315711809948E-005);
setting GALLIVM_DEBUG="nopt" fixes the problem!
Possible upstream fix here:
Changing component to llvm.
Created attachment 1261680 [details]
Fix aarch64 relocation
To be clear the patch is against LLVM 3.9.1 in F26 & rawhide.
mesa-17.0.1-2.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-701f4d0d08
mesa-17.0.1-2.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-701f4d0d08
For reference there was a similar problem seen with "root" package building that was fixed by this too. Reference: https://pagure.io/releng/issue/6653
I updated my hikey to the latest rawhide last night and VNC/gnome-shell/firefox were working. I will run a clean F26 install in the next couple days on seattle/juno.
Proposed as a Freeze Exception for 26-alpha by Fedora user pbrobinson using the blocker tracking app because:
This causes issues with the gnome desktop crashing on aarch64 when using the llvmpipe driver which is used for a number of usecases on aarch64.
mesa-17.0.2-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-741d36d0b1
Would this affect llvmpipe usage on x86_64 as well?
(In reply to Adam Williamson from comment #13)
> Would this affect llvmpipe usage on x86_64 as well?
No, it was an issue with aarch64 with llvm that was explicitly aarch64 codepaths
Discussed during the 2017-03-27 blocker review meeting: 
The decision was made to classify this bug as an AcceptedFreezeException was made as it would be nice to have this fixed in Alpha release.
mesa-17.0.1-2.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.
mesa-17.0.2-1.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.
Not directly a 48-bit VA problem, but definitely irritated by it.