Bug 1429050

Summary: Gnome core dumps on startup with llvmpipe crash due to llvm failing to clear high 16bits of correct register.
Product: [Fedora] Fedora Reporter: Jeremy Linton <jeremy.linton>
Component: llvmAssignee: Adam Jackson <ajax>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: ajax, awilliam, bos, davejohansen, dmalcolm, esteban.xandri, fschwarz, gmarr, ignatenko, jistone, nalimilan, pbrobinson, petersen, scottt.tw, tstellar
Target Milestone: ---   
Target Release: ---   
Hardware: aarch64   
OS: Unspecified   
Whiteboard: AcceptedFreezeException
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
undefined
Story Points: ---
Clone Of:
: 1461815 1461818 (view as bug list) Environment:
Last Closed: 2017-03-29 05:04:52 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 245418, 1349185, 1394837    
Attachments:
Description Flags
Fix aarch64 relocation none

Description Jeremy Linton 2017-03-03 23:02:39 UTC
Description of problem:

Starting gnome on an ARM64 machine without hardware GL support, results in a crash with large pointers in llvmpipe JIT'ed code. Running the mesa unit tests results in failures too. In both cases running the programs under valgrind makes the problem disappear. 

../../../../bin/test-driver: line 107:  7204 Segmentation fault      (core dumped) "$@" > $log_file 2>&1
FAIL: lp_test_format
../../../../bin/test-driver: line 107:  7212 Segmentation fault      (core dumped) "$@" > $log_file 2>&1
FAIL: lp_test_arit
PASS: lp_test_blend
PASS: lp_test_conv
PASS: lp_test_printf
============================================================================
Testsuite summary for Mesa 17.0.0
============================================================================
# TOTAL: 5
# PASS:  3
# SKIP:  0
# XFAIL: 0
# FAIL:  2
# XPASS: 0
# ERROR: 0
============================================================================
See src/gallium/drivers/llvmpipe/test-suite.log
Please report to https://bugs.freedesktop.org/enter_bug.cgi?product=Mesa
============================================================================





Version-Release number of selected component (if applicable):
mesa-dri-drivers-17.0.0-1.fc26.aarch64

How reproducible: 100% of the time starting gnome-shell (via vncserver) or running 'make check' in the fedora mesa repo.


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Running `src/gallium/drivers/llvmpipe/lp_test_arit -v`

rsqrt.v4(0): ref = inf, out = inf, precision = 24.000000 bits, PASS
rsqrt.v4(1): ref = 1, out = 1, precision = 24.000000 bits, PASS
rsqrt.v4(1.00000001e-07): ref = 3162.27783, out = 3162.27783, precision = 24.000000 bits, PASS
rsqrt.v4(4): ref = 0.5, out = 0.5, precision = 24.000000 bits, PASS
rsqrt.v4(100000): ref = 0.00316227786, out = 0.00316227786, precision = 24.000000 bits, PASS
rsqrt.v4(1.00000004e+35): ref = 3.16227777e-18, out = 3.16227777e-18, precision = 24.000000 bits, PASS
rsqrt.v4(5.8799997e-39): ref = 1.30410138e+19, out = 1.30410138e+19, precision = 24.000000 bits, PASS
rsqrt.v4(inf): ref = 0, out = 0, precision = 24.000000 bits, PASS
Segmentation fault (core dumped)

The backtrace looks like:

(gdb) bt
#0  0x0000ffff9b2400d0 in ?? ()
#1  0x0000000000000001 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

It seems that this is a call to the function returned from 
gallivm_jit_function(struct gallivm_state *gallivm,
                     LLVMValueRef func)

(gdb) disassemble 0x0000ffff9b2400c0,0x0000ffff9b2400e0
Dump of assembler code from 0xffff9b2400c0 to 0xffff9b2400e0:
   0x0000ffff9b2400c0:  scvtf   s1, w12
   0x0000ffff9b2400c4:  movk    x11, #0x9b25, lsl #16
   0x0000ffff9b2400c8:  movk    x9, #0x1c
   0x0000ffff9b2400cc:  movk    x10, #0x20
=> 0x0000ffff9b2400d0:  ldr     s7, [x13]
   0x0000ffff9b2400d4:  fmadd   s2, s1, s2, s5
   0x0000ffff9b2400d8:  movk    x11, #0x24
   0x0000ffff9b2400dc:  ldr     s5, [x9]
(gdb) info registers
x0             0x51250a0        85086368
x1             0x51250c0        85086400
x2             0xffff9b240000   281473284571136
x3             0x4531e0 4534752
x4             0x515b870        85309552
x5             0x0   0
x6             0xffffffffff     1099511627775
x7             0x514dcb0        85253296
x8             0x80000000       2147483648
x9             0xffff9b25001c   281473284636700
x10            0xffff9b250020   281473284636704
x11            0xffff9b25000c   281473284636684
x12            0x80000000       2147483648
x13            0xffffffff9b250014       -1692073964
x14            0x0   0
x15            0x2   2
x16            0xffff9b2001c8   281473284309448
x17            0xffff988f57f8   281473241274360
x18            0x0   0
x19            0x51250c0        85086400
x20            0x1   1
x21            0x512ddc0        85122496
x22            0x7f800000       2139095040
x23            0x51250a0        85086368
x24            0x1   1
x25            0x0   0
x26            0x1   1
x27            0x0   0
x28            0x4065e0 4220384
x29            0xfffff4eefc50   281474791046224
x30            0x406f1c 4222748
sp             0xfffff4eefc50   0xfffff4eefc50
pc             0xffff9b2400d0   0xffff9b2400d0
cpsr           0x60000000       [ EL=0 C Z ]
fpsr           0x0   0
fpcr           0x0   0

(more to come)

Comment 1 Jeremy Linton 2017-03-06 23:09:49 UTC
Ok, found a fun pile of unit test failures in llvm too (starting with the fact that ARM64 can't DC against a page without write permissions). The bottom line is that there is a LLVMPipe code generation problem. Its trying to load the constant 2.44331568e-05 into s7 (in this example) and its loading the address of that constant 16 bits at a time into x13 with movk's, but it fails to load the top 16 bits, leaving whatever happens to be in that register stale. 

Interrestingly it seems that it has code which is trying to clear the top 16 bits as well, but the target register (zxr in this case!) seems to be incorrect. 


The code in question is in lp_build_sin_or_cos().

and looks like:
LLVMValueRef coscof_p0 = lp_build_const_vec(gallivm, bld->type, 2.443315711809948E-005);


setting GALLIVM_DEBUG="nopt" fixes the problem!

Comment 2 Jeremy Linton 2017-03-08 15:30:43 UTC
Possible upstream fix here:

https://reviews.llvm.org/D27609

Comment 3 Jeremy Linton 2017-03-09 17:46:24 UTC
Changing component to llvm.

Comment 4 Jeremy Linton 2017-03-09 19:22:56 UTC
Created attachment 1261680 [details]
Fix aarch64 relocation

Comment 5 Jeremy Linton 2017-03-10 15:55:31 UTC
To be clear the patch is against LLVM 3.9.1 in F26 & rawhide.

Comment 6 Fedora Update System 2017-03-15 20:27:07 UTC
mesa-17.0.1-2.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-701f4d0d08

Comment 7 Fedora Update System 2017-03-16 00:51:25 UTC
mesa-17.0.1-2.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-701f4d0d08

Comment 8 Peter Robinson 2017-03-17 14:58:01 UTC
For reference there was a similar problem seen with "root" package building that was fixed by this too. Reference: https://pagure.io/releng/issue/6653

Comment 9 Jeremy Linton 2017-03-21 15:58:12 UTC
I updated my hikey to the latest rawhide last night and VNC/gnome-shell/firefox were working. I will run a clean F26 install in the next couple days on seattle/juno.

Comment 10 Fedora Blocker Bugs Application 2017-03-23 13:13:04 UTC
Proposed as a Freeze Exception for 26-alpha by Fedora user pbrobinson using the blocker tracking app because:

 This causes issues with the gnome desktop crashing on aarch64 when using the llvmpipe driver which is used for a number of usecases on aarch64.

Comment 11 Fedora Update System 2017-03-23 16:39:28 UTC
mesa-17.0.2-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-741d36d0b1

Comment 12 Dennis Gilmore 2017-03-23 18:14:33 UTC
+1 FE

Comment 13 Adam Williamson 2017-03-24 21:24:01 UTC
Would this affect llvmpipe usage on x86_64 as well?

Comment 14 Peter Robinson 2017-03-25 09:09:23 UTC
(In reply to Adam Williamson from comment #13)
> Would this affect llvmpipe usage on x86_64 as well?

No, it was an issue with aarch64 with llvm that was explicitly aarch64 codepaths

Comment 15 Geoffrey Marr 2017-03-27 17:12:28 UTC
Discussed during the 2017-03-27 blocker review meeting: [1]

The decision was made to classify this bug as an AcceptedFreezeException was made as it would be nice to have this fixed in Alpha release.

[1] https://meetbot.fedoraproject.org/fedora-blocker-review/2017-03-27/f26-blocker-review.2017-03-27-16.01.txt

Comment 16 Fedora Update System 2017-03-29 05:04:52 UTC
mesa-17.0.1-2.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Comment 17 Fedora Update System 2017-04-01 17:18:40 UTC
mesa-17.0.2-1.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.

Comment 18 Jeremy Linton 2017-05-19 15:16:45 UTC
Not directly a 48-bit VA problem, but definitely irritated by it.