Created attachment 1844506 [details] slpvec.C The following testcase is miscompiled on s390x with g++ -fPIC -fvisibility-inlines-hidden -ffunction-sections -fdata-sections -O2 -fPIC -fno-exceptions -fno-rtti -std=c++14 -mlong-double-128 -march=z13 -mtune=z14 both with the RHEL gcc 8.x and with upstream 8.5.0. When miscompiled, it prints something like __insertion_sort 0x3ffd74fd310 0x3ffd74fd348 0xdeadbeefcafebabe 0xdeadbeefcafebabe __insertion_sort 0x3ffd74fd348 0x3ffd74fd348 0x10006b8 0xdeadbeefcafebabe rather than __insertion_sort 0x3ffd74fd310 0x3ffd74fd348 0x10006b8 0xdeadbeefcafebabe __insertion_sort 0x3ffd74fd348 0x3ffd74fd348 0x10006b8 0xdeadbeefcafebabe The interesting part is below, .cfi_* directives removed for brevity. On entry, this function has 3 pointers in %r2, %r3 and %r4 registers, and %r5 is pointer to the 16-byte function_ref<decltype(foo)> - object with trivially copyable class containing 2 8-byte members. _ZSt24__merge_sort_with_bufferIPPvS1_N4llvm12function_refIFbS0_S0_EEEEvT_S6_T0_T1_: stmg %r6,%r15,48(%r15) lgr %r14,%r15 lay %r15,-248(%r15) aghi %r14,-32 std %f8,0(%r14) std %f12,8(%r14) std %f14,16(%r14) std %f9,24(%r14) sgrk %r11,%r3,%r2 lgr %r1,%r4 srag %r13,%r11,3 agr %r1,%r11 lmg %r8,%r9,0(%r5) stmg %r8,%r9,160(%r15) ! The above stores the whole 16-byte function_ref correctly to %r15+160 cgijle %r11,48,.L13 vlvgp %v0,%r8,%r9 ldgr %f9,%r1 ldgr %f12,%r4 la %r1,200(%r15) lgr %r10,%r3 stg %r11,176(%r15) ldgr %f8,%r2 lgr %r6,%r9 vlgvg %r7,%v0,1 stmg %r8,%r9,184(%r15) ! So does the above lgr %r8,%r1 .L14: la %r11,56(%r2) lgr %r4,%r8 lgr %r3,%r11 stmg %r6,%r7,200(%r15) ! But this one actually stores both 8-byte words the same to %r15+160, and %r15+200 is passed as %r4 to the function brasl %r14,_ZSt16__insertion_sortIPPvN4llvm12function_refIFbS0_S0_EEEEvT_S6_T0_@PLT In *.postreload, we have still correct: (insn 16 12 166 2 (set (reg/v:TI 16 %f0 [orig:69 __comp ] [69]) (reg:TI 8 %r8)) 1268 {movti} (nil)) ... (insn 137 136 140 3 (set (reg/v:TI 6 %r6 [orig:69 __comp ] [69]) (reg/v:TI 16 %f0 [orig:69 __comp ] [69])) 1268 {movti} (nil)) The code spills it to 128-bit %f0 register and loads it back from it. Next, split2 pass splits the latter (but not the former) into: (insn 167 136 168 3 (set (reg:DI 6 %r6 [ __comp ]) (reg:DI 16 %f0)) 1269 {*movdi_64} (nil)) (insn 168 167 140 3 (set (reg:DI 7 %r7 [orig:69 __comp+8 ] [69]) (unspec:DI [ (reg:V2DI 16 %f0) (const_int 1 [0x1]) ] UNSPEC_VEC_EXTRACT)) 402 {*vec_extractv2di} (nil)) and finally cprop_hardreg seeing (insn 187 188 186 3 (set (reg/v:TI 16 %f0 [orig:69 __comp ] [69]) (reg:TI 8 %r8)) 1268 {movti} (nil)) changes insn 167 to: (insn 167 136 168 3 (set (reg:DI 6 %r6 [ __comp ]) (reg:DI 9 %r9 [16])) 1269 {*movdi_64} (nil)) I'm not sure if this is a bug in the ; Split a VR -> GPR TImode move into 2 vector load GR from VR element. ; For the higher order bits we do simply a DImode move while the ; second part is done via vec extract. Both will end up as vlgvg. (define_split [(set (match_operand:TI 0 "register_operand" "") (match_operand:TI 1 "register_operand" ""))] "TARGET_VX && reload_completed && GENERAL_REG_P (operands[0]) && VECTOR_REG_P (operands[1])" [(set (match_dup 2) (match_dup 4)) (set (match_dup 3) (unspec:DI [(match_dup 5) (const_int 1)] UNSPEC_VEC_EXTRACT))] { operands[2] = operand_subword (operands[0], 0, 0, TImode); operands[3] = operand_subword (operands[0], 1, 0, TImode); operands[4] = gen_rtx_REG (DImode, REGNO (operands[1])); operands[5] = gen_rtx_REG (V2DImode, REGNO (operands[1])); }) splitter, in cprop_hardreg or the s390x representation of those TImodes in floating point registers. In GCC 9 it got "fixed" with https://gcc.gnu.org/r9-3763-gef976be1a23a517 but that just means it went latent. And I can't reproduce it even with upstream GCC 9 branch with r9-3763 reverted - some RA decisions changed. But that doesn't mean the problem isn't latent even on the trunk, certainly the above splitter is still there.
Andreas, do you think you could have a look, at least to point us in the right direction. Thanks.
I' wondering if it is correct that our back-end accepts TI->DI mode changes in TARGET_CAN_CHANGE_MODE_CLASS. Just doing that mode change while leaving the register number as is would result in dealing with the high word part of the TImode. So it probably would be wrong for all big endian targets to return true in that case. When cprop records value copies it rejects such mode switches as part of the copy on big endian targets. When it replaces the source of an assignment it consults the TARGET_CAN_CHANGE_MODE_CLASS. Rejecting the TI->DI change here should fix the problem. I don't know how much fallout this will produce. I'll run a regression test with that.
Just trying to understand what s390x is doing __attribute__((noipa)) __int128_t foo (__int128_t a) { asm volatile ("" : "+f" (a)); asm volatile ("" : "+r" (a)); return a; } __attribute__((noipa)) long long bar (__int128_t a) { long long b; asm volatile ("" : "=f" (b) : "0" (a)); asm volatile ("" : "+r" (b)); return b; } int main () { __int128_t a = ((__int128_t) 0x0102030405060708ULL << 64) | 0x090a0b0c0d0e0f00ULL; foo (a); bar (a); return 0; } it seems that when a TImode value appears in a floating point register, it contains V2DImode { high-64-bits, low-64-bits }, but when accessed using ldgr it reads the first half (i.e. high part), rather than what the compiler would expect (low part of the register). So yes, dealing with it in s390_can_change_mode_class does look like a way to go, after all, it already has some /* Values residing in VRs are little-endian style. All modes are placed left-aligned in an VR. This means that we cannot allow switching between modes with differing sizes. Also if the vector facility is available we still place TFmode values in VR register pairs, since the only instructions we have operating on TFmodes only deal with register pairs. Therefore we have to allow DFmode subregs of TFmodes to enable the TFmode splitters. */ code.
Working around this problem by disabling TI->DI mode changes in can_change_mode didn't work out. The hook then would also disable all DImode subregs of TImode regs. But this is needed when splitting TImode operations. What about doing the following in crop instead. Boostraps find on s390x without causing testsuite regressions: diff --git a/gcc/regcprop.c b/gcc/regcprop.c index 18132425ab2..402478b8292 100644 --- a/gcc/regcprop.c +++ b/gcc/regcprop.c @@ -402,7 +402,10 @@ maybe_mode_change (machine_mode orig_mode, machine_mode copy_mode, if (orig_mode == new_mode) return gen_raw_REG (new_mode, regno); - else if (mode_change_ok (orig_mode, new_mode, regno)) + else if (mode_change_ok (orig_mode, new_mode, regno) + && (hard_regno_nregs (copy_regno, copy_mode) == + hard_regno_nregs (regno, copy_mode) + || !maybe_ne (subreg_lowpart_offset (new_mode, copy_mode), 0U))) { int copy_nregs = hard_regno_nregs (copy_regno, copy_mode); int use_nregs = hard_regno_nregs (copy_regno, new_mode);
Here are my thoughts about the problem. Regprop was designed long time ago on assumption that narrow mode refers for low part of the value which is wrong for s390x vregs (at least when we are looking at the splitter). We could modify regprop to work w/o the assumption. But it is too risky change here as it might need many modifications and we can skip to change some regcprop code working on the assumption. More doable solution is to check the assumption in the place of actual transformation and reject it when the assumption does not hold. That is what Andreas patch is trying to do indirectly but actually prohibiting this optimization for all big endian targets. The right solution here would be introducing an undocumented hook (something narrow_mode_refers_low_part_p(regno) with default value true) and use it in the same place proposed by Andreas. The hook should be temporary probably as the right solution in the long run should be a change of regcprop to work w/o the assumption.
Thanks Vlad. Andreas, will you be able to update the patch to use a hook? I suppose it could go upstream first.
(In reply to Marek Polacek from comment #6) > Thanks Vlad. Andreas, will you be able to update the patch to use a hook? > I suppose it could go upstream first. Ok. https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588407.html
(In reply to Andreas Krebbel from comment #7) > (In reply to Marek Polacek from comment #6) > > Thanks Vlad. Andreas, will you be able to update the patch to use a hook? > > I suppose it could go upstream first. > > Ok. > https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588407.html Awesome, thanks.
In today's meeting we concluded that we'll backport this version https://gcc.gnu.org/pipermail/gcc-patches/2022-January/588407.html into RHEL 8 GCC.
That's the patch I've posted for upstream 2022-01-21: https://gcc.gnu.org/pipermail/gcc-patches/2022-January/589036.html
This is the patch I ended up committing after testing on s390x. It fixes a typo: hook_bool_unit_true -> hook_bool_uint_true, and since that hook doesn't exist yet in GCC 8, I've added it, which looked obvious enough. --- a/gcc/config/s390/s390.c +++ b/gcc/config/s390/s390.c @@ -10488,6 +10488,18 @@ s390_hard_regno_mode_ok (unsigned int regno, machine_mode mode) return false; } +/* Implement TARGET_NARROW_MODE_REFERS_LOW_PART_P. */ + +static bool +s390_narrow_mode_refers_low_part_p (unsigned int regno) +{ + if (reg_classes_intersect_p (VEC_REGS, REGNO_REG_CLASS (regno))) + return false; + + return true; +} + + /* Implement TARGET_MODES_TIEABLE_P. */ static bool @@ -16956,6 +16968,9 @@ s390_case_values_threshold (void) #undef TARGET_CASE_VALUES_THRESHOLD #define TARGET_CASE_VALUES_THRESHOLD s390_case_values_threshold +#undef TARGET_NARROW_MODE_REFERS_LOW_PART_P +#define TARGET_NARROW_MODE_REFERS_LOW_PART_P s390_narrow_mode_refers_low_part_p + struct gcc_target targetm = TARGET_INITIALIZER; #include "gt-s390.h" --- a/gcc/regcprop.c +++ b/gcc/regcprop.c @@ -426,7 +426,8 @@ maybe_mode_change (machine_mode orig_mode, machine_mode copy_mode, if (orig_mode == new_mode) return gen_raw_REG (new_mode, regno); - else if (mode_change_ok (orig_mode, new_mode, regno)) + else if (mode_change_ok (orig_mode, new_mode, regno) + && targetm.narrow_mode_refers_low_part_p (regno)) { int copy_nregs = hard_regno_nregs (copy_regno, copy_mode); int use_nregs = hard_regno_nregs (copy_regno, new_mode); --- a/gcc/target.def +++ b/gcc/target.def @@ -5446,6 +5446,16 @@ value that the middle-end intended.", bool, (machine_mode from, machine_mode to, reg_class_t rclass), hook_bool_mode_mode_reg_class_t_true) +/* This hook is used to work around a problem in regcprop. Hardcoded +assumptions currently prevent it from working correctly for targets +where the low part of a multi-word register doesn't align to accessing +the register with a narrower mode. */ +DEFHOOK_UNDOC +(narrow_mode_refers_low_part_p, +"", +bool, (unsigned int regno), +hook_bool_uint_true) + /* Change pseudo allocno class calculated by IRA. */ DEFHOOK (ira_change_pseudo_allocno_class, --- a/gcc/hooks.h +++ b/gcc/hooks.h @@ -86,6 +86,7 @@ extern void hook_void_tree (tree); extern void hook_void_tree_treeptr (tree, tree *); extern void hook_void_int_int (int, int); extern void hook_void_gcc_optionsp (struct gcc_options *); +extern bool hook_bool_uint_true (unsigned int); extern bool hook_bool_uint_uintp_false (unsigned int, unsigned int *); extern int hook_int_uint_mode_1 (unsigned int, machine_mode); --- a/gcc/hooks.c +++ b/gcc/hooks.c @@ -498,6 +498,14 @@ hook_void_gcc_optionsp (struct gcc_optio { } +/* Generic hook that takes an unsigned int and returns true. */ + +bool +hook_bool_uint_true (unsigned int) +{ + return true; +} + /* Generic hook that takes an unsigned int, an unsigned int pointer and returns false. */
Sadly the patch in Comment 11 does not seem to work. Andreas, do you have any ideas why? The new hook s390_narrow_mode_refers_low_part_p *is* called, so I don't know what's going on.
Sorry, my bad. The hook is invoked with the wrong reg actually. Please change it like so: diff --git a/gcc/regcprop.c b/gcc/regcprop.c index f457108b028..5585b759fb9 100644 --- a/gcc/regcprop.c +++ b/gcc/regcprop.c @@ -403,7 +403,7 @@ maybe_mode_change (machine_mode orig_mode, machine_mode copy_mode, if (orig_mode == new_mode) return gen_raw_REG (new_mode, regno); else if (mode_change_ok (orig_mode, new_mode, regno) - && targetm.narrow_mode_refers_low_part_p (regno)) + && targetm.narrow_mode_refers_low_part_p (copy_regno)) { int copy_nregs = hard_regno_nregs (copy_regno, copy_mode); int use_nregs = hard_regno_nregs (copy_regno, new_mode); I've verified that with that additional change the output for Jakubs testcase changes as expected: < lgr %r6,%r9 --- > lgdr %r6,%f0 r9 contained the low part of the r8/r9 register pair while the new instruction loads the high part into r6 as expected. Sorry for not testing this more carefully and making you do an extra round.
(In reply to Andreas Krebbel from comment #14) > Sorry for not testing this more carefully and making you do an extra round. Thanks! and no worries. It works now: # g++ -fPIC -fvisibility-inlines-hidden -ffunction-sections -fdata-sections -O2 -fPIC -fno-exceptions -fno-rtti -std=c++14 -mlong-double-128 -march=z13 -mtune=z14 slpvec.C # ./a.out __insertion_sort 0x3ffc76fd590 0x3ffc76fd5c8 0x10006b8 0xdeadbeefcafebabe __insertion_sort 0x3ffc76fd5c8 0x3ffc76fd5c8 0x10006b8 0xdeadbeefcafebabe
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (gcc bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:2072