Bug 1454543 - SIGILL in ceill() when called by liberasurecode_get_aligned_data_size()
Summary: SIGILL in ceill() when called by liberasurecode_get_aligned_data_size()
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: liberasurecode
Version: 25
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Pete Zaitcev
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-23 02:45 UTC by Pete Zaitcev
Modified: 2017-07-28 17:19 UTC (History)
3 users (show)

Fixed In Version: liberasurecode-1.4.0-3.fc27 liberasurecode-1.5.0-1.fc26
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-07-28 17:19:48 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
crash reproducer (739 bytes, text/x-csrc)
2017-05-23 02:58 UTC, Pete Zaitcev
no flags Details
same code but no crash (311 bytes, text/x-csrc)
2017-05-23 02:59 UTC, Pete Zaitcev
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 467418 0 None MERGED Stop using ceill() to compute padded data size 2020-07-03 14:40:23 UTC
OpenStack gerrit 467761 0 None MERGED Allow to disable optimizations for portability 2020-07-03 14:40:24 UTC

Description Pete Zaitcev 2017-05-23 02:45:32 UTC
Description of problem:

Program received signal SIGILL, Illegal instruction.
0x00007ffff78c44bd in liberasurecode_get_aligned_data_size (
    desc=<optimized out>, data_len=1) at erasurecode.c:1212
1212        ret = (int) ceill( (double)

Version-Release number of selected component (if applicable):

glibc-2.24-4.fc25.x86_64
liberasurecode-1.4.0-1.fc25.x86_64

How reproducible:

Synchronous 100%

Steps to Reproduce:
1. Build and run the attached test

Actual results:

SIGILL

Expected results:

Working

Additional info:

This was discovered because it happens on download from an EC container.

[root@rhev-a24c-01 misc]# more /etc/swift/swift.conf 
[swift-hash]
swift_hash_path_suffix = 583a********4258

[swift-constraints]
max_meta_count = 90
max_meta_overall_size = 4096

[storage-policy:0]
name = policy0
policy_type = replication
default = yes

[storage-policy:1]
name = ec0603
policy_type = erasure_coding
ec_type = liberasurecode_rs_vand
ec_num_data_fragments = 6
ec_num_parity_fragments = 3
ec_object_segment_size = 1048576
[root@rhev-a24c-01 misc]# 

This use of ceill() is the only time libm is used in the whole
liberasurecode, it appears. Do we even need it there?

N.B. Invoking the exact same code directly works perfectly.
So, the problem is not ceill() as such. See other attachment.

Comment 1 Pete Zaitcev 2017-05-23 02:58:39 UTC
Created attachment 1281307 [details]
crash reproducer

(only on a certain host)

Comment 2 Pete Zaitcev 2017-05-23 02:59:18 UTC
Created attachment 1281308 [details]
same code but no crash

Comment 3 Pete Zaitcev 2017-05-23 03:02:04 UTC
The crash occurs on certain systems only (possibly on AMD CPUs; Intel
works okay).

Comment 4 Pete Zaitcev 2017-05-23 03:32:15 UTC
The problem narrows down to the build liberasurecode-1.4.0-1.fc25.x86_64
using AVX instructions, because ceill() is inlined. The crash happens
here:

   0x00007ffff78c44bb <+59>:    js     0x7ffff78c44e8 <liberasurecode_get_aligned_data_size+104>
=> 0x00007ffff78c44bd <+61>:    vxorpd %xmm0,%xmm0,%xmm0
   0x00007ffff78c44c1 <+65>:    vcvtsi2sd %rbp,%xmm0,%xmm0

The same code built without special optimization flags ends calling
the version of ceill() in glibc, which works on all systems.

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 16
model           : 8
model name      : Six-Core AMD Opteron(tm) Processor 8431
stepping        : 0
microcode       : 0x10000da
cpu MHz         : 800.000
cache size      : 512 KB
physical id     : 0
siblings        : 6
core id         : 0
cpu cores       : 6
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 5
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov 
pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp
 lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid eagerfpu p
ni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misal
ignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nri
p_save pausefilter
bugs            : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg amd_e400
bogomips        : 4787.78
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate

Comment 5 Pete Zaitcev 2017-05-23 04:48:54 UTC
Actual build flags from Koji:

libtool: compile:  gcc -DHAVE_CONFIG_H -I. -I../include -I../include/erasurecode -I../include/xor_codes -I../include/rs_vand -I../include/isa_l -I../include/shss -Werror -O2 -g -Werror -D_GNU_SOURCE=1 -Wall -pedantic -std=c99 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -mmmx -DINTEL_MMX -msse -DINTEL_SSE -msse2 -DINTEL_SSE2 -msse3 -DINTEL_SSE3 -mssse3 -DINTEL_SSSE3 -msse4.1 -DINTEL_SSE41 -msse4.2 -DINTEL_SSE42 -mavx -DINTEL_AVX -DARCH_64 -c erasurecode.c  -fPIC -DPIC -o .libs/liberasurecode_la-erasurecode.o

The liberasurecode does not do anything custom with CFLAGS, just
inherits everything from RPM.

Comment 6 Andy Lutomirski 2017-05-23 05:10:06 UTC
gcc -E output might give a hint as to whether the problem is in a header or in the compiler itself.

Comment 7 Florian Weimer 2017-05-23 05:20:05 UTC
(In reply to Pete Zaitcev from comment #5)
> The liberasurecode does not do anything custom with CFLAGS, just
> inherits everything from RPM.

Looks like the sources perform CPU feature detection at build time:

# Detect the SIMD features supported by both the compiler and the CPU
SIMD_FLAGS=""
cat "$srcdir/get_flags_from_cpuid.c" \
    | sed "s/FLAGSFROMAUTOCONF/${SUPPORTED_FLAGS}/" \
    | $CC -x c -g - -o get_flags_from_cpuid
if [[ -e ./get_flags_from_cpuid ]]; then
  chmod 755 get_flags_from_cpuid; ./get_flags_from_cpuid; rm ./get_flags_from_cpuid
  if [[ -e compiler_flags ]]; then
    SIMD_FLAGS=`cat compiler_flags`
    rm -f compiler_flags
  else
    AC_MSG_WARN([Could not run the CPUID detection program])
  fi
else
  AC_MSG_WARN([Could not compile the CPUID detection program])
fi

And from: get_flags_from_cpuid.c:

  for (comp_flag = strtok(SUPPORTED_COMP_FLAGS, " "); comp_flag != NULL; comp_flag = strtok(NULL, " ")) {
    if (strncmp(comp_flag, "-m", 2) != 0) {
      fprintf(stderr, "Invalid comp_flag: %s\n", comp_flag);
      exit(2);
    }
    if (strcmp(comp_flag, "-mmmx\0") == 0) {
      supp_comp_flgs |= (1 << EDX_MMX_BIT);
    }

That's not going to work and will lead to the observed problems.

Comment 8 Pete Zaitcev 2017-05-25 18:44:19 UTC
Florian, thanks a lot! I ended with an upstream patch that makes optional
the code in configure.ac that you identified.

Comment 9 Fedora Update System 2017-07-20 01:56:49 UTC
liberasurecode-1.5.0-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-b14bb46f9a

Comment 10 Fedora Update System 2017-07-21 01:22:29 UTC
liberasurecode-1.5.0-1.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-b14bb46f9a

Comment 11 Fedora Update System 2017-07-28 17:19:48 UTC
liberasurecode-1.5.0-1.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.


Note You need to log in before you can comment on or make changes to this bug.