Description of problem: Program received signal SIGILL, Illegal instruction. 0x00007ffff78c44bd in liberasurecode_get_aligned_data_size ( desc=<optimized out>, data_len=1) at erasurecode.c:1212 1212 ret = (int) ceill( (double) Version-Release number of selected component (if applicable): glibc-2.24-4.fc25.x86_64 liberasurecode-1.4.0-1.fc25.x86_64 How reproducible: Synchronous 100% Steps to Reproduce: 1. Build and run the attached test Actual results: SIGILL Expected results: Working Additional info: This was discovered because it happens on download from an EC container. [root@rhev-a24c-01 misc]# more /etc/swift/swift.conf [swift-hash] swift_hash_path_suffix = 583a********4258 [swift-constraints] max_meta_count = 90 max_meta_overall_size = 4096 [storage-policy:0] name = policy0 policy_type = replication default = yes [storage-policy:1] name = ec0603 policy_type = erasure_coding ec_type = liberasurecode_rs_vand ec_num_data_fragments = 6 ec_num_parity_fragments = 3 ec_object_segment_size = 1048576 [root@rhev-a24c-01 misc]# This use of ceill() is the only time libm is used in the whole liberasurecode, it appears. Do we even need it there? N.B. Invoking the exact same code directly works perfectly. So, the problem is not ceill() as such. See other attachment.
Created attachment 1281307 [details] crash reproducer (only on a certain host)
Created attachment 1281308 [details] same code but no crash
The crash occurs on certain systems only (possibly on AMD CPUs; Intel works okay).
The problem narrows down to the build liberasurecode-1.4.0-1.fc25.x86_64 using AVX instructions, because ceill() is inlined. The crash happens here: 0x00007ffff78c44bb <+59>: js 0x7ffff78c44e8 <liberasurecode_get_aligned_data_size+104> => 0x00007ffff78c44bd <+61>: vxorpd %xmm0,%xmm0,%xmm0 0x00007ffff78c44c1 <+65>: vcvtsi2sd %rbp,%xmm0,%xmm0 The same code built without special optimization flags ends calling the version of ceill() in glibc, which works on all systems. processor : 0 vendor_id : AuthenticAMD cpu family : 16 model : 8 model name : Six-Core AMD Opteron(tm) Processor 8431 stepping : 0 microcode : 0x10000da cpu MHz : 800.000 cache size : 512 KB physical id : 0 siblings : 6 core id : 0 cpu cores : 6 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 5 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid eagerfpu p ni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misal ignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nri p_save pausefilter bugs : tlb_mmatch fxsave_leak sysret_ss_attrs null_seg amd_e400 bogomips : 4787.78 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 48 bits physical, 48 bits virtual power management: ts ttp tm stc 100mhzsteps hwpstate
Actual build flags from Koji: libtool: compile: gcc -DHAVE_CONFIG_H -I. -I../include -I../include/erasurecode -I../include/xor_codes -I../include/rs_vand -I../include/isa_l -I../include/shss -Werror -O2 -g -Werror -D_GNU_SOURCE=1 -Wall -pedantic -std=c99 -O2 -g -pipe -Wall -Werror=format-security -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector-strong --param=ssp-buffer-size=4 -grecord-gcc-switches -specs=/usr/lib/rpm/redhat/redhat-hardened-cc1 -m64 -mtune=generic -mmmx -DINTEL_MMX -msse -DINTEL_SSE -msse2 -DINTEL_SSE2 -msse3 -DINTEL_SSE3 -mssse3 -DINTEL_SSSE3 -msse4.1 -DINTEL_SSE41 -msse4.2 -DINTEL_SSE42 -mavx -DINTEL_AVX -DARCH_64 -c erasurecode.c -fPIC -DPIC -o .libs/liberasurecode_la-erasurecode.o The liberasurecode does not do anything custom with CFLAGS, just inherits everything from RPM.
gcc -E output might give a hint as to whether the problem is in a header or in the compiler itself.
(In reply to Pete Zaitcev from comment #5) > The liberasurecode does not do anything custom with CFLAGS, just > inherits everything from RPM. Looks like the sources perform CPU feature detection at build time: # Detect the SIMD features supported by both the compiler and the CPU SIMD_FLAGS="" cat "$srcdir/get_flags_from_cpuid.c" \ | sed "s/FLAGSFROMAUTOCONF/${SUPPORTED_FLAGS}/" \ | $CC -x c -g - -o get_flags_from_cpuid if [[ -e ./get_flags_from_cpuid ]]; then chmod 755 get_flags_from_cpuid; ./get_flags_from_cpuid; rm ./get_flags_from_cpuid if [[ -e compiler_flags ]]; then SIMD_FLAGS=`cat compiler_flags` rm -f compiler_flags else AC_MSG_WARN([Could not run the CPUID detection program]) fi else AC_MSG_WARN([Could not compile the CPUID detection program]) fi And from: get_flags_from_cpuid.c: for (comp_flag = strtok(SUPPORTED_COMP_FLAGS, " "); comp_flag != NULL; comp_flag = strtok(NULL, " ")) { if (strncmp(comp_flag, "-m", 2) != 0) { fprintf(stderr, "Invalid comp_flag: %s\n", comp_flag); exit(2); } if (strcmp(comp_flag, "-mmmx\0") == 0) { supp_comp_flgs |= (1 << EDX_MMX_BIT); } That's not going to work and will lead to the observed problems.
Florian, thanks a lot! I ended with an upstream patch that makes optional the code in configure.ac that you identified.
liberasurecode-1.5.0-1.fc26 has been submitted as an update to Fedora 26. https://bodhi.fedoraproject.org/updates/FEDORA-2017-b14bb46f9a
liberasurecode-1.5.0-1.fc26 has been pushed to the Fedora 26 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2017-b14bb46f9a
liberasurecode-1.5.0-1.fc26 has been pushed to the Fedora 26 stable repository. If problems still persist, please make note of it in this bug report.