Bug 1355879

Summary: nunc-stans: ns-slapd crashes during startup with SIGILL on AMD Opteron 280
Product: Red Hat Enterprise Linux 7 Reporter: Viktor Ashirov <vashirov>
Component: 389-ds-baseAssignee: wibrown <wibrown>
Status: CLOSED ERRATA QA Contact: Viktor Ashirov <vashirov>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 7.3CC: nhosoi, nkinder, rmeggins, wibrown
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: 389-ds-base-1.3.5.10-4.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-03 20:44:11 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Viktor Ashirov 2016-07-12 19:29:32 UTC
Description of problem:
With nunc-stans enabled ns-slapd crashes at startup on a machine with AMD Opteron 280 CPU:
==28269== Process terminating with default action of signal 4 (SIGILL): dumping core
==28269==  Illegal opcode at address 0x516F04A
==28269==    at 0x516F04A: abstraction_dcas (abstraction_dcas.c:179)
==28269==    by 0x516EF56: freelist_push (freelist_pop_push.c:84)
==28269==    by 0x516EC7C: freelist_new_elements (freelist_new.c:67)
==28269==    by 0x516ED40: freelist_new (freelist_new.c:31)
==28269==    by 0x516E48C: stack_new (stack_new.c:21)
==28269==    by 0x516CEB6: ns_thrpool_new (ns_thrpool.c:900)
==28269==    by 0x128EA9: slapd_daemon (daemon.c:1222)
==28269==    by 0x119ACB: main (main.c:1117)

nunc-stans uses libflds, which has this in the source:
src/abstraction/abstraction_dcas.c:183

   179	    __asm__ __volatile__
   180	    (
   181	      "xchg %%rsi, %%rbx;"  // swap RBI and RBX 
   182	      "lock;"               // make cmpxchg16b atomic
   183	      "cmpxchg16b %0;"      // cmpxchg16b sets ZF on success
   184	      "setz       %3;"      // if ZF set, set cas_result to 1
   185	      "xchg %%rbx, %%rsi;"  // re-swap RBI and RBX
   186	
   187	      // output
   188	      : "+m" (*(volatile atom_t (*)[2]) destination), "+a" (*compare), "+d" (*(compare+1)), "=q" (cas_result)
   189	
   190	      // input
   191	      : "S" (*exchange), "c" (*(exchange+1))
   192	
   193	      // clobbered
   194	      : "cc", "memory"
   195	    );

cmpxchg16b is not supported by some AMD proccessors
https://en.wikipedia.org/wiki/X86-64:
Early AMD64 processors (typically on Socket 939 and 940) lacked the CMPXCHG16B instruction, which is an extension of the CMPXCHG8B instruction present on most post-80486 processors. Similar to CMPXCHG8B, CMPXCHG16B allows for atomic operations on octal words. This is useful for parallel algorithms that use compare and swap on data larger than the size of a pointer, common in lock-free and wait-free algorithms. Without CMPXCHG16B one must use workarounds, such as a critical section or alternative lock-free approaches.

Version-Release number of selected component (if applicable):
389-ds-base-1.3.5.10-3.el7.x86_64

How reproducible:
always

Steps to Reproduce:
0. Use a machine with AMD Opteron 280
1. Enable nunc-stans
2. Start ns-slapd

Actual results:
Server crashes with SIGILL

Expected results:
Server should startup successfully.

Additional info:
processor	: 0
vendor_id	: AuthenticAMD
cpu family	: 15
model		: 33
model name	: Dual Core AMD Opteron(tm) Processor 280
stepping	: 2
microcode	: 0x4d
cpu MHz		: 2405.487
cache size	: 1024 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 2
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 1
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt lm 3dnowext 3dnow art rep_good nopl extd_apicid pni lahf_lm cmp_legacy
bogomips	: 4810.97
TLB size	: 1024 4K pages
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Comment 2 Noriko Hosoi 2016-07-13 20:52:16 UTC
Upstream ticket:
https://fedorahosted.org/389/ticket/48925

Comment 3 wibrown@redhat.com 2016-07-14 21:40:06 UTC
We have decided not to support nunc-stans on this specific hardware. It's old, and all modern hardware supports the cx16 flag.

We will now detect and raise a warning with dsktune for this.

Comment 5 Viktor Ashirov 2016-07-18 11:52:11 UTC
Build tested:
389-ds-base-1.3.5.10-5.el7.x86_64

During the initial setup by setup-ds.pl dsktune is executed and the following message is showed to the user:
[root@mgmt8 ~]# setup-ds.pl 

==============================================================================
This program will set up the 389 Directory Server.

It is recommended that you have "root" privilege to set up the software.
Tips for using this  program:
  - Press "Enter" to choose the default and go to the next screen
  - Type "Control-B" or the word "back" then "Enter" to go back to the previous screen
  - Type "Control-C" to cancel the setup program

Would you like to continue with set up? [yes]: 

==============================================================================
Your system has been scanned for potential problems, missing patches,
etc.  The following output is a report of the items found that need to
be addressed before running this software in a production
environment.

389 Directory Server system tuning analysis version 14-JULY-2016.

NOTICE : System is x86_64-unknown-linux3.10.0-464.el7.x86_64 (4 processors).

ERROR: This system does not support CMPXCHG16B instruction (cpuflag cx16).
       nsslapd-enable-nunc-stans must be set to "off" on this system. 
       In a future release of Directory Server this platform will NOT be supported.

NOTICE : The net.ipv4.tcp_keepalive_time is set to 7200000 milliseconds
(120 minutes).  This may cause temporary server congestion from lost
client connections.

WARNING: There are only 1024 file descriptors (soft limit) available, which
limit the number of simultaneous connections.  

ERROR  : The above errors MUST be corrected before proceeding.

Would you like to continue? [no]: 


Marking as VERIFIED.

Comment 7 errata-xmlrpc 2016-11-03 20:44:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2016-2594.html