Bug 1732548

Summary: glibc: Segmentation fault in libc-2.28.so when running IMB-MPI1
Product: Red Hat Enterprise Linux 8 Reporter: Adrian Suhov <v-adsuho>
Component: glibcAssignee: Florian Weimer <fweimer>
Status: CLOSED INSUFFICIENT_DATA QA Contact: qe-baseos-tools-bugs
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 8.0CC: ashankar, codonell, dj, fweimer, jopoulso, juhlee, mnewsome, pfrankli, vkuznets
Target Milestone: rc   
Target Release: 8.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-08-01 14:43:59 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dmesg output for the bad VM none

Description Adrian Suhov 2019-07-23 16:04:56 UTC
Created attachment 1592917 [details]
dmesg output for the bad VM

Description of problem:
This issue occurred when trying to test RDMA/Infiniband on Azure. The VM size used was HC44rs. Intel MPI version 2018.3.222 was used for testing

Version-Release number of selected component (if applicable):
2.28

How reproducible:
100%

Steps to Reproduce:
1. Setup 2 VMs on Azure, HC44rs VM size. Make sure they're one the same subnet so they can see each other over the network.
2. Install MLNX OFED driver 4.5-1.0.1.0. Disable firewall and SELinux. 
3. Change LOAD_EIPOIB to 'yes' in /etc/infiniband/openib.conf
4. Download and install Intel MPI.
5. Reboot both VMs
6. After reboot, if everything is ok on the setup side, the command that will cause the fail is this:
/opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin/mpirun -hosts 10.0.0.4,10.0.0.5 -ppn 2 -n 2 -env I_MPI_FABRICS ofa -env I_MPI_DAPL_PROVIDER=ofa-v2-ib0 -env SECS_PER_SAMPLE=600 /opt/intel/compilers_and_libraries_2018.3.222/linux/mpi/intel64/bin/IMB-MPI1 pingpong

Actual results:
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 15272 RUNNING AT 10.0.0.5
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
   Intel(R) MPI Library troubleshooting guide:
      https://software.intel.com/node/561764
===================================================================================

When looking inside dmesg, we can see segmentation fault errors:

[   42.725166] IMB-MPI1[2756]: segfault at 0 ip 00007f4281332b06 sp 00007ffd1b340190 error 4 in libc-2.28.so[7f42812a
[   42.725189] IMB-MPI1[2757]: segfault at 0 ip 00007f2d7822cb06 sp 00007ffc36ce0310 error 4 in libc-2.28.so[7f2d781a
[   42.737587] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   42.774843] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   43.905730] EXT4-fs (sdb1): mounted filesystem with ordered data mode. Opts: (null)
[   43.971992] Adding 2097148k swap on /mnt/resource/swapfile.  Priority:-2 extents:6 across:2260988k FS
[   53.259670] IMB-MPI1[2956]: segfault at 0 ip 00007f0e121dbb06 sp 00007ffff37e0950 error 4 in libc-2.28.so[7f0e1215
[   53.259691] IMB-MPI1[2957]: segfault at 0 ip 00007fced43a5b06 sp 00007ffc87ba6420 error 4 in libc-2.28.so[7fced431
[   53.272286] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   53.286612] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   59.719505] hv_balloon: Max. dynamic memory size: 360448 MB
[   63.795583] IMB-MPI1[3110]: segfault at 0 ip 00007f40fb537b06 sp 00007ffc366982a0 error 4 in libc-2.28.so[7f40fb4a
[   63.795606] IMB-MPI1[3111]: segfault at 0 ip 00007f661e633b06 sp 00007ffef41b7a10 error 4
[   63.808239] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   63.841781]  in libc-2.28.so[7f661e5a9000+1ba000]
[   63.847642] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   74.380840] IMB-MPI1[3269]: segfault at 0 ip 00007f4f57891b06 sp 00007ffff0a1aa80 error 4 in libc-2.28.so[7f4f5780
[   74.380861] IMB-MPI1[3270]: segfault at 0 ip 00007fbc8c8b9b06 sp 00007ffd8ae67180 error 4
[   74.392970] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   74.392970]  in libc-2.28.so[7fbc8c82f000+1ba000]
[   74.429882] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   84.916997] IMB-MPI1[3426]: segfault at 0 ip 00007f019d1e5b06 sp 00007fff6bd686a0 error 4 in libc-2.28.so[7f019d15
[   84.917024] IMB-MPI1[3427]: segfault at 0 ip 00007fa83abeab06 sp 00007ffcd2f418f0 error 4
[   84.929223] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   84.962649]  in libc-2.28.so[7fa83ab60000+1ba000]
[   84.968561] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   95.453505] IMB-MPI1[3499]: segfault at 0 ip 00007f3999beab06 sp 00007fff51e97690 error 4 in libc-2.28.so[7f3999b6
[   95.453527] IMB-MPI1[3500]: segfault at 0 ip 00007f42ac183b06 sp 00007ffd53f02410 error 4
[   95.465435] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89
[   95.465436]  in libc-2.28.so[7f42ac0f9000+1ba000]
[   95.503767] Code: 00 00 00 90 f3 0f 1e fa 48 8d 15 9d 5f 33 00 e9 00 00 00 00 f3 0f 1e fa 41 54 49 89 f4 55 48 89

Expected results:
The benchmark should return exit code 0 and dmesg should be clean.

Comment 2 Vitaly Kuznetsov 2019-07-24 15:32:47 UTC
Hi Adrian,

do I understand correctly that to enable RDMA on Azure one has to at least:

1) Install out-of-the-box 'LIS' drivers containing 'vmbus_rdma' driver

2) Install out-of-the-box Mellanox 'OFED' drivers?

or am I missing some recent changes? In case I'm not this will likely remain completely unsupported by Red Hat.
We can, however, look at the glibc issue if it is somehow reproducible without these external components.

Comment 3 Ju Lee 2019-07-24 20:37:35 UTC
Hi Vitaly,

Our test automation covers many different distro in Azure, and here is a quick summary. RHEL and CentOS will need OFED driver installation and LIS installation. Also there are the required packages and waagent. You will find detail information in Github, https://github.com/LIS/LISAv2/blob/master/Testscripts/Linux/SetupRDMA.sh#L65.

You can find the test command in this script line, https://github.com/LIS/LISAv2/blob/master/Testscripts/Linux/TestRDMA_MultiVM.sh#L710 

We have done many different configurations in my testing, and Intel MPI (2018.3.222) failed in RHEL 8.0 + 4.18.0-80.4.2. LIS does not participate in this case. However, IBM Platform MPI, Open MPI, MVAPICH MPI passed in the same RHEL 8.0 VM though.

Comment 4 Florian Weimer 2019-07-25 08:15:51 UTC
I'm sorry, but Red Hat does not support the Mellanox OFED drivers.  You will have to reproduce this issue with the supported openmpi-based stack, with an untainted kernel, before we can debug this issue.

Feel free to post a backtrace with debugging symbols from the crash, and I can see if it is potentially glibc-related.  In msot cases when we see crashes in libc.so.6, it is the result of applications passsing NULL pointers to functions such as strlen, so the crash is due to an application bug.

Comment 5 Florian Weimer 2019-08-01 14:43:59 UTC
Without a backtrace or coredump, we are not able to assist with diagnosing this issue.  Sorry.