Bug 1314459

Summary: lstopo, openmpi and hwloc-info segfault in hwloc_obj_cmp() on VM
Product: Red Hat Enterprise Linux 7 Reporter: Orion Poplawski <orion>
Component: hwlocAssignee: Don Zickus <dzickus>
Status: CLOSED ERRATA QA Contact: Mike Gahagan <mgahagan>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.2CC: d.andric, dbasant, jshortt
Target Milestone: rcKeywords: Patch
Target Release: 7.3   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-11-04 08:12:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1274397    
Attachments:
Description Flags
dmidecode
none
/proc/cpuinfo
none
Minimal patch to fix hwloc 1.7 segfault none

Description Orion Poplawski 2016-03-03 16:36:58 UTC
Created attachment 1132862 [details]
dmidecode

Description of problem:


Running lstopo or openmpi 1.10.1 compiled code on a VM results in a segementation fault.


Version-Release number of selected component (if applicable):
hwloc-1.7-5.el7.x86_64

How reproducible:
Everytime

Steps to Reproduce:
1.
2.
3.

Core was generated by `lstopo'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007fdd44e71076 in __strcmp_sse42 () from /lib64/libc.so.6
(gdb) bt
#0  0x00007fdd44e71076 in __strcmp_sse42 () from /lib64/libc.so.6
#1  0x00007fdd4661bccb in hwloc_obj_cmp (obj1=obj1@entry=0x20ae940, 
    obj2=obj2@entry=0x20ae7d0) at topology.c:569
#2  0x00007fdd4661be87 in hwloc___insert_object_by_cpuset (
    report_error=0x7fdd4661aef0 <hwloc_report_os_error>, obj=0x20ae940, 
    cur=<optimized out>, topology=0x20a9e40) at topology.c:669
#3  hwloc__insert_object_by_cpuset (topology=topology@entry=0x20a9e40, 
    obj=obj@entry=0x20ae940, 
    report_error=report_error@entry=0x7fdd4661aef0 <hwloc_report_os_error>)
    at topology.c:843
#4  0x00007fdd4661c30c in hwloc_insert_object_by_cpuset (
    topology=topology@entry=0x20a9e40, obj=obj@entry=0x20ae940)
    at topology.c:855
#5  0x00007fdd4663596e in summarize (topology=0x20a9e40, infos=0x20ae250, 
    nbprocs=4, fulldiscovery=0) at topology-x86.c:559
#6  0x00007fdd46636828 in hwloc_look_x86 (topology=topology@entry=0x20a9e40, 
    nbprocs=nbprocs@entry=4, fulldiscovery=fulldiscovery@entry=0)
    at topology-x86.c:835
#7  0x00007fdd466368a3 in hwloc_x86_discover (backend=<optimized out>)
    at topology-x86.c:876
#8  0x00007fdd4661e8bb in hwloc_discover (topology=0x20a9e40)
    at topology.c:2157
#9  hwloc_topology_load (topology=topology@entry=0x20a9e40) at topology.c:2648
#10 0x000000000040334a in main (argc=<optimized out>, argv=<optimized out>)
    at lstopo.c:559
(gdb) up
#1  0x00007fdd4661bccb in hwloc_obj_cmp (obj1=obj1@entry=0x20ae940, 
    obj2=obj2@entry=0x20ae7d0) at topology.c:569
569	          int res = strcmp(obj1->name, obj2->name);
(gdb) print obj1->name
$1 = 0x0
(gdb) print obj2->name
$2 = 0x0

Comment 1 Orion Poplawski 2016-03-03 16:38:56 UTC
Created attachment 1132864 [details]
/proc/cpuinfo

Comment 2 Orion Poplawski 2016-03-03 16:47:15 UTC
I can't reproduce on a "stock" VM, so this is probably triggered by this particular VM having a different cpu configuration:

>   <cpu mode='custom' match='exact'>
>     <model fallback='allow'>Nehalem</model>
>     <vendor>Intel</vendor>
>     <feature policy='require' name='tm2'/>
>     <feature policy='require' name='est'/>
>     <feature policy='require' name='monitor'/>
>     <feature policy='require' name='ds'/>
>     <feature policy='require' name='ss'/>
>     <feature policy='require' name='vme'/>
>     <feature policy='require' name='dtes64'/>
>     <feature policy='require' name='rdtscp'/>
>     <feature policy='require' name='ht'/>
>     <feature policy='require' name='dca'/>
>     <feature policy='require' name='pbe'/>
>     <feature policy='require' name='tm'/>
>     <feature policy='require' name='pdcm'/>
>     <feature policy='require' name='vmx'/>
>     <feature policy='require' name='ds_cpl'/>
>     <feature policy='require' name='xtpr'/>
>     <feature policy='require' name='acpi'/>
>     <feature policy='require' name='invtsc'/>
>   </cpu>

Comment 4 Orion Poplawski 2016-03-03 17:04:35 UTC
Downgrading to 1.7-3.el7 fixes the issue.

Comment 5 Divya 2016-03-08 10:35:42 UTC
hwloc-info crashes when run on VM. Looks like for hwloc_obj name field is NULL which causes the crash: 


Core was generated by `hwloc-info'.
Program terminated with signal 11, Segmentation fault.
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164		movdqu	(%rdi), %xmm1
(gdb) bt
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
#1  0x00007f6d46e03ccb in hwloc_obj_cmp (obj1=obj1@entry=0x2266ce0, obj2=obj2@entry=0x2266b40) at topology.c:569
#2  0x00007f6d46e03e87 in hwloc___insert_object_by_cpuset (report_error=0x7f6d46e02ef0 <hwloc_report_os_error>, obj=0x2266ce0, 
    cur=<optimized out>, topology=0x2263a40) at topology.c:669
#3  hwloc__insert_object_by_cpuset (topology=topology@entry=0x2263a40, obj=obj@entry=0x2266ce0, 
    report_error=report_error@entry=0x7f6d46e02ef0 <hwloc_report_os_error>) at topology.c:843
#4  0x00007f6d46e0430c in hwloc_insert_object_by_cpuset (topology=topology@entry=0x2263a40, obj=obj@entry=0x2266ce0)
    at topology.c:855
#5  0x00007f6d46e1d96e in summarize (topology=topology@entry=0x2263a40, infos=infos@entry=0x22667e0, nbprocs=nbprocs@entry=2, 
    fulldiscovery=fulldiscovery@entry=0) at topology-x86.c:559
#6  0x00007f6d46e1e818 in hwloc_look_x86 (topology=topology@entry=0x2263a40, nbprocs=nbprocs@entry=2, 
    fulldiscovery=fulldiscovery@entry=0) at topology-x86.c:835
#7  0x00007f6d46e1e893 in hwloc_x86_discover (backend=<optimized out>) at topology-x86.c:876
#8  0x00007f6d46e068ab in hwloc_discover (topology=0x2263a40) at topology.c:2157
#9  hwloc_topology_load (topology=topology@entry=0x2263a40) at topology.c:2648
#10 0x0000000000401859 in main (argc=<optimized out>, argv=<optimized out>) at hwloc-info.c:384
(gdb) f 0 
#0  __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164		movdqu	(%rdi), %xmm1
(gdb) info registers rdi
rdi            0x0	0

(gdb) f 1
#1  0x00007f6d46e03ccb in hwloc_obj_cmp (obj1=obj1@entry=0x2266ce0, obj2=obj2@entry=0x2266b40) at topology.c:569
569	          int res = strcmp(obj1->name, obj2->name);
(gdb) p obj1->name
$2 = 0x0
(gdb) p obj2->name
$3 = 0x0

(gdb) p *obj1
$4 = {type = HWLOC_OBJ_MISC, os_index = 0, name = 0x0, memory = {total_memory = 0, local_memory = 0, page_types_len = 0, 
    page_types = 0x0}, attr = 0x2266de0, depth = 0, logical_index = 0, os_level = 1, next_cousin = 0x0, prev_cousin = 0x0, 
  parent = 0x0, sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 0, children = 0x0, first_child = 0x0, 
  last_child = 0x0, userdata = 0x0, cpuset = 0x2266c70, complete_cpuset = 0x0, online_cpuset = 0x0, allowed_cpuset = 0x0, 
  nodeset = 0x0, complete_nodeset = 0x0, allowed_nodeset = 0x0, distances = 0x0, distances_count = 0, infos = 0x0, 
  infos_count = 0, symmetric_subtree = 0}
(gdb) p *obj2
$5 = {type = HWLOC_OBJ_MISC, os_index = 0, name = 0x0, memory = {total_memory = 0, local_memory = 0, page_types_len = 0, 
    page_types = 0x0}, attr = 0x2266c40, depth = 0, logical_index = 0, os_level = 2, next_cousin = 0x0, prev_cousin = 0x0, 
  parent = 0x0, sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 0, children = 0x0, first_child = 0x22649d0, 
  last_child = 0x0, userdata = 0x0, cpuset = 0x2266ad0, complete_cpuset = 0x0, online_cpuset = 0x0, allowed_cpuset = 0x0, 
  nodeset = 0x0, complete_nodeset = 0x0, allowed_nodeset = 0x0, distances = 0x0, distances_count = 0, infos = 0x0, 
  infos_count = 0, symmetric_subtree = 0}

Comment 8 Don Zickus 2016-05-26 16:36:52 UTC
Hi,

Please test the following rpms to see if your issue has been resolved.  This is a rebase due to the large request of other features.  If this rebase causes issues with your usage, please let us know so we can futher evaluate how we want to distribute requested features by other customers.  The API should be backwards compatible if you have applications linking to the current hwloc-libs.

http://people.redhat.com/dzickus/rhel7/.hwloc_8d5e1809e13/

Cheers,
Don

Comment 9 Orion Poplawski 2016-06-27 15:16:37 UTC
Sorry for the delay.  Looks good to me.

Comment 11 Dimitry Andric 2016-08-16 11:23:50 UTC
Created attachment 1191220 [details]
Minimal patch to fix hwloc 1.7 segfault

FWIW, I have been using this minimized patch to fix the segfaults in hwloc.  This was easier for me deploy, and it has minimal impact, as far as I could determine.

Comment 12 Mike Gahagan 2016-09-14 18:11:46 UTC
Confirmed hwloc-1.11.2-1.el7 has fixed this issue.

Comment 14 errata-xmlrpc 2016-11-04 08:12:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2535.html