| Summary: | lstopo, openmpi and hwloc-info segfault in hwloc_obj_cmp() on VM | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Orion Poplawski <orion> | ||||||||
| Component: | hwloc | Assignee: | Don Zickus <dzickus> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Gahagan <mgahagan> | ||||||||
| Severity: | high | Docs Contact: | |||||||||
| Priority: | unspecified | ||||||||||
| Version: | 7.2 | CC: | d.andric, dbasant, jshortt | ||||||||
| Target Milestone: | rc | Keywords: | Patch | ||||||||
| Target Release: | 7.3 | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2016-11-04 08:12:10 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Bug Depends On: | |||||||||||
| Bug Blocks: | 1274397 | ||||||||||
| Attachments: |
|
||||||||||
Created attachment 1132864 [details]
/proc/cpuinfo
I can't reproduce on a "stock" VM, so this is probably triggered by this particular VM having a different cpu configuration:
> <cpu mode='custom' match='exact'>
> <model fallback='allow'>Nehalem</model>
> <vendor>Intel</vendor>
> <feature policy='require' name='tm2'/>
> <feature policy='require' name='est'/>
> <feature policy='require' name='monitor'/>
> <feature policy='require' name='ds'/>
> <feature policy='require' name='ss'/>
> <feature policy='require' name='vme'/>
> <feature policy='require' name='dtes64'/>
> <feature policy='require' name='rdtscp'/>
> <feature policy='require' name='ht'/>
> <feature policy='require' name='dca'/>
> <feature policy='require' name='pbe'/>
> <feature policy='require' name='tm'/>
> <feature policy='require' name='pdcm'/>
> <feature policy='require' name='vmx'/>
> <feature policy='require' name='ds_cpl'/>
> <feature policy='require' name='xtpr'/>
> <feature policy='require' name='acpi'/>
> <feature policy='require' name='invtsc'/>
> </cpu>
Downgrading to 1.7-3.el7 fixes the issue. hwloc-info crashes when run on VM. Looks like for hwloc_obj name field is NULL which causes the crash:
Core was generated by `hwloc-info'.
Program terminated with signal 11, Segmentation fault.
#0 __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164 movdqu (%rdi), %xmm1
(gdb) bt
#0 __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
#1 0x00007f6d46e03ccb in hwloc_obj_cmp (obj1=obj1@entry=0x2266ce0, obj2=obj2@entry=0x2266b40) at topology.c:569
#2 0x00007f6d46e03e87 in hwloc___insert_object_by_cpuset (report_error=0x7f6d46e02ef0 <hwloc_report_os_error>, obj=0x2266ce0,
cur=<optimized out>, topology=0x2263a40) at topology.c:669
#3 hwloc__insert_object_by_cpuset (topology=topology@entry=0x2263a40, obj=obj@entry=0x2266ce0,
report_error=report_error@entry=0x7f6d46e02ef0 <hwloc_report_os_error>) at topology.c:843
#4 0x00007f6d46e0430c in hwloc_insert_object_by_cpuset (topology=topology@entry=0x2263a40, obj=obj@entry=0x2266ce0)
at topology.c:855
#5 0x00007f6d46e1d96e in summarize (topology=topology@entry=0x2263a40, infos=infos@entry=0x22667e0, nbprocs=nbprocs@entry=2,
fulldiscovery=fulldiscovery@entry=0) at topology-x86.c:559
#6 0x00007f6d46e1e818 in hwloc_look_x86 (topology=topology@entry=0x2263a40, nbprocs=nbprocs@entry=2,
fulldiscovery=fulldiscovery@entry=0) at topology-x86.c:835
#7 0x00007f6d46e1e893 in hwloc_x86_discover (backend=<optimized out>) at topology-x86.c:876
#8 0x00007f6d46e068ab in hwloc_discover (topology=0x2263a40) at topology.c:2157
#9 hwloc_topology_load (topology=topology@entry=0x2263a40) at topology.c:2648
#10 0x0000000000401859 in main (argc=<optimized out>, argv=<optimized out>) at hwloc-info.c:384
(gdb) f 0
#0 __strcmp_sse42 () at ../sysdeps/x86_64/multiarch/strcmp-sse42.S:164
164 movdqu (%rdi), %xmm1
(gdb) info registers rdi
rdi 0x0 0
(gdb) f 1
#1 0x00007f6d46e03ccb in hwloc_obj_cmp (obj1=obj1@entry=0x2266ce0, obj2=obj2@entry=0x2266b40) at topology.c:569
569 int res = strcmp(obj1->name, obj2->name);
(gdb) p obj1->name
$2 = 0x0
(gdb) p obj2->name
$3 = 0x0
(gdb) p *obj1
$4 = {type = HWLOC_OBJ_MISC, os_index = 0, name = 0x0, memory = {total_memory = 0, local_memory = 0, page_types_len = 0,
page_types = 0x0}, attr = 0x2266de0, depth = 0, logical_index = 0, os_level = 1, next_cousin = 0x0, prev_cousin = 0x0,
parent = 0x0, sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 0, children = 0x0, first_child = 0x0,
last_child = 0x0, userdata = 0x0, cpuset = 0x2266c70, complete_cpuset = 0x0, online_cpuset = 0x0, allowed_cpuset = 0x0,
nodeset = 0x0, complete_nodeset = 0x0, allowed_nodeset = 0x0, distances = 0x0, distances_count = 0, infos = 0x0,
infos_count = 0, symmetric_subtree = 0}
(gdb) p *obj2
$5 = {type = HWLOC_OBJ_MISC, os_index = 0, name = 0x0, memory = {total_memory = 0, local_memory = 0, page_types_len = 0,
page_types = 0x0}, attr = 0x2266c40, depth = 0, logical_index = 0, os_level = 2, next_cousin = 0x0, prev_cousin = 0x0,
parent = 0x0, sibling_rank = 0, next_sibling = 0x0, prev_sibling = 0x0, arity = 0, children = 0x0, first_child = 0x22649d0,
last_child = 0x0, userdata = 0x0, cpuset = 0x2266ad0, complete_cpuset = 0x0, online_cpuset = 0x0, allowed_cpuset = 0x0,
nodeset = 0x0, complete_nodeset = 0x0, allowed_nodeset = 0x0, distances = 0x0, distances_count = 0, infos = 0x0,
infos_count = 0, symmetric_subtree = 0}
Hi, Please test the following rpms to see if your issue has been resolved. This is a rebase due to the large request of other features. If this rebase causes issues with your usage, please let us know so we can futher evaluate how we want to distribute requested features by other customers. The API should be backwards compatible if you have applications linking to the current hwloc-libs. http://people.redhat.com/dzickus/rhel7/.hwloc_8d5e1809e13/ Cheers, Don Sorry for the delay. Looks good to me. Created attachment 1191220 [details]
Minimal patch to fix hwloc 1.7 segfault
FWIW, I have been using this minimized patch to fix the segfaults in hwloc. This was easier for me deploy, and it has minimal impact, as far as I could determine.
Confirmed hwloc-1.11.2-1.el7 has fixed this issue. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-2535.html |
Created attachment 1132862 [details] dmidecode Description of problem: Running lstopo or openmpi 1.10.1 compiled code on a VM results in a segementation fault. Version-Release number of selected component (if applicable): hwloc-1.7-5.el7.x86_64 How reproducible: Everytime Steps to Reproduce: 1. 2. 3. Core was generated by `lstopo'. Program terminated with signal 11, Segmentation fault. #0 0x00007fdd44e71076 in __strcmp_sse42 () from /lib64/libc.so.6 (gdb) bt #0 0x00007fdd44e71076 in __strcmp_sse42 () from /lib64/libc.so.6 #1 0x00007fdd4661bccb in hwloc_obj_cmp (obj1=obj1@entry=0x20ae940, obj2=obj2@entry=0x20ae7d0) at topology.c:569 #2 0x00007fdd4661be87 in hwloc___insert_object_by_cpuset ( report_error=0x7fdd4661aef0 <hwloc_report_os_error>, obj=0x20ae940, cur=<optimized out>, topology=0x20a9e40) at topology.c:669 #3 hwloc__insert_object_by_cpuset (topology=topology@entry=0x20a9e40, obj=obj@entry=0x20ae940, report_error=report_error@entry=0x7fdd4661aef0 <hwloc_report_os_error>) at topology.c:843 #4 0x00007fdd4661c30c in hwloc_insert_object_by_cpuset ( topology=topology@entry=0x20a9e40, obj=obj@entry=0x20ae940) at topology.c:855 #5 0x00007fdd4663596e in summarize (topology=0x20a9e40, infos=0x20ae250, nbprocs=4, fulldiscovery=0) at topology-x86.c:559 #6 0x00007fdd46636828 in hwloc_look_x86 (topology=topology@entry=0x20a9e40, nbprocs=nbprocs@entry=4, fulldiscovery=fulldiscovery@entry=0) at topology-x86.c:835 #7 0x00007fdd466368a3 in hwloc_x86_discover (backend=<optimized out>) at topology-x86.c:876 #8 0x00007fdd4661e8bb in hwloc_discover (topology=0x20a9e40) at topology.c:2157 #9 hwloc_topology_load (topology=topology@entry=0x20a9e40) at topology.c:2648 #10 0x000000000040334a in main (argc=<optimized out>, argv=<optimized out>) at lstopo.c:559 (gdb) up #1 0x00007fdd4661bccb in hwloc_obj_cmp (obj1=obj1@entry=0x20ae940, obj2=obj2@entry=0x20ae7d0) at topology.c:569 569 int res = strcmp(obj1->name, obj2->name); (gdb) print obj1->name $1 = 0x0 (gdb) print obj2->name $2 = 0x0