Bug 886862 - netcf uses libnl addr and link caches which are not thread-safe, leading to libvirtd crashes
Summary: netcf uses libnl addr and link caches which are not thread-safe, leading to l...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: netcf
Version: 6.4
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Laine Stump
QA Contact: Virtualization Bugs
URL:
Whiteboard: abrt_hash:63589980492b58b5b3206b1722c...
Depends On: 886454
Blocks: 895654
TreeView+ depends on / blocked
 
Reported: 2012-12-13 12:09 UTC by Daniel Berrangé
Modified: 2013-02-21 11:04 UTC (History)
17 users (show)

Fixed In Version: netcf-0.1.9-3.el6
Doc Type: Bug Fix
Doc Text:
Cause: libvirt uses the netcf library to configure/retrieve status of host network devices. netcf uses the libnl library for communicating with the kernel to learn about the current state of network devices. Some functions in the libnl library are not threadsafe. Consequence: When doing stress testing of the libvirt API under extremely heavy load (with multiple tests running simultaneously), it was possible to confuse libnl such that it caused a segfault, resulting in a crash of the application using libvirt. Fix: The non-threadsafe libnl functions being called by netcf were not actually needed for proper operation, so calls to them were removed from netcf. Result: libvirt no longer experiences crashes in libnl (via netcf) under heavy load.
Clone Of: 886454
Environment:
Last Closed: 2013-02-21 11:04:39 UTC
Target Upstream Version:


Attachments (Terms of Use)
run nc.c in a loop (292.62 KB, text/plain)
2012-12-24 09:58 UTC, yanbing du
no flags Details
thread apply all bt (9.25 KB, text/plain)
2012-12-25 06:10 UTC, yanbing du
no flags Details
run nc.c in a loop (34.97 KB, text/plain)
2013-01-09 04:11 UTC, yanbing du
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2013:0494 normal SHIPPED_LIVE netcf bug fix update 2013-02-20 21:06:06 UTC

Description Daniel Berrangé 2012-12-13 12:09:09 UTC
Cloning to RHEL because the same problem exists with use of libnl.so

+++ This bug was initially created as a clone of Bug #886454 +++

Description of problem:
I was running the libguestfs test suite, and it died here:

/home/rjones/d/libguestfs/run --test ./test-max-disks.pl
libvir: XML-RPC error : Cannot recv data: Connection reset by peer
could not connect to libvirt (URI = NULL): Cannot recv data: Connection reset by peer [code=38 domain=7] at /home/rjones/d/libguestfs/tests/disks/test-max-disks.pl line 46.
max_disks is 255
/home/rjones/d/libguestfs/run: command failed with exit code 104
FAIL: test-max-disks.pl

Version-Release number of selected component:
libvirt-daemon-0.10.2.1-3.fc18

Additional info:
backtrace_rating: 4
cmdline:        /usr/sbin/libvirtd --timeout=30
crash_function: nl_object_put
executable:     /usr/sbin/libvirtd
kernel:         3.6.9-4.fc18.x86_64
remote_result:  NOTFOUND
uid:            1000

Truncated backtrace:
Thread no. 1 (10 frames)
 #4 nl_object_put at object.c:197
 #5 nl_object_free at object.c:158
 #6 nl_cache_remove at cache.c:484
 #7 nl_cache_clear at cache.c:347
 #8 nl_cache_free at cache.c:364
 #9 netlink_close at dutil_linux.c:864
 #10 drv_close at drv_redhat.c:384
 #11 ncf_close at netcf.c:101
 #12 interfaceCloseInterface at interface/interface_backend_netcf.c:170
 #13 virConnectDispose at datatypes.c:134

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:29:35 GMT ---

Created attachment 662242 [details]
File: backtrace

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:29:37 GMT ---

Created attachment 662243 [details]
File: cgroup

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:29:39 GMT ---

Created attachment 662244 [details]
File: core_backtrace

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:29:41 GMT ---

Created attachment 662245 [details]
File: dso_list

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:29:43 GMT ---

Created attachment 662246 [details]
File: environ

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:29:45 GMT ---

Created attachment 662247 [details]
File: limits

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:30:01 GMT ---

Created attachment 662248 [details]
File: maps

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:30:03 GMT ---

Created attachment 662249 [details]
File: open_fds

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:30:05 GMT ---

Created attachment 662251 [details]
File: proc_pid_status

--- Additional comment from Richard W.M. Jones on 2012-12-12 10:30:08 GMT ---

Created attachment 662252 [details]
File: var_log_messages

--- Additional comment from Daniel Berrange on 2012-12-12 10:34:26 GMT ---

Superficially the stack trace points towards libnl as being at fault. Can you tell me what 'libnl3' and 'netcf' library versions are installed

--- Additional comment from Daniel Berrange on 2012-12-13 11:30:33 GMT ---

The libnl3 code that's causing the crash is this

        if (obj->ce_refcnt < 0)
                BUG();

so libnl's ref counting seems to have a bug somewhere :-(

--- Additional comment from Daniel Berrange on 2012-12-13 12:03:14 GMT ---

It appears that netcf_init/netcf_close are not thread-safe :-(


$ cat nc.c
#include <netcf.h>
#include <pthread.h>
#include <stdlib.h>

void *worker(void *data)
{
  for (;;) {
    struct netcf *netcf;

    if (ncf_init(&netcf, NULL) != 0)
      abort();

    ncf_close(netcf);
  }
}

int main (int argc, char **argv)
{
  int nthreads = 20;
  pthread_t threads[nthreads];
  size_t i;

  for (i = 0  ; i < nthreads ; i++) {
    pthread_create(&threads[i], NULL,
		   worker, NULL);
  }

  for (i = 0  ; i < nthreads ; i++) {
    pthread_join(threads[i], NULL);
  }
  
  return 0;
}

$ ./nc 
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Segmentation fault
[berrange@mustard ~]$ ./nc 
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted
[berrange@mustard ~]$ ./nc 

BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted
[berrange@mustard ~]$ 
[berrange@mustard ~]$ ^C
[berrange@mustard ~]$ ./nc 
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted


There appear to be two problems here. One appears to be libxml related - the RNG schema warnings. The second is libnl3 related. The second problem can be isolated with the following test

#include <netlink/netlink.h>
#include <pthread.h>
#include <stdlib.h>
#include <netlink/route/addr.h>
#include <netlink/route/link.h>

void *worker(void *data)
{
  for (;;) {
    struct nl_sock     *nl_sock;
    struct nl_cache   *link_cache;
    struct nl_cache   *addr_cache;

    if (!(nl_sock = nl_socket_alloc())) {
      perror("nl_sock_alloc");
      abort();
    }

    if (nl_connect(nl_sock, NETLINK_ROUTE) < 0) {
      perror("nl_connect");
      abort();
    }
    
    if (rtnl_link_alloc_cache(nl_sock, AF_UNSPEC, &link_cache) < 0) {
      perror("nl_link_alloc_cache");
      abort();
    }
    nl_cache_mngt_provide(link_cache);

    if (rtnl_addr_alloc_cache(nl_sock, &addr_cache) < 0) {
      perror("nl_addr_alloc_cache");
      abort();
    }
    nl_cache_mngt_provide(addr_cache);


    nl_cache_free(addr_cache);
    nl_cache_free(link_cache);
    nl_close(nl_sock);
    nl_socket_free(nl_sock);
  }
}

int main (int argc, char **argv)
{
  int nthreads = 20;
  pthread_t threads[nthreads];
  size_t i;

  for (i = 0  ; i < nthreads ; i++) {
    pthread_create(&threads[i], NULL,
		   worker, NULL);
  }

  for (i = 0  ; i < nthreads ; i++) {
    pthread_join(threads[i], NULL);
  }
  
  return 0;
}


This test program will crash. If you comment out the two nl_cache_mngt_provide calls then the crashes go away.

Looking at the libnl3 code this is not surprising

/**
 * Provide a cache for global use
 * @arg cache           cache to provide
 *
 * Offers the specified cache to be used by other modules.
 * Only one cache per type may be shared at a time,
 * a previsouly provided caches will be overwritten.
 */
void nl_cache_mngt_provide(struct nl_cache *cache)
{
        struct nl_cache_ops *ops;

        ops = cache_ops_lookup_for_obj(cache->c_ops->co_obj_ops);
        if (!ops)
                BUG();
        else
                ops->co_major_cache = cache;
}


Note the comment that only a single cache can be used at a time - this is a process wide global cache, held in a static global variable

  static struct nl_cache_ops *cache_ops;

This is really awful design from libnl3. The caches really need to be scoped to the nl_sock.

It is not sufficient for netcf to simply do a one-time init of the caches itself, because other parts of libvirt also use libnl, so netcf can't assume it is the only owner of the caches.

AFAICT, the only viable option is to *not* register the caches at all. I'm not sure what that will do to performance though

Comment 2 Thomas Graf 2012-12-13 17:21:26 UTC
I fixed this in the git tree (released as 3.2.16) by protecting the cache operations with a mutex. Therefore cache provisioning is now thread safe. I can no longer reproduce the SIGSEGV with the latest git tree.

When this functionality was initially implemented the use case in mind was one similar to iproute2 where there would never be multiple threads.

A backport to libnl1.1 should not be hard. I would be happy to maintain a stable branch if really required.

Comment 3 Richard W.M. Jones 2012-12-13 17:44:33 UTC
See also:
https://bugzilla.redhat.com/show_bug.cgi?id=886454#c15

(In reply to comment #2)
> I fixed this in the git tree (released as 3.2.16) by protecting the cache
> operations with a mutex. Therefore cache provisioning is now thread safe. I
> can no longer reproduce the SIGSEGV with the latest git tree.

Can you push this into Fedora 18 updates-testing, so that it
gets more testing with libguestfs?

Comment 4 Thomas Graf 2012-12-17 08:31:37 UTC
CCing Dan Williams as he is maintaning the libnl packages. 

@Dan, could you push 3.2.16 into F18?

Comment 6 yanbing du 2012-12-20 09:26:15 UTC
Hi Daniel,
  I'm using the C file you gave to do test on RHEL6.4, and in certain probability will get core dumped.

$ cat nc.c
#include <netcf.h>
#include <pthread.h>
#include <stdlib.h>

void *worker(void *data)
{
  for (;;) {
    struct netcf *netcf;

    if (ncf_init(&netcf, NULL) != 0)
      abort();

    ncf_close(netcf);
  }
}

int main (int argc, char **argv)
{
  int nthreads = 20;
  pthread_t threads[nthreads];
  size_t i;

  for (i = 0  ; i < nthreads ; i++) {
    pthread_create(&threads[i], NULL,
		   worker, NULL);
  }

  for (i = 0  ; i < nthreads ; i++) {
    pthread_join(threads[i], NULL);
  }
  
  return 0;
}
$ ./nc 
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Aborted (core dumped)

Do you think it's enough to reproduce this bug?

BTW, when compile the second C file, it will fail:
$ gcc -o crash_test -lpthread -lnl crash_test.c 
crash_test.c: In function ‘worker’:
crash_test.c:14: warning: assignment makes pointer from integer without a cast
crash_test.c:19: warning: passing argument 1 of ‘nl_connect’ from incompatible pointer type
/usr/include/netlink/netlink.h:40: note: expected ‘struct nl_handle *’ but argument is of type ‘struct nl_sock *’
crash_test.c:24: warning: passing argument 1 of ‘rtnl_link_alloc_cache’ from incompatible pointer type
/usr/include/netlink/route/link.h:66: note: expected ‘struct nl_handle *’ but argument is of type ‘struct nl_sock *’
crash_test.c:24: error: too many arguments to function ‘rtnl_link_alloc_cache’
crash_test.c:30: warning: passing argument 1 of ‘rtnl_addr_alloc_cache’ from incompatible pointer type
/usr/include/netlink/route/addr.h:31: note: expected ‘struct nl_handle *’ but argument is of type ‘struct nl_sock *’
crash_test.c:30: error: too many arguments to function ‘rtnl_addr_alloc_cache’
crash_test.c:39: warning: passing argument 1 of ‘nl_close’ from incompatible pointer type
/usr/include/netlink/netlink.h:41: note: expected ‘struct nl_handle *’ but argument is of type ‘struct nl_sock *’

And on RHEL6, it only use libnl-1 not libnl-3, so do u know when the fix will  backport to RHEL6? 

Thanks.

Comment 7 Thomas Graf 2012-12-20 09:35:13 UTC
(In reply to comment #6)
> And on RHEL6, it only use libnl-1 not libnl-3, so do u know when the fix
> will  backport to RHEL6? 

I'm creating a stable branch for libnl1 and do a new release today that can be used for RHEL6.

Comment 8 Daniel Berrangé 2012-12-20 10:35:30 UTC
Yes, the first program is enough to demonstrate the crash. The second program was merely to isolate the problem further

Comment 9 Thomas Graf 2012-12-20 13:07:45 UTC
A stable branch is being maintained here now:

https://github.com/tgraf/libnl-1.1-stable

I have backported all relevant fixes

Comment 13 yanbing du 2012-12-24 09:58:20 UTC
Created attachment 668424 [details]
run nc.c in a loop

Install the new libnl from https://github.com/tgraf/libnl-1.1-stable
and update netcf to netcf-0.1.9-3.el6.x86_64.

# virsh iface-list --all
Name                 State      MAC Address
--------------------------------------------
eth0                 active     00:23:ae:8f:f2:b3
lo                   active     00:00:00:00:00:00

# virsh iface-dumpxml eth0
<interface type='ethernet' name='eth0'>
  <mac address='00:23:ae:8f:f2:b3'/>
  <protocol family='ipv4'>
    <ip address='10.66.82.251' prefix='23'/>
  </protocol>
  <protocol family='ipv6'>
    <ip address='fe80::223:aeff:fe8f:f2b3' prefix='64'/>
  </protocol>
</interface>

But when run the nc program in a loop, in very few cases(1/100), it will get following error:
#for i in {1..770}; do ./nc & (sleep 1; kill -9 `pidof ./nc`); sleep 1; done &> nc.log

......
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
error : not a socket
error : not a socket
error : not a socket
/usr/share/netcf/xml/interface.rng:1: parser error : /usr/share/netcf/xml/interface.rng:1: parser error : Document is empty

^
/usr/share/netcf/xml/interface.rng:1: parser error : Start tag expected, '<' not found

^
error : not a socket
/usr/share/netcf/xml/interface.rng:1: parser error : Document is empty

^
/usr/share/netcf/xml/interface.rng:1: parser error : Start tag expected, '<' not found

^
Segmentation fault (core dumped)
......

In /var/log/messages, there's few logs as following:
......
Dec 24 16:45:00 intel-q9400-4-2 kernel: nc[16206]: segfault at 378d ip 000000339a87b6ec sp 00007f2361e3fcc8 error 4 in libc-2.12.so[339a800000+18a000]
......
Dec 24 17:22:14 intel-q9400-4-2 kernel: nc[31134]: segfault at 378d ip 000000339a87b6ec sp 00007fed5ebfccc8 error 4 in libc-2.12.so[339a800000+18a000]
Dec 24 17:22:14 intel-q9400-4-2 kernel: nc[31133]: segfault at 378d ip 000000339a87b6ec sp 00007fed5f5fdcc8 error 4 in libc-2.12.so[339a800000+18a000]
......

Comment 14 Richard W.M. Jones 2012-12-24 10:05:41 UTC
(In reply to comment #13)
> Segmentation fault (core dumped)

Can you find the core dump and get a stack trace from it.

Comment 15 yanbing du 2012-12-25 06:07:36 UTC
(gdb) run
The program being debugged has been started already.
Start it from the beginning? (y or n) y
Starting program: /home/ydu/nc 
[Thread debugging using libthread_db enabled]
[New Thread 0x7ffff7fe7700 (LWP 8250)]
[New Thread 0x7ffff75e6700 (LWP 8251)]
[New Thread 0x7ffff6be5700 (LWP 8252)]
error : not a socket
error : not a socket
/usr/share/netcf/xml/interface.rng:1: [New Thread 0x7ffff61e4700 (LWP 8253)]
parser error : Document is empty

^
/usr/share/netcf/xml/interface.rng:1: /usr/share/netcf/xml/interface.rng:1: [New Thread 0x7ffff57e3700 (LWP 8254)]
parser error : Start tag expected, '<' not found

^
parser error : Document is empty

^
/usr/share/netcf/xml/interface.rng:1: parser error : Start tag expected, '<' not found

^
*** glibc detected *** /home/ydu/nc: free(): invalid pointer: 0x00007ffff7fe6cc4 ***
*** glibc detected *** /home/ydu/nc: free(): invalid pointer: 0x00007ffff75e5cc4[New Thread 0x7ffff4de2700 (LWP 8255)]
error : Bad file descriptor
error : Bad file descriptor
/usr/share/netcf/xml/interface.rng:1: parser error : Document is empty

^
/usr/share/netcf/xml/interface.rng:1: parser error : Start tag expected, '<' not found

^
/usr/share/netcf/xml/interface.rng:1: *** glibc detected *** /home/ydu/ncparser error : Document is empty

^
[New Thread 0x7fffdffff700 (LWP 8256)]
error : not a socket
/usr/share/netcf/xml/interface.rng:1: parser error : Start tag expected, '<' not found

^
*** glibc detected *** /home/ydu/nc: free(): invalid pointer: 0x00007ffff61e3cc4 ***
/usr/share/netcf/xml/interface.rng:1: parser error : Document is empty

^
/usr/share/netcf/xml/interface.rng:1: parser error : Start tag expected, '<' not found

^
*** glibc detected *** /home/ydu/nc: free(): invalid pointer: 0x00007ffff57e2cc4 ***
[New Thread 0x7fffdf5fe700 (LWP 8257)]
error : not a socket
/usr/share/netcf/xml/interface.rng:1: parser error : Document is empty

^
error : /usr/share/netcf/xml/interface.rng:1: not a socket
parser ======= Backtrace: =========
error : Start tag expected, '<' not found

^
/lib64/libc.so.6/usr/share/netcf/xml/interface.rng:1: parser error : Document is empty

^
/usr/share/netcf/xml/interface.rng:1: parser error : Start tag expected, '<' not found

^
*** glibc detected *** /home/ydu/nc: free(): invalid pointer: 0x00007fffdfffecc4 ***
======= Backtrace: =========
/usr/lib64/libxml2.so.2(xmlNanoFTPFreeCtxt+0x3c)[0x33a409b00c]
/lib64/libc.so.6*** glibc detected *** /home/ydu/nc: free(): invalid pointer: 0x00007ffff4de1cc4 ***
/usr/lib64/libxml2.so.2(xmlNanoFTPFreeCtxt+0x3c)[0x======= Backtrace: =========
/usr/lib64/libxml2.so.2(xmlNanoFTPClose+0x56)[0x33a409b1d6]
/usr/lib64/libxml2.so.2(xmlNanoFTPClose+0x56)[0x33a409b1d6]
/usr/lib64/libxml2.so.2(xmlFreeParserInputBuffer+0x3b)[0x33a405e55b]
/usr/lib64/libxml2.so.2(xmlFreeInputStream+0x66)[0x33a4033c66]
/usr/lib64/libxml2.so.2(xmlFreeParserCtxt+0x20)[0x33a4033cb0]
/usr/lib64/libxml2.so.2[0x33a404df21]
/usr/lib64/libxml2.so.2(xmlRelaxNGParse+0x31)[0x33a40ef651]
/usr/lib64/libnetcf.so.1[0x3e4fa07c4d]
/usr/lib64/libnetcf.so.1(ncf_init+0xb6)[0x3e4fa049a6]
/home/ydu/nc[0x400761]
/lib64/libpthread.so.0[0x339ac07851]
/lib64/libc.so.6(clone+0x6d)[0x339a8e890d]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:01 393567                             /home/ydu/nc
00600000-00601000 rw-p 00000000 08:01 393567                             /home/ydu/nc
00601000-00622000 rw-p 00000000 00:00 0                                  [heap]
339a000000-339a020000 r-xp 00000000 08:01 931139                         /lib64/ld-2.12.so
339a21f000-339a220000 r--p 0001f000 08:01 931139                         /lib64/ld-2.12.so
339a220000-339a221000 rw-p 00020000 08:01 931139                         /lib64/ld-2.12.so
339a221000-339a222000 rw-p 00000000 00:00 0 
339a400000-339a402000 r-xp 00000000 08:01 931149                         /lib64/libdl-2.12.so
339a402000-339a602000 ---p 00002000 08:01 931149                         /lib64/libdl-2.12.so
339a602000-339a603000 r--p 00002000 08:01 931149                         /lib64/libdl-2.12.so
339a603000-339a604000 rw-p 00003000 08:01 931149                         /lib64/libdl-2.12.so
339a800000-339a98a000 r-xp 00000000 08:01 931141                         /lib64/libc-2.12.so
339a98a000-339ab89000 ---p 0018a000 08:01 931141                         /lib64/libc-2.12.so
339ab89000-339ab8d000 r--p 00189000 08:01 931141                         /lib64/libc-2.12.so
339ab8d000-339ab8e000 rw-p 0018d000 08:01 931141                         /lib64/libc-2.12.so
339ab8e000-339ab93000 rw-p 00000000 00:00 0 
339ac00000-339ac17000 r-xp 00000000 08:01 931147                         /lib64/libpthread-2.12.so
339ac17000-339ae17000 ---p 00017000 08:01 931147                         /lib64/libpthread-2.12.so
339ae17000-339ae18000 r--p 00017000 08:01 931147                         /lib64/libpthread-2.12.so
339ae18000-339ae19000 rw-p 00018000 08:01 931147                         /lib64/libpthread-2.12.so
339ae19000-339ae1d000 rw-p 00000000 00:00 0 
339b000000-339b083000 r-xp 00000000 08:01 931143                         /lib64/libm-2.12.so
339b083000-339b282000 ---p 00083000 08:01 931143                         /lib64/libm-2.12.so
339b282000-339b283000 r--p 00082000 08:01 931143                         /lib64/libm-2.12.so
339b283000-339b284000 rw-p 00083000 08:01 931143                         /lib64/libm-2.12.so
339bc00000-339bc15000 r-xp 00000000 08:01 931144                         /lib64/libz.so.1.2.3
339bc15000-339be14000 ---p 00015000 08:01 931144                         /lib64/libz.so.1.2.3
339be14000-339be15000 r--p 00014000 08:01 931144                         /lib64/libz.so.1.2.3
339be15000-339be16000 rw-p 00015000 08:01 931144                         /lib64/libz.so.1.2.3
339c000000-339c01d000 r-xp 00000000 08:01 931157                         /lib64/libselinux.so.1
339c01d000-339c21c000 ---p 0001d000 08:01 931157                         /lib64/libselinux.so.1
339c21c000-339c21d000 r--p 0001c000 08:01 931157                         /lib64/libselinux.so.1
339c21d000-339c21e000 rw-p 0001d000 08:01 931157                         /lib64/libselinux.so.1
339c21e000-339c21f000 rw-p 00000000 00:00 0 
339dc00000-339dc41000 r-xp 00000000 08:01 659576                         /usr/lib64/libaugeas.so.0.14.0
339dc41000-339de41000 ---p 00041000 08:01 659576                         /usr/lib64/libaugeas.so.0.14.0
339de41000-339de43000 rw-p 00041000 08:01 659576                         /usr/lib64/libaugeas.so.0.14.0
339e000000-339e011000 r-xp 00000000 08:01 661903                         /usr/lib64/libfa.so.1.3.4
339e011000-339e210000 ---p 00011000 08:01 661903                         /usr/lib64/libfa.so.1.3.4
339e210000-339e211000 rw-p 00010000 08:01 661903                         /usr/lib64/libfa.so.1.3.4
33a4000000-33a4148000 r-xp 00000000 08:01 662483                         /usr/lib64/libxml2.so.2.7.6
33a4148000-33a4347000 ---p 00148000 08:01 662483                         /usr/lib64/libxml2.so.2.7.6
33a4347000-33a4351000 rw-p 00147000 08:01 662483                         /usr/lib64/libxml2.so.2.7.6
33a4351000-33a4352000 rw-p 00000000 00:00 0 
33a6c00000-33a6c16000 r-xp 00000000 08:01 931146                         /lib64/libgcc_s-4.4.7-20120601.so.1
33a6c16000-33a6e15000 ---p 00016000 08:01 931146                         /lib64/libgcc_s-4.4.7-20120601.so.1
33a6e15000-33a6e16000 rw-p 00015000 08:01 931146                         /lib64/libgcc_s-4.4.7-20120601.so.1
33ab400000-33ab403000 r-xp 00000000 08:01 931171                         /lib64/libgpg-error.so.0.5.0
33ab403000-33ab602000 ---p 00003000 08:01 931171                         /lib64/libgpg-error.so.0.5.0
33ab602000-33ab603000 r--p 00002000 08:01 931171                         /lib64/libgpg-error.so.0.5.0
33ab603000-33ab604000 rw-p 00003000 08:01 931171                         /lib64/libgpg-error.so.0.5.0
33ac400000-33ac472000 r-xp 00000000 08:01 931172                         /lib64/libgcrypt.so.11.5.3
33ac472000-33ac671000 ---p 00072000 08:01 931172                         /lib64/libgcrypt.so.11.5.3
33ac671000-33ac672000 r--p 00071000 08:01 931172                         /lib64/libgcrypt.so.11.5.3
33ac672000-33ac675000 rw-p 00072000 08:01 931172                         /lib64/libgcrypt.so.11.5.3
33d8600000-33d8613000 r-xp 00000000 08:01 677463                         /usr/lib64/libexslt.so.0.8.15
33d8613000-33d8813000 ---p 00013000 08:01 677463                         /usr/lib64/libexslt.so.0.8.15
33d8813000-33d8814000 rw-p 00013000 08:01 677463                         /usr/lib64/libexslt.so.0.8.15
33d9200000-33d923b000 r-xp 00000000 08:01 660217                         /usr/lib64/libxslt.so.1.1.26
33d923b000-33d943b000 ---p 0003b000 08:01 660217                         /usr/lib64/libxslt.so.1.1.26
33d943b000-33d943d000 rw-p 0003b000 08:01 660217                         /usr/lib64/libxslt.so.1.1.26
3e4f600000-3e4f64c000 r-xp 00000000 08:01 930343                         /lib64/libnl.so.1.1.1
3e4f64c000-3e4f84c000 ---p 0004c000 08:01 930343                         /lib64/libnl.so.1.1.1
3e4f84c000-3e4f851000 rw-p 0004c000 08:01 930343                         /lib64/libnl.so.1.1.1
3e4fa00000-3e4fa0e000 r-xp 00000000 08:01 676243                         /usr/lib64/libnetcf.so.1.4.0
3e4fa0e000-3e4fc0d000 ---p 0000e000 08:01 676243                         /usr/lib64/libnetcf.so.1.4.0
3e4fc0d000-3e4fc0e000 rw-p 0000d000 08:01 676243                         /usr/lib64/libnetcf.so.1.4.0/usr/lib64/libxml2.so.2(xmlFreeInputStream+0x66)[0x33a4033c66]
/usr/lib64/libxml2.so.2(xmlNanoFTPClose+0x56)[0x33a409b1d6]
/lib64/libc.so.6[0x339a8760e6]
/usr/lib64/libxml2.so.2(xmlFreeParserCtxt+0x20)[0x33a4033cb0]
/usr/lib64/libxml2.so.2[0x33a404df21]
/usr/lib64/libxml2.so.2(xmlRelaxNGParse+0x31)
Program received signal SIGABRT, Aborted.
[Switching to Thread 0x7ffff57e3700 (LWP 8254)]
0x000000339a8328a5 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install libgcc-4.4.7-3.el6.x86_64
(gdb) thread apply all bt

Comment 16 yanbing du 2012-12-25 06:10:27 UTC
Created attachment 668718 [details]
thread apply all bt

Comment 17 Richard W.M. Jones 2012-12-25 08:03:25 UTC
I would say this looks like a different bug in libxml2's
nano-FTP library.

Since this is RHEL 6, it may also be a bug that has since
been fixed upstream, although I seem to recall that danpb
saw something similar in Fedora.

Comment 18 yanbing du 2013-01-04 07:07:40 UTC
HI, Daniel
  As the problem mentioned in comment 15 & comment 17, what should i do with this bug, should i close this one as VERIFIED and open a new one against another component? 
  Please help, thank!
 Berrange

Comment 19 Laine Stump 2013-01-08 16:14:01 UTC
(In reply to comment #13)
> Install the new libnl from https://github.com/tgraf/libnl-1.1-stable
> and update netcf to netcf-0.1.9-3.el6.x86_64.

You *should not* install a new libnl from some external source to test this bz. That version of libnl will not be included in RHEL6.4 (because libnl isn't in the ACL). We were able to get netcf added to the ACL, and it has been patched to avoid the offending code in libnl. So the proper test is to install the new official netcf build *only*, then run the nc program in a loop.

(In reply to comment #18)
> As the problem mentioned in comment 15 & comment 17, what should i do with
> this bug, should i close this one as VERIFIED and open a new one against
> another component? 

Yes, that is a different bug, and looks to be a problem in libxml2, as Rich suggested. You should file a different BZ and put the component as libxml2 (for now at least).

In the meantime, if you no longer encounter the old crash when you run nc in a loop with the updated netcf installed *but without the unofficial libnl update*, then  you should mark this bug as VERIFIED.

Comment 20 yanbing du 2013-01-09 04:09:44 UTC
(In reply to comment #19)
> (In reply to comment #13)
> > Install the new libnl from https://github.com/tgraf/libnl-1.1-stable
> > and update netcf to netcf-0.1.9-3.el6.x86_64.
> 
> You *should not* install a new libnl from some external source to test this
> bz. That version of libnl will not be included in RHEL6.4 (because libnl
> isn't in the ACL). We were able to get netcf added to the ACL, and it has
> been patched to avoid the offending code in libnl. So the proper test is to
> install the new official netcf build *only*, then run the nc program in a
> loop.
> 

I see, thanks for the detailed explanation.
I just install the latest netcf package(netcf-0.1.9-3.el6.x86_64) and run the nc program in a loop:
 
#for i in {1..770}; do ./nc & (sleep 1; kill -9 `pidof ./nc`); sleep 1; done &> nc.log

After the loop finish, it get 3 'core dumped'.

# tail -20 /var/log/messages
Jan  9 10:57:03 intel-8400-8-2 abrtd: Directory 'ccpp-2013-01-09-10:57:02-26195' creation detected
Jan  9 10:57:03 intel-8400-8-2 abrt[26218]: Saved core dump of pid 26195 (/root/nc) to /var/spool/abrt/ccpp-2013-01-09-10:57:02-26195 (213925888 bytes)
Jan  9 10:57:03 intel-8400-8-2 abrtd: Executable '/root/nc' doesn't belong to any package
Jan  9 10:57:03 intel-8400-8-2 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-01-09-10:57:02-26195' exited with 1
Jan  9 10:57:03 intel-8400-8-2 abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2013-01-09-10:57:02-26195, deleting
Jan  9 11:08:37 intel-8400-8-2 abrtd: Directory 'ccpp-2013-01-09-11:08:37-2293' creation detected
Jan  9 11:08:37 intel-8400-8-2 abrt[2316]: Saved core dump of pid 2293 (/root/nc) to /var/spool/abrt/ccpp-2013-01-09-11:08:37-2293 (214310912 bytes)
Jan  9 11:08:37 intel-8400-8-2 abrtd: Executable '/root/nc' doesn't belong to any package
Jan  9 11:08:37 intel-8400-8-2 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-01-09-11:08:37-2293' exited with 1
Jan  9 11:08:37 intel-8400-8-2 abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2013-01-09-11:08:37-2293, deleting
Jan  9 11:13:13 intel-8400-8-2 dhclient[1251]: DHCPREQUEST on switch to 10.66.78.111 port 67 (xid=0x12d18fac)
Jan  9 11:13:13 intel-8400-8-2 dhclient[1251]: DHCPACK from 10.66.78.111 (xid=0x12d18fac)
Jan  9 11:13:14 intel-8400-8-2 dhclient[1251]: bound to 10.66.83.216 -- renewal in 6471 seconds.
Jan  9 11:21:07 intel-8400-8-2 abrt[11594]: Saved core dump of pid 11571 (/root/nc) to /var/spool/abrt/ccpp-2013-01-09-11:21:06-11571 (214011904 bytes)
Jan  9 11:21:07 intel-8400-8-2 abrtd: Directory 'ccpp-2013-01-09-11:21:06-11571' creation detected
Jan  9 11:21:07 intel-8400-8-2 abrtd: Executable '/root/nc' doesn't belong to any package
Jan  9 11:21:07 intel-8400-8-2 abrtd: 'post-create' on '/var/spool/abrt/ccpp-2013-01-09-11:21:06-11571' exited with 1
Jan  9 11:21:07 intel-8400-8-2 abrtd: Corrupted or bad directory /var/spool/abrt/ccpp-2013-01-09-11:21:06-11571, deleting
Jan  9 11:23:38 intel-8400-8-2 dnsmasq-dhcp[18747]: DHCPREQUEST(virbr0) 192.168.122.93 52:54:00:99:72:50 
Jan  9 11:23:38 intel-8400-8-2 dnsmasq-dhcp[18747]: DHCPACK(virbr0) 192.168.122.93 52:54:00:99:72:50 



> (In reply to comment #18)
> > As the problem mentioned in comment 15 & comment 17, what should i do with
> > this bug, should i close this one as VERIFIED and open a new one against
> > another component? 
> 
> Yes, that is a different bug, and looks to be a problem in libxml2, as Rich
> suggested. You should file a different BZ and put the component as libxml2
> (for now at least).
> 
> In the meantime, if you no longer encounter the old crash when you run nc in
> a loop with the updated netcf installed *but without the unofficial libnl
> update*, then  you should mark this bug as VERIFIED.

Yes, no longer encounter the crash, but just 3 'core dumped', do you think it's OK to mark this bug as VERIFIED?

Comment 21 yanbing du 2013-01-09 04:11:21 UTC
Created attachment 675236 [details]
run nc.c in a loop

Comment 22 Laine Stump 2013-01-15 17:18:50 UTC
Yes. Mark this bug as VERIFIED and open a new bug against libxml2 for the new crashes. If you can collect the coredump, run gdb on it, and attach the output of "thread apply all bt" to that new bug, I'm sure it would be very helpful.

Comment 23 yanbing du 2013-01-16 03:07:58 UTC
(In reply to comment #22)
> Yes. Mark this bug as VERIFIED and open a new bug against libxml2 for the
> new crashes. If you can collect the coredump, run gdb on it, and attach the
> output of "thread apply all bt" to that new bug, I'm sure it would be very
> helpful.

Thanks.
Move this one to VERIFIED, and i will file a new bug against libxml2.

Comment 24 yanbing du 2013-01-16 06:33:59 UTC
(In reply to comment #23)
> (In reply to comment #22)
> > Yes. Mark this bug as VERIFIED and open a new bug against libxml2 for the
> > new crashes. If you can collect the coredump, run gdb on it, and attach the
> > output of "thread apply all bt" to that new bug, I'm sure it would be very
> > helpful.
> 
> Thanks.
> Move this one to VERIFIED, and i will file a new bug against libxml2.

I can NOT reproduce the crash bug anymore, so will not file a new bug until encounter it again.

Comment 26 errata-xmlrpc 2013-02-21 11:04:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-0494.html


Note You need to log in before you can comment on or make changes to this bug.