This service will be undergoing maintenance at 00:00 UTC, 2016-08-01. It is expected to last about 1 hours
Bug 886454 - libnl addr and link caches are not thread-safe, leading to libvirtd crashes
libnl addr and link caches are not thread-safe, leading to libvirtd crashes
Status: CLOSED CURRENTRELEASE
Product: Fedora
Classification: Fedora
Component: netcf (Show other bugs)
18
x86_64 Unspecified
unspecified Severity urgent
: ---
: ---
Assigned To: Laine Stump
Fedora Extras Quality Assurance
abrt_hash:63589980492b58b5b3206b1722c...
:
: 875741 (view as bug list)
Depends On:
Blocks: 886862
  Show dependency treegraph
 
Reported: 2012-12-12 05:29 EST by Richard W.M. Jones
Modified: 2013-01-19 22:35 EST (History)
10 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 886862 (view as bug list)
Environment:
Last Closed: 2013-01-19 22:19:47 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:


Attachments (Terms of Use)
File: backtrace (40.84 KB, text/plain)
2012-12-12 05:29 EST, Richard W.M. Jones
no flags Details
File: cgroup (129 bytes, text/plain)
2012-12-12 05:29 EST, Richard W.M. Jones
no flags Details
File: core_backtrace (3.15 KB, text/plain)
2012-12-12 05:29 EST, Richard W.M. Jones
no flags Details
File: dso_list (7.24 KB, text/plain)
2012-12-12 05:29 EST, Richard W.M. Jones
no flags Details
File: environ (186 bytes, text/plain)
2012-12-12 05:29 EST, Richard W.M. Jones
no flags Details
File: limits (1.29 KB, text/plain)
2012-12-12 05:29 EST, Richard W.M. Jones
no flags Details
File: maps (1018.64 KB, text/plain)
2012-12-12 05:30 EST, Richard W.M. Jones
no flags Details
File: open_fds (731 bytes, text/plain)
2012-12-12 05:30 EST, Richard W.M. Jones
no flags Details
File: proc_pid_status (928 bytes, text/plain)
2012-12-12 05:30 EST, Richard W.M. Jones
no flags Details
File: var_log_messages (232 bytes, text/plain)
2012-12-12 05:30 EST, Richard W.M. Jones
no flags Details

  None (edit)
Description Richard W.M. Jones 2012-12-12 05:29:30 EST
Description of problem:
I was running the libguestfs test suite, and it died here:

/home/rjones/d/libguestfs/run --test ./test-max-disks.pl
libvir: XML-RPC error : Cannot recv data: Connection reset by peer
could not connect to libvirt (URI = NULL): Cannot recv data: Connection reset by peer [code=38 domain=7] at /home/rjones/d/libguestfs/tests/disks/test-max-disks.pl line 46.
max_disks is 255
/home/rjones/d/libguestfs/run: command failed with exit code 104
FAIL: test-max-disks.pl

Version-Release number of selected component:
libvirt-daemon-0.10.2.1-3.fc18

Additional info:
backtrace_rating: 4
cmdline:        /usr/sbin/libvirtd --timeout=30
crash_function: nl_object_put
executable:     /usr/sbin/libvirtd
kernel:         3.6.9-4.fc18.x86_64
remote_result:  NOTFOUND
uid:            1000

Truncated backtrace:
Thread no. 1 (10 frames)
 #4 nl_object_put at object.c:197
 #5 nl_object_free at object.c:158
 #6 nl_cache_remove at cache.c:484
 #7 nl_cache_clear at cache.c:347
 #8 nl_cache_free at cache.c:364
 #9 netlink_close at dutil_linux.c:864
 #10 drv_close at drv_redhat.c:384
 #11 ncf_close at netcf.c:101
 #12 interfaceCloseInterface at interface/interface_backend_netcf.c:170
 #13 virConnectDispose at datatypes.c:134
Comment 1 Richard W.M. Jones 2012-12-12 05:29:35 EST
Created attachment 662242 [details]
File: backtrace
Comment 2 Richard W.M. Jones 2012-12-12 05:29:37 EST
Created attachment 662243 [details]
File: cgroup
Comment 3 Richard W.M. Jones 2012-12-12 05:29:39 EST
Created attachment 662244 [details]
File: core_backtrace
Comment 4 Richard W.M. Jones 2012-12-12 05:29:41 EST
Created attachment 662245 [details]
File: dso_list
Comment 5 Richard W.M. Jones 2012-12-12 05:29:43 EST
Created attachment 662246 [details]
File: environ
Comment 6 Richard W.M. Jones 2012-12-12 05:29:45 EST
Created attachment 662247 [details]
File: limits
Comment 7 Richard W.M. Jones 2012-12-12 05:30:01 EST
Created attachment 662248 [details]
File: maps
Comment 8 Richard W.M. Jones 2012-12-12 05:30:03 EST
Created attachment 662249 [details]
File: open_fds
Comment 9 Richard W.M. Jones 2012-12-12 05:30:05 EST
Created attachment 662251 [details]
File: proc_pid_status
Comment 10 Richard W.M. Jones 2012-12-12 05:30:08 EST
Created attachment 662252 [details]
File: var_log_messages
Comment 11 Daniel Berrange 2012-12-12 05:34:26 EST
Superficially the stack trace points towards libnl as being at fault. Can you tell me what 'libnl3' and 'netcf' library versions are installed
Comment 12 Daniel Berrange 2012-12-13 06:30:33 EST
The libnl3 code that's causing the crash is this

        if (obj->ce_refcnt < 0)
                BUG();

so libnl's ref counting seems to have a bug somewhere :-(
Comment 13 Daniel Berrange 2012-12-13 07:03:14 EST
It appears that netcf_init/netcf_close are not thread-safe :-(


$ cat nc.c
#include <netcf.h>
#include <pthread.h>
#include <stdlib.h>

void *worker(void *data)
{
  for (;;) {
    struct netcf *netcf;

    if (ncf_init(&netcf, NULL) != 0)
      abort();

    ncf_close(netcf);
  }
}

int main (int argc, char **argv)
{
  int nthreads = 20;
  pthread_t threads[nthreads];
  size_t i;

  for (i = 0  ; i < nthreads ; i++) {
    pthread_create(&threads[i], NULL,
		   worker, NULL);
  }

  for (i = 0  ; i < nthreads ; i++) {
    pthread_join(threads[i], NULL);
  }
  
  return 0;
}

$ ./nc 
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Segmentation fault
[berrange@mustard ~]$ ./nc 
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted
[berrange@mustard ~]$ ./nc 

BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted
[berrange@mustard ~]$ 
[berrange@mustard ~]$ ^C
[berrange@mustard ~]$ ./nc 
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted


There appear to be two problems here. One appears to be libxml related - the RNG schema warnings. The second is libnl3 related. The second problem can be isolated with the following test

#include <netlink/netlink.h>
#include <pthread.h>
#include <stdlib.h>
#include <netlink/route/addr.h>
#include <netlink/route/link.h>

void *worker(void *data)
{
  for (;;) {
    struct nl_sock     *nl_sock;
    struct nl_cache   *link_cache;
    struct nl_cache   *addr_cache;

    if (!(nl_sock = nl_socket_alloc())) {
      perror("nl_sock_alloc");
      abort();
    }

    if (nl_connect(nl_sock, NETLINK_ROUTE) < 0) {
      perror("nl_connect");
      abort();
    }
    
    if (rtnl_link_alloc_cache(nl_sock, AF_UNSPEC, &link_cache) < 0) {
      perror("nl_link_alloc_cache");
      abort();
    }
    nl_cache_mngt_provide(link_cache);

    if (rtnl_addr_alloc_cache(nl_sock, &addr_cache) < 0) {
      perror("nl_addr_alloc_cache");
      abort();
    }
    nl_cache_mngt_provide(addr_cache);


    nl_cache_free(addr_cache);
    nl_cache_free(link_cache);
    nl_close(nl_sock);
    nl_socket_free(nl_sock);
  }
}

int main (int argc, char **argv)
{
  int nthreads = 20;
  pthread_t threads[nthreads];
  size_t i;

  for (i = 0  ; i < nthreads ; i++) {
    pthread_create(&threads[i], NULL,
		   worker, NULL);
  }

  for (i = 0  ; i < nthreads ; i++) {
    pthread_join(threads[i], NULL);
  }
  
  return 0;
}


This test program will crash. If you comment out the two nl_cache_mngt_provide calls then the crashes go away.

Looking at the libnl3 code this is not surprising

/**
 * Provide a cache for global use
 * @arg cache           cache to provide
 *
 * Offers the specified cache to be used by other modules.
 * Only one cache per type may be shared at a time,
 * a previsouly provided caches will be overwritten.
 */
void nl_cache_mngt_provide(struct nl_cache *cache)
{
        struct nl_cache_ops *ops;

        ops = cache_ops_lookup_for_obj(cache->c_ops->co_obj_ops);
        if (!ops)
                BUG();
        else
                ops->co_major_cache = cache;
}


Note the comment that only a single cache can be used at a time - this is a process wide global cache, held in a static global variable

  static struct nl_cache_ops *cache_ops;

This is really awful design from libnl3. The caches really need to be scoped to the nl_sock.

It is not sufficient for netcf to simply do a one-time init of the caches itself, because other parts of libvirt also use libnl, so netcf can't assume it is the only owner of the caches.

AFAICT, the only viable option is to *not* register the caches at all. I'm not sure what that will do to performance though
Comment 14 Daniel Berrange 2012-12-13 08:56:18 EST
*** Bug 875741 has been marked as a duplicate of this bug. ***
Comment 15 Richard W.M. Jones 2012-12-13 09:23:50 EST
How about putting a mutex around the guts of
nl_cache_mngt_provide?  Gnulib's glthread module would
be ideal for this.
Comment 16 Daniel Berrange 2012-12-13 09:28:49 EST
That alone would not be sufficient - you'd need to mutex protect every other method that uses the global cache. Also I think you'd need to actually protect the users of the cache directly, since they can do  "cache lookup....some work...cache insert" and you'd not want the cache instance changing during this time.
Comment 17 Thomas Graf 2012-12-13 12:53:36 EST
Making accesses to a single cache safe from multiple threads will become incredible complex and is likely to not perform at all.

What seems to make the most sense is to not expose caches registered from one cache to others caches. It shouldn't be hard hard to convert it to a per thread list of caches.
Comment 18 Daniel Berrange 2012-12-13 13:00:42 EST
I'm not sure that per-thread caches would match the way libvirt uses netcf, and thus libnl. Each nl_sock instance netcf creates is associated with a single virConnectPtr instance. Libvirt serializes access to each virConnectPtr instance, but they can be used by multiple threads. So we really need the caches associated with the nl_sock instances, not threads.
Comment 19 Thomas Graf 2012-12-13 13:09:14 EST
I see, the only way to achieve that seems to be by adding a new function that changes the behaviour of libnl to create per socket namespaces for registration. It wouldn't be libnl's default behaviour but libvirt could enable it.
Comment 20 Laine Stump 2012-12-13 13:51:41 EST
Is there really existing application code that depends on different nl_socks sharing the same cache? Since the nl_sock is sent to _rtnl_*_alloc_cache(), it seems perfectly suited to just give each nl_sock their own caches and be done with it.
Comment 21 Thomas Graf 2012-12-17 03:35:50 EST
It's not that simple unfortunately. It is very common to use multiple sockets, for example when a protocol is based on generic netlink and needs to resolve interfaces indices to interface names and thus maintains a link cache that will be used from the generic netlink socket context. Or when handling multicast notifications which are received on a separate async socket.

Why are you provisioning respectively sharing the caches anyway? Where do you access them from again? Couldn't you handle the distribution of caches yourself?
Comment 22 Fedora Update System 2012-12-21 13:59:36 EST
netcf-0.2.3-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/netcf-0.2.3-1.fc18
Comment 23 Fedora Update System 2012-12-21 14:05:52 EST
netcf-0.2.3-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/netcf-0.2.3-1.fc17
Comment 24 Fedora Update System 2012-12-22 16:13:40 EST
Package netcf-0.2.3-1.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing netcf-0.2.3-1.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-20855/netcf-0.2.3-1.fc18
then log in and leave karma (feedback).
Comment 25 Fedora Update System 2013-01-19 22:19:53 EST
netcf-0.2.3-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.
Comment 26 Fedora Update System 2013-01-19 22:35:59 EST
netcf-0.2.3-1.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.