886454 – libnl addr and link caches are not thread-safe, leading to libvirtd crashes

Bug 886454 - libnl addr and link caches are not thread-safe, leading to libvirtd crashes

Summary: libnl addr and link caches are not thread-safe, leading to libvirtd crashes

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	netcf
Sub Component:
Version:	18
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Laine Stump
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	abrt_hash:63589980492b58b5b3206b1722c...
Duplicates (1):	875741 (view as bug list)
Depends On:
Blocks:	886862
TreeView+	depends on / blocked

Reported:	2012-12-12 10:29 UTC by Richard W.M. Jones
Modified:	2013-01-20 03:35 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Clone Of:
Clones:	886862 (view as bug list)
Environment:
Last Closed:	2013-01-20 03:19:47 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
File: backtrace (40.84 KB, text/plain) 2012-12-12 10:29 UTC, Richard W.M. Jones	no flags	Details
File: cgroup (129 bytes, text/plain) 2012-12-12 10:29 UTC, Richard W.M. Jones	no flags	Details
File: core_backtrace (3.15 KB, text/plain) 2012-12-12 10:29 UTC, Richard W.M. Jones	no flags	Details
File: dso_list (7.24 KB, text/plain) 2012-12-12 10:29 UTC, Richard W.M. Jones	no flags	Details
File: environ (186 bytes, text/plain) 2012-12-12 10:29 UTC, Richard W.M. Jones	no flags	Details
File: limits (1.29 KB, text/plain) 2012-12-12 10:29 UTC, Richard W.M. Jones	no flags	Details
File: maps (1018.64 KB, text/plain) 2012-12-12 10:30 UTC, Richard W.M. Jones	no flags	Details
File: open_fds (731 bytes, text/plain) 2012-12-12 10:30 UTC, Richard W.M. Jones	no flags	Details
File: proc_pid_status (928 bytes, text/plain) 2012-12-12 10:30 UTC, Richard W.M. Jones	no flags	Details
File: var_log_messages (232 bytes, text/plain) 2012-12-12 10:30 UTC, Richard W.M. Jones	no flags	Details
View All

Description Richard W.M. Jones 2012-12-12 10:29:30 UTC

Description of problem:
I was running the libguestfs test suite, and it died here:

/home/rjones/d/libguestfs/run --test ./test-max-disks.pl
libvir: XML-RPC error : Cannot recv data: Connection reset by peer
could not connect to libvirt (URI = NULL): Cannot recv data: Connection reset by peer [code=38 domain=7] at /home/rjones/d/libguestfs/tests/disks/test-max-disks.pl line 46.
max_disks is 255
/home/rjones/d/libguestfs/run: command failed with exit code 104
FAIL: test-max-disks.pl

Version-Release number of selected component:
libvirt-daemon-0.10.2.1-3.fc18

Additional info:
backtrace_rating: 4
cmdline:        /usr/sbin/libvirtd --timeout=30
crash_function: nl_object_put
executable:     /usr/sbin/libvirtd
kernel:         3.6.9-4.fc18.x86_64
remote_result:  NOTFOUND
uid:            1000

Truncated backtrace:
Thread no. 1 (10 frames)
 #4 nl_object_put at object.c:197
 #5 nl_object_free at object.c:158
 #6 nl_cache_remove at cache.c:484
 #7 nl_cache_clear at cache.c:347
 #8 nl_cache_free at cache.c:364
 #9 netlink_close at dutil_linux.c:864
 #10 drv_close at drv_redhat.c:384
 #11 ncf_close at netcf.c:101
 #12 interfaceCloseInterface at interface/interface_backend_netcf.c:170
 #13 virConnectDispose at datatypes.c:134

Comment 1 Richard W.M. Jones 2012-12-12 10:29:35 UTC

Created attachment 662242 [details]
File: backtrace

Comment 2 Richard W.M. Jones 2012-12-12 10:29:37 UTC

Created attachment 662243 [details]
File: cgroup

Comment 3 Richard W.M. Jones 2012-12-12 10:29:39 UTC

Created attachment 662244 [details]
File: core_backtrace

Comment 4 Richard W.M. Jones 2012-12-12 10:29:41 UTC

Created attachment 662245 [details]
File: dso_list

Comment 5 Richard W.M. Jones 2012-12-12 10:29:43 UTC

Created attachment 662246 [details]
File: environ

Comment 6 Richard W.M. Jones 2012-12-12 10:29:45 UTC

Created attachment 662247 [details]
File: limits

Comment 7 Richard W.M. Jones 2012-12-12 10:30:01 UTC

Created attachment 662248 [details]
File: maps

Comment 8 Richard W.M. Jones 2012-12-12 10:30:03 UTC

Created attachment 662249 [details]
File: open_fds

Comment 9 Richard W.M. Jones 2012-12-12 10:30:05 UTC

Created attachment 662251 [details]
File: proc_pid_status

Comment 10 Richard W.M. Jones 2012-12-12 10:30:08 UTC

Created attachment 662252 [details]
File: var_log_messages

Comment 11 Daniel Berrangé 2012-12-12 10:34:26 UTC

Superficially the stack trace points towards libnl as being at fault. Can you tell me what 'libnl3' and 'netcf' library versions are installed

Comment 12 Daniel Berrangé 2012-12-13 11:30:33 UTC

The libnl3 code that's causing the crash is this

        if (obj->ce_refcnt < 0)
                BUG();

so libnl's ref counting seems to have a bug somewhere :-(

Comment 13 Daniel Berrangé 2012-12-13 12:03:14 UTC

It appears that netcf_init/netcf_close are not thread-safe :-(


$ cat nc.c
#include <netcf.h>
#include <pthread.h>
#include <stdlib.h>

void *worker(void *data)
{
  for (;;) {
    struct netcf *netcf;

    if (ncf_init(&netcf, NULL) != 0)
      abort();

    ncf_close(netcf);
  }
}

int main (int argc, char **argv)
{
  int nthreads = 20;
  pthread_t threads[nthreads];
  size_t i;

  for (i = 0  ; i < nthreads ; i++) {
    pthread_create(&threads[i], NULL,
		   worker, NULL);
  }

  for (i = 0  ; i < nthreads ; i++) {
    pthread_join(threads[i], NULL);
  }
  
  return 0;
}

$ ./nc 
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Segmentation fault
[berrange@mustard ~]$ ./nc 
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted
[berrange@mustard ~]$ ./nc 

BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted
[berrange@mustard ~]$ 
[berrange@mustard ~]$ ^C
[berrange@mustard ~]$ ./nc 
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes'
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered
BUG: object.c:197
nc: object.c:197: nl_object_put: Assertion `0' failed.
Aborted


There appear to be two problems here. One appears to be libxml related - the RNG schema warnings. The second is libnl3 related. The second problem can be isolated with the following test

#include <netlink/netlink.h>
#include <pthread.h>
#include <stdlib.h>
#include <netlink/route/addr.h>
#include <netlink/route/link.h>

void *worker(void *data)
{
  for (;;) {
    struct nl_sock     *nl_sock;
    struct nl_cache   *link_cache;
    struct nl_cache   *addr_cache;

    if (!(nl_sock = nl_socket_alloc())) {
      perror("nl_sock_alloc");
      abort();
    }

    if (nl_connect(nl_sock, NETLINK_ROUTE) < 0) {
      perror("nl_connect");
      abort();
    }
    
    if (rtnl_link_alloc_cache(nl_sock, AF_UNSPEC, &link_cache) < 0) {
      perror("nl_link_alloc_cache");
      abort();
    }
    nl_cache_mngt_provide(link_cache);

    if (rtnl_addr_alloc_cache(nl_sock, &addr_cache) < 0) {
      perror("nl_addr_alloc_cache");
      abort();
    }
    nl_cache_mngt_provide(addr_cache);


    nl_cache_free(addr_cache);
    nl_cache_free(link_cache);
    nl_close(nl_sock);
    nl_socket_free(nl_sock);
  }
}

int main (int argc, char **argv)
{
  int nthreads = 20;
  pthread_t threads[nthreads];
  size_t i;

  for (i = 0  ; i < nthreads ; i++) {
    pthread_create(&threads[i], NULL,
		   worker, NULL);
  }

  for (i = 0  ; i < nthreads ; i++) {
    pthread_join(threads[i], NULL);
  }
  
  return 0;
}


This test program will crash. If you comment out the two nl_cache_mngt_provide calls then the crashes go away.

Looking at the libnl3 code this is not surprising

/**
 * Provide a cache for global use
 * @arg cache           cache to provide
 *
 * Offers the specified cache to be used by other modules.
 * Only one cache per type may be shared at a time,
 * a previsouly provided caches will be overwritten.
 */
void nl_cache_mngt_provide(struct nl_cache *cache)
{
        struct nl_cache_ops *ops;

        ops = cache_ops_lookup_for_obj(cache->c_ops->co_obj_ops);
        if (!ops)
                BUG();
        else
                ops->co_major_cache = cache;
}


Note the comment that only a single cache can be used at a time - this is a process wide global cache, held in a static global variable

  static struct nl_cache_ops *cache_ops;

This is really awful design from libnl3. The caches really need to be scoped to the nl_sock.

It is not sufficient for netcf to simply do a one-time init of the caches itself, because other parts of libvirt also use libnl, so netcf can't assume it is the only owner of the caches.

AFAICT, the only viable option is to *not* register the caches at all. I'm not sure what that will do to performance though

Comment 14 Daniel Berrangé 2012-12-13 13:56:18 UTC

*** Bug 875741 has been marked as a duplicate of this bug. ***

Comment 15 Richard W.M. Jones 2012-12-13 14:23:50 UTC

How about putting a mutex around the guts of
nl_cache_mngt_provide?  Gnulib's glthread module would
be ideal for this.

Comment 16 Daniel Berrangé 2012-12-13 14:28:49 UTC

That alone would not be sufficient - you'd need to mutex protect every other method that uses the global cache. Also I think you'd need to actually protect the users of the cache directly, since they can do  "cache lookup....some work...cache insert" and you'd not want the cache instance changing during this time.

Comment 17 Thomas Graf 2012-12-13 17:53:36 UTC

Making accesses to a single cache safe from multiple threads will become incredible complex and is likely to not perform at all.

What seems to make the most sense is to not expose caches registered from one cache to others caches. It shouldn't be hard hard to convert it to a per thread list of caches.

Comment 18 Daniel Berrangé 2012-12-13 18:00:42 UTC

I'm not sure that per-thread caches would match the way libvirt uses netcf, and thus libnl. Each nl_sock instance netcf creates is associated with a single virConnectPtr instance. Libvirt serializes access to each virConnectPtr instance, but they can be used by multiple threads. So we really need the caches associated with the nl_sock instances, not threads.

Comment 19 Thomas Graf 2012-12-13 18:09:14 UTC

I see, the only way to achieve that seems to be by adding a new function that changes the behaviour of libnl to create per socket namespaces for registration. It wouldn't be libnl's default behaviour but libvirt could enable it.

Comment 20 Laine Stump 2012-12-13 18:51:41 UTC

Is there really existing application code that depends on different nl_socks sharing the same cache? Since the nl_sock is sent to _rtnl_*_alloc_cache(), it seems perfectly suited to just give each nl_sock their own caches and be done with it.

Comment 21 Thomas Graf 2012-12-17 08:35:50 UTC

It's not that simple unfortunately. It is very common to use multiple sockets, for example when a protocol is based on generic netlink and needs to resolve interfaces indices to interface names and thus maintains a link cache that will be used from the generic netlink socket context. Or when handling multicast notifications which are received on a separate async socket.

Why are you provisioning respectively sharing the caches anyway? Where do you access them from again? Couldn't you handle the distribution of caches yourself?

Comment 22 Fedora Update System 2012-12-21 18:59:36 UTC

netcf-0.2.3-1.fc18 has been submitted as an update for Fedora 18.
https://admin.fedoraproject.org/updates/netcf-0.2.3-1.fc18

Comment 23 Fedora Update System 2012-12-21 19:05:52 UTC

netcf-0.2.3-1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/netcf-0.2.3-1.fc17

Comment 24 Fedora Update System 2012-12-22 21:13:40 UTC

Package netcf-0.2.3-1.fc18:
* should fix your issue,
* was pushed to the Fedora 18 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing netcf-0.2.3-1.fc18'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-20855/netcf-0.2.3-1.fc18
then log in and leave karma (feedback).

Comment 25 Fedora Update System 2013-01-20 03:19:53 UTC

netcf-0.2.3-1.fc18 has been pushed to the Fedora 18 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 26 Fedora Update System 2013-01-20 03:35:59 UTC

netcf-0.2.3-1.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.