Description of problem: I was running the libguestfs test suite, and it died here: /home/rjones/d/libguestfs/run --test ./test-max-disks.pl libvir: XML-RPC error : Cannot recv data: Connection reset by peer could not connect to libvirt (URI = NULL): Cannot recv data: Connection reset by peer [code=38 domain=7] at /home/rjones/d/libguestfs/tests/disks/test-max-disks.pl line 46. max_disks is 255 /home/rjones/d/libguestfs/run: command failed with exit code 104 FAIL: test-max-disks.pl Version-Release number of selected component: libvirt-daemon-0.10.2.1-3.fc18 Additional info: backtrace_rating: 4 cmdline: /usr/sbin/libvirtd --timeout=30 crash_function: nl_object_put executable: /usr/sbin/libvirtd kernel: 3.6.9-4.fc18.x86_64 remote_result: NOTFOUND uid: 1000 Truncated backtrace: Thread no. 1 (10 frames) #4 nl_object_put at object.c:197 #5 nl_object_free at object.c:158 #6 nl_cache_remove at cache.c:484 #7 nl_cache_clear at cache.c:347 #8 nl_cache_free at cache.c:364 #9 netlink_close at dutil_linux.c:864 #10 drv_close at drv_redhat.c:384 #11 ncf_close at netcf.c:101 #12 interfaceCloseInterface at interface/interface_backend_netcf.c:170 #13 virConnectDispose at datatypes.c:134
Created attachment 662242 [details] File: backtrace
Created attachment 662243 [details] File: cgroup
Created attachment 662244 [details] File: core_backtrace
Created attachment 662245 [details] File: dso_list
Created attachment 662246 [details] File: environ
Created attachment 662247 [details] File: limits
Created attachment 662248 [details] File: maps
Created attachment 662249 [details] File: open_fds
Created attachment 662251 [details] File: proc_pid_status
Created attachment 662252 [details] File: var_log_messages
Superficially the stack trace points towards libnl as being at fault. Can you tell me what 'libnl3' and 'netcf' library versions are installed
The libnl3 code that's causing the crash is this if (obj->ce_refcnt < 0) BUG(); so libnl's ref counting seems to have a bug somewhere :-(
It appears that netcf_init/netcf_close are not thread-safe :-( $ cat nc.c #include <netcf.h> #include <pthread.h> #include <stdlib.h> void *worker(void *data) { for (;;) { struct netcf *netcf; if (ncf_init(&netcf, NULL) != 0) abort(); ncf_close(netcf); } } int main (int argc, char **argv) { int nthreads = 20; pthread_t threads[nthreads]; size_t i; for (i = 0 ; i < nthreads ; i++) { pthread_create(&threads[i], NULL, worker, NULL); } for (i = 0 ; i < nthreads ; i++) { pthread_join(threads[i], NULL); } return 0; } $ ./nc Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered Segmentation fault [berrange@mustard ~]$ ./nc Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes' Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered BUG: object.c:197 nc: object.c:197: nl_object_put: Assertion `0' failed. Aborted [berrange@mustard ~]$ ./nc BUG: object.c:197 nc: object.c:197: nl_object_put: Assertion `0' failed. Aborted [berrange@mustard ~]$ [berrange@mustard ~]$ ^C [berrange@mustard ~]$ ./nc Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes' Relax-NG types library failed to register 'http://www.w3.org/2001/XMLSchema-datatypes' Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered Relax-NG types library 'http://www.w3.org/2001/XMLSchema-datatypes' already registered Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered Relax-NG types library 'http://relaxng.org/ns/structure/1.0' already registered BUG: object.c:197 nc: object.c:197: nl_object_put: Assertion `0' failed. Aborted There appear to be two problems here. One appears to be libxml related - the RNG schema warnings. The second is libnl3 related. The second problem can be isolated with the following test #include <netlink/netlink.h> #include <pthread.h> #include <stdlib.h> #include <netlink/route/addr.h> #include <netlink/route/link.h> void *worker(void *data) { for (;;) { struct nl_sock *nl_sock; struct nl_cache *link_cache; struct nl_cache *addr_cache; if (!(nl_sock = nl_socket_alloc())) { perror("nl_sock_alloc"); abort(); } if (nl_connect(nl_sock, NETLINK_ROUTE) < 0) { perror("nl_connect"); abort(); } if (rtnl_link_alloc_cache(nl_sock, AF_UNSPEC, &link_cache) < 0) { perror("nl_link_alloc_cache"); abort(); } nl_cache_mngt_provide(link_cache); if (rtnl_addr_alloc_cache(nl_sock, &addr_cache) < 0) { perror("nl_addr_alloc_cache"); abort(); } nl_cache_mngt_provide(addr_cache); nl_cache_free(addr_cache); nl_cache_free(link_cache); nl_close(nl_sock); nl_socket_free(nl_sock); } } int main (int argc, char **argv) { int nthreads = 20; pthread_t threads[nthreads]; size_t i; for (i = 0 ; i < nthreads ; i++) { pthread_create(&threads[i], NULL, worker, NULL); } for (i = 0 ; i < nthreads ; i++) { pthread_join(threads[i], NULL); } return 0; } This test program will crash. If you comment out the two nl_cache_mngt_provide calls then the crashes go away. Looking at the libnl3 code this is not surprising /** * Provide a cache for global use * @arg cache cache to provide * * Offers the specified cache to be used by other modules. * Only one cache per type may be shared at a time, * a previsouly provided caches will be overwritten. */ void nl_cache_mngt_provide(struct nl_cache *cache) { struct nl_cache_ops *ops; ops = cache_ops_lookup_for_obj(cache->c_ops->co_obj_ops); if (!ops) BUG(); else ops->co_major_cache = cache; } Note the comment that only a single cache can be used at a time - this is a process wide global cache, held in a static global variable static struct nl_cache_ops *cache_ops; This is really awful design from libnl3. The caches really need to be scoped to the nl_sock. It is not sufficient for netcf to simply do a one-time init of the caches itself, because other parts of libvirt also use libnl, so netcf can't assume it is the only owner of the caches. AFAICT, the only viable option is to *not* register the caches at all. I'm not sure what that will do to performance though
*** Bug 875741 has been marked as a duplicate of this bug. ***
How about putting a mutex around the guts of nl_cache_mngt_provide? Gnulib's glthread module would be ideal for this.
That alone would not be sufficient - you'd need to mutex protect every other method that uses the global cache. Also I think you'd need to actually protect the users of the cache directly, since they can do "cache lookup....some work...cache insert" and you'd not want the cache instance changing during this time.
Making accesses to a single cache safe from multiple threads will become incredible complex and is likely to not perform at all. What seems to make the most sense is to not expose caches registered from one cache to others caches. It shouldn't be hard hard to convert it to a per thread list of caches.
I'm not sure that per-thread caches would match the way libvirt uses netcf, and thus libnl. Each nl_sock instance netcf creates is associated with a single virConnectPtr instance. Libvirt serializes access to each virConnectPtr instance, but they can be used by multiple threads. So we really need the caches associated with the nl_sock instances, not threads.
I see, the only way to achieve that seems to be by adding a new function that changes the behaviour of libnl to create per socket namespaces for registration. It wouldn't be libnl's default behaviour but libvirt could enable it.
Is there really existing application code that depends on different nl_socks sharing the same cache? Since the nl_sock is sent to _rtnl_*_alloc_cache(), it seems perfectly suited to just give each nl_sock their own caches and be done with it.
It's not that simple unfortunately. It is very common to use multiple sockets, for example when a protocol is based on generic netlink and needs to resolve interfaces indices to interface names and thus maintains a link cache that will be used from the generic netlink socket context. Or when handling multicast notifications which are received on a separate async socket. Why are you provisioning respectively sharing the caches anyway? Where do you access them from again? Couldn't you handle the distribution of caches yourself?
netcf-0.2.3-1.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/netcf-0.2.3-1.fc18
netcf-0.2.3-1.fc17 has been submitted as an update for Fedora 17. https://admin.fedoraproject.org/updates/netcf-0.2.3-1.fc17
Package netcf-0.2.3-1.fc18: * should fix your issue, * was pushed to the Fedora 18 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing netcf-0.2.3-1.fc18' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2012-20855/netcf-0.2.3-1.fc18 then log in and leave karma (feedback).
netcf-0.2.3-1.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report.
netcf-0.2.3-1.fc17 has been pushed to the Fedora 17 stable repository. If problems still persist, please make note of it in this bug report.