Bug 1794691 - Duplicate key error reported in 'virQEMUDriverGetDomainCapabilities'
Summary: Duplicate key error reported in 'virQEMUDriverGetDomainCapabilities'
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux Advanced Virtualization
Classification: Red Hat
Component: libvirt
Version: 8.1
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: rc
: 8.0
Assignee: Michal Privoznik
QA Contact: Lili Zhu
URL:
Whiteboard:
Depends On: 1791790
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-01-24 11:00 UTC by Michal Privoznik
Modified: 2020-05-12 10:25 UTC (History)
10 users (show)

Fixed In Version: libvirt-6.0.0-3.el8
Doc Type: Bug Fix
Doc Text:
Cause: When generating domain capabilities XML (e.g. virsh domcapabilities), libvirt saves generated caps for possible future use (caching). This saves couple of CPU cycles, because the capabilities need to be constructed exactly once. However, due to a missing mutual exclusion in the code the typical TOCTOU problem might have happened. If two threads were asked to return domain capabilities XML, both correctly identified that no cached data is available. So both proceeded to generate it and tried to add it to the cache. The first one succeeded, while the other failed with a cryptic error message. Consequence: Generating domain capabilities XML might have failed with a cryptic error message. Fix: Add mutexes around cache, so only one thread can work with it at once. Result: Caching works as designed and generating domain capabilities succeeds.
Clone Of: 1791790
Environment:
Last Closed: 2020-05-05 09:55:54 UTC
Type: Bug
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
test.c (1.05 KB, text/plain)
2020-04-15 04:38 UTC, Lili Zhu
no flags Details
repro.c (1.24 KB, text/plain)
2020-04-15 11:58 UTC, Michal Privoznik
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2017 0 None None None 2020-05-05 09:56:49 UTC

Description Michal Privoznik 2020-01-24 11:00:51 UTC
+++ This bug was initially created as a clone of Bug #1791790 +++

Description of problem:

Several virt-v2v tests fail with:

virt-v2v: error: libguestfs error: could not get libvirt domain 
capabilities: internal error: Duplicate key [code=1 int1=-1]

Version-Release number of selected component (if applicable):

qemu-4.2.0-2.fc32.x86_64
libvirt-daemon-6.0.0-1.fc32.x86_64

How reproducible:

100% with libvirt from Fedora
Does NOT appear to happen with libvirt from git, so this
one could be fixed.

Steps to Reproduce:
1. In virt-v2v source, make && make check

--- Additional comment from Richard W.M. Jones on 2020-01-16 15:30:20 CET ---

I'm pretty convinced this bug is already fixed upstream although
I did not yet identify which exact commit may have fixed it, but
feel free to close it if you want.

--- Additional comment from Richard W.M. Jones on 2020-01-16 18:17:11 CET ---

I just had the error happen with upstream libvirt from git, so
in fact I do NOT think it is fixed.  However it is very intermittent.
I am currently trying to provoke the error while at the same time
using Peter Krempa's enhanced debugging patch.

--- Additional comment from Richard W.M. Jones on 2020-01-16 18:34:18 CET ---

With the enhanced debugging patch:

virt-v2v: error: libguestfs error: could not get libvirt domain 
capabilities: internal error: Duplicate hash table key 
'34:3:pc-i440fx-4.2:/usr/bin/qemu-system-x86_64' [code=1 int1=-1]

This error is definitely *very* intermittent.  It seems to happen
one time in 20 or less.

--- Additional comment from Peter Krempa on 2020-01-17 12:02:54 CET ---

According to the hash key the error comes from:

virQEMUDriverGetDomainCapabilities in src/qemu/qemu_conf.c

but I have no idea why that is happening as I didn't deal with the qemu domain caps cache much.

--- Additional comment from Richard W.M. Jones on 2020-01-17 14:46:31 CET ---

In theory this should work to reproduce the bug, but I've run this for
thousands of iterations and it didn't reproduce it for me.

while killall lt-libvirtd libvirtd >& /dev/null; tools/virsh domcapabilities >& /tmp/log; do echo -n .; done

--- Additional comment from Richard W.M. Jones on 2020-01-17 16:11:04 CET ---

I have a reproducer, although the bug still happens very infrequently.

*Note* that this reproducer will start 256 processes on your system
using a fork bomb approach, so you may want to adjust down the
"NR_JOBS" setting at the top before running, although it may take
even longer to reproduce the bug if you do that.

$ gcc -O2 -g -Wall `pkgconf libvirt --cflags --libs` test.c -o test
$ ./test
$ ./test
$ ./test
$ ./test
$ ./test
[repeat many times until ...]
$ ./test
libvirt:  error : internal error: Duplicate hash table key '34:3:pc-i440fx-4.2:/usr/bin/qemu-system-x86_64'

--- Additional comment from Michal Privoznik on 2020-01-23 16:58:47 CET ---

The problem is that we try to save couple of CPU cycles and cache domain capabilities. However, we don't lock the hash table that holds the cached values. So if you have two threads fighting, they will both try to add a value under the same key. This is the problematic function: https://libvirt.org/git/?p=libvirt.git;a=blob;f=src/qemu/qemu_conf.c;h=b62dd1df52d54ca9f8386beb8b786a9f39dd4854;hb=HEAD#l1368

--- Additional comment from Michal Privoznik on 2020-01-24 11:28:40 CET ---

Patches posted upstream:

https://www.redhat.com/archives/libvir-list/2020-January/msg01004.html

Comment 3 Lili Zhu 2020-04-14 10:01:24 UTC
Tested with:
libvirt-daemon-6.0.0-1.module+el8.2.0+5453+31b2b136.x86_64
qemu-kvm-4.2.0-2.module+el8.2.0+5135+ed3b2489.x86_64

1. Download the attached bash script and compile it
$ gcc -O2 -g -Wall `pkgconf libvirt --cflags --libs` test.c -o test

2. execute the script
$ i=0
$  while true; do i=$((i+1)); echo $i; ./test; done
1
2
3
....
10212
10213
10214
^C

repeat for 10000+ times, could not reproduce

Comment 4 Lili Zhu 2020-04-15 04:28:26 UTC
Retested with:
libvirt-daemon-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64
qemu-kvm-4.2.0-17.module+el8.2.0+6141+0f540f16.x86_64

1. Download the attached bash script and compile it
$ gcc -O2 -g -Wall `pkgconf libvirt --cflags --libs` test.c -o test

2. execute the script
$ i=0
$  while true; do i=$((i+1)); echo $i; ./test; done
1
2
3
....
3025
3026
3027
^C

repeated for 3000+ times, did not hit the error

Comment 5 Lili Zhu 2020-04-15 04:37:03 UTC
Hi, Michal

The attached file is test.c
I tried to reproduce this bug with libvirt-daemon-6.0.0-1, and also with libvirt-daemon-5.6.0-10
can not reproduce.
Please help to check whether there is something wrong with the testing steps. Thanks very much.

Comment 6 Lili Zhu 2020-04-15 04:38:43 UTC
Created attachment 1678916 [details]
test.c

Comment 7 Richard W.M. Jones 2020-04-15 10:02:28 UTC
I think that test is the one which I wrote ...?  I think Peter
had a much better program which used threads to reproduce the bug more
reliably.  Unfortunately I cannot find it in the mailing list.
(https://www.redhat.com/archives/libvir-list/2020-January/msg01006.html)

Comment 8 Michal Privoznik 2020-04-15 11:58:15 UTC
Created attachment 1679013 [details]
repro.c

Yeah,  this bug is not easy to reproduce because it is a race condition. What I did when debugging this was to put sleep(5) into a specific place and then get domcaps from two terminals at once. More details here:

https://www.redhat.com/archives/libvir-list/2020-January/msg01004.html

However, with the attached file and tuning worker pool (min_workers = 2 max_workers = 2 in libvirtd.conf) I am able to reproduce occasionally, I mean, if I revert the fix and don't put the sleep():

https://www.redhat.com/archives/libvir-list/2020-January/msg01007.html

Comment 9 Lili Zhu 2020-04-19 13:09:21 UTC
Thanks very much to Michal and Richard, can produce this bug now

1. put a sleep in the following the lines, then build from source
# git diff 
diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c
index 30637b21ac..f76bbe4b39 100644
--- a/src/qemu/qemu_conf.c
+++ b/src/qemu/qemu_conf.c
@@ -1372,6 +1372,7 @@ virQEMUDriverGetDomainCapabilities(virQEMUDriverPtr driver,
         key = g_strdup_printf("%d:%d:%s:%s", data.arch, data.virttype,
                               NULLSTR(data.machine), NULLSTR(data.path));
 
+       sleep(5);
         if (virHashAddEntry(domCapsCache, key, domCaps) < 0)
             return NULL;
     }

2. And then try to get domcaps from two consoles at once
Terminal 1:
# virsh domcapabilities 
<domainCapabilities>
  <path>/usr/libexec/qemu-kvm</path>
  <domain>kvm</domain>
  <machine>pc-i440fx-rhel7.6.0</machine>
  <arch>x86_64</arch>
  <vcpu max='240'/>
  <iothreads supported='yes'/>
....

Terminal 2:
# virsh domcapabilities 
error: failed to get emulator capabilities
error: internal error: Duplicate key

Comment 10 Lili Zhu 2020-04-19 18:22:21 UTC
Verify this bug with:
libvirt-daemon-6.0.0-17.module+el8.2.0+6257+0d066c28.x86_64.rpm

1. Apply the following patch to rebuild libvirt
From 75610bfea795c4988cccb84491d718492e67474f Mon Sep 17 00:00:00 2001
From: rpm-build <rpm-build>
Date: Sun, 19 Apr 2020 10:06:33 -0400
Subject: [PATCH] Verify bug 1794691

Signed-off-by: rpm-build <rpm-build>
---
 src/qemu/qemu_capabilities.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c
index 0b4ed42..1e22c10 100644
--- a/src/qemu/qemu_capabilities.c
+++ b/src/qemu/qemu_capabilities.c
@@ -2108,6 +2108,7 @@ virQEMUCapsGetDomainCapsCache(virQEMUCapsPtr qemuCaps,
         key = g_strdup_printf("%d:%d:%s:%s", arch, virttype,
                               NULLSTR(machine), path);

+        sleep(5);
         if (virHashAddEntry(cache->cache, key, tempDomCaps) < 0)
             goto cleanup;

--
2.18.2

2. start libvirtd

3. try to get domcapabilities from two consoles at once
Terminal 1:
# virsh domcapabilities 
<domainCapabilities>
  <path>/usr/libexec/qemu-kvm</path>
  <domain>kvm</domain>
  <machine>pc-i440fx-rhel7.6.0</machine>
  <arch>x86_64</arch>
  <vcpu max='240'/>
  <iothreads supported='yes'/>
.....

Terminal 2:
# virsh domcapabilities 
<domainCapabilities>
  <path>/usr/libexec/qemu-kvm</path>
  <domain>kvm</domain>
  <machine>pc-i440fx-rhel7.6.0</machine>
  <arch>x86_64</arch>
  <vcpu max='240'/>
  <iothreads supported='yes'/>
  <os supported='yes'>
....

Comment 11 Lili Zhu 2020-04-19 18:25:24 UTC
As the testing result matches with the expected result, mark the bug as verified

Comment 13 errata-xmlrpc 2020-05-05 09:55:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2017


Note You need to log in before you can comment on or make changes to this bug.