Bug 1894359

Summary: virt-install race: Storage pool not found: no storage pool with matching name <unrelated pool>
Product: [Fedora] Fedora Reporter: Martin Pitt <mpitt>
Component: virt-managerAssignee: Cole Robinson <crobinso>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 32CC: berrange, crobinso, jkonecny, mkolman, philip.wyett, rvykydal
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-24 12:51:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
initial reproducer
none
better reproducer none

Description Martin Pitt 2020-11-04 06:22:02 UTC
Description of problem: In the anaconda installer's kickstart tests [1] we run a lot of virt-install commands in parallel to test various scenarios. I recently tried them on a beefy machine with 32 CPUs, and thus 32 parallel test jobs, for ~ 200 tests in parallel.

Of these, two tests failed like this:

VirtualInstall failed: Problem starting virtual install: Command '['virt-install', '-n', 'kstest-onboot-activate-httpks_(47db99ac-f17d-4336-a302-fecc32fd51b3)', '-r', '2048', '--noautoconsole', '--vcpus', '1', '--rng', '/dev/random', '--noreboot', '--graphics', 'vnc', '--disk', 'path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/disk-a.img,cache=unsafe,bus=sata', '--network', 'user', '--network', 'user', '--network', 'user', '--disk', 'path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/boot.iso,device=cdrom,readonly=on,shareable=on', '--extra-args', ' debug=1 inst.debug rd.shell=0 rd.emergency=poweroff ip=enp1s0:dhcp inst.ks=http://10.0.2.2:40000/ks.cfg inst.kernel.hung_task_timeout_secs=1200 inst.stage2=hd:CDLABEL=Fedora-S-dvd-x86_64-rawh', '--location', '/var/tmp/kstest-onboot-activate-httpks.0o1sra44/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img', '--channel', 'tcp,host=127.0.0.1:54347,mode=connect,target_type=virtio,name=org.fedoraproject.anaconda.log.0']' returned non-zero exit status 1.

ERROR    Error: --disk path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/disk-a.img,cache=unsafe,bus=sata: Storage pool not found: no storage pool with matching name 'kstest-initial-setup-gui.cb45niwk'

Apparently, virt-install creates a new storage pool for each test with a name that matches the domain name:

$ virsh pool-list
 Name                                     State    Autostart
--------------------------------------------------------------
 kstest-onboot-activate-httpks.0o1sra44   active   yes


So you see that virt-install tried to do something about *other* storage pools, in that case 'kstest-initial-setup-gui.cb45niwk'. This is from a parallel test which may clean up its own storage pool at any time. I suppose it doesn't actually try to modify/configure that, but this smells like a race condition when enumerating all storage pools and doing some query?


[1] https://github.com/rhinstaller/kickstart-tests


Version-Release number of selected component (if applicable):

virt-install-3.1.0-1.fc33.noarch

How reproducible: Sometimes


Steps to Reproduce:

For now, the only known way is to run kickstart-tests with massive parallelism on a powerful machine:

  git clone https://github.com/rhinstaller/kickstart-tests
  cd kickstart-tests
  containers/runner/launch all

This downloads a kstest-runner container and runs the test with *user* libvirt inside. (Not sure if user vs. system makes any difference -- the container runs unprivileged and inside the test runs as user, not root).

But I realize that this is a bit unwieldy -- I'm happy to run the reproducer myself with proposed changes, or come up with a better one.

Some questions to you:

 - Do you have some initial hints in which part of the code to look? I'm happy to give that a shot.
 - Do you know how to have a really cheap way to invoke virt-install, to basically just do a no-op? So that we can test the parallelism without actually having to do a Fedora install in each case?

Thanks!

Comment 1 Martin Pitt 2020-11-04 06:24:09 UTC
> virt-install-3.1.0-1.fc33.noarch

Sorry, that was a lie, I looked in the wrong place. This happened with

  virt-install-2.2.1-3.fc32.noarch

The fc33 version is what I have locally, but unfortunately Fedora 33's docker images don't work any more on a RHEL 7 host, so I have to use a F32 image to run the massively parallel test.

Comment 2 Martin Pitt 2020-11-04 15:49:44 UTC
I am trying to get a very small virt-install invocation that causes this effect, without having to endure an actual long installation.

My first naïve attempt was

  qemu-img create -f qcow2 test1.img 50M
  virt-install --network user --memory 128 --pxe --network user --os-variant alpinelinux3.8 --disk path=test1.img,cache=unsafe --name test1

which essentially just goes into a PXE wait loop -- but that's fine, it defines the domain. But it defines just a single storage pool that's named after the host name ("kstest"):

$ virsh pool-list --all
 Name     State    Autostart
------------------------------
 kstest   active   yes


and running the same thing with s/test1/test2/ also does not create a storage pool. Then I thought maybe it's because the real invocation has two disks, the CD-ROM. But

  touch boot.iso
  virt-install --network user --memory 128 --pxe --network user --os-variant alpinelinux3.8 --disk path=test1.img,cache=unsafe --disk path=boot.iso,device=cdrom,readonly=on,shareable=on --name test1

but that doesn't create a new pool either. Neither does  appending ",bus=sata".

The only reference to pools that I find in https://github.com/rhinstaller/kickstart-tests is scripts/launcher/lib/virtual_controller.py `def destroy()`, which removes a test's own storage pool. But that's not what this bug is about.

So I'm missing something. How does virt-install go into that "create separate storage pool" mode?

Comment 3 Martin Pitt 2020-11-04 16:16:26 UTC
Oh, got it -- it's --location! And that requires an actual boot.iso, so:

 
  wget https://download.fedoraproject.org/pub/fedora/linux/development/rawhide/Server/x86_64/os/images/boot.iso
  mkdir test1
  ln ./boot.iso test1/
  qemu-img create -f qcow2 test1.img 50M
  virt-install --network user --memory 128 --network user --disk path=test1.img,cache=unsafe --name test1 --location test1/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img

This fails, but it's enough to create a "test1" pool. I'll see to building a reproducer out of that, but that's something for tomorrow.

Comment 4 Martin Pitt 2020-11-05 09:54:21 UTC
Created attachment 1726822 [details]
initial reproducer

I have a first working reproducer. It creates four virt-installs in a loop (test{1..4}), which closely resembles what happens in our tests. After a few minutes, one of them fails like this:

$ ./reproducer.sh
ERROR    Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '06cae576-f1eb-44b9-a4b3-6fae9302202c' (test3)

This is exactly this bug (test1 erroring on storage pool test3 getting removed underneath).

You can watch the creation/destruction while the script runs with

    watch -n0.2 'virsh pool-list --all; virsh list --all'

Due to its shell/sloppy nature, one failing subprocess won't stop the entire reproducer. Stop the test with `pkill -e reproducer`.

I know this isn't ideal yet, I'll still look for a better one. But at least this confirms that it's not a bug in our kickstart-tests, is _some_ standalone reproducer, and it also happens with virt-install-3.1.0-1.fc33.noarch.

Comment 5 Martin Pitt 2020-11-05 11:00:01 UTC
Created attachment 1726828 [details]
better reproducer

I have a better reproducer now, which greatly improves the chance of hitting the race. It creates 100 storage pools, and removes them in the background while trying to do a *single* virt-install run.

I ran it a few times, and get a 100% "success" (i.e. hit the race) rate:

$ ./reproducer2.sh 
ERROR    Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '03957f51-0151-42e5-bf41-3852e27a8654' (pool16)

sometimes it's pool17 or similar, that depends on the timing. But this seems to be reliable enough.

Comment 6 Martin Pitt 2020-11-05 11:04:55 UTC
Now that virt-install doesn't run a thousand times, replacing its --quiet with --debug is insightful:

[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:204) Launched with command line: /usr/bin/virt-install --name test1 --debug --network user --memory 128 --disk path=test1/disk.img,cache=unsafe --location test1/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (virtinstall:212) Distilled --network options: ['user']
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (virtinstall:144) Distilled --disk options: ['path=test1/disk.img,cache=unsafe']
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:216) Requesting libvirt URI default
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:219) Received libvirt URI qemu:///session
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:1514) Exception parsing inst=<DeviceDisk disk 139935836089552> optstr=path=test1/disk.img,cache=unsafe
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtinst/cli.py", line 1501, in parse
    objs = self._parse(inst is None and self.guest or inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 3308, in _parse
    super()._parse(inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1471, in _parse
    param.parse_param(self, inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1135, in parse_param
    xmlutil.set_prop_path(inst, self.propname, self.val)
  File "/usr/share/virt-manager/virtinst/xmlutil.py", line 66, in set_prop_path
    return setattr(parent, pieces[-1], value)
  File "/usr/share/virt-manager/virtinst/devices/disk.py", line 337, in _set_path
    (vol_object, parent_pool) = diskbackend.manage_path(self.conn, newpath)
  File "/usr/share/virt-manager/virtinst/diskbackend.py", line 137, in manage_path
    if not path_is_url(path) and not path_is_network_vol(conn, path):
  File "/usr/share/virt-manager/virtinst/diskbackend.py", line 171, in path_is_network_vol
    for volxml in conn.fetch_all_vols():
  File "/usr/share/virt-manager/virtinst/connection.py", line 273, in fetch_all_vols
    return self._fetch_helper(
  File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper
    self._fetch_cache[key] = raw_cb()
  File "/usr/share/virt-manager/virtinst/connection.py", line 222, in _fetch_all_vols_raw
    for poolxmlobj in self.fetch_all_pools():
  File "/usr/share/virt-manager/virtinst/connection.py", line 264, in fetch_all_pools
    return self._fetch_helper(
  File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper
    self._fetch_cache[key] = raw_cb()
  File "/usr/share/virt-manager/virtinst/connection.py", line 195, in _fetch_all_pools_raw
    return [self._build_pool_raw(poolobj) for poolobj in ret]
  File "/usr/share/virt-manager/virtinst/connection.py", line 195, in <listcomp>
    return [self._build_pool_raw(poolobj) for poolobj in ret]
  File "/usr/share/virt-manager/virtinst/connection.py", line 190, in _build_pool_raw
    parsexml=poolobj.XMLDesc(0))
  File "/usr/lib64/python3.9/site-packages/libvirt.py", line 3502, in XMLDesc
    if ret is None: raise libvirtError ('virStoragePoolGetXMLDesc() failed', pool=self)
libvirt.libvirtError: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15)
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:255)   File "/usr/bin/virt-install", line 8, in <module>
    virtinstall.runcli()
  File "/usr/share/virt-manager/virtinst/virtinstall.py", line 1158, in runcli
    sys.exit(main())
  File "/usr/share/virt-manager/virtinst/virtinstall.py", line 1145, in main
    guest, installer = build_guest_instance(conn, options)
  File "/usr/share/virt-manager/virtinst/virtinstall.py", line 580, in build_guest_instance
    guest = _build_options_guest(conn, options)
  File "/usr/share/virt-manager/virtinst/virtinstall.py", line 561, in _build_options_guest
    cli.parse_option_strings(options, guest, None)
  File "/usr/share/virt-manager/virtinst/cli.py", line 4391, in parse_option_strings
    parseret = parserobj.parse(inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1516, in parse
    fail(_("Error: %(cli_flag_name)s %(options)s: %(err)s") %
  File "/usr/share/virt-manager/virtinst/cli.py", line 255, in fail
    log.debug("".join(traceback.format_stack()))

[Thu, 05 Nov 2020 11:02:50 virt-install 41734] ERROR (cli:256) Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15)
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:258) 
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtinst/cli.py", line 1501, in parse
    objs = self._parse(inst is None and self.guest or inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 3308, in _parse
    super()._parse(inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1471, in _parse
    param.parse_param(self, inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1135, in parse_param
    xmlutil.set_prop_path(inst, self.propname, self.val)
  File "/usr/share/virt-manager/virtinst/xmlutil.py", line 66, in set_prop_path
    return setattr(parent, pieces[-1], value)
  File "/usr/share/virt-manager/virtinst/devices/disk.py", line 337, in _set_path
    (vol_object, parent_pool) = diskbackend.manage_path(self.conn, newpath)
  File "/usr/share/virt-manager/virtinst/diskbackend.py", line 137, in manage_path
    if not path_is_url(path) and not path_is_network_vol(conn, path):
  File "/usr/share/virt-manager/virtinst/diskbackend.py", line 171, in path_is_network_vol
    for volxml in conn.fetch_all_vols():
  File "/usr/share/virt-manager/virtinst/connection.py", line 273, in fetch_all_vols
    return self._fetch_helper(
  File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper
    self._fetch_cache[key] = raw_cb()
  File "/usr/share/virt-manager/virtinst/connection.py", line 222, in _fetch_all_vols_raw
    for poolxmlobj in self.fetch_all_pools():
  File "/usr/share/virt-manager/virtinst/connection.py", line 264, in fetch_all_pools
    return self._fetch_helper(
  File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper
    self._fetch_cache[key] = raw_cb()
  File "/usr/share/virt-manager/virtinst/connection.py", line 195, in _fetch_all_pools_raw
    return [self._build_pool_raw(poolobj) for poolobj in ret]
  File "/usr/share/virt-manager/virtinst/connection.py", line 195, in <listcomp>
    return [self._build_pool_raw(poolobj) for poolobj in ret]
  File "/usr/share/virt-manager/virtinst/connection.py", line 190, in _build_pool_raw
    parsexml=poolobj.XMLDesc(0))
  File "/usr/lib64/python3.9/site-packages/libvirt.py", line 3502, in XMLDesc
    if ret is None: raise libvirtError ('virStoragePoolGetXMLDesc() failed', pool=self)
libvirt.libvirtError: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15)


So this is a TOCTOU again between _fetch_all_pools_raw() and introspecting a particular pool wiht the .XMLDesc() function.

Comment 7 Martin Pitt 2020-11-05 12:46:19 UTC
I sent a fix to upstream: https://github.com/virt-manager/virt-manager/pull/179

Comment 8 Cole Robinson 2020-11-08 22:03:00 UTC
Martin's fixed were applied upstream now

Comment 9 Martin Pitt 2020-11-24 12:51:55 UTC
The fix got released upstream in v3.2.0, and delivered into F33 in https://bodhi.fedoraproject.org/updates/FEDORA-2020-0782be7dfb

I ran the reproducer again, and it is stable now. So this is fixed in F33, and for F32 we can backport this locally.

Thanks!