Bug 1894359 - virt-install race: Storage pool not found: no storage pool with matching name <unrelated pool>
Summary: virt-install race: Storage pool not found: no storage pool with matching name...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: virt-manager
Version: 32
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Cole Robinson
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-04 06:22 UTC by Martin Pitt
Modified: 2020-11-24 12:51 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-24 12:51:55 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
initial reproducer (1.28 KB, text/plain)
2020-11-05 09:54 UTC, Martin Pitt
no flags Details
better reproducer (1.42 KB, text/plain)
2020-11-05 11:00 UTC, Martin Pitt
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github virt-manager virt-manager pull 179 0 None closed virtinst: Fix TOCTOU in pool and volume enumeration 2020-11-24 10:12:19 UTC

Description Martin Pitt 2020-11-04 06:22:02 UTC
Description of problem: In the anaconda installer's kickstart tests [1] we run a lot of virt-install commands in parallel to test various scenarios. I recently tried them on a beefy machine with 32 CPUs, and thus 32 parallel test jobs, for ~ 200 tests in parallel.

Of these, two tests failed like this:

VirtualInstall failed: Problem starting virtual install: Command '['virt-install', '-n', 'kstest-onboot-activate-httpks_(47db99ac-f17d-4336-a302-fecc32fd51b3)', '-r', '2048', '--noautoconsole', '--vcpus', '1', '--rng', '/dev/random', '--noreboot', '--graphics', 'vnc', '--disk', 'path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/disk-a.img,cache=unsafe,bus=sata', '--network', 'user', '--network', 'user', '--network', 'user', '--disk', 'path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/boot.iso,device=cdrom,readonly=on,shareable=on', '--extra-args', ' debug=1 inst.debug rd.shell=0 rd.emergency=poweroff ip=enp1s0:dhcp inst.ks=http://10.0.2.2:40000/ks.cfg inst.kernel.hung_task_timeout_secs=1200 inst.stage2=hd:CDLABEL=Fedora-S-dvd-x86_64-rawh', '--location', '/var/tmp/kstest-onboot-activate-httpks.0o1sra44/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img', '--channel', 'tcp,host=127.0.0.1:54347,mode=connect,target_type=virtio,name=org.fedoraproject.anaconda.log.0']' returned non-zero exit status 1.

ERROR    Error: --disk path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/disk-a.img,cache=unsafe,bus=sata: Storage pool not found: no storage pool with matching name 'kstest-initial-setup-gui.cb45niwk'

Apparently, virt-install creates a new storage pool for each test with a name that matches the domain name:

$ virsh pool-list
 Name                                     State    Autostart
--------------------------------------------------------------
 kstest-onboot-activate-httpks.0o1sra44   active   yes


So you see that virt-install tried to do something about *other* storage pools, in that case 'kstest-initial-setup-gui.cb45niwk'. This is from a parallel test which may clean up its own storage pool at any time. I suppose it doesn't actually try to modify/configure that, but this smells like a race condition when enumerating all storage pools and doing some query?


[1] https://github.com/rhinstaller/kickstart-tests


Version-Release number of selected component (if applicable):

virt-install-3.1.0-1.fc33.noarch

How reproducible: Sometimes


Steps to Reproduce:

For now, the only known way is to run kickstart-tests with massive parallelism on a powerful machine:

  git clone https://github.com/rhinstaller/kickstart-tests
  cd kickstart-tests
  containers/runner/launch all

This downloads a kstest-runner container and runs the test with *user* libvirt inside. (Not sure if user vs. system makes any difference -- the container runs unprivileged and inside the test runs as user, not root).

But I realize that this is a bit unwieldy -- I'm happy to run the reproducer myself with proposed changes, or come up with a better one.

Some questions to you:

 - Do you have some initial hints in which part of the code to look? I'm happy to give that a shot.
 - Do you know how to have a really cheap way to invoke virt-install, to basically just do a no-op? So that we can test the parallelism without actually having to do a Fedora install in each case?

Thanks!

Comment 1 Martin Pitt 2020-11-04 06:24:09 UTC
> virt-install-3.1.0-1.fc33.noarch

Sorry, that was a lie, I looked in the wrong place. This happened with

  virt-install-2.2.1-3.fc32.noarch

The fc33 version is what I have locally, but unfortunately Fedora 33's docker images don't work any more on a RHEL 7 host, so I have to use a F32 image to run the massively parallel test.

Comment 2 Martin Pitt 2020-11-04 15:49:44 UTC
I am trying to get a very small virt-install invocation that causes this effect, without having to endure an actual long installation.

My first naïve attempt was

  qemu-img create -f qcow2 test1.img 50M
  virt-install --network user --memory 128 --pxe --network user --os-variant alpinelinux3.8 --disk path=test1.img,cache=unsafe --name test1

which essentially just goes into a PXE wait loop -- but that's fine, it defines the domain. But it defines just a single storage pool that's named after the host name ("kstest"):

$ virsh pool-list --all
 Name     State    Autostart
------------------------------
 kstest   active   yes


and running the same thing with s/test1/test2/ also does not create a storage pool. Then I thought maybe it's because the real invocation has two disks, the CD-ROM. But

  touch boot.iso
  virt-install --network user --memory 128 --pxe --network user --os-variant alpinelinux3.8 --disk path=test1.img,cache=unsafe --disk path=boot.iso,device=cdrom,readonly=on,shareable=on --name test1

but that doesn't create a new pool either. Neither does  appending ",bus=sata".

The only reference to pools that I find in https://github.com/rhinstaller/kickstart-tests is scripts/launcher/lib/virtual_controller.py `def destroy()`, which removes a test's own storage pool. But that's not what this bug is about.

So I'm missing something. How does virt-install go into that "create separate storage pool" mode?

Comment 3 Martin Pitt 2020-11-04 16:16:26 UTC
Oh, got it -- it's --location! And that requires an actual boot.iso, so:

 
  wget https://download.fedoraproject.org/pub/fedora/linux/development/rawhide/Server/x86_64/os/images/boot.iso
  mkdir test1
  ln ./boot.iso test1/
  qemu-img create -f qcow2 test1.img 50M
  virt-install --network user --memory 128 --network user --disk path=test1.img,cache=unsafe --name test1 --location test1/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img

This fails, but it's enough to create a "test1" pool. I'll see to building a reproducer out of that, but that's something for tomorrow.

Comment 4 Martin Pitt 2020-11-05 09:54:21 UTC
Created attachment 1726822 [details]
initial reproducer

I have a first working reproducer. It creates four virt-installs in a loop (test{1..4}), which closely resembles what happens in our tests. After a few minutes, one of them fails like this:

$ ./reproducer.sh
ERROR    Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '06cae576-f1eb-44b9-a4b3-6fae9302202c' (test3)

This is exactly this bug (test1 erroring on storage pool test3 getting removed underneath).

You can watch the creation/destruction while the script runs with

    watch -n0.2 'virsh pool-list --all; virsh list --all'

Due to its shell/sloppy nature, one failing subprocess won't stop the entire reproducer. Stop the test with `pkill -e reproducer`.

I know this isn't ideal yet, I'll still look for a better one. But at least this confirms that it's not a bug in our kickstart-tests, is _some_ standalone reproducer, and it also happens with virt-install-3.1.0-1.fc33.noarch.

Comment 5 Martin Pitt 2020-11-05 11:00:01 UTC
Created attachment 1726828 [details]
better reproducer

I have a better reproducer now, which greatly improves the chance of hitting the race. It creates 100 storage pools, and removes them in the background while trying to do a *single* virt-install run.

I ran it a few times, and get a 100% "success" (i.e. hit the race) rate:

$ ./reproducer2.sh 
ERROR    Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '03957f51-0151-42e5-bf41-3852e27a8654' (pool16)

sometimes it's pool17 or similar, that depends on the timing. But this seems to be reliable enough.

Comment 6 Martin Pitt 2020-11-05 11:04:55 UTC
Now that virt-install doesn't run a thousand times, replacing its --quiet with --debug is insightful:

[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:204) Launched with command line: /usr/bin/virt-install --name test1 --debug --network user --memory 128 --disk path=test1/disk.img,cache=unsafe --location test1/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (virtinstall:212) Distilled --network options: ['user']
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (virtinstall:144) Distilled --disk options: ['path=test1/disk.img,cache=unsafe']
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:216) Requesting libvirt URI default
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:219) Received libvirt URI qemu:///session
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:1514) Exception parsing inst=<DeviceDisk disk 139935836089552> optstr=path=test1/disk.img,cache=unsafe
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtinst/cli.py", line 1501, in parse
    objs = self._parse(inst is None and self.guest or inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 3308, in _parse
    super()._parse(inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1471, in _parse
    param.parse_param(self, inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1135, in parse_param
    xmlutil.set_prop_path(inst, self.propname, self.val)
  File "/usr/share/virt-manager/virtinst/xmlutil.py", line 66, in set_prop_path
    return setattr(parent, pieces[-1], value)
  File "/usr/share/virt-manager/virtinst/devices/disk.py", line 337, in _set_path
    (vol_object, parent_pool) = diskbackend.manage_path(self.conn, newpath)
  File "/usr/share/virt-manager/virtinst/diskbackend.py", line 137, in manage_path
    if not path_is_url(path) and not path_is_network_vol(conn, path):
  File "/usr/share/virt-manager/virtinst/diskbackend.py", line 171, in path_is_network_vol
    for volxml in conn.fetch_all_vols():
  File "/usr/share/virt-manager/virtinst/connection.py", line 273, in fetch_all_vols
    return self._fetch_helper(
  File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper
    self._fetch_cache[key] = raw_cb()
  File "/usr/share/virt-manager/virtinst/connection.py", line 222, in _fetch_all_vols_raw
    for poolxmlobj in self.fetch_all_pools():
  File "/usr/share/virt-manager/virtinst/connection.py", line 264, in fetch_all_pools
    return self._fetch_helper(
  File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper
    self._fetch_cache[key] = raw_cb()
  File "/usr/share/virt-manager/virtinst/connection.py", line 195, in _fetch_all_pools_raw
    return [self._build_pool_raw(poolobj) for poolobj in ret]
  File "/usr/share/virt-manager/virtinst/connection.py", line 195, in <listcomp>
    return [self._build_pool_raw(poolobj) for poolobj in ret]
  File "/usr/share/virt-manager/virtinst/connection.py", line 190, in _build_pool_raw
    parsexml=poolobj.XMLDesc(0))
  File "/usr/lib64/python3.9/site-packages/libvirt.py", line 3502, in XMLDesc
    if ret is None: raise libvirtError ('virStoragePoolGetXMLDesc() failed', pool=self)
libvirt.libvirtError: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15)
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:255)   File "/usr/bin/virt-install", line 8, in <module>
    virtinstall.runcli()
  File "/usr/share/virt-manager/virtinst/virtinstall.py", line 1158, in runcli
    sys.exit(main())
  File "/usr/share/virt-manager/virtinst/virtinstall.py", line 1145, in main
    guest, installer = build_guest_instance(conn, options)
  File "/usr/share/virt-manager/virtinst/virtinstall.py", line 580, in build_guest_instance
    guest = _build_options_guest(conn, options)
  File "/usr/share/virt-manager/virtinst/virtinstall.py", line 561, in _build_options_guest
    cli.parse_option_strings(options, guest, None)
  File "/usr/share/virt-manager/virtinst/cli.py", line 4391, in parse_option_strings
    parseret = parserobj.parse(inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1516, in parse
    fail(_("Error: %(cli_flag_name)s %(options)s: %(err)s") %
  File "/usr/share/virt-manager/virtinst/cli.py", line 255, in fail
    log.debug("".join(traceback.format_stack()))

[Thu, 05 Nov 2020 11:02:50 virt-install 41734] ERROR (cli:256) Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15)
[Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:258) 
Traceback (most recent call last):
  File "/usr/share/virt-manager/virtinst/cli.py", line 1501, in parse
    objs = self._parse(inst is None and self.guest or inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 3308, in _parse
    super()._parse(inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1471, in _parse
    param.parse_param(self, inst)
  File "/usr/share/virt-manager/virtinst/cli.py", line 1135, in parse_param
    xmlutil.set_prop_path(inst, self.propname, self.val)
  File "/usr/share/virt-manager/virtinst/xmlutil.py", line 66, in set_prop_path
    return setattr(parent, pieces[-1], value)
  File "/usr/share/virt-manager/virtinst/devices/disk.py", line 337, in _set_path
    (vol_object, parent_pool) = diskbackend.manage_path(self.conn, newpath)
  File "/usr/share/virt-manager/virtinst/diskbackend.py", line 137, in manage_path
    if not path_is_url(path) and not path_is_network_vol(conn, path):
  File "/usr/share/virt-manager/virtinst/diskbackend.py", line 171, in path_is_network_vol
    for volxml in conn.fetch_all_vols():
  File "/usr/share/virt-manager/virtinst/connection.py", line 273, in fetch_all_vols
    return self._fetch_helper(
  File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper
    self._fetch_cache[key] = raw_cb()
  File "/usr/share/virt-manager/virtinst/connection.py", line 222, in _fetch_all_vols_raw
    for poolxmlobj in self.fetch_all_pools():
  File "/usr/share/virt-manager/virtinst/connection.py", line 264, in fetch_all_pools
    return self._fetch_helper(
  File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper
    self._fetch_cache[key] = raw_cb()
  File "/usr/share/virt-manager/virtinst/connection.py", line 195, in _fetch_all_pools_raw
    return [self._build_pool_raw(poolobj) for poolobj in ret]
  File "/usr/share/virt-manager/virtinst/connection.py", line 195, in <listcomp>
    return [self._build_pool_raw(poolobj) for poolobj in ret]
  File "/usr/share/virt-manager/virtinst/connection.py", line 190, in _build_pool_raw
    parsexml=poolobj.XMLDesc(0))
  File "/usr/lib64/python3.9/site-packages/libvirt.py", line 3502, in XMLDesc
    if ret is None: raise libvirtError ('virStoragePoolGetXMLDesc() failed', pool=self)
libvirt.libvirtError: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15)


So this is a TOCTOU again between _fetch_all_pools_raw() and introspecting a particular pool wiht the .XMLDesc() function.

Comment 7 Martin Pitt 2020-11-05 12:46:19 UTC
I sent a fix to upstream: https://github.com/virt-manager/virt-manager/pull/179

Comment 8 Cole Robinson 2020-11-08 22:03:00 UTC
Martin's fixed were applied upstream now

Comment 9 Martin Pitt 2020-11-24 12:51:55 UTC
The fix got released upstream in v3.2.0, and delivered into F33 in https://bodhi.fedoraproject.org/updates/FEDORA-2020-0782be7dfb

I ran the reproducer again, and it is stable now. So this is fixed in F33, and for F32 we can backport this locally.

Thanks!


Note You need to log in before you can comment on or make changes to this bug.