Description of problem: In the anaconda installer's kickstart tests [1] we run a lot of virt-install commands in parallel to test various scenarios. I recently tried them on a beefy machine with 32 CPUs, and thus 32 parallel test jobs, for ~ 200 tests in parallel. Of these, two tests failed like this: VirtualInstall failed: Problem starting virtual install: Command '['virt-install', '-n', 'kstest-onboot-activate-httpks_(47db99ac-f17d-4336-a302-fecc32fd51b3)', '-r', '2048', '--noautoconsole', '--vcpus', '1', '--rng', '/dev/random', '--noreboot', '--graphics', 'vnc', '--disk', 'path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/disk-a.img,cache=unsafe,bus=sata', '--network', 'user', '--network', 'user', '--network', 'user', '--disk', 'path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/boot.iso,device=cdrom,readonly=on,shareable=on', '--extra-args', ' debug=1 inst.debug rd.shell=0 rd.emergency=poweroff ip=enp1s0:dhcp inst.ks=http://10.0.2.2:40000/ks.cfg inst.kernel.hung_task_timeout_secs=1200 inst.stage2=hd:CDLABEL=Fedora-S-dvd-x86_64-rawh', '--location', '/var/tmp/kstest-onboot-activate-httpks.0o1sra44/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img', '--channel', 'tcp,host=127.0.0.1:54347,mode=connect,target_type=virtio,name=org.fedoraproject.anaconda.log.0']' returned non-zero exit status 1. ERROR Error: --disk path=/var/tmp/kstest-onboot-activate-httpks.0o1sra44/disk-a.img,cache=unsafe,bus=sata: Storage pool not found: no storage pool with matching name 'kstest-initial-setup-gui.cb45niwk' Apparently, virt-install creates a new storage pool for each test with a name that matches the domain name: $ virsh pool-list Name State Autostart -------------------------------------------------------------- kstest-onboot-activate-httpks.0o1sra44 active yes So you see that virt-install tried to do something about *other* storage pools, in that case 'kstest-initial-setup-gui.cb45niwk'. This is from a parallel test which may clean up its own storage pool at any time. I suppose it doesn't actually try to modify/configure that, but this smells like a race condition when enumerating all storage pools and doing some query? [1] https://github.com/rhinstaller/kickstart-tests Version-Release number of selected component (if applicable): virt-install-3.1.0-1.fc33.noarch How reproducible: Sometimes Steps to Reproduce: For now, the only known way is to run kickstart-tests with massive parallelism on a powerful machine: git clone https://github.com/rhinstaller/kickstart-tests cd kickstart-tests containers/runner/launch all This downloads a kstest-runner container and runs the test with *user* libvirt inside. (Not sure if user vs. system makes any difference -- the container runs unprivileged and inside the test runs as user, not root). But I realize that this is a bit unwieldy -- I'm happy to run the reproducer myself with proposed changes, or come up with a better one. Some questions to you: - Do you have some initial hints in which part of the code to look? I'm happy to give that a shot. - Do you know how to have a really cheap way to invoke virt-install, to basically just do a no-op? So that we can test the parallelism without actually having to do a Fedora install in each case? Thanks!
> virt-install-3.1.0-1.fc33.noarch Sorry, that was a lie, I looked in the wrong place. This happened with virt-install-2.2.1-3.fc32.noarch The fc33 version is what I have locally, but unfortunately Fedora 33's docker images don't work any more on a RHEL 7 host, so I have to use a F32 image to run the massively parallel test.
I am trying to get a very small virt-install invocation that causes this effect, without having to endure an actual long installation. My first naïve attempt was qemu-img create -f qcow2 test1.img 50M virt-install --network user --memory 128 --pxe --network user --os-variant alpinelinux3.8 --disk path=test1.img,cache=unsafe --name test1 which essentially just goes into a PXE wait loop -- but that's fine, it defines the domain. But it defines just a single storage pool that's named after the host name ("kstest"): $ virsh pool-list --all Name State Autostart ------------------------------ kstest active yes and running the same thing with s/test1/test2/ also does not create a storage pool. Then I thought maybe it's because the real invocation has two disks, the CD-ROM. But touch boot.iso virt-install --network user --memory 128 --pxe --network user --os-variant alpinelinux3.8 --disk path=test1.img,cache=unsafe --disk path=boot.iso,device=cdrom,readonly=on,shareable=on --name test1 but that doesn't create a new pool either. Neither does appending ",bus=sata". The only reference to pools that I find in https://github.com/rhinstaller/kickstart-tests is scripts/launcher/lib/virtual_controller.py `def destroy()`, which removes a test's own storage pool. But that's not what this bug is about. So I'm missing something. How does virt-install go into that "create separate storage pool" mode?
Oh, got it -- it's --location! And that requires an actual boot.iso, so: wget https://download.fedoraproject.org/pub/fedora/linux/development/rawhide/Server/x86_64/os/images/boot.iso mkdir test1 ln ./boot.iso test1/ qemu-img create -f qcow2 test1.img 50M virt-install --network user --memory 128 --network user --disk path=test1.img,cache=unsafe --name test1 --location test1/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img This fails, but it's enough to create a "test1" pool. I'll see to building a reproducer out of that, but that's something for tomorrow.
Created attachment 1726822 [details] initial reproducer I have a first working reproducer. It creates four virt-installs in a loop (test{1..4}), which closely resembles what happens in our tests. After a few minutes, one of them fails like this: $ ./reproducer.sh ERROR Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '06cae576-f1eb-44b9-a4b3-6fae9302202c' (test3) This is exactly this bug (test1 erroring on storage pool test3 getting removed underneath). You can watch the creation/destruction while the script runs with watch -n0.2 'virsh pool-list --all; virsh list --all' Due to its shell/sloppy nature, one failing subprocess won't stop the entire reproducer. Stop the test with `pkill -e reproducer`. I know this isn't ideal yet, I'll still look for a better one. But at least this confirms that it's not a bug in our kickstart-tests, is _some_ standalone reproducer, and it also happens with virt-install-3.1.0-1.fc33.noarch.
Created attachment 1726828 [details] better reproducer I have a better reproducer now, which greatly improves the chance of hitting the race. It creates 100 storage pools, and removes them in the background while trying to do a *single* virt-install run. I ran it a few times, and get a 100% "success" (i.e. hit the race) rate: $ ./reproducer2.sh ERROR Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '03957f51-0151-42e5-bf41-3852e27a8654' (pool16) sometimes it's pool17 or similar, that depends on the timing. But this seems to be reliable enough.
Now that virt-install doesn't run a thousand times, replacing its --quiet with --debug is insightful: [Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:204) Launched with command line: /usr/bin/virt-install --name test1 --debug --network user --memory 128 --disk path=test1/disk.img,cache=unsafe --location test1/boot.iso,kernel=isolinux/vmlinuz,initrd=isolinux/initrd.img [Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (virtinstall:212) Distilled --network options: ['user'] [Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (virtinstall:144) Distilled --disk options: ['path=test1/disk.img,cache=unsafe'] [Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:216) Requesting libvirt URI default [Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:219) Received libvirt URI qemu:///session [Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:1514) Exception parsing inst=<DeviceDisk disk 139935836089552> optstr=path=test1/disk.img,cache=unsafe Traceback (most recent call last): File "/usr/share/virt-manager/virtinst/cli.py", line 1501, in parse objs = self._parse(inst is None and self.guest or inst) File "/usr/share/virt-manager/virtinst/cli.py", line 3308, in _parse super()._parse(inst) File "/usr/share/virt-manager/virtinst/cli.py", line 1471, in _parse param.parse_param(self, inst) File "/usr/share/virt-manager/virtinst/cli.py", line 1135, in parse_param xmlutil.set_prop_path(inst, self.propname, self.val) File "/usr/share/virt-manager/virtinst/xmlutil.py", line 66, in set_prop_path return setattr(parent, pieces[-1], value) File "/usr/share/virt-manager/virtinst/devices/disk.py", line 337, in _set_path (vol_object, parent_pool) = diskbackend.manage_path(self.conn, newpath) File "/usr/share/virt-manager/virtinst/diskbackend.py", line 137, in manage_path if not path_is_url(path) and not path_is_network_vol(conn, path): File "/usr/share/virt-manager/virtinst/diskbackend.py", line 171, in path_is_network_vol for volxml in conn.fetch_all_vols(): File "/usr/share/virt-manager/virtinst/connection.py", line 273, in fetch_all_vols return self._fetch_helper( File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper self._fetch_cache[key] = raw_cb() File "/usr/share/virt-manager/virtinst/connection.py", line 222, in _fetch_all_vols_raw for poolxmlobj in self.fetch_all_pools(): File "/usr/share/virt-manager/virtinst/connection.py", line 264, in fetch_all_pools return self._fetch_helper( File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper self._fetch_cache[key] = raw_cb() File "/usr/share/virt-manager/virtinst/connection.py", line 195, in _fetch_all_pools_raw return [self._build_pool_raw(poolobj) for poolobj in ret] File "/usr/share/virt-manager/virtinst/connection.py", line 195, in <listcomp> return [self._build_pool_raw(poolobj) for poolobj in ret] File "/usr/share/virt-manager/virtinst/connection.py", line 190, in _build_pool_raw parsexml=poolobj.XMLDesc(0)) File "/usr/lib64/python3.9/site-packages/libvirt.py", line 3502, in XMLDesc if ret is None: raise libvirtError ('virStoragePoolGetXMLDesc() failed', pool=self) libvirt.libvirtError: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15) [Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:255) File "/usr/bin/virt-install", line 8, in <module> virtinstall.runcli() File "/usr/share/virt-manager/virtinst/virtinstall.py", line 1158, in runcli sys.exit(main()) File "/usr/share/virt-manager/virtinst/virtinstall.py", line 1145, in main guest, installer = build_guest_instance(conn, options) File "/usr/share/virt-manager/virtinst/virtinstall.py", line 580, in build_guest_instance guest = _build_options_guest(conn, options) File "/usr/share/virt-manager/virtinst/virtinstall.py", line 561, in _build_options_guest cli.parse_option_strings(options, guest, None) File "/usr/share/virt-manager/virtinst/cli.py", line 4391, in parse_option_strings parseret = parserobj.parse(inst) File "/usr/share/virt-manager/virtinst/cli.py", line 1516, in parse fail(_("Error: %(cli_flag_name)s %(options)s: %(err)s") % File "/usr/share/virt-manager/virtinst/cli.py", line 255, in fail log.debug("".join(traceback.format_stack())) [Thu, 05 Nov 2020 11:02:50 virt-install 41734] ERROR (cli:256) Error: --disk path=test1/disk.img,cache=unsafe: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15) [Thu, 05 Nov 2020 11:02:50 virt-install 41734] DEBUG (cli:258) Traceback (most recent call last): File "/usr/share/virt-manager/virtinst/cli.py", line 1501, in parse objs = self._parse(inst is None and self.guest or inst) File "/usr/share/virt-manager/virtinst/cli.py", line 3308, in _parse super()._parse(inst) File "/usr/share/virt-manager/virtinst/cli.py", line 1471, in _parse param.parse_param(self, inst) File "/usr/share/virt-manager/virtinst/cli.py", line 1135, in parse_param xmlutil.set_prop_path(inst, self.propname, self.val) File "/usr/share/virt-manager/virtinst/xmlutil.py", line 66, in set_prop_path return setattr(parent, pieces[-1], value) File "/usr/share/virt-manager/virtinst/devices/disk.py", line 337, in _set_path (vol_object, parent_pool) = diskbackend.manage_path(self.conn, newpath) File "/usr/share/virt-manager/virtinst/diskbackend.py", line 137, in manage_path if not path_is_url(path) and not path_is_network_vol(conn, path): File "/usr/share/virt-manager/virtinst/diskbackend.py", line 171, in path_is_network_vol for volxml in conn.fetch_all_vols(): File "/usr/share/virt-manager/virtinst/connection.py", line 273, in fetch_all_vols return self._fetch_helper( File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper self._fetch_cache[key] = raw_cb() File "/usr/share/virt-manager/virtinst/connection.py", line 222, in _fetch_all_vols_raw for poolxmlobj in self.fetch_all_pools(): File "/usr/share/virt-manager/virtinst/connection.py", line 264, in fetch_all_pools return self._fetch_helper( File "/usr/share/virt-manager/virtinst/connection.py", line 179, in _fetch_helper self._fetch_cache[key] = raw_cb() File "/usr/share/virt-manager/virtinst/connection.py", line 195, in _fetch_all_pools_raw return [self._build_pool_raw(poolobj) for poolobj in ret] File "/usr/share/virt-manager/virtinst/connection.py", line 195, in <listcomp> return [self._build_pool_raw(poolobj) for poolobj in ret] File "/usr/share/virt-manager/virtinst/connection.py", line 190, in _build_pool_raw parsexml=poolobj.XMLDesc(0)) File "/usr/lib64/python3.9/site-packages/libvirt.py", line 3502, in XMLDesc if ret is None: raise libvirtError ('virStoragePoolGetXMLDesc() failed', pool=self) libvirt.libvirtError: Storage pool not found: no storage pool with matching uuid '3d85ec50-f69f-45af-9aa5-dd3151773269' (pool15) So this is a TOCTOU again between _fetch_all_pools_raw() and introspecting a particular pool wiht the .XMLDesc() function.
I sent a fix to upstream: https://github.com/virt-manager/virt-manager/pull/179
Martin's fixed were applied upstream now
The fix got released upstream in v3.2.0, and delivered into F33 in https://bodhi.fedoraproject.org/updates/FEDORA-2020-0782be7dfb I ran the reproducer again, and it is stable now. So this is fixed in F33, and for F32 we can backport this locally. Thanks!