Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1088454

Summary: libvirt update killed some qemu VMs
Product: Red Hat Enterprise Linux 7 Reporter: Orion Poplawski <orion>
Component: libvirtAssignee: Laine Stump <laine>
Status: CLOSED DEFERRED QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: unspecified    
Version: 7.0CC: dpal, dyuan, fjin, jdenemar, mzhan, orion, rbalakri, tzheng, yafu
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-31 18:31:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
qemu log
none
dumpxml
none
libvirtd.log
none
/var/run/libvirt/qemu/vulcan.xml
none
/etc/libvirt/qemu/vulcan.xml none

Description Orion Poplawski 2014-04-16 16:17:57 UTC
Description of problem:

The following packages were updated:

Apr 16 03:30:34 Updated: 2:qemu-img-0.12.1.2-2.415.el6_5.7.x86_64
Apr 16 03:30:38 Updated: libvirt-client-0.10.2-29.el6_5.7.x86_64
Apr 16 03:31:01 Updated: krb5-libs-1.10.3-15.el6_5.1.x86_64
Apr 16 03:31:02 Updated: nss-softokn-freebl-3.14.3-10.el6_5.x86_64
pr 16 03:31:12 Updated: krb5-workstation-1.10.3-15.el6_5.1.x86_64
Apr 16 03:31:15 Updated: libvirt-python-0.10.2-29.el6_5.7.x86_64
Apr 16 03:31:17 Updated: libvirt-0.10.2-29.el6_5.7.x86_64
Apr 16 03:31:18 Updated: 2:qemu-kvm-0.12.1.2-2.415.el6_5.7.x86_64

On libvirtd restart, it appears that the new libvirtd process killed some (but not all) of the running VMs:

2014-04-16 09:31:18.175+0000: 27771: info : libvirt version: 0.10.2, package: 29.el6_5.7 (Scientific Linux, 2014-04-07-05:05:20, sl6.fnal.gov)
2014-04-16 09:31:18.175+0000: 27771: error : virStorageBackendVolOpenCheckMode:1040 : cannot stat file '/dev/vg_root/tpool': No such file or directory
2014-04-16 09:31:18.479+0000: 27771: error : virCommandWait:2319 : internal error Child process (/sbin/vgchange -aln vg_root) unexpected exit status 5:   Logical volume vg_root/root contains a filesystem in use.
  Can't deactivate volume group "vg_root" with 10 open logical volume(s)
 
2014-04-16 09:31:18.479+0000: 27771: error : storageDriverAutostart:115 : Failed to autostart storage pool 'vg_root': internal error Child process (/sbin/vgchange -aln vg_root) unexpected exit status 5:   Logical volume vg_root/root contains a filesystem in use.
  Can't deactivate volume group "vg_root" with 10 open logical volume(s)
 
2014-04-16 09:31:19.311+0000: 27760: error : qemuMonitorIO:614 : internal error End of file from monitor
2014-04-16 09:31:19.614+0000: 27760: error : qemuMonitorIO:614 : internal error End of file from monitor
2014-04-16 09:31:19.862+0000: 27760: error : qemuMonitorIO:614 : internal error End of file from monitor
 
 
qemu/netdrmsdb.log:
qemu: terminating on signal 15 from pid 27760
2014-04-16 09:31:19.614+0000: shutting down


(Also, why are the logs in UTC?)

How reproducible:
Tried downgrading and re-upgrading but could not reproduce.

Comment 3 Laine Stump 2014-04-22 08:37:40 UTC
What indicated that the guests were "killed"? Did they no longer show up in the output of "virsh list"? What did "virsh list --all" show? Or did you look for the qemu process and it was missing?

The error messages from libvirt's log indicate that, during a re-scan of an LVM based storage pool, there was a symlink to an LVM volume in /dev/vg_root (tpool) that was stale, which led to the remainder of the storage pool not being rescanned, but 1) I'm not sure how that could have caused a qemu process to exit, unless it was directly using that LVM volume, and 2) libvirt itself doesn't do anything with those links (which should be pointing to the actual device nodes in /dev/dm-*), so it's unclear how an update/restart of libvirt could cause the link to be broken.

Were the "disappeared" guests using the storage pool that uses /dev/vg_root?

Please attach /var/log/libvirt/qemu/$guestname.log and the output of "virsh dumpxml --inactive $guestname" for at least one of the guests that disappeared.

> (Also, why are the logs in UTC?)

The logs are in UTC because it is impossible to do timezone conversion for log messages in a reliable threadsafe manner.

Comment 5 Orion Poplawski 2014-04-23 15:40:45 UTC
Well, the /var/log/libvirt/qemu/$guestname.log for netdrmsdb (same message in the other dead guests) above is my main evidence that the guests were killed.  I'm afraid I don't remember if virsh list --all showed anything different then "shut off" for the killed guests.

I think the tpool thing may be a red-herring - I see it going back a ways in the libvirtd.log without ill effects.

Comment 6 Orion Poplawski 2014-04-23 15:41:58 UTC
Created attachment 888989 [details]
qemu log

Comment 7 Orion Poplawski 2014-04-23 15:42:24 UTC
Created attachment 888990 [details]
dumpxml

Comment 8 Laine Stump 2014-04-24 10:06:08 UTC
Ah, I missed the single line from the qemu logfile at the bottom of the description. So libvirtd definitely sent a SIGTERM to the qemu process.

This can happen when libvirtd encounters one of several different errors while trying to reconnect to the qemu process after a restart, but in all these cases there should be an error log describing the failure (for reference, see the function qemuProcessReconnect() - anything with goto error could have caused the SIGTERM, which is sent by the call to qemuProcessStop()).

Since there is no eplanatory log from libvirt, then I can see two possibilities: 1) libvirt is incorrectly returning a failure code from a function that isn't actually failing (and thus not logging any failure message), leading to the SIGTERM, 2) libvirt has a failure path that doesn't have a log message, or 3) qemu is closing the monitor socket that libvirt is trying to connect to, leading libvirt to believe that qemu is exiting, after which it sends SIGTERM just to be sure.

I do have a very vague memory of an instance of (1) involving (I think) PCI passthrough devices that was fixed quite a long time ago, but unfortunately am unable to find either the bug report or the patch that fixed it; perhaps the fix was implicit in some other patch. At any rate, I'm fairly certain that problem was far enough in the past to be fixed in RHEL6's libvirt.

In the absence of a way to reproduce this, I'm not sure what more we can do other than hope that it pops up somewhere else so that we can collect more data.

Comment 10 Orion Poplawski 2014-06-17 20:43:21 UTC
Created attachment 909772 [details]
libvirtd.log

This just happened again with the libvirt-0.10.2-29.el6_5.9.x86_64 update, this time only to a single guest.  I've been running libvirtd at log_level 2 (1 is just too verbose for general use it seems), so there is a little more info there:

2014-06-17 09:35:14.955+0000: 23347: error : qemuMonitorIO:614 : internal error End of file from monitor
2014-06-17 09:35:14.957+0000: 23347: info : virSecuritySELinuxRestoreSecurityFileLabel:924 : Restoring SELinux context on '/dev/mapper/vg_data-vulcan--disk0'
2014-06-17 09:35:15.075+0000: 23347: info : virSecuritySELinuxSetFileconHelper:794 : Setting SELinux context on '/dev/dm-10' to 'system_u:object_r:fixed_disk_device_t:s0'
2014-06-17 09:35:15.076+0000: 23347: info : virSecurityDACRestoreSecurityFileLabel:340 : Restoring DAC user and group on '/dev/mapper/vg_data-vulcan--disk0'
2014-06-17 09:35:15.076+0000: 23347: info : virSecurityDACSetOwnership:296 : Setting DAC user and group on '/dev/dm-10' to '0:0'

but I guess that is all pretty normal,

Comment 11 Orion Poplawski 2014-06-17 20:54:26 UTC
I tried downgrading/upgrading again, but no real issues, though I do see some of these now in the vulcan qemu log after the downgrade:

block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)
block I/O error in device 'drive-virtio-disk0': Input/output error (5)

Comment 12 Orion Poplawski 2014-06-17 20:58:56 UTC
vulcan went unstable after that is now having problems:

Buffer I/O error on device vda3, logical block 524292
lost page write due to I/O error on vda3
Buffer I/O error on device vda3, logical block 1572867
lost page write due to I/O error on vda3

Comment 13 Orion Poplawski 2014-06-17 21:09:39 UTC
So what happened this time is I ended up with two qemu-kvm process for vulcan:

qemu      4306     1 11 07:21 ?        00:52:51 /usr/libexec/qemu-kvm -name vulcan -S -M rhel6.5.0 -cpu Nehalem,+rdtscp,+dca,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 8192 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid ff1fc9d1-5afc-3440-5efc-5656da239222 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vulcan.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/mapper/vg_data-vulcan--disk0,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:16:3e:04:e1:ff,bus=pci.0,addr=0x3 -netdev tap,fd=29,id=hostnet1,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:d4:da:d4,bus=pci.0,addr=0x4 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
qemu     15512     1 16 14:56 ?        00:01:44 /usr/libexec/qemu-kvm -name vulcan -S -M rhel6.5.0 -cpu Nehalem,+rdtscp,+dca,+pdcm,+xtpr,+tm2,+est,+vmx,+ds_cpl,+monitor,+dtes64,+pbe,+tm,+ht,+ss,+acpi,+ds,+vme -enable-kvm -m 8192 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid ff1fc9d1-5afc-3440-5efc-5656da239222 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/vulcan.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/dev/mapper/vg_data-vulcan--disk0,if=none,id=drive-virtio-disk0,format=raw,cache=none,aio=native -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=27,id=hostnet0,vhost=on,vhostfd=28 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=00:16:3e:04:e1:ff,bus=pci.0,addr=0x3 -netdev tap,fd=29,id=hostnet1,vhost=on,vhostfd=30 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:d4:da:d4,bus=pci.0,addr=0x4 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:6 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

  This VM is a somewhat production though, so I need to stop messing with it.

Comment 14 Laine Stump 2014-06-18 13:20:48 UTC
That is *definitely* not good, although something else must be at play here - I don't know of any way that restarting libvirtd would cause a new qemu process to be started (hmm, unless maybe it was marked as "autostart" and libvirtd was for some reason unable to "see" the earlier instance). Can you think of any other operation you performed that might have tried to start "vulcan"?

Comment 15 Laine Stump 2014-06-18 13:58:13 UTC
A few of us have been discussing this most recent problem and are curious about the contents of the files /var/run/libvirt/qemu/vulcan.xml and /etc/libvirt/qemu/vulcan.xml. It's a bit of a fishing expedition, but may yield some clues. Can you attach them? (current contents are okay)

One theory is that something in the status xml (from /var/run/...) was unparseable when libvirtd restarted (possibly during the downgrade); this would cause it to not show up on the list of active domains, and if that domain is set to autostart, a new instance might be started. (is that domain set to autostart?)

Comment 16 Orion Poplawski 2014-06-18 16:27:26 UTC
Created attachment 910076 [details]
/var/run/libvirt/qemu/vulcan.xml

I now have one instance running.

Here's how I ended up with two instances:
- downgraded libvirt from 29.el6_5.9 to 29.el6_5.7
  I see this in the libvirtd.log:
2014-06-17 20:48:51.952+0000: 13725: info : virDomainLoadAllConfigs:14738 : Loading config file 'vulcan.xml'
2014-06-17 20:48:51.954+0000: 13725: error : qemuDomainObjPrivateXMLParse:420 : internal error Unknown qemu capabilities flag host-pci-multidomain

and it looks like it starts another vulcan instance (yes, it is set to autostart):

2014-06-17 20:48:52.159+0000: 13725: info : virSecuritySELinuxSetFileconHelper:794 : Setting SELinux context on '/dev/mapper/vg_data-vulcan--disk0' to 'unconfined_u:object_r:svirt_image_t:s0:c519,c1023'


- Upgraded back to 29.el6_5.9

So that may be a separate downgrade issue.

Comment 17 Orion Poplawski 2014-06-18 16:35:17 UTC
Created attachment 910077 [details]
/etc/libvirt/qemu/vulcan.xml

Comment 18 Laine Stump 2014-06-19 07:53:29 UTC
(In reply to Orion Poplawski from comment #16)
>
> 2014-06-17 20:48:51.954+0000: 13725: error :
> qemuDomainObjPrivateXMLParse:420 : internal error Unknown qemu capabilities
> flag host-pci-multidomain

Ugh. I'll repeat that is *not* good. Explanation: During/after the upgrade, a "new" capability was detected (the capability could have already been present in qemu, but libvirt didn't look for it previously). It turns out that this capability has absolutely 0 effect on the commandline sent to qemu except in the very rare case that you want to passthrough a PCI device that is in a PCI domain other than 0 (which is only possible on very large/expensive hardware).

But all detected capabilities are stored in the domain status, not just those that are utilized, so it is added to the list of capabilities flags.

Then when libvirtd is downgraded and the older libvirtd starts, it reads in the status files of all running domains. When it encounters this flag that it doesn't recognize, it pukes on it and doesn't mark the domain as running (but also the normally very liberal "kill domains you can't connect to" policy isn't followed in this case, so the qemu process is left running.)

So yes, definitely this is a separate problem from the one that you originally reported, but just as important. I'm going to clone this BZ into a new BZ, leaving in just the last couple comments, so that we can track the fixes separately.

Comment 19 Laine Stump 2014-06-19 08:19:16 UTC
After a short discussion with jdenemar on IRC, we've concluded the following:

1) downgrading libvirt with running domains isn't officially supported, so we don't need to make it work perfectly (as a matter of fact, it's either very difficult or in some cases impossible).

2) still we should "fail" in a much more graceful manner.

3) even though this is a separate problem from the "libvirt update kills some guests" that originated this BZ, both problems could be remedied (as much as that is possible) in the same way - by introducing a new "zombie" state for domains, and placing problematic domains in that state rather than a) terminating them or b) ignoring them ((a) led to the original problem, (b) led to this new problem).

So I've decided to *not* clone this BZ, but instead use it to track development of the idea of a zombie state for domains. Any domain that encountered problems while being re-attached at libvirtd restart would be placed in this state, and domains in this state could not be managed by libvirt except to destroy them (and of course since they were not "shutdown", a new instance could not be started). This would at least leave the domain accessible via ssh etc, so that it could be shut down gracefully.

The other part of this would be better reporting of the reason for entering the zombie state. In particular, an audit is needed of all error paths that currently lead to termination of the domain, to assure that all of them log an error (apparently this is not the case, since you've had domains terminated with no corresponding error in the log file).

Comment 21 Laine Stump 2015-01-22 18:53:58 UTC
Orion,

Have you ever encountered this problem again? I was never able to reproduce it.

Also, are you in a position to upgrade any systems to RHEL7 and see if the same issue occurs there? We're wondering whether we should possibly move this BZ to RHEL7, if it needs to stay here, or if it can be closed. (We have some definite ideas on how to make *downgrading* less painful in the face of new features that are no longer recognized after the downgrade, but that would be too invasive / apt to lead to regression to make them candidates for RHEL6).

Thanks!

Comment 22 Orion Poplawski 2015-01-28 20:30:15 UTC
Looks like there were four libvirt updates since 2014-06-17:

Aug 06 03:24:28 Updated: libvirt-0.10.2-29.el6_5.11.x86_64
Sep 03 03:09:30 Updated: libvirt-0.10.2-29.el6_5.12.x86_64
Nov 13 09:28:10 Updated: libvirt-0.10.2-46.el6_6.1.x86_64
Nov 19 03:21:43 Updated: libvirt-0.10.2-46.el6_6.2.x86_64

and I didn't notice any machines going down and I can't find any evidence in my logs.

I have not yet moved any of my VM hosts to EL7.

Comment 23 Laine Stump 2015-01-29 15:39:38 UTC
Moving to RHEL7 to track addition of "broken" state for guests that encounter xml parsing problems after upgrade/downgrade.

Comment 24 Laine Stump 2015-02-06 14:50:44 UTC
I just noticed Bug 1169405, which describes a scenario where guests get killed with SIGTERM (15) when libvirtd is restarted *if they had previously been restarted with "virsh reboot"*. Is there any chance this could have been the cause of your terminated guests?

Comment 25 Orion Poplawski 2015-02-06 18:39:14 UTC
It's certainly possible, not sure how to check for sure.  Thanks for the info.

Comment 28 Laine Stump 2016-05-31 18:31:03 UTC
Since there isn't a clear plan on what to do about this, and no reproducers, I'm closing it.