Bug 1225425 - [HC] Hosted Engine VM accessed w/ libgfapi destroyed while adding the host: Failed to find the libvirt domain (code=9)
Summary: [HC] Hosted Engine VM accessed w/ libgfapi destroyed while adding the host: ...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: vdsm
Classification: oVirt
Component: General
Version: ---
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ovirt-4.3.2
: ---
Assignee: Denis Chaplygin
QA Contact: SATHEESARAN
URL:
Whiteboard:
Depends On: 1022961 1322852
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-05-27 10:46 UTC by Sandro Bonazzola
Modified: 2019-04-24 07:48 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-18 07:03:52 UTC
oVirt Team: Gluster
rule-engine: ovirt-4.3?
rule-engine: planning_ack+
rule-engine: devel_ack?
rule-engine: testing_ack?


Attachments (Terms of Use)
sos full report (7.75 MB, application/x-xz)
2015-05-27 10:50 UTC, Sandro Bonazzola
no flags Details
sos report (7.23 MB, application/x-xz)
2015-06-08 11:36 UTC, Sandro Bonazzola
no flags Details


Links
System ID Priority Status Summary Last Updated
oVirt gerrit 76362 ovirt-4.1 MERGED errors: Introduce the common.errors module 2017-05-30 07:04:16 UTC
oVirt gerrit 76363 ovirt-4.1 MERGED errors: Use new errors.Base class 2017-05-30 07:04:09 UTC
oVirt gerrit 76365 ovirt-4.1 MERGED storage: add parse_volume_chain method 2017-05-30 07:03:52 UTC
oVirt gerrit 76366 ovirt-4.1 MERGED storage: extracted path_to_vol_id 2017-05-30 07:03:43 UTC
oVirt gerrit 76367 ovirt-4.1 MERGED storage: Added tests for parse_volume_chain covering error handling. 2017-06-07 08:03:18 UTC
oVirt gerrit 76979 ovirt-4.1 MERGED storage: extracted get_snapshot_xml 2017-06-13 14:05:08 UTC
oVirt gerrit 76980 ovirt-4.1 MERGED storage: Added tests for get_snapshot_xml 2017-06-14 08:20:51 UTC
oVirt gerrit 76981 ovirt-4.1 MERGED storage: Replaced strings with constants in get_disk_snapshot_xml 2017-06-14 09:17:15 UTC
oVirt gerrit 77023 ovirt-4.1 MERGED vm: Remove unneeded return value handling 2017-06-14 09:26:20 UTC
oVirt gerrit 77024 ovirt-4.1 MERGED common: Made UnsupportedOperation subclass of a ContextException 2017-06-14 09:26:31 UTC
oVirt gerrit 77025 ovirt-4.1 MERGED storage: Made get_snapshot_xml a member of Drive. 2017-06-14 09:38:32 UTC
oVirt gerrit 77026 ovirt-4.1 MERGED storage: Made get_snapshot_xml use _getSourceXml 2017-06-14 09:46:36 UTC
oVirt gerrit 77332 ovirt-4.1 MERGED storage: Add index support for blockcommit 2017-06-23 16:01:55 UTC

Description Sandro Bonazzola 2015-05-27 10:46:33 UTC
Description of problem:
Hosted Engine VM destroyed while adding the host:  Failed to find the libvirt domain (code=9)

Version-Release number of selected component (if applicable):
# rpm -qa |egrep "(vdsm|gluster|hosted-engine)"|sort
glusterfs-3.7.0-2.el7.x86_64
glusterfs-api-3.7.0-2.el7.x86_64
glusterfs-cli-3.7.0-2.el7.x86_64
glusterfs-client-xlators-3.7.0-2.el7.x86_64
glusterfs-fuse-3.7.0-2.el7.x86_64
glusterfs-geo-replication-3.7.0-2.el7.x86_64
glusterfs-libs-3.7.0-2.el7.x86_64
glusterfs-rdma-3.7.0-2.el7.x86_64
glusterfs-server-3.7.0-2.el7.x86_64
ovirt-hosted-engine-ha-1.3.0-0.0.master.20150424113553.20150424113551.git7c14f4c.el7.noarch
ovirt-hosted-engine-setup-1.3.0-0.0.master.20150518075146.gitdd9741f.el7.noarch
vdsm-4.17.0-860.git92219e2.el7.noarch
vdsm-cli-4.17.0-860.git92219e2.el7.noarch
vdsm-gluster-4.17.0-860.git92219e2.el7.noarch
vdsm-infra-4.17.0-860.git92219e2.el7.noarch
vdsm-jsonrpc-4.17.0-860.git92219e2.el7.noarch
vdsm-python-4.17.0-860.git92219e2.el7.noarch
vdsm-xmlrpc-4.17.0-860.git92219e2.el7.noarch
vdsm-yajsonrpc-4.17.0-860.git92219e2.el7.noarch

How reproducible:
100%


Steps to Reproduce:
1. Deploy Hosted Engine on first host with vdsm with libgfapi support in HC setup.

It worked fine 1 week ago.

Comment 1 Sandro Bonazzola 2015-05-27 10:50:19 UTC
Created attachment 1030473 [details]
sos full report

Comment 3 Michal Skrivanek 2015-06-04 09:46:17 UTC
thanks sandro, I investigated the logs and identified there was another process (not vdsm) which killed the VM during the time when vdsm was restarted during upgrade, therefore it lost track of that VM and on vdsm startup in recovery flow it failed
I believe it needs to be investigated from the integration point of view as it may be some services conflict during the HE deployment

Comment 4 Sandro Bonazzola 2015-06-08 06:39:01 UTC
(In reply to Michal Skrivanek from comment #3)
> thanks sandro, I investigated the logs and identified there was another
> process (not vdsm) which killed the VM during the time when vdsm was
> restarted during upgrade, therefore it lost track of that VM and on vdsm
> startup in recovery flow it failed
> I believe it needs to be investigated from the integration point of view as
> it may be some services conflict during the HE deployment

Thanks Michal. Dan is something changed in the way vdsm handle child process termination? Looks like vdsm restart is now killing qemu like it's in the same cgroup and killed by something that may be systemd

Comment 5 Aharon Canan 2015-06-08 06:41:07 UTC
If you guys can provide (once you will have it) wordaround it will be great.

Comment 6 Sandro Bonazzola 2015-06-08 11:24:31 UTC
Reproduced today.
In /var/log/messages I see:

Jun  8 13:12:21 minidell journal: vdsm vds WARNING Found 1 VMs from recovery files not reported by libvirt. This should not happen! Will try to recover them.


A fiew lines before that there is:
Jun  8 13:12:18 minidell systemd: Starting Virtualization daemon...
Jun  8 13:12:18 minidell systemd: Failed to reset devices.list on /machine.slice: Invalid argument
Jun  8 13:12:18 minidell journal: libvirt version: 1.2.8, package: 16.el7_1.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-04-02-12:11:36, x86-024.build
.eng.bos.redhat.com)
Jun  8 13:12:18 minidell journal: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so not accessible

And also:
Jun  8 13:12:08 minidell yum[23495]: Installed: 10:qemu-kvm-tools-ev-2.1.2-23.el7_1.3.1.x86_64
Jun  8 13:12:07 minidell systemd: Time has been changed

Qemu logs says:
2015-06-08 11:12:20.134+0000: shutting down
qemu: terminating on signal 15 from pid 24264

and the process involved is: 
root     24264  0.1  0.1 1123260 22124 ?       Ssl  13:12   0:00 /usr/sbin/libvirtd --listen

I'm not sure about why libvirt and messages dispaly logs in different timezones.

So, moving back to virt. Not sure why libvirt send the term signal.

Comment 7 Sandro Bonazzola 2015-06-08 11:36:19 UTC
Created attachment 1036316 [details]
sos report

Comment 8 Sandro Bonazzola 2015-06-08 11:51:26 UTC
Also note that vdsm has been stopped at 13:12:02:
VM Channels Listener::DEBUG::2015-06-08 13:12:02,452::vmchannels::187::vds::(run) VM channels listener thread has ended

and started at 13:12:21
MainThread::INFO::2015-06-08 13:12:21,793::vdsm::153::vds::(run) (PID: 24464) I am the actual vdsm 4.17.0-914.gitdb9c4de.el7 minidell.home (3.10.0-229.el7.x86_64)

one second after libvirt killed qemu, in the middle of supervdsmd initialization.

Comment 9 Dan Kenigsberg 2015-06-08 11:56:44 UTC
(In reply to Sandro Bonazzola from comment #4)

> Dan is something changed in the way vdsm handle child process
> termination?

No. Vdsm never kills qemu processes directly. It issues a destroy() call to libvirt if it finds vdsm-looking VM that is not expected (e.g. supposed to be Donw). A big

  loose qemu process with id: %s found, killing it.

message should appear in vdsm.log in this case.

BTW, libvirtd.log uses UTC, and vdsm.log uses the system timezone. This can be confusing, I admit.

Comment 10 Michal Skrivanek 2015-06-29 07:53:32 UTC
There is bunch of stuff being restarted. Libvirt somehow decides to kill the VM, In principle I'm not that surprised...it's not a good idea to upgrade system packages (even though here it was only a harmless qemu-kvm-tools-ev) while the KVM is in use. AFAICT there are no warnings in libvirt log other than

Jun  8 13:12:18 minidell journal: libvirt version: 1.2.8, package: 16.el7_1.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2015-04-02-12:11:36, x86-024.build.eng.bos.redhat.com)
Jun  8 13:12:18 minidell journal: Module /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so not accessible
Jun  8 13:12:19 minidell journal: failed to initialize gluster connection to server: 'minidell.home': Errore di input/output

I'm not sure if that gluster error is related or not, you may have better luck with libvirt debug logs

Moving back to integration, it seems to be related to the actual upgrade/install procedure.

Comment 11 Sandro Bonazzola 2015-06-29 11:23:49 UTC
(In reply to Michal Skrivanek from comment #10)
> There is bunch of stuff being restarted. Libvirt somehow decides to kill the
> VM, In principle I'm not that surprised...it's not a good idea to upgrade
> system packages (even though here it was only a harmless qemu-kvm-tools-ev)
> while the KVM is in use. AFAICT there are no warnings in libvirt log other
> than

May I ask where did you see a package upgrade while the VM was running?


> 
> Jun  8 13:12:18 minidell journal: libvirt version: 1.2.8, package:
> 16.el7_1.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>,
> 2015-04-02-12:11:36, x86-024.build.eng.bos.redhat.com)
> Jun  8 13:12:18 minidell journal: Module
> /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so not accessible
> Jun  8 13:12:19 minidell journal: failed to initialize gluster connection to
> server: 'minidell.home': Errore di input/output
> 
> I'm not sure if that gluster error is related or not, you may have better
> luck with libvirt debug logs

Can you please specify how to enable libvirt debug logs?


> Moving back to integration, it seems to be related to the actual
> upgrade/install procedure.

Again, I don't see any upgrade happening.

Comment 12 Michal Skrivanek 2015-06-29 12:01:58 UTC
(In reply to Sandro Bonazzola from comment #11)
> (In reply to Michal Skrivanek from comment #10)
> > There is bunch of stuff being restarted. Libvirt somehow decides to kill the
> > VM, In principle I'm not that surprised...it's not a good idea to upgrade
> > system packages (even though here it was only a harmless qemu-kvm-tools-ev)
> > while the KVM is in use. AFAICT there are no warnings in libvirt log other
> > than
> 
> May I ask where did you see a package upgrade while the VM was running?

again, I'm only guessing based on what I see in the logs. I don't know what exactly it was doing and why:

Jun 08 13:12:01 minidell.home systemd[1]: Stopping Virtual Desktop Server Manager...
Jun 08 13:12:05 minidell.home systemd[1]: Stopped Virtual Desktop Server Manager. 
Jun 08 13:12:05 minidell.home systemd[1]: Stopped Auxiliary vdsm service for running helper functions as root.

^^ stopped vdsm

Jun 08 13:12:08 minidell.home yum[23495]: Installed: 10:qemu-kvm-tools-ev-2.1.2-23.el7_1.3.1.x86_64
Jun 08 13:12:07 minidell.home systemd[1]: Time has been changed
Jun 08 13:12:08 minidell.home systemd[1]: Started Dynamic System Tuning Daemon.
Jun 08 13:12:11 minidell.home systemd[1]: Reloading.

^^ yum install/update and a complete reload of config, e.g.:

Jun 08 13:12:11 minidell.home systemd[1]: Stopping IPv4 firewall with iptables...
Jun 08 13:12:11 minidell.home systemd[1]: Stopped GlusterFS, a clustered file-system server.
Jun 08 13:12:11 minidell.home systemd[1]: Starting GlusterFS, a clustered file-system server...
Jun 08 13:12:14 minidell.home systemd[1]: Stopped Virtualization daemon.
Jun 08 13:12:14 minidell.home systemd[1]: Starting Virtualization daemon.

^^ libvirt is restarted too, there are few errors on start, e.g.

Jun 08 13:12:19 minidell.home libvirtd[24264]: failed to initialize gluster connection to server: 'minidell.home': Errore di input/output

which I dont know if it is relevant or not, but in few seconds the HE VM is gracefully terminated by libvirt.
Other services are in progress of restart during that time (vdsm as well, however I don't see any indication that the VM shutdown is related as on vdsm recovery the VM was already gone)

> > Jun  8 13:12:18 minidell journal: libvirt version: 1.2.8, package:
> > 16.el7_1.3 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>,
> > 2015-04-02-12:11:36, x86-024.build.eng.bos.redhat.com)
> > Jun  8 13:12:18 minidell journal: Module
> > /usr/lib64/libvirt/connection-driver/libvirt_driver_lxc.so not accessible
> > Jun  8 13:12:19 minidell journal: failed to initialize gluster connection to
> > server: 'minidell.home': Errore di input/output
> > 
> > I'm not sure if that gluster error is related or not, you may have better
> > luck with libvirt debug logs
> 
> Can you please specify how to enable libvirt debug logs?

see log_outputs in /etc/libvirt/libvirtd.conf.

> > Moving back to integration, it seems to be related to the actual
> > upgrade/install procedure.
> 
> Again, I don't see any upgrade happening.

The system activity as documented by systemd's journal indicates otherwise

Comment 13 Red Hat Bugzilla Rules Engine 2015-09-22 07:43:39 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 14 Sandro Bonazzola 2016-05-02 10:06:36 UTC
Moving from 4.0 alpha to 4.0 beta since 4.0 alpha has been already released and bug is not ON_QA.

Comment 15 Yaniv Lavi 2016-05-23 13:21:32 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 16 Yaniv Lavi 2016-05-23 13:24:17 UTC
oVirt 4.0 beta has been released, moving to RC milestone.

Comment 19 Sahina Bose 2016-11-30 10:22:32 UTC
This bug is dependent on 1022961

Comment 20 Sahina Bose 2016-12-22 06:27:23 UTC
Moving this to 4.2 as libgfapi access is not available in 4.1

Comment 21 Giuseppe Ragusa 2017-02-17 14:24:43 UTC
Maybe this should be moved to target milestone ovirt-4.1.3 in light of the related recent change on BZ # 1022961

Comment 22 Gianluca Cecchi 2017-07-04 10:27:37 UTC
What about comment#21 considerations? Can we move target milestone to any 4.1.x?
Are there any other limitations not to target 4.1?
Thanks,
Gianluca

Comment 23 Yaniv Kaul 2017-07-04 10:39:21 UTC
(In reply to Gianluca Cecchi from comment #22)
> What about comment#21 considerations? Can we move target milestone to any
> 4.1.x?
> Are there any other limitations not to target 4.1?
> Thanks,
> Gianluca

We'll need to add more code (see https://gerrit.ovirt.org/#/q/status:open+project:vdsm+branch:master+topic:libgfapi ) before we can see if we can backport to 4.1.x

Comment 24 Sandro Bonazzola 2019-01-28 09:43:53 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 25 Sahina Bose 2019-03-18 07:03:52 UTC
Hosted engine deployment with gfapi support is now a lower priority. Closing this now, please re-open with justification and we can revisit priority.


Note You need to log in before you can comment on or make changes to this bug.