Bug 871616

Summary: Guest agent information is missing after few VM's migrations
Product: Red Hat Enterprise Virtualization Manager Reporter: Oded Ramraz <oramraz>
Component: vdsmAssignee: Vinzenz Feenstra [evilissimo] <vfeenstr>
Status: CLOSED ERRATA QA Contact: Jiri Belka <jbelka>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1.0CC: abaron, acathrow, bazulay, danken, dcaroest, dyasny, hateya, iheim, italkohe, lpeer, michal.skrivanek, mkenneth, mpavlik, pep, pstehlik, rhev-integ, sgrinber, sputhenp, vipatel, ykaul, zdover
Target Milestone: ---Keywords: Regression
Target Release: 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: virt
Fixed In Version: vdsm-4.10.2-16.0.el6ev Doc Type: Bug Fix
Doc Text:
Previously, guest agent information vanished after virtual machines were migrated several times. This was because the virtual machine channel listener was not handling any errors. If an error occurred, VDSM did not try to reconnect and the connection to the guest was lost for the lifetime of the guest or until VDSM was restarted. A patch to VDSM introduces a mechanism to reconnect to the channel. When an error occurs, the setup callback is called, which gives the handled client a chance to recreate the socket and prepare it for a connect. After that callback is called, the erroneous connection is moved into the unconnected items dict where it will be handled by the event loop. If there have been 5 or more unsuccessful attempts made the reconnect rate will be slowed down to the same time as specified for the 'read timeout'. The items which are slowed down are moved into the 'reconnect_cooldown' dict. After this patch is applied, guest agent information does not vanish after several virtual machine migrations.
Story Points: ---
Clone Of:
: 947888 (view as bug list) Environment:
Last Closed: 2013-06-10 20:32:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 947888    
Attachments:
Description Flags
vdsm and engine logs
none
Guest Agent log
none
logs from local run none

Description Oded Ramraz 2012-10-30 21:08:37 UTC
Description of problem:

After creating 5 VM's and installing RHEL6.3 with guest agent on them I migrated the VM's between 2 hosts few times ( using automated scripts ) .
After few migration processes guest agent info such as VM's IP was missing in both UI / API . 

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Oded Ramraz 2012-10-30 21:18:53 UTC
Created attachment 635804 [details]
vdsm and engine logs

Comment 2 Barak 2012-11-01 10:10:18 UTC
Oded a few questions:

1 - did the info exist through vdsm-cli ?
2 - can we get the agent logs (debug mode) as well ?

Comment 3 Oded Ramraz 2012-11-13 20:24:18 UTC
(In reply to comment #2)
> Oded a few questions:
> 
> 1 - did the info exist through vdsm-cli ?

The info is visible via vdsmCli after the guest installation process but it disappear after few migrations ( or hibernate / resume VM operations - since the test perform both actions )
We are able to reproduce this issue easily on few environments.

> 2 - can we get the agent logs (debug mode) as well ?

Yes , we'll attach those logs soon ( hopefully tomorrow )

Comment 5 Barak Dagan 2012-11-15 17:26:37 UTC
Created attachment 645773 [details]
Guest Agent log

Comment 6 Barak Dagan 2012-11-15 17:28:16 UTC
(In reply to comment #2)
> Oded a few questions:
> 
> 1 - did the info exist through vdsm-cli ?
> 2 - can we get the agent logs (debug mode) as well ?

It seems that the IP doesn't return after the VM suspend action

Comment 7 Andrew Cathrow 2012-12-03 11:52:58 UTC
Could the problem be with virtio-serial not working after suspend - we can test to see if shutdown command works, etc.

Also is this after suspend only or is it really caused by migration (impacts severity)

Comment 8 Barak Dagan 2012-12-16 18:27:00 UTC
(In reply to comment #7)
> Could the problem be with virtio-serial not working after suspend - we can
> test to see if shutdown command works, etc.
> 
> Also is this after suspend only or is it really caused by migration (impacts
> severity)

It seems that restarting the VDSM solves that issue. The agent can't be un/re-installed, probably since the virtio-serial is not working - restart the VDSM solves these two issues. As for the other questions, the sequence seems to be not so simple, when I manage to find it, I'll give the answers.

Comment 9 Vinzenz Feenstra [evilissimo] 2012-12-17 14:01:26 UTC
I have noticed two things in the logs:

1. The VDSM Host 2 logs are full of SSL certificate validation errors which leads me to the conclusion of a misconfigured Host/Engine setup.
2. The VDSM Host 1 logs seem to have some libvirt connection issues. I am not really sure what it is, however in the light of having RHEVM 3.1 just right of the door I would like to see this reproduced with the RHEVM 3.1 release version.

Also this issue is not related to the guest agent, this issue must be somewhere on VDSM, libvirt, qemu or the drivers.

Please try to reproduce this with an appropriately configured setup and please provide fresh logs from the RHEVM 3.1 environment.

Thanks.

Comment 10 Barak Dagan 2012-12-18 12:17:58 UTC
(In reply to comment #7)
> Could the problem be with virtio-serial not working after suspend - we can
> test to see if shutdown command works, etc.
> 
> Also is this after suspend only or is it really caused by migration (impacts
> severity)

the shutdown command is working though

Comment 11 Barak Dagan 2012-12-26 19:02:42 UTC
This happens after a few migrations. I think that I can reproduce it migrating the vms between the two hosts using SDK.

Which logs do you need ? engine and 2 VDSMs ?

Comment 12 Vinzenz Feenstra [evilissimo] 2013-01-03 06:56:46 UTC
I personally need only VDSM logs, however if there's a problem somewhere else engine logs might be needed as well.
Therefore add both of them. Thanks.

Comment 13 Vinzenz Feenstra [evilissimo] 2013-01-03 11:56:33 UTC
(In reply to comment #10)
> (In reply to comment #7)
> > Could the problem be with virtio-serial not working after suspend - we can
> > test to see if shutdown command works, etc.
> > 
> > Also is this after suspend only or is it really caused by migration (impacts
> > severity)
> 
> the shutdown command is working though

Well the question is how the shutdown performed in the end. It won't tell anything if the GA version timed out and the ACPI shutdown kicked in.

Comment 14 Barak Dagan 2013-01-03 13:14:21 UTC
(In reply to comment #13)
> (In reply to comment #10)
> > (In reply to comment #7)
> > > Could the problem be with virtio-serial not working after suspend - we can
> > > test to see if shutdown command works, etc.
> > > 
> > > Also is this after suspend only or is it really caused by migration (impacts
> > > severity)
> > 
> > the shutdown command is working though
> 
> Well the question is how the shutdown performed in the end. It won't tell
> anything if the GA version timed out and the ACPI shutdown kicked in.

The shutdown performed smoothly, It seems to be virtio issue but I'll let you decide once I manage to get the logs

Comment 15 Vinzenz Feenstra [evilissimo] 2013-01-04 09:34:02 UTC
Please attach the log files as soon as you have them. Thanks.

Comment 16 Barak Dagan 2013-01-07 13:46:35 UTC
Created attachment 674021 [details]
logs from local run

Comment 31 Frantisek Kobzik 2013-04-11 08:37:57 UTC
*** Bug 870447 has been marked as a duplicate of this bug. ***

Comment 35 Jiri Belka 2013-04-30 10:59:39 UTC
ok, vdsm-4.10.2-16.0.el6ev.x86_64.

Comment 38 errata-xmlrpc 2013-06-10 20:32:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-0886.html