Bug 996146 - [vdsm] VM migration occasionally fails
Summary: [vdsm] VM migration occasionally fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: vdsm
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 3.3.0
Assignee: Omer Frenkel
QA Contact: Artyom
URL:
Whiteboard: virt
Depends On:
Blocks: 1019461
TreeView+ depends on / blocked
 
Reported: 2013-08-12 14:03 UTC by Tareq Alayan
Modified: 2015-09-22 13:09 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-01-21 16:12:44 UTC
oVirt Team: ---
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
all_logs_engine_vdsm_libvirt_host-Depoly (5.73 MB, application/x-tar)
2013-08-12 14:03 UTC, Tareq Alayan
no flags Details
libvirt logs from both hosts (997.19 KB, application/x-compressed-tar)
2013-08-21 08:19 UTC, sefi litmanovich
no flags Details
libvirt logs from origin host 20.8.13 (16.16 MB, application/x-tar)
2013-08-21 09:06 UTC, sefi litmanovich
no flags Details
libvirt logs from destination host 20.8.13 (1.73 MB, application/x-tar)
2013-08-21 09:08 UTC, sefi litmanovich
no flags Details
new libvirtd log from source host (13.39 MB, text/x-log)
2013-09-02 14:57 UTC, sefi litmanovich
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2014:0040 0 normal SHIPPED_LIVE vdsm bug fix and enhancement update 2014-01-21 20:26:21 UTC

Description Tareq Alayan 2013-08-12 14:03:29 UTC
Created attachment 785697 [details]
all_logs_engine_vdsm_libvirt_host-Depoly

Description of problem:
VM migration fails occasionally in automated tests.

Version-Release number of selected component (if applicable):
is9

How reproducible:
75% of the times

Steps to Reproduce:
1. create vm and install os on it
2. migrate


Actual results:
vm fails to migrate

Expected results:
successful migration to destination host

Additional info:
logs attached. 

libvirt log
============
 2013-08-11 14:28:14.902+0000: 30189: error : virNetTLSContextValidCertificate:991 : Certificate failed validation: The certificate hasn't got a known issuer.
 2013-08-11 14:28:14.902+0000: 30189: warning : virNetTLSContextCheckCertificate:1102 : Certificate check failed Certificate failed validation: The certificate hasn't got a known issuer.
 2013-08-11 14:28:14.902+0000: 30189: error : virNetTLSContextCheckCertificate:1105 : authentication failed: Failed to verify peer's certificate
 2013-08-11 14:28:14.902+0000: 30189: debug : do_open:1180 : driver 2 remote returned ERROR
 2013-08-11 14:28:14.902+0000: 30189: error : doPeer2PeerMigrate:2646 : operation failed: Failed to connect to remote libvirt URI qemu+tls://10.35.160.41/system
 2013-08-11 14:28:14.902+0000: 30189: debug : qemuDomainObjEndAsyncJob:955 : Stopping async job: migration out


certificates are valid:
# openssl verify -verbose -CAfile /etc/pki/vdsm/certs/cacert.pem /etc/pki/vdsm/certs/vdsmcert.pem
 /etc/pki/vdsm/certs/vdsmcert.pem: OK
 ]# openssl verify -verbose -CAfile /etc/pki/vdsm/certs/cacert.pem /etc/pki/libvirt/clientcert.pem
 /etc/pki/libvirt/clientcert.pem: OK 
# openssl x509 -in /etc/pki/vdsm/certs/cacert.pem -fingerprint -noout
SHA1 Fingerprint=E4:BE:D2:C6:6C:03:46:F1:41:51:3B:48:FB:E3:94:95:89:30:32:6E

Comment 1 Michal Skrivanek 2013-08-13 13:17:37 UTC
Tareq to check with Alon about possible setup issues...

Comment 2 Omer Frenkel 2013-08-20 14:09:11 UTC
Hi Jiri,
Certificate looks ok but i see this in libvirtd.log:

2013-08-20 13:05:47.190+0000: 5546: error : virNetTLSContextCheckCertDN:418 : Certificate [session] owner does not match the hostname 10.35.0.186
2013-08-20 13:05:47.190+0000: 5546: warning : virNetTLSContextCheckCertificate:1102 : Certificate check failed Certificate [session] owner does not match the hostname 10.35.0.186
2013-08-20 13:05:47.190+0000: 5546: error : virNetTLSContextCheckCertificate:1105 : authentication failed: Failed to verify peer's certificate
2013-08-20 13:05:47.190+0000: 5546: debug : do_open:1180 : driver 2 remote returned ERROR
2013-08-20 13:05:47.190+0000: 5546: error : doPeer2PeerMigrate:2646 : operation failed: Failed to connect to remote libvirt URI qemu+tls://10.35.0.186/system
2013-08-20 13:05:47.190+0000: 5546: debug : qemuDomainObjEndAsyncJob:955 : Stopping async job: migration out

any idea?

Comment 3 Jiri Denemark 2013-08-20 18:30:37 UTC
Well, according to the error message, the certificate's subject does not match 10.35.0.186. I guess the certificate contains a hostname rather than an IP address. Anyway, it's pasting just a few lines from libvirtd.log does not help much. Could you attach a complete log?

Comment 4 sefi litmanovich 2013-08-21 08:19:22 UTC
Created attachment 788750 [details]
libvirt logs from both hosts

Comment 5 Jiri Denemark 2013-08-21 08:27:11 UTC
Unfortunately, the attached logs start 5 minutes after the messages from comment #2. And they don't contain any new instance of the same error. Please, provide logs with the actual error from comment #2.

Comment 6 sefi litmanovich 2013-08-21 09:06:43 UTC
Created attachment 788769 [details]
libvirt logs from origin host 20.8.13

Comment 7 sefi litmanovich 2013-08-21 09:08:20 UTC
Created attachment 788770 [details]
libvirt logs from destination host 20.8.13

Comment 8 Omer Frenkel 2013-08-27 11:44:02 UTC
Are new logs helpful?

Comment 9 Jiri Denemark 2013-08-29 20:01:19 UTC
Whoops, vdsm doesn't configure libvirtd to log debug messages from tls code which I hoped to see in the provided logs. Could you reproduce the issue after adding "1:virnettls" to log_filters option in /etc/libvirt/libvirtd.conf on the source host and restarting libvirtd service? Another and probably easier check is to look at the actual DN of the target host's certificate and see if it matches 10.35.0.186 :-)

Comment 10 sefi litmanovich 2013-09-02 14:57:35 UTC
Created attachment 792892 [details]
new libvirtd log from source host

Comment 11 Jiri Denemark 2013-09-03 07:31:28 UTC
What errors did you get during migration? I don't see "Certificate [session] owner does not match the hostname" anywhere in the new log file. Likely because you used FQDN rather than an IP address in libvirt URI: qemu+tls://monique-vds05.tlv.redhat.com/system

Comment 12 Tareq Alayan 2013-09-03 10:48:33 UTC
Jiri, you can take a look on the machine where this occur and do your own research, slitmano will help you in the info regarding his environment where this is reproducible... 
bottom line migration doesn't work. 
i don't know what is the certificate issue owner ....

Comment 14 Michal Skrivanek 2013-09-05 13:09:33 UTC
Tareq, Sefi,
so do you still see the same issue? From comment #11 I don't understand if the migration is still failing or not

Comment 15 Tareq Alayan 2013-09-08 07:26:32 UTC
yes

Comment 16 Andrew Cathrow 2013-09-09 12:38:14 UTC
Michal - what's next step on this. I don't see need-info on libvirt team or any next steps

Comment 17 Jiri Denemark 2013-09-10 07:03:09 UTC
BTW, to be able to provide any assistance, we need exact reproducer steps, the error message that is reported when migration fails and debug logs from both sides of migration that match the reproducer steps and the error message. So far it seems steps keep changing, the error message as well and we never got complete logs that would match them.

Currently, I have no data that would suggest this is not just a wrong PKI setup. It would be nice to check a 6.4.z environment with working migration and what happens when just libvirt is upgraded to 6.5 version and everything else is left intact.

Comment 18 Vinzenz Feenstra [evilissimo] 2013-09-10 08:48:06 UTC
I was checking it now and it is a certificate problem. (See the output below) The problem/reason for this is that after the re-deployment during the host install new certificates are created and replaced. However libvirtd was not restarted so it had the previous certificate still loaded.
In the previous certificates (thanks to the backups kept in /etc/pki/libvirt/) there was the FQDN used. (Most likely because the hosts were added to the engine with the FQDN and not with the IP) However this time the IP has been used to add the host and the certificates were issued with the IP instead of the FQDN. Now when the migration started it tried to verify the hostname of the sent certificate with the IP it retrieved instead. 10.35.0.186 does not correspond with monique-vds05.tlv.redhat.com therefore it was correctly rejected.

The solution to this problem was to restart the libvirtd service, after this I was able to successfully migrate to 10.35.0.186.

@Alon Bar Lev: Can you confirm that libvirtd is not restarted during the host deploy if it is running already?


[root@rose07 ~]# openssl s_client -showcerts -connect 10.35.0.186:16514
CONNECTED(00000003)
depth=0 O = scl.lab.tlv.redhat.com, CN = monique-vds05.tlv.redhat.com
verify error:num=20:unable to get local issuer certificate
verify return:1
depth=0 O = scl.lab.tlv.redhat.com, CN = monique-vds05.tlv.redhat.com
verify error:num=27:certificate not trusted
verify return:1
depth=0 O = scl.lab.tlv.redhat.com, CN = monique-vds05.tlv.redhat.com
verify error:num=21:unable to verify the first certificate
verify return:1
---
Certificate chain
 0 s:/O=scl.lab.tlv.redhat.com/CN=monique-vds05.tlv.redhat.com
   i:/C=US/O=scl.lab.tlv.redhat.com/CN=slitmano-rhevm.scl.lab.tlv.redhat.com.43384
-----BEGIN CERTIFICATE-----
MIIEdDCCA1ygAwIBAgICEAQwDQYJKoZIhvcNAQEFBQAwZDELMAkGA1UEBhMCVVMx
HzAdBgNVBAoTFnNjbC5sYWIudGx2LnJlZGhhdC5jb20xNDAyBgNVBAMTK3NsaXRt
YW5vLXJoZXZtLnNjbC5sYWIudGx2LnJlZGhhdC5jb20uNDMzODQwIhcRMTMwODI2
MTYwNTU4KzAwMDAXDTE4MDgyNjE2MDU1OFowSDEfMB0GA1UEChMWc2NsLmxhYi50
bHYucmVkaGF0LmNvbTElMCMGA1UEAxMcbW9uaXF1ZS12ZHMwNS50bHYucmVkaGF0
LmNvbTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMssMnT36LOJlqC6
HmdIw8MwnatUxhQ/8qqlEuYa92mbQATyvybbpCfHJqkPFwlMIaesfxi4obNiQ+Kd
8VJzHboKAJsuiAXoiF5woKHu7Sts6WjkfR758Gv2sAyvNE3n87cafIcGXgEs7Nt6
kvDCMojscf12q6qZrdSC/c0uh0drjIZOz4otKie8pLbwBa+HERDTmFKLMEP4s/iW
Mn5CdMaNU/qwZvWmYWN+pIo6+GjacUIkroT6Ea6EvsHicoIKkQfE+g6XZVcv5nAj
NeSvI4zekmFIdXgSwvagBljkWkaWG1CENs6H7HZAV11xqZtCYbvEDu0X3/mS8sG9
Vf7qvxUCAwEAAaOCAUYwggFCMB0GA1UdDgQWBBTb1C6Qfw/O/KF3q+VbPTVay5MU
ITBSBggrBgEFBQcBAQRGMEQwQgYIKwYBBQUHMAKGNmh0dHA6Ly9zbGl0bWFuby1y
aGV2bS5zY2wubGFiLnRsdi5yZWRoYXQuY29tOjgwL2NhLmNydDCBjwYDVR0jBIGH
MIGEgBTiQfPA6Gm5mx4vThy+XpUhwzw0YKFopGYwZDELMAkGA1UEBhMCVVMxHzAd
BgNVBAoTFnNjbC5sYWIudGx2LnJlZGhhdC5jb20xNDAyBgNVBAMTK3NsaXRtYW5v
LXJoZXZtLnNjbC5sYWIudGx2LnJlZGhhdC5jb20uNDMzODSCAhAAMAkGA1UdEwQC
MAAwDgYDVR0PAQH/BAQDAgWgMCAGA1UdJQEB/wQWMBQGCCsGAQUFBwMBBggrBgEF
BQcDAjANBgkqhkiG9w0BAQUFAAOCAQEAg74bwlK35P3bba7NMeRnrD3YdgRnf4mO
jlXT8olPTpyfgCjJrVFXfC5fg/HuqN29g3apdjEt7qm2vLG6sPmzodcz/V3DyKY3
W+yU7rkowzGFQ5j5nkYWdPcb6C1sqPiJRlYNITuFlxfoQAzMS9cgJZmOx9tPB02A
/Dc6IR5j7SObbPWQur9DBwtDbsmLplAlqyt+lQJvFBeQU5KxCCzoXFfEHiyLLD0y
iUzVsJNAN9aYDafFYhHwwo0vXIxak4cbzdb+qTyEhZ0lOV+SmNZmhsh0xAVdkblb
Rrp6xGzOq0jveynvjpE6zBwKNLnZeLEti9Bhm4JhxU1QB5udkKJe1g==
-----END CERTIFICATE-----
---
Server certificate
subject=/O=scl.lab.tlv.redhat.com/CN=monique-vds05.tlv.redhat.com
issuer=/C=US/O=scl.lab.tlv.redhat.com/CN=slitmano-rhevm.scl.lab.tlv.redhat.com.43384
---
Acceptable client certificate CA names
/C=US/O=scl.lab.tlv.redhat.com/CN=slitmano-rhevm.scl.lab.tlv.redhat.com.43384
---
SSL handshake has read 1993 bytes and written 322 bytes
---
New, TLSv1/SSLv3, Cipher is DHE-RSA-AES256-SHA
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
SSL-Session:
    Protocol  : TLSv1
    Cipher    : DHE-RSA-AES256-SHA
    Session-ID: BE57E74F922DBA0F2902BCABADA005EA6A583022694C7020C1EBF58D9DD8E5E4
    Session-ID-ctx: 
    Master-Key: 413A64EA81BBA17485E074E7770F453969C6C17DF7CBBC7FE06CA2D7897B08B945E795FADA0DB9C29B0E1313DDFEC199
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1378801519
    Timeout   : 300 (sec)
    Verify return code: 21 (unable to verify the first certificate)
---
read:errno=0

Comment 19 Alon Bar-Lev 2013-09-10 09:02:52 UTC
(In reply to Vinzenz Feenstra [evilissimo] from comment #18)
> @Alon Bar Lev: Can you confirm that libvirtd is not restarted during the
> host deploy if it is running already?

Well, it depends on vdsm libvirt reconfigure process. host-deploy executes:

# /etc/init.d/vdsmd reconfigure

And that code is obsoleted in 3.3 in favor of vdsm-tool, and I can see restart in vdsm::lib/vdsm/tool/libvirt_configure.sh.in:

    /usr/bin/vdsm-tool service-restart libvirtd

Newer host-deploy use the newer vdsm-tool, which will not restart libvirt[1], and I already added explicit stop of libvirt in this mode[2], although [1] has not been merged, [2] will be released at next build.

[1] http://gerrit.ovirt.org/#/c/18780/
[2] http://gerrit.ovirt.org/#/c/18784/

Comment 20 Michal Skrivanek 2013-09-16 06:43:17 UTC
AFAIU this issue is fixed then

Comment 21 Tareq Alayan 2013-09-24 14:07:06 UTC
this bug haven't been verified... 

the fix is included in is16?

Comment 22 Dan Kenigsberg 2013-09-24 14:58:13 UTC
The patches mentioned in comment 19 are in IS16. Please reopen the bug if the issue reoccurs.

Comment 23 Artyom 2013-09-27 07:41:31 UTC
Verified
Rhevm - is16
Host1 - vdsm-4.12.0-156.git6e499d6.el6ev, libvirt-0.10.2-27.el6
Host2 - vdsm-4.12.0-138.gitab256be.el6ev, libvirt-0.10.2-24.el6
Vm1 - OS rhel6.5
Vm2 - OS Windows 7

Comment 24 Charlie 2013-11-28 00:27:36 UTC
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to 
minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag.

Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information:

* Cause: What actions or circumstances cause this bug to present.
* Consequence: What happens when the bug presents.
* Fix: What was done to fix the bug.
* Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore')

Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug.

For further details on the Cause, Consequence, Fix, Result format please refer to:

https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes 

Thanks in advance.

Comment 25 errata-xmlrpc 2014-01-21 16:12:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2014-0040.html


Note You need to log in before you can comment on or make changes to this bug.