Created attachment 785697 [details] all_logs_engine_vdsm_libvirt_host-Depoly Description of problem: VM migration fails occasionally in automated tests. Version-Release number of selected component (if applicable): is9 How reproducible: 75% of the times Steps to Reproduce: 1. create vm and install os on it 2. migrate Actual results: vm fails to migrate Expected results: successful migration to destination host Additional info: logs attached. libvirt log ============ 2013-08-11 14:28:14.902+0000: 30189: error : virNetTLSContextValidCertificate:991 : Certificate failed validation: The certificate hasn't got a known issuer. 2013-08-11 14:28:14.902+0000: 30189: warning : virNetTLSContextCheckCertificate:1102 : Certificate check failed Certificate failed validation: The certificate hasn't got a known issuer. 2013-08-11 14:28:14.902+0000: 30189: error : virNetTLSContextCheckCertificate:1105 : authentication failed: Failed to verify peer's certificate 2013-08-11 14:28:14.902+0000: 30189: debug : do_open:1180 : driver 2 remote returned ERROR 2013-08-11 14:28:14.902+0000: 30189: error : doPeer2PeerMigrate:2646 : operation failed: Failed to connect to remote libvirt URI qemu+tls://10.35.160.41/system 2013-08-11 14:28:14.902+0000: 30189: debug : qemuDomainObjEndAsyncJob:955 : Stopping async job: migration out certificates are valid: # openssl verify -verbose -CAfile /etc/pki/vdsm/certs/cacert.pem /etc/pki/vdsm/certs/vdsmcert.pem /etc/pki/vdsm/certs/vdsmcert.pem: OK ]# openssl verify -verbose -CAfile /etc/pki/vdsm/certs/cacert.pem /etc/pki/libvirt/clientcert.pem /etc/pki/libvirt/clientcert.pem: OK # openssl x509 -in /etc/pki/vdsm/certs/cacert.pem -fingerprint -noout SHA1 Fingerprint=E4:BE:D2:C6:6C:03:46:F1:41:51:3B:48:FB:E3:94:95:89:30:32:6E
Tareq to check with Alon about possible setup issues...
Hi Jiri, Certificate looks ok but i see this in libvirtd.log: 2013-08-20 13:05:47.190+0000: 5546: error : virNetTLSContextCheckCertDN:418 : Certificate [session] owner does not match the hostname 10.35.0.186 2013-08-20 13:05:47.190+0000: 5546: warning : virNetTLSContextCheckCertificate:1102 : Certificate check failed Certificate [session] owner does not match the hostname 10.35.0.186 2013-08-20 13:05:47.190+0000: 5546: error : virNetTLSContextCheckCertificate:1105 : authentication failed: Failed to verify peer's certificate 2013-08-20 13:05:47.190+0000: 5546: debug : do_open:1180 : driver 2 remote returned ERROR 2013-08-20 13:05:47.190+0000: 5546: error : doPeer2PeerMigrate:2646 : operation failed: Failed to connect to remote libvirt URI qemu+tls://10.35.0.186/system 2013-08-20 13:05:47.190+0000: 5546: debug : qemuDomainObjEndAsyncJob:955 : Stopping async job: migration out any idea?
Well, according to the error message, the certificate's subject does not match 10.35.0.186. I guess the certificate contains a hostname rather than an IP address. Anyway, it's pasting just a few lines from libvirtd.log does not help much. Could you attach a complete log?
Created attachment 788750 [details] libvirt logs from both hosts
Unfortunately, the attached logs start 5 minutes after the messages from comment #2. And they don't contain any new instance of the same error. Please, provide logs with the actual error from comment #2.
Created attachment 788769 [details] libvirt logs from origin host 20.8.13
Created attachment 788770 [details] libvirt logs from destination host 20.8.13
Are new logs helpful?
Whoops, vdsm doesn't configure libvirtd to log debug messages from tls code which I hoped to see in the provided logs. Could you reproduce the issue after adding "1:virnettls" to log_filters option in /etc/libvirt/libvirtd.conf on the source host and restarting libvirtd service? Another and probably easier check is to look at the actual DN of the target host's certificate and see if it matches 10.35.0.186 :-)
Created attachment 792892 [details] new libvirtd log from source host
What errors did you get during migration? I don't see "Certificate [session] owner does not match the hostname" anywhere in the new log file. Likely because you used FQDN rather than an IP address in libvirt URI: qemu+tls://monique-vds05.tlv.redhat.com/system
Jiri, you can take a look on the machine where this occur and do your own research, slitmano will help you in the info regarding his environment where this is reproducible... bottom line migration doesn't work. i don't know what is the certificate issue owner ....
Tareq, Sefi, so do you still see the same issue? From comment #11 I don't understand if the migration is still failing or not
yes
Michal - what's next step on this. I don't see need-info on libvirt team or any next steps
BTW, to be able to provide any assistance, we need exact reproducer steps, the error message that is reported when migration fails and debug logs from both sides of migration that match the reproducer steps and the error message. So far it seems steps keep changing, the error message as well and we never got complete logs that would match them. Currently, I have no data that would suggest this is not just a wrong PKI setup. It would be nice to check a 6.4.z environment with working migration and what happens when just libvirt is upgraded to 6.5 version and everything else is left intact.
I was checking it now and it is a certificate problem. (See the output below) The problem/reason for this is that after the re-deployment during the host install new certificates are created and replaced. However libvirtd was not restarted so it had the previous certificate still loaded. In the previous certificates (thanks to the backups kept in /etc/pki/libvirt/) there was the FQDN used. (Most likely because the hosts were added to the engine with the FQDN and not with the IP) However this time the IP has been used to add the host and the certificates were issued with the IP instead of the FQDN. Now when the migration started it tried to verify the hostname of the sent certificate with the IP it retrieved instead. 10.35.0.186 does not correspond with monique-vds05.tlv.redhat.com therefore it was correctly rejected. The solution to this problem was to restart the libvirtd service, after this I was able to successfully migrate to 10.35.0.186. @Alon Bar Lev: Can you confirm that libvirtd is not restarted during the host deploy if it is running already? [root@rose07 ~]# openssl s_client -showcerts -connect 10.35.0.186:16514 CONNECTED(00000003) depth=0 O = scl.lab.tlv.redhat.com, CN = monique-vds05.tlv.redhat.com verify error:num=20:unable to get local issuer certificate verify return:1 depth=0 O = scl.lab.tlv.redhat.com, CN = monique-vds05.tlv.redhat.com verify error:num=27:certificate not trusted verify return:1 depth=0 O = scl.lab.tlv.redhat.com, CN = monique-vds05.tlv.redhat.com verify error:num=21:unable to verify the first certificate verify return:1 --- Certificate chain 0 s:/O=scl.lab.tlv.redhat.com/CN=monique-vds05.tlv.redhat.com i:/C=US/O=scl.lab.tlv.redhat.com/CN=slitmano-rhevm.scl.lab.tlv.redhat.com.43384 -----BEGIN CERTIFICATE----- MIIEdDCCA1ygAwIBAgICEAQwDQYJKoZIhvcNAQEFBQAwZDELMAkGA1UEBhMCVVMx HzAdBgNVBAoTFnNjbC5sYWIudGx2LnJlZGhhdC5jb20xNDAyBgNVBAMTK3NsaXRt YW5vLXJoZXZtLnNjbC5sYWIudGx2LnJlZGhhdC5jb20uNDMzODQwIhcRMTMwODI2 MTYwNTU4KzAwMDAXDTE4MDgyNjE2MDU1OFowSDEfMB0GA1UEChMWc2NsLmxhYi50 bHYucmVkaGF0LmNvbTElMCMGA1UEAxMcbW9uaXF1ZS12ZHMwNS50bHYucmVkaGF0 LmNvbTCCASIwDQYJKoZIhvcNAQEBBQADggEPADCCAQoCggEBAMssMnT36LOJlqC6 HmdIw8MwnatUxhQ/8qqlEuYa92mbQATyvybbpCfHJqkPFwlMIaesfxi4obNiQ+Kd 8VJzHboKAJsuiAXoiF5woKHu7Sts6WjkfR758Gv2sAyvNE3n87cafIcGXgEs7Nt6 kvDCMojscf12q6qZrdSC/c0uh0drjIZOz4otKie8pLbwBa+HERDTmFKLMEP4s/iW Mn5CdMaNU/qwZvWmYWN+pIo6+GjacUIkroT6Ea6EvsHicoIKkQfE+g6XZVcv5nAj NeSvI4zekmFIdXgSwvagBljkWkaWG1CENs6H7HZAV11xqZtCYbvEDu0X3/mS8sG9 Vf7qvxUCAwEAAaOCAUYwggFCMB0GA1UdDgQWBBTb1C6Qfw/O/KF3q+VbPTVay5MU ITBSBggrBgEFBQcBAQRGMEQwQgYIKwYBBQUHMAKGNmh0dHA6Ly9zbGl0bWFuby1y aGV2bS5zY2wubGFiLnRsdi5yZWRoYXQuY29tOjgwL2NhLmNydDCBjwYDVR0jBIGH MIGEgBTiQfPA6Gm5mx4vThy+XpUhwzw0YKFopGYwZDELMAkGA1UEBhMCVVMxHzAd BgNVBAoTFnNjbC5sYWIudGx2LnJlZGhhdC5jb20xNDAyBgNVBAMTK3NsaXRtYW5v LXJoZXZtLnNjbC5sYWIudGx2LnJlZGhhdC5jb20uNDMzODSCAhAAMAkGA1UdEwQC MAAwDgYDVR0PAQH/BAQDAgWgMCAGA1UdJQEB/wQWMBQGCCsGAQUFBwMBBggrBgEF BQcDAjANBgkqhkiG9w0BAQUFAAOCAQEAg74bwlK35P3bba7NMeRnrD3YdgRnf4mO jlXT8olPTpyfgCjJrVFXfC5fg/HuqN29g3apdjEt7qm2vLG6sPmzodcz/V3DyKY3 W+yU7rkowzGFQ5j5nkYWdPcb6C1sqPiJRlYNITuFlxfoQAzMS9cgJZmOx9tPB02A /Dc6IR5j7SObbPWQur9DBwtDbsmLplAlqyt+lQJvFBeQU5KxCCzoXFfEHiyLLD0y iUzVsJNAN9aYDafFYhHwwo0vXIxak4cbzdb+qTyEhZ0lOV+SmNZmhsh0xAVdkblb Rrp6xGzOq0jveynvjpE6zBwKNLnZeLEti9Bhm4JhxU1QB5udkKJe1g== -----END CERTIFICATE----- --- Server certificate subject=/O=scl.lab.tlv.redhat.com/CN=monique-vds05.tlv.redhat.com issuer=/C=US/O=scl.lab.tlv.redhat.com/CN=slitmano-rhevm.scl.lab.tlv.redhat.com.43384 --- Acceptable client certificate CA names /C=US/O=scl.lab.tlv.redhat.com/CN=slitmano-rhevm.scl.lab.tlv.redhat.com.43384 --- SSL handshake has read 1993 bytes and written 322 bytes --- New, TLSv1/SSLv3, Cipher is DHE-RSA-AES256-SHA Server public key is 2048 bit Secure Renegotiation IS supported Compression: NONE Expansion: NONE SSL-Session: Protocol : TLSv1 Cipher : DHE-RSA-AES256-SHA Session-ID: BE57E74F922DBA0F2902BCABADA005EA6A583022694C7020C1EBF58D9DD8E5E4 Session-ID-ctx: Master-Key: 413A64EA81BBA17485E074E7770F453969C6C17DF7CBBC7FE06CA2D7897B08B945E795FADA0DB9C29B0E1313DDFEC199 Key-Arg : None Krb5 Principal: None PSK identity: None PSK identity hint: None Start Time: 1378801519 Timeout : 300 (sec) Verify return code: 21 (unable to verify the first certificate) --- read:errno=0
(In reply to Vinzenz Feenstra [evilissimo] from comment #18) > @Alon Bar Lev: Can you confirm that libvirtd is not restarted during the > host deploy if it is running already? Well, it depends on vdsm libvirt reconfigure process. host-deploy executes: # /etc/init.d/vdsmd reconfigure And that code is obsoleted in 3.3 in favor of vdsm-tool, and I can see restart in vdsm::lib/vdsm/tool/libvirt_configure.sh.in: /usr/bin/vdsm-tool service-restart libvirtd Newer host-deploy use the newer vdsm-tool, which will not restart libvirt[1], and I already added explicit stop of libvirt in this mode[2], although [1] has not been merged, [2] will be released at next build. [1] http://gerrit.ovirt.org/#/c/18780/ [2] http://gerrit.ovirt.org/#/c/18784/
AFAIU this issue is fixed then
this bug haven't been verified... the fix is included in is16?
The patches mentioned in comment 19 are in IS16. Please reopen the bug if the issue reoccurs.
Verified Rhevm - is16 Host1 - vdsm-4.12.0-156.git6e499d6.el6ev, libvirt-0.10.2-27.el6 Host2 - vdsm-4.12.0-138.gitab256be.el6ev, libvirt-0.10.2-24.el6 Vm1 - OS rhel6.5 Vm2 - OS Windows 7
This bug is currently attached to errata RHBA-2013:15291. If this change is not to be documented in the text for this errata please either remove it from the errata, set the requires_doc_text flag to minus (-), or leave a "Doc Text" value of "--no tech note required" if you do not have permission to alter the flag. Otherwise to aid in the development of relevant and accurate release documentation, please fill out the "Doc Text" field above with these four (4) pieces of information: * Cause: What actions or circumstances cause this bug to present. * Consequence: What happens when the bug presents. * Fix: What was done to fix the bug. * Result: What now happens when the actions or circumstances above occur. (NB: this is not the same as 'the bug doesn't present anymore') Once filled out, please set the "Doc Type" field to the appropriate value for the type of change made and submit your edits to the bug. For further details on the Cause, Consequence, Fix, Result format please refer to: https://bugzilla.redhat.com/page.cgi?id=fields.html#cf_release_notes Thanks in advance.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2014-0040.html