Bug 810272

Summary: Live migration not working (connection refused)
Product: [Retired] oVirt Reporter: marcik4
Component: ovirt-nodeAssignee: Mike Burns <mburns>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: unspecifiedCC: abaron, acathrow, bazulay, danken, dyasny, fdeutsch, iheim, jboggs, jwyatt, mburns, mgoldboi, mishu, mivaho, ovirt-bugs, ovirt-maint, ykaul
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: 2.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-06-14 13:35:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
VDSM logs of both nodes the fail to migrate towards each other. none

Description marcik4 2012-04-05 14:00:04 UTC
Description of problem:

Live migration not working between 2 latest nodes.

Version-Release number of selected component (if applicable):

2.3.0-1.0.fc16.iso
  
Actual results:

2012-04-05 13:33:01.295+0000: 1753: error : virNetSocketNewConnectTCP:432 : unable to connect to server at 'host:16514': Connection refused
2012-04-05 13:33:01.295+0000: 1753: debug : do_open:1078 : driver 8 remote returned ERROR
2012-04-05 13:33:01.295+0000: 1753: error : doPeer2PeerMigrate:2129 : operation failed: Failed to connect to remote libvirt URI qemu+tls://host/system

Migration fail.

Expected results:

Migration not to fail :-)

Additional info:

Problem with libvirt config. After changing in /etc/libvirt/libvirtd.conf listen_tls from 0 to 1, migration started to work.

Comment 1 Andrew Cathrow 2012-04-05 14:51:58 UTC
Can you confirm that iptables rules are correct and that the two hosts can resolve eachothers names.

Comment 2 marcik4 2012-04-05 15:01:35 UTC
Yes - iptables rules allow all traffic and hosts can resolve each others.

During migration, source host tries to connect to destination libvirtd on port 16514 (it tries to connect to "qemu+tls://host/system" aka "host:16514" as in log). But with default config, libvirtd  listens only on port 16509.
Enabling tls in config, resulted in libvirt listening with tls on port 16514 and migration started to work.
I've also tried to redirect incoming traffic from port 16514 to 16509 but it ended with tls errors so i suspect that libvirtd on port 16509 doesn't support tls connections.

So as I see it, to fix it either tls has to be enabled in libvirtd config (as I did myself) or migration process has to be changed not to use tls.

Comment 3 Fabian Deutsch 2012-05-15 09:30:41 UTC
I suppose that the two nodes were managed by oVirt Engine, is this correct?

Comment 4 marcik4 2012-05-17 09:26:07 UTC
Yes. All on fresh installs.

Comment 5 Fabian Deutsch 2012-05-18 07:33:47 UTC
Okay, then I'm reassigning this to the vdsm component, as vdsm is responsible for this file after Node is registered to an Engine.

Comment 6 Haim 2012-05-18 14:17:21 UTC
(In reply to comment #5)
> Okay, then I'm reassigning this to the vdsm component, as vdsm is responsible
> for this file after Node is registered to an Engine.

indeed, could you please try the following: 

- we would like to get the output of the following files before (current state) and after the following command: 
- service vdsmd reconfigure
  * /etc/vdsm/vdsm.conf 
  * /etc/libvirt/libvirt.conf 
  * /etc/libvirt/qemu.conf

** please revert your changes of-course.

also, what version of vdsm are you working with ?

Comment 7 Dan Kenigsberg 2012-05-18 18:54:14 UTC
Could you share your /etc/libvirt/libvirtd.conf at the source and at the destination hosts? They should have had a line showing
  listen_tls=1 # by vdsm

If not, does running

  /lib/systemd/systemd-vdsmd

fix your config?

Comment 8 Mike Burns 2012-05-23 22:02:50 UTC
*** Bug 824605 has been marked as a duplicate of this bug. ***

Comment 9 marcik4 2012-05-24 11:15:33 UTC
Sorry, don't have testing environment anymore.

But as all my testing were done on clean installs (latest node, fedora16 and latest packages installed according to guide on website; even reinstalled everything, just be sure) it should be very easy to reproduce in lab.

I can assemble environment for testing again, but it could take a few weeks.

Comment 10 Michel van Horssen 2012-05-24 15:12:11 UTC
Created attachment 586665 [details]
VDSM logs of both nodes the fail to migrate towards each other.

Comment 11 Michel van Horssen 2012-05-24 15:13:17 UTC
I've reported this problem to the vdsm and node list a few times. Just now found this bug report :)

In my /etc/libvirt/libvirtd.conf on the node I'm migrating away from the line says:

listen_tls = 0

On the node I'm migrating towards it also says:

listen_tls = 0

Or do you want me to attache the files?


I've attached my vdsm logs from both nodes at the moment of migration so you can see it the problems the same as mentioned by marcik4. See entry above this one.

Comment 12 Jacob Wyatt 2012-05-24 15:47:30 UTC
I reported this same bug in the oVirt Node section not knowing it was a vdsm issue.  It really is a simple fix.  The default for libvirtd is to enable tls.  You actually have to add the line listen_tls=0 to disable it.  Deleteing or commenting that line out fixes the issue.  Whoever customized that config file for the install just needs to not disable tls on purpose.

Comment 13 Michel van Horssen 2012-05-25 12:24:39 UTC
Yes, thnx Jacob,

Remarking "listen_tls= 0" did the trick. Funny because the instrucions for installing VDSM says just that, to remark it before starting. Strange that the node does otherwise :)

I've put it in /config/etc/libvirt/libvirtd.conf so a reboot isn't a problem

A migration between my 2 nodes worked perfectly.

I'll post it to the list as well so others can find it in the archives

Comment 14 Mike Burns 2012-05-25 12:55:32 UTC
While I still contend that vdsm should be setting all values correctly that it depends on (and I'm not closing this bug for that reason), I'll submit a patch to ovirt-node that makes sure it's set the same way.

Comment 15 Michel van Horssen 2012-05-25 13:15:57 UTC
I'm with you on the vdsm part should be doing this, still thnx for patching ovirt-node so one way or another it get's fixed.

Hasn't this come up before or is everyone migrating between installed VDSM's and not between nodes?

Comment 16 Mike Burns 2012-05-25 13:36:42 UTC
ovirt-node patch:  http://gerrit.ovirt.org/4813

(In reply to comment #15)
> I'm with you on the vdsm part should be doing this, still thnx for patching
> ovirt-node so one way or another it get's fixed.
> 
> Hasn't this come up before or is everyone migrating between installed VDSM's
> and not between nodes?

I'm surprised more people haven't complained about this, although maybe people figured out what the problem was and just haven't reported it.  Or maybe things are just so stable that no one has needed to migrate to a node.

Another possibility is that no one is really running in production yet.

For whatever reason, this was never flagged as an issue before.

Comment 17 Dan Kenigsberg 2012-05-27 20:05:25 UTC
(In reply to comment #15)
> I'm with you on the vdsm part should be doing this, still thnx for patching
> ovirt-node so one way or another it get's fixed.
> 
> Hasn't this come up before or is everyone migrating between installed VDSM's
> and not between nodes?

Vdsm tries to touch libvirtd.conf as little as possible, so it does not set listen_tls=1, as this is the libvirt default. Vdsm assumes that if someone/something else changes the default, it knows what it is doing. Apparently this was not the case this time.

I believe that Vdsm's minimalistic approach is the Right Thing to do,
and about to close the Vdsm part of this bug.

Comment 18 Michel van Horssen 2012-05-28 21:09:12 UTC
All parts of Ovirt should be in agreement on what to use. listen_tls on or off so to speak.

So VDSM does not want to touch libvirtd.conf and therefore uses it's default, which is listen_tls=1. The node on the other hand doesn't touch anything as well and thus also has listen_tls=1. But a migration between nodes goes wrong. The moment you force it to listen_tls=0 we can migrate.

>Vdsm assumes that if someone/something else changes the default, 
> it knows what it is doing. Apparently this was not the case this time.

As far as I can see no one/nothing touched the defaults. We touch the defaults now because otherwise we can not migrate between nodes because of the defaults :)

I'm not saying therefore it is a VSDM problem just that no one changed the defaults.

Comment 19 Jacob Wyatt 2012-05-28 23:01:01 UTC
(In reply to comment #18)
> All parts of Ovirt should be in agreement on what to use. listen_tls on or
> off so to speak.
> 
> So VDSM does not want to touch libvirtd.conf and therefore uses it's
> default, which is listen_tls=1. The node on the other hand doesn't touch
> anything as well and thus also has listen_tls=1. But a migration between
> nodes goes wrong. The moment you force it to listen_tls=0 we can migrate.
> 
> >Vdsm assumes that if someone/something else changes the default, 
> > it knows what it is doing. Apparently this was not the case this time.
> 
> As far as I can see no one/nothing touched the defaults. We touch the
> defaults now because otherwise we can not migrate between nodes because of
> the defaults :)
> 
> I'm not saying therefore it is a VSDM problem just that no one changed the
> defaults.

I think you have that backwards, Michel.  The variable listen_tls defaults to 1 (http://libvirt.org/remote.html) which is what we want. By simply NOT adding any configuration to libvirtd.conf file you fix the problem.  Someone, at some point, changed libvirtd.conf adding the line listen_tls=0 thereby disabling the ability to use TLS for migration.  It should be easy to determine if this should be handled by the VDSM team or the oVirt Node team by finding out who added listen_tls=0 to the file.  It's not a matter of someone doing something.  It's a matter of stopping someone from doing what they're already doing.

Comment 20 Michel van Horssen 2012-05-29 12:15:23 UTC
Before calling me backward :) Just kidding ;)

Maybe my explanation was off but what I said was that in the libvirtd.conf file the line listen_tls=0 was remarked by default and thus the default listen_tls=1 was in effect.

With this I could not migrate. After manually removing the remark so listen_tls=0 was used I could migrate. 

So as far as I can see no one changed the default on the node at install.

Default node install? Then listen_tls=1 but migration not working

Maybe we are talking about different things here.

Comment 21 Michel van Horssen 2012-05-29 12:16:53 UTC
oooh no sorry I'm confusing myself now. Scratch that last comment of mine.

Comment 22 Michel van Horssen 2012-05-29 12:18:35 UTC
I am backwards

Comment 23 Itamar Heim 2012-06-03 09:08:06 UTC
mburns - it sounds like no change is needed (this bug is in POST with no patch in commnents)

Comment 24 Mike Burns 2012-06-03 11:26:39 UTC
itamar -- see comment 16

Comment 25 Itamar Heim 2012-06-03 11:40:27 UTC
sorry - i missed that patch.
i see it was merged, shouldn't this be in MODIFIED then?

Comment 26 Mike Burns 2012-06-03 11:48:59 UTC
yes

Comment 27 Fabian Deutsch 2012-06-06 11:01:16 UTC
listen_tls=1 requires certificates to reside in /etc/pki/CA - but they aren't part of Node and only deployed when Node is registered with Engine.
Libvirt will fail to start if the certificates are not available and further more parts of Node's init scripts will also fail.

Vdsm should be the component enabling listen_tls after it deployed the certificates. Node should set it to  listen_tls=0 as it does not provide the required certificates. (bug #829267)

Comment 28 Itamar Heim 2012-06-06 12:28:59 UTC
not sure i agree.
I think a better approach would be for libvirt to not start until it has a certficiate configured by vdsm bootstrap.
danken - thoughts?

Comment 29 Dan Kenigsberg 2012-06-06 13:45:47 UTC
(In reply to comment #28)
> not sure i agree.
> I think a better approach would be for libvirt to not start until it has a
> certficiate configured by vdsm bootstrap.

I'm afraid we need libvirt to already run during bootstrap - we use libvirt to define host management network.

Comment 30 Itamar Heim 2012-06-06 21:04:16 UTC
(In reply to comment #29)
> I'm afraid we need libvirt to already run during bootstrap - we use libvirt
> to define host management network.

so there is no way around node disabling listen_tls for vdsm to enable it?
how does it work on a normal fedora if libvirt is installed on it?

Comment 31 Dan Kenigsberg 2012-06-07 06:55:20 UTC
F17 libvirt works out of the box. I haven't heard a convincing reason why ovirt-node should touch it. Once vdsm takes responsibility, it should bring in the certs, reconfigure libvirt, and restart it.

Comment 32 Fabian Deutsch 2012-06-07 07:01:09 UTC
(In reply to comment #30)
> (In reply to comment #29)
> > I'm afraid we need libvirt to already run during bootstrap - we use libvirt
> > to define host management network.
> 
> so there is no way around node disabling listen_tls for vdsm to enable it?
> how does it work on a normal fedora if libvirt is installed on it?

Both, Fedora and oVirt Node, are having an commented out listen_tls=0 in their libvirtd.conf, Node's behavior differs, because Node's /etc/sysconfig/libvirtd differs and passes the "--listen" argument to libvirtd.

So it's working on Fedora wihout errors because it's not listening on tcp at all by default (which is required for listen_tls to have any effect [afaiu]).
The change to pass "--listen" to the libvirt daemon is introduced when node is build. 
As far as I understand the situation, there are a few options to prevent the error:

1. stick to F17 defaults and VDSM provides certificates and enables --listen if required

2. at node build time, node enables --listen and sets listen_tls=0 because there are no certificates yet

In both situations libvirt should still be listening on the unix socket.
Similar to what Dan said in comment #31, I don't know a reason why libvirt should be listening on external interfaces by default, so I'd go with 2.

Comment 33 Itamar Heim 2012-06-07 07:54:21 UTC
(In reply to comment #32)
...
> 1. stick to F17 defaults and VDSM provides certificates and enables --listen
> if required
> 
> 2. at node build time, node enables --listen and sets listen_tls=0 because
> there are no certificates yet
> 
> In both situations libvirt should still be listening on the unix socket.
> Similar to what Dan said in comment #31, I don't know a reason why libvirt
> should be listening on external interfaces by default, so I'd go with 2.

you meant you'd go with option 1, right?

Comment 34 Fabian Deutsch 2012-06-07 07:56:51 UTC
(In reply to comment #33)
> (In reply to comment #32)
> ...
> > 1. stick to F17 defaults and VDSM provides certificates and enables --listen
> > if required
> > 
> > 2. at node build time, node enables --listen and sets listen_tls=0 because
> > there are no certificates yet
> > 
> > In both situations libvirt should still be listening on the unix socket.
> > Similar to what Dan said in comment #31, I don't know a reason why libvirt
> > should be listening on external interfaces by default, so I'd go with 2.
> 
> you meant you'd go with option 1, right?

yes -  a typo.

Comment 35 Fabian Deutsch 2012-06-07 08:44:29 UTC
(In reply to comment #34)
> (In reply to comment #33)
> > (In reply to comment #32)
> > ...
> > > 1. stick to F17 defaults and VDSM provides certificates and enables --listen
> > > if required

The following patch prevents node from touching libvirt config files at build time:
http://gerrit.ovirt.org/#/c/5122/