Red Hat Bugzilla – Bug 1477294
If there is no default route on the compute nodes, ephemeral data local on compute nodes, a nova resize faills
Last modified: 2017-09-04 03:04:31 EDT
Description of problem:
- 1 controller
- 2 compute nodes
- ephemeral data configured to be local to the compute nodes.
- during a deployment the compute nic configs have no default gateway defined.
When a resize of an instance is done this is what happens
The instance is *seemingly* resized properly on the second compute node however
- data is lost, any files created on the ephemeral storage is lost in the resize.
- an abnormal effect i observe is the original compute node still has the /var/lib/nova/instance/<name>/<disks> there vs a case when the default gw is setup.
The customer root cased the problem to be this:
The problem has probably to do with the fact that our compute nodes, we do nothave any default route.
dest_host will be set in _create_migration in nova/compute/resource_tracker.py to self.driver.get_host_ip_addr()
On the libvirt driver this is set to CONF.my_ip.
CONF.my_ip is set, presumably at nova startup using a method in oslo_utils/netutils.py called get_my_ipv4. It tries to create a socket to destination 192.0.2.0. This presupposes that this address is routable. On our comutes nodes, it is nodes, so this defaults to 127.0.0.1.
With dest_host set to 127.0.0.1, the migration/ resize will perform a local copy.
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.Set up an openstack env with 1 controller and 2 computes
2.ephemeral storage local on compute nodes
3. no default routes on the compute nic configs (have uploaded my yaml files for reference in case it is useful. Please note my yaml files use ceph for all the other storage)
- Ephemeral data lost
- Ephemeral data should be copied over to the second compute node vs a disk being recreated
- The false positive for a resize could cause serious damage for users using applications like hadoop depending heavily on local compute storage.
Created attachment 1307721 [details]
yaml files used with director to deploy a stack with no default gw
*** Bug 1477293 has been marked as a duplicate of this bug. ***
It seems that we do currently rely on the default gateway being set to the control plane gateway. It's set in all of the default network configs AFAICT.
Why isn't this being set in the custom network config used to deploy in this case?
As part of security hardening, this customer has setup up specifically all the routes they want to allow traffic on.
They are able to successfully ping/ssh from one nova node to another.
In that case why should they have a default gateway ?
(In reply to Ruchika K from comment #7)
> As part of security hardening, this customer has setup up specifically all
> the routes they want to allow traffic on.
I'm not an expert on network security but this sounds like a poor substitute for a firewall.
However if there is a compelling reason to support this then I think it would be best to raise an RFE.
> They are able to successfully ping/ssh from one nova node to another.
> In that case why should they have a default gateway ?
They are on the same network. Routing is only required to connect to an IP on a different network.
(In reply to Ruchika K from comment #9)
> Why does this one feature of resize need a default gateway? Could it just be the implementation is erroneous?
Resize does not need a default gateway. This is essentially an scp of the disks from src host to target host. As they are on the same network there is no routing involved.
I believe the root cause is that resize needs the computes to report their correct IP address, and in this case they do not.
AFAICT the IP address is determined by the value of my_ip in nova.conf (not currently possible to set this using RHOSP director). If not set it defaults to auto-detection.
To auto-detect the IP address a socket is open with the target IP set to 192.0.2.0. This is not a real IP address, it's in the test/doc IP address range. The src IP address of this socket should provided the default IP of the host.
If there is no route found for this IP then the interfaces are inspected. If no default gateway is found then this process fails and the loopback address 127.0.0.1 is used. https://github.com/openstack/oslo.utils/blob/stable/newton/oslo_utils/netutils.py#L259
So removing the default route has inadvertently broken the IP auto-detection. Every compute node reports the IP address 127.0.0.1. As a result the resize is transferring disks from the src host to the src host via this loopback address instead of from the src host to the target host.
Note: I would also be concerned that the lack of a default route has side-effects for other features/services. oslo_utils is common code used by many OpenStack services. They may also rely on this IP address auto-detection.
> Please clarify why this is an RFE and not a bug.
I think it could be argued either way...
Support for deploying without a default gateway doesn't appear to have been implemented yet.
E.g for nova-compute I think we need to add support for setting my_ip in puppet-nova and then set this by default to the control plane IP of the node via tripleo-heat-templates.
Other nova.conf params may require this too (e.g if they don't default to my_ip).
Other services may need to make similar changes too.
So this sounds like an RFE to support this specific network config in RHOSP director.
Saying that, I really don't think we should be relying on the IP auto-detection mechanism. It looks more like a sensible default to make development more convenient than something we should rely on in production. I would much rather that director set this explicitly in nova.conf. So I think you could argue that it's a bug.
For the support case associated with this - for now I would restore the default route to the control plane gateway for compute nodes unless there is a compelling reason not to do this. If there is a good reason not to then maybe setting a static route on the computes nodes for 192.0.2.0 via the control plane gateway would be a viable workaround to fix IP auto-detection (not tested! last resort!).
Setting this back to needinfo, is there a good reason for removing the default route?