Nodes would end up out of synchronization due to a lack of access to NTP servers. This was because not all nodes routed access to the required servers (NTP, DNS, etc). This fix sets the Undercloud as a gateway for non-Controller nodes. This provides non-Controller nodes with access to external services such as DNS and NTP, which aids synchronization.
Description of problem:
Following try to validate NTP server bug - https://bugzilla.redhat.com/show_bug.cgi?id=1233916#c17
It seems that if DNS server address is located on external or internal networks, some overcloud hosts will not reach it since not every host have external or internal network
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1.can't ping DNS server configured during cloud deploy from CEPH/Compute nodes
Dan, any ideas for how to resolve this?
One possible fix for this is a system management network. I'm tracking that in BZ 1240395.
There is an upstream patch for that bug which is in review: https://review.openstack.org/#/c/199800/
That probably won't make it into GA, but there is a slim chance.
One of the bad outcomes of it is that ntpd on the nodes is not able to be sync correctly - so each node get different timing.
The latest on this is that Dan Prince is working on a patch which would allow the ctlplane to be used with static IPs and an external gateway. That would provide a way for nodes to reach NTP without having to rely on the undercloud as a SPOF.
(In reply to Dan Sneddon from comment #5)
> The latest on this is that Dan Prince is working on a patch which would
> allow the ctlplane to be used with static IPs and an external gateway. That
> would provide a way for nodes to reach NTP without having to rely on the
> undercloud as a SPOF.
Should read "reach DNS", but the same approach will also provide access to NTP.
When using network isolation we have the default gateway (which is likely used to access the DNS server) on different networks. For compute/ceph roles the default gateway is going to be on the ctlplane network. For the controller node it would be on the external network (public traffic).
So long as you're router (either the undercloud, or a real router) can route traffic from the ctlplane to the external network (where the external DNS server resides) I think this should work fine.
If however you put your DNS server on one of the isolated networks (internal_api, storage, etc) then it would only be accessible by select roles, unless again you've gone and put routes in place on your ctlplane router to handle this.
Could we just treat as a missing route issue? As in there needs to be a route added somewhere (either the undercloud or the gateway router) to handle this traffic?
So the problem with comment 7 is:
* it requires a router, of some kind
* the router needs to not serve DHCP because neutron is going to do that
* there needs to be nothing else on the subnet behind the router, because neutron is going to be serving dhcp on it
I'm having a hard time imagining how we will get our *own* IT to let us set that up in the lab, much less explaining to a customer that that's what we need.
I think a better solution is probably to get all nodes onto a network which has external connectivity, but where we do not serve DHCP. What's wrong with just using the external API network for this? The controllers already use that network as a gateway if I'm not mistaken.
Alternatively we could define a gateway for the internal API network, but that seems a little screwy.
OK, having said comment 8 ...
Is the least invasive solution here not simply setting the undercloud up as the external gateway for everything but the controllers? It's already routing traffic on that subnet and serving dhcp... I don't think we should leave it this way but I think we could ship with it.
ok, i believe the only patch needed to set this up is to enable ip forwarding on the undercloud. i've linked Ben's patches for that.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.