1809611 – [ocp on osp hackfest] OCP 4.4 Installer fails with multiple dns servers in config file.

Bug 1809611 - [ocp on osp hackfest] OCP 4.4 Installer fails with multiple dns servers in config file.

Summary: [ocp on osp hackfest] OCP 4.4 Installer fails with multiple dns servers in co...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	4.5.0
Assignee:	Martin André
QA Contact:	David Sanz
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1851267 1851911
TreeView+	depends on / blocked

Reported:	2020-03-03 13:56 UTC by Chris Janiszewski
Modified:	2020-08-04 18:03 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: CoreDNS forward plugin uses a default policy of random to pick the upstream server. Consequence: Clusters may not be able to resolve the OpenStack API hostname when the user sets multiple external DNS resolvers such as ['172.31.8.1','8.8.8.8'], where 172.31.8.1 knows to resolve internal hostnames and 8.8.8.8 is a public resolver that doesn't know how to resolve internal hostnames. Fix: Switch to the sequential policy to simulate the libc behavior. Result: The DNS servers specified via externalDNS option are now used in order by CoreDNS.
Clone Of:
Environment:
Last Closed:	2020-08-04 18:03:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 1527	0	None	closed	Bug 1809611: OpenStack: Set coredns forward policy to sequencial	2021-02-05 23:45:19 UTC
Red Hat Product Errata	RHBA-2020:2409	0	None	None	None	2020-08-04 18:03:36 UTC

Description Chris Janiszewski 2020-03-03 13:56:12 UTC

Description of problem:
OCP 4.4 on OSP16 fails with error:

INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the s
tack.                                                                                         
ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to 
rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager
Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available
 for alertmanager-main                                                                        
FATAL failed to initialize the cluster: Working towards 4.4.0-0.nightly-2020-03-01-212047: 99%
 complete 

$ oc logs -n openshift-machine-api machine-api-controllers-8d874cb86-z8xp6 -c machine-controller

E0303 11:41:28.528641       1 controller.go:279] Failed to check if machine "ocpra-vbmm9-worker-2lbhd" exists: Error checking if instance exists (machine/actuator.go 346):                 
Error getting a new instance service from the machine (machine/actuator.go 467): Failed to authenticate provider client: Get https://openstack.home.lab:13000/: dial tcp: lookup openstack.home.lab on 172.30.0.10:53: no such host 

Config file DNS section:

platform:
  openstack:                                                                                 
    cloud: openstack                                                                         
    computeFlavor: m1.large                                                                  
    externalDNS: ['172.31.8.1','8.8.8.8']



Version-Release number of the following components:
OCP 4.4.0-0.nightly-2020-03-01-212047

Based on this doc -> https://coredns.io/plugins/forward/
The default policy is random, meaning that if it picks 8.8.8.8
resolver that won't be able to resolve your cluster hostname

We should change it to default to first parameter and only use the second one if the first one fails

How reproducible:
Everytime

Steps to Reproduce:
1. Deploy OCP 4.4 with 2 dns entries where only one of the dns resolved overcloud domain name


Actual results:
failed

Expected results:
only use first dns unless first dns is not accessible 

Additional info:

Comment 3 David Sanz 2020-04-30 10:36:33 UTC

Verified on 4.5.0-0.nightly-2020-04-29-111042

Comment 5 errata-xmlrpc 2020-08-04 18:03:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409

Note You need to log in before you can comment on or make changes to this bug.