Bug 1809611

Summary: [ocp on osp hackfest] OCP 4.4 Installer fails with multiple dns servers in config file.
Product: OpenShift Container Platform Reporter: Chris Janiszewski <cjanisze>
Component: InstallerAssignee: Martin André <m.andre>
Installer sub component: OpenShift on OpenStack QA Contact: David Sanz <dsanzmor>
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: medium CC: m.andre, mfedosin, pprinett
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: CoreDNS forward plugin uses a default policy of random to pick the upstream server. Consequence: Clusters may not be able to resolve the OpenStack API hostname when the user sets multiple external DNS resolvers such as ['172.31.8.1','8.8.8.8'], where 172.31.8.1 knows to resolve internal hostnames and 8.8.8.8 is a public resolver that doesn't know how to resolve internal hostnames. Fix: Switch to the sequential policy to simulate the libc behavior. Result: The DNS servers specified via externalDNS option are now used in order by CoreDNS.
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 18:03:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1851267, 1851911    

Description Chris Janiszewski 2020-03-03 13:56:12 UTC
Description of problem:
OCP 4.4 on OSP16 fails with error:

INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the s
tack.                                                                                         
ERROR Cluster operator monitoring Degraded is True with UpdatingAlertmanagerFailed: Failed to 
rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager
Route to become ready failed: waiting for RouteReady of alertmanager-main: no status available
 for alertmanager-main                                                                        
FATAL failed to initialize the cluster: Working towards 4.4.0-0.nightly-2020-03-01-212047: 99%
 complete 

$ oc logs -n openshift-machine-api machine-api-controllers-8d874cb86-z8xp6 -c machine-controller

E0303 11:41:28.528641       1 controller.go:279] Failed to check if machine "ocpra-vbmm9-worker-2lbhd" exists: Error checking if instance exists (machine/actuator.go 346):                 
Error getting a new instance service from the machine (machine/actuator.go 467): Failed to authenticate provider client: Get https://openstack.home.lab:13000/: dial tcp: lookup openstack.home.lab on 172.30.0.10:53: no such host 

Config file DNS section:

platform:
  openstack:                                                                                 
    cloud: openstack                                                                         
    computeFlavor: m1.large                                                                  
    externalDNS: ['172.31.8.1','8.8.8.8']



Version-Release number of the following components:
OCP 4.4.0-0.nightly-2020-03-01-212047

Based on this doc -> https://coredns.io/plugins/forward/
The default policy is random, meaning that if it picks 8.8.8.8
resolver that won't be able to resolve your cluster hostname

We should change it to default to first parameter and only use the second one if the first one fails

How reproducible:
Everytime

Steps to Reproduce:
1. Deploy OCP 4.4 with 2 dns entries where only one of the dns resolved overcloud domain name


Actual results:
failed

Expected results:
only use first dns unless first dns is not accessible 

Additional info:

Comment 3 David Sanz 2020-04-30 10:36:33 UTC
Verified on 4.5.0-0.nightly-2020-04-29-111042

Comment 5 errata-xmlrpc 2020-08-04 18:03:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409