Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1100890

Summary: rhc create-app retries for DNS records on the client machine even though the app is created and added to DNS
Product: OpenShift Container Platform Reporter: Veer Muchandi <veer>
Component: ocAssignee: Miciah Dashiel Butler Masters <mmasters>
Status: CLOSED NOTABUG QA Contact: libra bugs <libra-bugs>
Severity: medium Docs Contact:
Priority: medium    
Version: 2.1.0CC: anli, bleanhar, erich, jokerman, libra-onpremise-devel, mmccomas, nicholas_schuetz, veer
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-02-09 22:07:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Retry issue when the nameserver entry is not the first one in the list
none
Retries error resolved when you move nameserver entry to the top in /etc/resolv.conf none

Description Veer Muchandi 2014-05-23 17:42:57 UTC
Description of problem:
When you are in a test client machine, the nameserver entries might be added as DNS records on the client machine. On MAC client we use /etc/resolver/<domainname> file to configure the nameserver entry. 
When you create an application on this MAC client it always keeps retrying for the DNS record to be created and stops. You have to manually do a git clone later.

Based on the analysis so far I think the rhc client (or more specifically the library it uses for name resolution) is stopping at the first name server when it checks for the application name to be available right after creation. It is not looping through the list of name servers.


Version-Release number of selected component (if applicable):
all versions

How reproducible:
Try to create an application from a MAC client machine that has name server configuration made in /etc/resolver/<domainname>/

Or if you are using a windows client, make the openshift name server as the second name server on the list of DNS servers you configure (not the first).

Steps to Reproduce:
1.Create an application with rhc app create <appname> <cartridge>. When the openshift name server is not the first name server in the list of name servers you have configured on Windows box. Or if you use a MAC client with etc/resolver/<domainname> configuration, you will see this issue.
2.Now make the openshift name server entry as the first one your client machine goes to (on windows) or on the MAC client add your name server entry as the first one in /etc/resolv.conf. Retry creating an application again. You will notice that this problem is not anymore.
3.

Actual results:
Application is created but does not get cloned on your machine

Expected results:
Application is created and cloned on the client machine.

Additional info:

Comment 2 Miciah Dashiel Butler Masters 2014-05-24 06:07:53 UTC
> When you are in a test client machine, the nameserver entries might be added as DNS records on the client machine.

Do you just mean the nameservers are configured on the client machine? I'm not sure what you mean by "as DNS records."

> Based on the analysis so far I think the rhc client (or more specifically the library it uses for name resolution) is stopping at the first name server when it checks for the application name to be available right after creation. It is not looping through the list of name servers.

That sounds like correct behaviour.  Generally, a nameserver does not say, "I don't know of this domain"; rather, it says, "This domain does not exist" (NXDOMAIN).  When we set up OpenShift with its own BIND instance, if we do not configure DNS forwarding on that BIND server (to ask another nameserver about domains it does not have in its own database), then for any domains that you do not add to its database, it will say, "This domain does not exist."

(Arguably, a server is saying, "I don't know" when it returns a SERVFAIL response, but a SERVFAIL response conveys that the server is faulty, not merely missing the information.  I assume we are not talking about faulty nameservers in this bug report.)

Are you saying that OS X will try the second nameserver if the first returns NXDOMAIN?

I remember there was an annoying idiosyncrasy of OS X that some versions of OS X used round-robin whereas others, I believe, just used the first nameserver that returned a response (even if it was NXDOMAIN), same as Linux.  However, trying a second server if the first returns NXDOMAIN would seem to be taking liberties in interpreting how DNS is supposed to work, and taking such liberties can cause false expectations that lead to problems when others follow standards.

> Version-Release number of selected component (if applicable):
> all versions

Do you mean you have seen the bug using multiple versions of Mac OS X, or using multiple versions of rhc?

> How reproducible:
> Try to create an application from a MAC client machine that has name server configuration made in /etc/resolver/<domainname>/
> 
> Or if you are using a windows client, make the openshift name server as the second name server on the list of DNS servers you configure (not the first).

Are you saying that other programs on Microsoft Windows loop through the DNS servers? That is, if you ping a hostname that does not exist in the first nameserver's database, and that nameserver consequently returns an NXDOMAIN response, will ping on MS Windows automatically try the lookup with the second nameserver?

It sounds like there is more variation in how name lookups are performed on different OSes than I had realised.  The rhc tool relies on the core, native-Ruby resolv library for name lookups, which quite possibly gets OS-specific semantics wrong.  However, I'm not sure that the fault is in the Ruby library as opposed to being in the OSes.

Do you know of documentation for OS X or MS Windows that says that this behaviour of trying the next nameserver upon receiving NXDOMAIN is the way that these OSes are intended to behave? (In that case, one could argue that the OSes' specifications were faulty, but even a faulty specification that contradicts standards carries more weight than faulty software that contradicts standards.)

Comment 3 Veer Muchandi 2014-05-26 03:06:07 UTC
(In reply to Miciah Dashiel Butler Masters from comment #2)
> > When you are in a test client machine, the nameserver entries might be added as DNS records on the client machine.
> 
> Do you just mean the nameservers are configured on the client machine? I'm
> not sure what you mean by "as DNS records."
Yes. The nameserver entries (we make in /etc/resolv.conf for example)
> 
> > Based on the analysis so far I think the rhc client (or more specifically the library it uses for name resolution) is stopping at the first name server when it checks for the application name to be available right after creation. It is not looping through the list of name servers.
> 
> That sounds like correct behaviour.  Generally, a nameserver does not say,
> "I don't know of this domain"; rather, it says, "This domain does not exist"
> (NXDOMAIN).  When we set up OpenShift with its own BIND instance, if we do
> not configure DNS forwarding on that BIND server (to ask another nameserver
> about domains it does not have in its own database), then for any domains
> that you do not add to its database, it will say, "This domain does not
> exist."
> 
> (Arguably, a server is saying, "I don't know" when it returns a SERVFAIL
> response, but a SERVFAIL response conveys that the server is faulty, not
> merely missing the information.  I assume we are not talking about faulty
> nameservers in this bug report.)
> 
> Are you saying that OS X will try the second nameserver if the first returns
> NXDOMAIN?
> 
> I remember there was an annoying idiosyncrasy of OS X that some versions of
> OS X used round-robin whereas others, I believe, just used the first
> nameserver that returned a response (even if it was NXDOMAIN), same as
> Linux.  However, trying a second server if the first returns NXDOMAIN would
> seem to be taking liberties in interpreting how DNS is supposed to work, and
> taking such liberties can cause false expectations that lead to problems
> when others follow standards.
> 
> > Version-Release number of selected component (if applicable):
> > all versions
> 

It seems "rhc create-app" tries to check with the nameserver multiple times. It is able to reach the broker and node with no problems, which means even if the nameserver is not configured as the first one on the list, it gets there. But once the application is created, when it tries to resolve application by name, it seems to be checking just the first nameserver on the list and getting into the retry loop. The logic seems to be different between resolving a broker/node vs resolving the application name.


> Do you mean you have seen the bug using multiple versions of Mac OS X, or
> using multiple versions of rhc?
No. I am using a single version of Mac OS X. This problem has always been there. But when I came across this issue on a Windows client, I understood what is happening
-- if the first nameserver entry on the list does not resolve to an application name, it gets into retry loop. This is consistent whether it is Mac OS X or Windows. 
Hence I thought it can be fixed in the rhc client scripts or the libraries rhc uses while resolving application name right after application creation.

> 
> > How reproducible:
> > Try to create an application from a MAC client machine that has name server configuration made in /etc/resolver/<domainname>/
> > 
> > Or if you are using a windows client, make the openshift name server as the second name server on the list of DNS servers you configure (not the first).
> 
> Are you saying that other programs on Microsoft Windows loop through the DNS
> servers? That is, if you ping a hostname that does not exist in the first
> nameserver's database, and that nameserver consequently returns an NXDOMAIN
> response, will ping on MS Windows automatically try the lookup with the
> second nameserver?
I think it is. If I configure the Openshift nameserver as the second one in the list, I am able to reach the broker and node from the Windows client. So if it is not able to resolve the first nameserver it is going to the second nameserver. But rhc script is not able to do that when it checks for the application name (after creating app).

> 
> It sounds like there is more variation in how name lookups are performed on
> different OSes than I had realised.  The rhc tool relies on the core,
> native-Ruby resolv library for name lookups, which quite possibly gets
> OS-specific semantics wrong.  However, I'm not sure that the fault is in the
> Ruby library as opposed to being in the OSes.
It may be a fault with the Ruby library. OS seems to be resolving fine based on what I said above. 

> 
> Do you know of documentation for OS X or MS Windows that says that this
> behaviour of trying the next nameserver upon receiving NXDOMAIN is the way
> that these OSes are intended to behave? (In that case, one could argue that
> the OSes' specifications were faulty, but even a faulty specification that
> contradicts standards carries more weight than faulty software that
> contradicts standards.)
I am not sure. I did not check any of these. The fault seems to be pointing to the ruby library.

Comment 4 Veer Muchandi 2014-05-26 13:11:59 UTC
Today I tried with RHEL client as well and the problem is consistent across all the clients (whether it is Windows, Mac or RHEL). If the nameserver entry that points to the OpenshiftDNS is not the first on on the list in /etc/resolv.conf, we get this retry issue. 
This means the ruby library that rhc uses for checking whether the application added or not is reaching out to the first nameserver in the list and stops. It is not looping through all the nameservers if multiple are configured.
OS (Windows, Mac or RHEL) does not have such an issue as they are resolving to the broker and node.

I have attached two screenshots. The first one is when the nameserver entry for OpenshiftDNS is not the first one in /etc/resolv.conf (here there is retry issue) and the second one when I moved the nameserver entry to top of the list in /etc/resolv.conf and the issue disappears.

Comment 5 Veer Muchandi 2014-05-26 13:14:38 UTC
Created attachment 899308 [details]
Retry issue when the nameserver entry is not the first one in the list

This is tested on a RHEL box. This is pretty common on Mac OS X as we configuring the nameserver in /etc/resolver/<domain> file - so it always ends up not being the first one in the nameserver list.

Comment 6 Veer Muchandi 2014-05-26 13:15:18 UTC
Created attachment 899309 [details]
Retries error resolved when you move nameserver entry to the top in /etc/resolv.conf

Comment 7 Veer Muchandi 2014-05-26 13:17:52 UTC
(In reply to Veer Muchandi from comment #4)
> Today I tried with RHEL client as well and the problem is consistent across
> all the clients (whether it is Windows, Mac or RHEL). If the nameserver
> entry that points to the OpenshiftDNS is not the first on on the list in
> /etc/resolv.conf, we get this retry issue. 
> This means the ruby library that rhc uses for checking whether the
> application added or not is reaching out to the first nameserver in the list
> and stops. It is not looping through all the nameservers if multiple are
> configured.


> OS (Windows, Mac or RHEL) does not have such an issue as they are resolving
> to the broker and node.
(In the above statement, I meant to say it is not an OS problem, but the Ruby problem)


> 
> I have attached two screenshots. The first one is when the nameserver entry
> for OpenshiftDNS is not the first one in /etc/resolv.conf (here there is
> retry issue) and the second one when I moved the nameserver entry to top of
> the list in /etc/resolv.conf and the issue disappears.

Comment 8 Miciah Dashiel Butler Masters 2014-05-30 15:15:21 UTC
>Today I tried with RHEL client as well and the problem is consistent across all the clients (whether it is Windows, Mac or RHEL). If the nameserver entry that points to the OpenshiftDNS is not the first on on the list in /etc/resolv.conf, we get this retry issue. 

On RHEL, if the nameserver that points to the OpenShift DNS is not first in /etc/resolv.conf, can other programs besides rhc resolve your OpenShift hosts' names? The expected behaviour, and behaviour I see (I just checked with ping and host), is that they cannot.  That is, the OS trusts the first nameserver's response and doesn't try subsequent nameservers:

    [root@broker ~]# ping -c 1 broker.hosts.miciah-ose262.example.com
    PING broker.hosts.miciah-ose262.example.com (172.16.14.29) 56(84) bytes of data.
    64 bytes from miciah-ose262.novalocal (172.16.14.29): icmp_seq=1 ttl=64 time=0.040 ms
    
    --- broker.hosts.miciah-ose262.example.com ping statistics ---
    1 packets transmitted, 1 received, 0% packet loss, time 1ms
    rtt min/avg/max/mdev = 0.040/0.040/0.040/0.000 ms
    [root@broker ~]# host broker.hosts.miciah-ose262.example.com
    broker.hosts.miciah-ose262.example.com has address 172.16.14.29
    [root@broker ~]# vim /etc/resolv.conf
    (Re-order the nameservers here so that the first nameserver cannot resolve broker.hosts.miciah-ose262.example.com.)
    [root@broker ~]# ping -c 1 broker.hosts.miciah-ose262.example.com
    ping: unknown host broker.hosts.miciah-ose262.example.com
    [root@broker ~]# host broker.hosts.miciah-ose262.example.com
    Host broker.hosts.miciah-ose262.example.com not found: 3(NXDOMAIN)
    [root@broker ~]#

GNU/Linux's behaviour is standards-compliant and makes sense.  I knew that OS X sometimes behaved differently from GNU/Linux; I thought OS X was just idiotic^H^H^Hsyncratic.  I am surprised that Microsoft Windows also behaves differently from GNU/Linux.

Can you reproduce the issue with OpenShift Online? (You can check by putting your OpenShift Enterprise DNS server first, because it will be unable to resolve the OpenShift Online names, and then trying to create an application to see whether rhc tries the second nameserver.) If so, I'm sure upstream would be happy to figure out a fix or workaround.

Comment 9 Veer Muchandi 2014-06-03 19:16:21 UTC
(In reply to Miciah Dashiel Butler Masters from comment #8)
> >Today I tried with RHEL client as well and the problem is consistent across all the clients (whether it is Windows, Mac or RHEL). If the nameserver entry that points to the OpenshiftDNS is not the first on on the list in /etc/resolv.conf, we get this retry issue. 
> 
> On RHEL, if the nameserver that points to the OpenShift DNS is not first in
> /etc/resolv.conf, can other programs besides rhc resolve your OpenShift
> hosts' names? The expected behaviour, and behaviour I see (I just checked
> with ping and host), is that they cannot.  That is, the OS trusts the first
> nameserver's response and doesn't try subsequent nameservers:
> 
>     [root@broker ~]# ping -c 1 broker.hosts.miciah-ose262.example.com
>     PING broker.hosts.miciah-ose262.example.com (172.16.14.29) 56(84) bytes
> of data.
>     64 bytes from miciah-ose262.novalocal (172.16.14.29): icmp_seq=1 ttl=64
> time=0.040 ms
>     
>     --- broker.hosts.miciah-ose262.example.com ping statistics ---
>     1 packets transmitted, 1 received, 0% packet loss, time 1ms
>     rtt min/avg/max/mdev = 0.040/0.040/0.040/0.000 ms
>     [root@broker ~]# host broker.hosts.miciah-ose262.example.com
>     broker.hosts.miciah-ose262.example.com has address 172.16.14.29
>     [root@broker ~]# vim /etc/resolv.conf
>     (Re-order the nameservers here so that the first nameserver cannot
> resolve broker.hosts.miciah-ose262.example.com.)
>     [root@broker ~]# ping -c 1 broker.hosts.miciah-ose262.example.com
>     ping: unknown host broker.hosts.miciah-ose262.example.com
>     [root@broker ~]# host broker.hosts.miciah-ose262.example.com
>     Host broker.hosts.miciah-ose262.example.com not found: 3(NXDOMAIN)
>     [root@broker ~]#
> 
> GNU/Linux's behaviour is standards-compliant and makes sense.  I knew that
> OS X sometimes behaved differently from GNU/Linux; I thought OS X was just
> idiotic^H^H^Hsyncratic.  I am surprised that Microsoft Windows also behaves
> differently from GNU/Linux.
> 
> Can you reproduce the issue with OpenShift Online? (You can check by putting
> your OpenShift Enterprise DNS server first, because it will be unable to
> resolve the OpenShift Online names, and then trying to create an application
> to see whether rhc tries the second nameserver.) If so, I'm sure upstream
> would be happy to figure out a fix or workaround.

Not sure why your RHEL box is stopping at the first name server. See the output below from my machine.

I have configured 209.132.178.95 as the second name server. Ping is able to reach the broker at this ip address. Also I tried pinging the application from this broker that is not configured as the first nameserver on the list and I am able for each it.

[veer@broker workspace]$ cat /etc/resolv.conf
; generated by /sbin/dhclient-script
search hsd1.ga.comcast.net. hosts.ipacc.com
nameserver 209.132.179.78
nameserver 209.132.178.95
nameserver 192.168.1.1
nameserver 75.75.75.75
nameserver 75.75.76.76
 
[veer@broker workspace]$ ping -c 2 broker.hosts.pocteam.com
PING broker.hosts.pocteam.com (209.132.178.95) 56(84) bytes of data.
64 bytes from vm-95-178-132-209.osop.rhcloud.com (209.132.178.95): icmp_seq=1 ttl=46 time=64.1 ms
64 bytes from vm-95-178-132-209.osop.rhcloud.com (209.132.178.95): icmp_seq=2 ttl=46 time=66.0 ms
 
--- broker.hosts.pocteam.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1067ms
rtt min/avg/max/mdev = 64.107/65.094/66.081/0.987 ms


[veer@broker workspace]$ ping pocteamphptest2-veer1.ose.pocteam.com
PING node.hosts.pocteam.com (209.132.178.93) 56(84) bytes of data.
64 bytes from vm-93-178-132-209.osop.rhcloud.com (209.132.178.93): icmp_seq=1 ttl=46 time=63.7 ms
64 bytes from vm-93-178-132-209.osop.rhcloud.com (209.132.178.93): icmp_seq=2 ttl=46 time=71.6 ms
64 bytes from vm-93-178-132-209.osop.rhcloud.com (209.132.178.93): icmp_seq=3 ttl=46 time=65.3 ms
64 bytes from vm-93-178-132-209.osop.rhcloud.com (209.132.178.93): icmp_seq=4 ttl=46 time=64.2 ms
^C
--- node.hosts.pocteam.com ping statistics ---
4 packets transmitted, 4 received, 0% packet loss, time 3307ms
rtt min/avg/max/mdev = 63.719/66.241/71.612/3.172 ms
 
64 bytes from vm-95-178-132-209.osop.rhcloud.com (209.132.178.95): icmp_seq=2 ttl=46 time=66.0 ms
 
--- broker.hosts.pocteam.com ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1067ms
rtt min/avg/max/mdev = 64.107/65.094/66.081/0.987 ms

Comment 10 Veer Muchandi 2014-06-03 19:30:23 UTC
> Can you reproduce the issue with OpenShift Online? (You can check by putting
> your OpenShift Enterprise DNS server first, because it will be unable to
> resolve the OpenShift Online names, and then trying to create an application
> to see whether rhc tries the second nameserver.) If so, I'm sure upstream
> would be happy to figure out a fix or workaround.

Yes I did test with Openshift Online and I am able to reproduce this issue even with online. I have put Openshift Enterprise DNS server first in the name server list. I changed the rhc setup to point to Openshift online. When I try to create an application it keeps trying to resolve. See below.

[veer@broker ~]$ rhc app create onlinephptst php-5.3
Application Options
-------------------
  Domain:     veer1
  Cartridges: php-5.3
  Gear Size:  default
  Scaling:    no
 
Creating application 'onlinephptst' ... done
 
 
Waiting for your DNS name to be available ...     retry # 1 - Waiting for DNS: onlinephptst-veer1.rhcloud.com
    retry # 2 - Waiting for DNS: onlinephptst-veer1.rhcloud.com
    retry # 3 - Waiting for DNS: onlinephptst-veer1.rhcloud.com
    retry # 4 - Waiting for DNS: onlinephptst-veer1.rhcloud.com
    retry # 5 - Waiting for DNS: onlinephptst-veer1.rhcloud.com
    retry # 6 - Waiting for DNS: onlinephptst-veer1.rhcloud.com
    retry # 7 - Waiting for DNS: onlinephptst-veer1.rhcloud.com