Bug 1401891

Summary: docker-registry login failed when the first dns is down in docker-registry container.
Product: OpenShift Container Platform Reporter: Johnny Liu <jialiu>
Component: Image RegistryAssignee: Oleg Bulatov <obulatov>
Status: CLOSED WORKSFORME QA Contact: ge liu <geliu>
Severity: low Docs Contact:
Priority: low    
Version: 3.4.1CC: aos-bugs, bparees, dyan, jialiu, jokerman, mfojtik, miminar, mmccomas
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-01-11 17:15:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
docker registry log none

Description Johnny Liu 2016-12-06 11:05:03 UTC
Description of problem:
See the following details.

Version-Release number of selected component (if applicable):
openshift v3.4.0.30+e10cc28
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0


How reproducible:
Always

Steps to Reproduce:
1. Set openshift_dns_ip=<a-non-existing-ip>, e.g: 172.30.0.1
2. After installation, log into docker-registry via oc rsh command.
# oc rsh docker-registry-2-n4ms1
sh-4.2$ cat /etc/resolv.conf 
search default.svc.cluster.local svc.cluster.local cluster.local openstacklocal lab.sjc.redhat.com
nameserver 172.30.0.2
nameserver 192.168.2.15
options ndots:5
# curl google.com  => PASS
3. Try to login to this docker-registry via docker command.


Actual results:
# docker login -u unused  -p $(oc sa get-token builder -n openshift3) 172.30.19.67:5000
Error response from daemon: Get http://172.30.19.67:5000/v2/: Get http://172.30.19.67:5000/openshift/token?account=unused&client_id=docker&offline_token=true: net/http: request canceled (Client.Timeout exceeded while awaiting headers) (Client.Timeout exceeded while awaiting headers)

sti build will also fail.
# oc logs django-psql-example-1-build -n install-test
<--snip-->
Pushing image 172.30.19.67:5000/install-test/django-psql-example:latest ...
Registry server Address: 
Registry server User Name: serviceaccount
Registry server Email: serviceaccount
Registry server Password: <<non-empty>>
error: build error: Failed to push image: net/http: request canceled (Client.Timeout exceeded while awaiting headers)


Expected results:
When the 1st dns server is down, the 2nd dns server still available, docker login should succeed.

Additional info:
There are several similar issue in github issues:
https://github.com/docker/docker/issues/22635#issuecomment-260063252
https://github.com/concourse/concourse/issues/374#issuecomment-211466240

After I correct dnsIP in node-config.yaml, restart node server, re-deploy docker-registry pod to make a working DNS listed on the top of /etc/resolv.conf in container, this issue disappeared.

Comment 1 Scott Dodson 2016-12-06 14:32:34 UTC
Jianlin,

> 1. Set openshift_dns_ip=<a-non-existing-ip>, e.g: 172.30.0.1

So you're saying that isn't a valid ip? That's not the kubernetes service ip?

Can you provide the logs from the registry too?

Given this sounds like a deliberate misconfiguration I'm going to mark this UpcomingRelease.

Comment 2 Johnny Liu 2016-12-07 03:42:49 UTC
Created attachment 1228815 [details]
docker registry log

Comment 3 Johnny Liu 2016-12-07 03:43:56 UTC
(In reply to Scott Dodson from comment #1)
> Jianlin,
> 
> > 1. Set openshift_dns_ip=<a-non-existing-ip>, e.g: 172.30.0.1
> 
> So you're saying that isn't a valid ip? That's not the kubernetes service ip?
> 
Sorry for my typo. I was setting "openshift_dns_ip=172.30.0.2"

> Can you provide the logs from the registry too?
Attached.

Comment 4 Scott Dodson 2017-06-09 02:37:10 UTC
Re-assigning to Image Registry component but reviewing the logs attached in comment #3 it doesn't look like it's actually a dns failure but a failure to connect to the api server based on this log entry

time="2016-12-07T02:53:43.511149145Z" level=debug msg="invalid token: Get https://openshift-136.lab.sjc.redhat.com:443/oapi/v1/users/~: dial tcp: i/o timeout" go.version=go1.7.3 http.request.host="172.30.245.180:5000" http.request.id=80af04bb-3e33-4502-9c59-5c78c3172260 http.request.method=GET http.request.remoteaddr="10.128.0.1:40022" http.request.uri="/openshift/token?account=unused&client_id=docker&offline_token=true" http.request.useragent="docker/1.12.3 go/go1.6.2 git-commit/8b91553-redhat kernel/3.10.0-514.2.2.el7.x86_64 os/linux arch/amd64 UpstreamClient(Docker-Client/1.12.3 \\(linux\\))" instance.id=f90a06b7-512b-4633-96eb-b8517dce4b08

If it is a dns issue then it'd be the registry's dns resolver library not properly failing over from one dns server to the other.

Comment 5 Oleg Bulatov 2017-07-10 16:02:57 UTC
v3.6.0-alpha.2, go v1.7.5 - it tries to use another DNS server after waiting for 20 second. Which timeout do you have for master api calls from the registry?

Comment 6 Johnny Liu 2017-07-11 02:38:55 UTC
(In reply to Oleg Bulatov from comment #5)
> v3.6.0-alpha.2, go v1.7.5 - it tries to use another DNS server after waiting
> for 20 second. Which timeout do you have for master api calls from the
> registry?

I am not deeply familiar with the functionality between api and registry stuff, can you tell me, how to get that timeout value?

Comment 8 Ben Parees 2018-01-10 04:37:06 UTC
Much has probably changed in this space since 3.6, can you confirm it's still an issue in 3.9?

Comment 9 Johnny Liu 2018-01-11 10:44:25 UTC
In 3.9, I have no way to reproduce the same scenario like the initial report in 3.4, now only dnsIP is kept in /etc/resolv.conf in pod, no way to add one more as the second nameserver.

# oc rsh docker-registry-2-bhlbg
sh-4.2$ cat /etc/resolv.conf 
nameserver 172.16.120.117
search default.svc.cluster.local svc.cluster.local cluster.local openstacklocal bb.com
options ndots:5

So this scenario is not applicable for 3.9 version.

But I tried some negative testing, set dnsIP to a wrong IP, then try docker login, it succeeded.

# oc rsh docker-registry-2-wrcvw
sh-4.2$ cat /etc/resolv.conf 
nameserver 172.16.120.7
search default.svc.cluster.local svc.cluster.local cluster.local openstacklocal bb.com
options ndots:5


# docker login -u unused  -p $(oc sa get-token builder -n openshift) 172.31.43.231:5000
Login Succeeded


# openshift version
openshift v3.9.0-0.16.0
kubernetes v1.9.0-beta1
etcd 3.2.8

Comment 10 Ben Parees 2018-01-11 17:15:14 UTC
Ok, going to close this out then.  Sorry we let it sit so long.

(I am surprised it works even with a completely wrong dns value).