Bug 1482239 - System container install on atomic host - push image to docker registry fails
Summary: System container install on atomic host - push image to docker registry fails
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.6.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 3.7.0
Assignee: Giuseppe Scrivano
QA Contact: Vikas Laad
URL:
Whiteboard:
Depends On: 1489913 1489959
Blocks: 1463574
TreeView+ depends on / blocked
 
Reported: 2017-08-16 20:19 UTC by Vikas Laad
Modified: 2017-11-28 22:07 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 22:07:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description Vikas Laad 2017-08-16 20:19:27 UTC
Description of problem:
After creating system container install by passing following flags to openshift-ansible
openshift_use_system_containers: true
system_images_registry: registry.access.redhat.com

I can see registry pod running in default project. When I try to create new app I see failure to push image
Pushing image docker-registry.default.svc:5000/vlaad/cakephp-mysql-example:latest ...
Registry server Address: 
Registry server User Name: serviceaccount
Registry server Email: serviceaccount
Registry server Password: <<non-empty>>
error: build error: Failed to push image: Get https://docker-registry.default.svc:5000/v1/_ping: dial tcp: lookup docker-registry.default.svc on 172.31.55.188:53: no such host

I also see warning in the default project events

25m        25m         1         docker-registry-1-deploy                      Pod                                                         Warning   FailedMount               kubelet, ip-172-31-42-188.us-west-2.compute.internal      Unable to mount volumes for pod "docker-registry-1-deploy_default(e6d98103-82bb-11e7-a589-02618f0bef2c)": timeout expired waiting for volumes to attach/mount for pod "default"/"docker-registry-1-deploy". list of unattached/unmounted volumes=[deployer-token-j0qw7]

oc get pods -n default
docker-registry-1-nx952    1/1       Running   0          28m
registry-console-1-691hc   1/1       Running   0          27m
router-1-t3gwf             1/1       Running   0          29m

Version-Release number of selected component (if applicable):
-bash-4.2# openshift version
openshift v3.6.173.0.5
kubernetes v1.6.1+5115d708d7
etcd 3.2.1


How reproducible:
Always

Steps to Reproduce:
1. Create a system container install on atomic host
2. create cakephp-mysql app
3. see build log

Actual results:
Build fails due to registry push failure

Expected results:
Build should pass

Additional info:
I will attach openshift ansible logs and all the events from default project.

Comment 3 Vikas Laad 2017-08-17 12:41:20 UTC
-bash-4.2# oc get svc -n default
NAME               CLUSTER-IP       EXTERNAL-IP   PORT(S)                   AGE
docker-registry    172.26.84.149    <none>        5000/TCP                  16h
kubernetes         172.24.0.1       <none>        443/TCP,53/UDP,53/TCP     17h
registry-console   172.24.127.251   <none>        9000/TCP                  16h
router             172.25.82.116    <none>        80/TCP,443/TCP,1936/TCP   16h

Comment 4 Michal Fojtik 2017-08-18 08:08:09 UTC
It looks like the DNS resolved in a wrong IP address. Clayton, do you know if this is a  known issue?

Comment 5 Michal Fojtik 2017-08-23 09:13:28 UTC
Moving this to networking team do investigate the DNS issues (I don't think there is anything wrong in the Docker Registry code).

Comment 6 Ben Bennett 2017-08-31 15:08:24 UTC
It looks like docker-registry.default.svc is just not resolvable.  How is the DNS set up on that node?  Is it using the dnsmasq to do split horizon?

If you ssh to the node and try:
  - nslookup docker-registry.default.svc
  - nslookup docker-registry.default.svc.cluster.local

What happens?

Also, please attach /etc/resolv.conf from the node.

Comment 7 Ben Bennett 2017-08-31 15:12:18 UTC
Also, please grab /etc/dnsmasq.d/origin-dns.conf from the node.

Also from any pod that is running on the node, please grab /etc/resolv.conf from _inside_ the pod.

Comment 10 Scott Dodson 2017-09-08 15:27:14 UTC
Ben,

In 3.6 dnsmasq config for cluster.local and in-addr.arpa are dynamically added when the node starts by sending a dbus signal to dnsmasq. In the event that they restart dnsmasq we also create /etc/dnsmasq.d/node-dnsmasq.conf which contains the same configuration values. This is removed when the node service is stopped because we no longer want to intercept in-addr.arpa queries.

While the node service is running these queries should go to the node's dns service running on 127.0.0.1.

server=/in-addr.arpa/127.0.0.1
server=/cluster.local/127.0.0.1

The node will have these configuration values specified in /etc/origin/node/node-config.yaml

dnsBindAddress: 127.0.0.1:53
dnsRecursiveResolvConf: /etc/origin/node/resolv.conf

The second will be a resolv.conf which contains the hosts's default resolvers so that the node can break the loop for queries it has to recurse.




Vikas,

Can you provide all of the contents of /etc/dnsmasq.d/ by running `more /etc/dnsmasq.d/* | cat` also /etc/origin/node/node-config.yaml ?

And a log file from `journalctl --no-pager -u dnsmasq`

Comment 11 Vikas Laad 2017-09-08 19:48:32 UTC
I am running into following bz, cant create cluster because of following

https://bugzilla.redhat.com/show_bug.cgi?id=1489959

Comment 12 Gaoyun Pei 2017-09-11 08:51:43 UTC
Also encounter this error on a system container installed 3.7 cluster, compared with rpm installed env, it's missing /etc/dnsmasq.d/node-dnsmasq.conf file on the nodes.

[root@qe-gpei-node-zone1-primary-1 ~]# more /etc/dnsmasq.d/* |cat
::::::::::::::
/etc/dnsmasq.d/origin-dns.conf
::::::::::::::
no-resolv
domain-needed
no-negcache
max-cache-ttl=1
enable-dbus
bind-interfaces
listen-address=10.240.0.54
::::::::::::::
/etc/dnsmasq.d/origin-upstream-dns.conf
::::::::::::::
server=169.254.169.254


After copy /etc/origin/node/node-dnsmasq.conf to /etc/dnsmasq.d/node-dnsmasq.conf and restart dnsmasq service, docker-registry.default.svc could be resolved on nodes.


Seems atomic-openshift-node systemd unit file is lacking of configuration about dnsmasq.
[root@qe-gpei-node-zone1-primary-1 ~]# cat /etc/systemd/system/atomic-openshift-node.service
[Unit]
After=docker.service
After=openvswitch.service
Wants=docker.service
After=atomic-openshift-node-dep.service
After=atomic-openshift-master.service

[Service]
EnvironmentFile=/etc/sysconfig/atomic-openshift-node
EnvironmentFile=/etc/sysconfig/atomic-openshift-node-dep
ExecStartPre=/bin/bash -c 'export -p > /run/atomic-openshift-node-env'
ExecStart=/bin/runc --systemd-cgroup run 'atomic-openshift-node'
ExecStop=/bin/runc --systemd-cgroup kill 'atomic-openshift-node'
SyslogIdentifier=atomic-openshift-node
Restart=always
RestartSec=5s
WorkingDirectory=/sysroot/ostree/deploy/rhel-atomic-host/var/lib/containers/atomic/atomic-openshift-node.0
RuntimeDirectory=atomic-openshift-node

[Install]
WantedBy=docker.service


The service template inside latest node image:
[root@qe-gpei-node-zone2-primary-1 ~]# docker run --entrypoint cat registry.x.com/openshift3/node:v3.7.0-0.125.0 /exports/service.template
[Unit]
After=${DOCKER_SERVICE}
After=${OPENVSWITCH_SERVICE}
Wants=${DOCKER_SERVICE}
After=$NAME-dep.service
After=${MASTER_SERVICE}

[Service]
EnvironmentFile=/etc/sysconfig/$NAME
EnvironmentFile=/etc/sysconfig/$NAME-dep
ExecStartPre=/bin/bash -c 'export -p > /run/$NAME-env'
ExecStart=$EXEC_START
ExecStop=$EXEC_STOP
SyslogIdentifier=$NAME
Restart=always
RestartSec=5s
WorkingDirectory=$DESTDIR
RuntimeDirectory=${NAME}

[Install]
WantedBy=docker.service

Comment 13 Gaoyun Pei 2017-09-11 08:55:57 UTC
This issue is blocking QE's testing on system container environment.

Version-Release number of selected component (if applicable):
openshift-ansible-3.7.0-0.125.0.git.0.91043b6.el7.noarch.rpm
ansible-2.3.2.0-2.el7.noarch
openshift3/node:v3.7.0-0.125.0

Comment 15 Giuseppe Scrivano 2017-09-15 15:30:46 UTC
proposed PR here:

https://github.com/openshift/origin/pull/16378
https://github.com/openshift/openshift-ansible/pull/5429

In the meanwhile, could you verify if adding these lines to your /etc/systemd/system/atomic-openshift-node.service solve the issue for you?

ExecStartPre=/usr/bin/cp /etc/origin/node/node-dnsmasq.conf /etc/dnsmasq.d/
ExecStartPre=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:/in-addr.arpa/127.0.0.1,/cluster.local/127.0.0.1
ExecStopPost=/usr/bin/rm /etc/dnsmasq.d/node-dnsmasq.conf
ExecStopPost=/usr/bin/dbus-send --system --dest=uk.org.thekelleys.dnsmasq /uk/org/thekelleys/dnsmasq uk.org.thekelleys.SetDomainServers array:string:

and then:

systemctl daemon-reload
systemctl restart atomic-openshift-node

Comment 17 Vikas Laad 2017-09-27 15:23:13 UTC
Verified in following version

openshift v3.7.0-0.127.0
kubernetes v1.7.0+80709908fd
etcd 3.2.1


Tested all the quickstart app build completed fine.

Comment 20 errata-xmlrpc 2017-11-28 22:07:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.