Description of problem: Upgrade to 3.11 on atomic host fails during Install or Update node system container task. This is due to docker service was stopped in the previous task TASK [openshift_node : Install or Update node system container] **************************************************************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/node_system_container_install.yml:2 Using module file /usr/share/ansible/openshift-ansible/roles/lib_openshift/library/oc_atomic_container.py <master.example.com> ESTABLISH SSH CONNECTION FOR USER: cloud-user <master.example.com> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o PreferredAuthentications=publickey -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=cloud-user -o ConnectTimeout=30 -o ControlPath=/root/.ansible/cp/%h-%r-%p master-01.ocp03.cacc.ch '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-miogwipkuyfmafemkqmtuyfmamhnzdog; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"'' Escalation succeeded <master.example.com> (1, '\n{"msg": "time=\\"2018-10-19T16:43:36Z\\" level=fatal msg=\\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\\" \\n\\n", "failed": true, "rc": 1, "invocation": {"module_args": {"image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", "values": ["DNS_DOMAIN=cluster.local", "DOCKER_SERVICE=docker.service", "ADDTL_MOUNTS=,{\\"source\\": \\"/var/lib/origin/.docker\\", \\"destination\\": \\"/root/.docker\\", \\"type\\": \\"bind\\", \\"options\\": [\\"ro\\", \\"bind\\"]}"], "state": "latest", "name": "atomic-openshift-node"}}}\n', '') fatal: [master-01.ocp03.cacc.ch]: FAILED! => { "changed": false, "invocation": { "module_args": { "image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", "name": "atomic-openshift-node", "state": "latest", "values": [ "DNS_DOMAIN=cluster.local", "DOCKER_SERVICE=docker.service", "ADDTL_MOUNTS=,{\"source\": \"/var/lib/origin/.docker\", \"destination\": \"/root/.docker\", \"type\": \"bind\", \"options\": [\"ro\", \"bind\"]}" ] } }, "msg": "time=\"2018-10-19T16:43:36Z\" level=fatal msg=\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\" \n\n", "rc": 1 } This is what the task is doing: - name: Install or Update node system container oc_atomic_container: name: "{{ openshift_service_type }}-node" image: "{{ system_osn_image }}" values: - "DNS_DOMAIN={{ openshift.common.dns_domain }}" - "DOCKER_SERVICE={{ openshift_docker_service_name }}.service" - 'ADDTL_MOUNTS={{ l_node_syscon_add_mounts2 }}' state: latest vars: l_node_syscon_auth_mounts_l: "{{ l_bind_docker_reg_auth | ternary(openshift_node_syscon_auth_mounts_l,[]) }}" l_node_syscon_add_mounts_l: "{{ openshift_node_syscon_add_mounts_l | union(l_node_syscon_auth_mounts_l) }}" l_node_syscon_add_mounts: ",{{ l_node_syscon_add_mounts_l | lib_utils_oo_l_of_d_to_csv }}" l_node_syscon_add_mounts2: "{{ (l_node_syscon_add_mounts != ',') | bool | ternary(l_node_syscon_add_mounts,'') }}" Basically, Playbook stopped the docker service to kill existing static pods. Here is the task which is getting executed: - name: stop docker to kill static pods service: name: docker state: stopped register: l_openshift_node_upgrade_docker_stop_result until: not (l_openshift_node_upgrade_docker_stop_result is failed) retries: 3 delay: 30 when: > inventory_hostname in groups['oo_masters_to_config'] or (l_docker_upgrade is defined and l_docker_upgrade | bool) The problem is, to start the node system container, it's using the image as docker:registry.redhat.io/openshift3/ose-node:v3.11 which will obviously fail if docker or conatiner engine is not running. So, either we have to remove the docker: in the image prefix (which I don't think would be a good idea) or start the docker/container engine before installing the node system container. Version-Release number of the following components: ansible-playbook 2.6.5 How reproducible: Everytime Steps to Reproduce: 1. Upgrade 3.10 deployed on Atomic host to 3.11 Actual results: TASK [openshift_node : Install or Update node system container] **************************************************************************************************************************************************************************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/node_system_container_install.yml:2 Using module file /usr/share/ansible/openshift-ansible/roles/lib_openshift/library/oc_atomic_container.py <master.example.com> ESTABLISH SSH CONNECTION FOR USER: cloud-user <master.example.com> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o PreferredAuthentications=publickey -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=cloud-user -o ConnectTimeout=30 -o ControlPath=/root/.ansible/cp/%h-%r-%p master.example.com '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-miogwipkuyfmafemkqmtuyfmamhnzdog; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"'' Escalation succeeded <master.example.com> (1, '\n{"msg": "time=\\"2018-10-19T16:43:36Z\\" level=fatal msg=\\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\\" \\n\\n", "failed": true, "rc": 1, "invocation": {"module_args": {"image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", "values": ["DNS_DOMAIN=cluster.local", "DOCKER_SERVICE=docker.service", "ADDTL_MOUNTS=,{\\"source\\": \\"/var/lib/origin/.docker\\", \\"destination\\": \\"/root/.docker\\", \\"type\\": \\"bind\\", \\"options\\": [\\"ro\\", \\"bind\\"]}"], "state": "latest", "name": "atomic-openshift-node"}}}\n', '') fatal: [master.example.com]: FAILED! => { "changed": false, "invocation": { "module_args": { "image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", "name": "atomic-openshift-node", "state": "latest", "values": [ "DNS_DOMAIN=cluster.local", "DOCKER_SERVICE=docker.service", "ADDTL_MOUNTS=,{\"source\": \"/var/lib/origin/.docker\", \"destination\": \"/root/.docker\", \"type\": \"bind\", \"options\": [\"ro\", \"bind\"]}" ] } }, "msg": "time=\"2018-10-19T16:43:36Z\" level=fatal msg=\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\" \n\n", "rc": 1 } PLAY RECAP ********************************************************************************************************************************************************************************************************************************* infra1.example.com : ok=27 changed=2 unreachable=0 failed=0 infra2.example.com : ok=26 changed=2 unreachable=0 failed=0 localhost : ok=36 changed=0 unreachable=0 failed=0 master.example.com : ok=361 changed=68 unreachable=0 failed=1 master2.example.com : ok=212 changed=42 unreachable=0 failed=0 master3.example.com : ok=212 changed=42 unreachable=0 failed=0 node1.example.com : ok=26 changed=2 unreachable=0 failed=0 node2.example.com : ok=26 changed=2 unreachable=0 failed=0 node3.example.com : ok=26 changed=2 unreachable=0 failed=0 INSTALLER STATUS *************************************************************************************************************************************************************************************************************************** Initialization : Complete (0:02:05) Failure summary: 1. Hosts: master.example.com Play: Update master nodes Task: Install or Update node system container Message: time="2018-10-19T16:43:36Z" level=fatal msg="Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"
PR created in 3.11: https://github.com/openshift/openshift-ansible/pull/10555
*** Bug 1645164 has been marked as a duplicate of this bug. ***
Can not re-produce on openshift-ansible-3.11.16-1.git.0.4ac6f81.el7.noarch according to following steps: 1. Fresh install ocp v3.10.45 on atomic hosts. [root@ip-172-18-3-222 ~]# oc version oc v3.10.45 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-3-222.ec2.internal:8443 openshift v3.10.45 kubernetes v1.10.0+b81c8f8 2. Edit inventory file to specify registry and token for upgrade openshift_image_tag=v3.11.16 openshift_cluster_monitoring_operator_node_selector={"role": "node"} oreg_auth_user="{{ lookup('env','REG_AUTH_USER2') }}" oreg_auth_password="{{ lookup('env','REG_AUTH_PASSWORD2') }}" oreg_url=registry.redhat.io/openshift3/ose-${component}:${version} 3. Run upgrade_control_plane to upgrade master first. Upgrade succeed. [root@ip-172-18-3-222 ~]# oc version oc v3.11.16 kubernetes v1.11.0+d4cacc0 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-3-222.ec2.internal:8443 openshift v3.11.16 kubernetes v1.11.0+d4cacc0 [root@ip-172-18-3-222 ~]# oc get node NAME STATUS ROLES AGE VERSION ip-172-18-13-218.ec2.internal Ready compute 1h v1.10.0+b81c8f8 ip-172-18-3-222.ec2.internal Ready master 1h v1.11.0+d4cacc0 ip-172-18-8-46.ec2.internal Ready <none> 1h v1.10.0+b81c8f8 TASK [openshift_node : Install or Update node system container] **************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/node_system_container_install.yml:2 Monday 05 November 2018 03:15:09 +0000 (0:00:03.543) 0:15:04.794 ******* changed: [x] => {"changed": true, "msg": "Extracting to /var/lib/containers/atomic/atomic-openshift-node.0\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"} 4. Run upgrade node Upgrade succeed. [root@ip-172-18-3-222 ~]# oc get node NAME STATUS ROLES AGE VERSION ip-172-18-13-218.ec2.internal Ready compute 3h v1.11.0+d4cacc0 ip-172-18-3-222.ec2.internal Ready master 3h v1.11.0+d4cacc0 ip-172-18-8-46.ec2.internal Ready <none> 3h v1.11.0+d4cacc0 TASK [openshift_node : Install or Update node system container] **************** task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/node_system_container_install.yml:2 Monday 05 November 2018 03:45:01 +0000 (0:00:03.651) 0:05:41.271 ******* changed: [x] => {"changed": true, "msg": "Extracting to /var/lib/containers/atomic/atomic-openshift-node.0\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"}
Dig into the fail task [Install or Update node system container], I thought I found the root cause for the original issue in description. According to upgrade log in comment 2. TASK [openshift_node : Install or Update node system container] .... fatal: [master-01.ocp03.cacc.ch]: FAILED! => { "changed": false, "invocation": { "module_args": { "image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", "name": "atomic-openshift-node", "state": "latest", "values": [ "DNS_DOMAIN=cluster.local", "DOCKER_SERVICE=docker.service", "ADDTL_MOUNTS=,{\"source\": \"/var/lib/origin/.docker\", \"destination\": \"/root/.docker\", \"type\": \"bind\", \"options\": [\"ro\", \"bind\"]}" ] } }, "msg": "time=\"2018-10-19T16:43:36Z\" level=fatal msg=\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\" \n\n", "rc": 1 } The name of image was "docker:registry.redhat.io/openshift3/ose-node:v3.11" instead of "registry.redhat.io/openshift3/ose-node:v3.11". This is why atomic command have relation with docker daemon. Let's look at following two image's name when do atomic install while docker service stopped. [root@ip-172-18-11-119 ~]# atomic install --system --system-package=no --name=atomic-openshift-node registry.redhat.io/openshift3/ose-node:v3.11 Getting image source signatures Copying blob sha256:367d845540573038025f445c654675aa63905ec8682938fb45bc00f40849c37b 71.46 MB / 71.46 MB [======================================================] 2s Copying blob sha256:b82a357e4f15fda58e9728fced8558704e3a2e1d100e93ac408edb45fe3a5cb9 1.27 KB / 1.27 KB [========================================================] 0s ... [root@ip-172-18-11-119 ~]# atomic install --system --system-package=no --name=atomic-openshift-node docker:registry.redhat.io/openshift3/ose-node:v3.11 FATA[0000] Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? Then, let's come to the reason why the name of image including "docker:*". Let's checked the code. # vim roles/openshift_node/tasks/node_system_container_install.yml --- - name: Install or Update node system container oc_atomic_container: name: "{{ openshift_service_type }}-node" image: "{{ system_osn_image }}" ... # grep -r "system_osn_image" roles/openshift_node/defaults/main.yml:system_osn_image: "{{ (system_images_registry == 'docker') | ternary('docker:' + l_osn_image, l_osn_image) }}" ... # grep -r "system_images_registry" roles/openshift_node/defaults/main.yml:system_images_registry: "docker" The default value of system_images_registry was "docker" and it was assembled in "system_osn_image", which resulted that atomic command was influenced by docker daemon. In QE's test, we need specify correct "system_images_registry" when fresh system container install v3.10 on atomic hosts. And then upgrade against the cluster will succeed because atomic command pull image from correct registry instead of docker:***. Please refer to the variable info in [1] [1] https://github.com/openshift/openshift-ansible/blob/master/inventory/hosts.example#L85 The resolution before the fix released for this issue should be that specifying correct "system_images_registry" when install/upgrade. In conclusion, QE considered it seems not related with the orders in playbook. If "system_images_registry" should be specified during install/upgrade, we should documented it just as [1]. If "system_images_registry" could be un-specified, then it should be consistent with other default registry, such as access or redhat.io. Change bug status back to wait for the further fix.
(In reply to liujia from comment #20) > Dig into the fail task [Install or Update node system container], I thought > I found the root cause for the original issue in description. > >... > > The name of image was "docker:registry.redhat.io/openshift3/ose-node:v3.11" > instead of "registry.redhat.io/openshift3/ose-node:v3.11". This is why > atomic command have relation with docker daemon. Let's look at following two > image's name when do atomic install while docker service stopped. We didn't change this in 3.11, it's been that way for some time. We did change in 3.11 that docker is stopped during the upgrade. This might be another suitable fix, but I'm unsure of the trade offs, I'm pretty sure what is in 3.11 now is good as well.
I still can not verify if it works well due to new blocker bug 1646887. After the bug fixed, will test it and do some regression against current re-order tasks. Furthermore, we need to confirm if it was a document bug or not. AFAIK, QE need to set "system_images_registry" to correct registry during install/upgrade against system container ocp. If yes, then either fix from openshift-ansible is not necessary and will add risk of regression. [1] https://github.com/openshift/openshift-ansible/blob/master/inventory/hosts.example#L85
According to comment 20 - comment 23, verify the bug without "system_images_registry" set in inventory file. QE will add testcase for the new scenario. Version: openshift-ansible-3.11.41-1.git.0.f711b2d.el7.noarch Steps: 1. system container install ocp v3.10 on atomic hosts. 2. ensure no "system_images_registry" set in inventory file 3. upgrade above ocp to v3.11 Upgrade succeed. TASK [openshift_node : Install or Update node system container] **************** changed: [x] => { "changed": true, "invocation": { "module_args": { "image": "docker:registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.11.41", "name": "atomic-openshift-node", "state": "latest", "values": [ "DNS_DOMAIN=cluster.local", "DOCKER_SERVICE=docker.service", "ADDTL_MOUNTS=,{\"source\": \"/var/lib/origin/.docker\", \"destination\": \"/root/.docker\", \"type\": \"bind\", \"options\": [\"ro\", \"bind\"]}" ] } }, "msg": "Getting image source signatures\nSkipping fetch of repeat blob sha256:dd7d5adb4579031663c0489591f9516900e3c64727ca9ad0bc4516265703ac92\nSkipping fetch of repeat blob sha256:27e45ca143e19ec3a4f6ff98ffbd470680ddb396c83ae76a9dc5e28ec6ade24d\nSkipping fetch of repeat blob sha256:32344fd2441451bcc9f861a159e2a56face2c56c94aa3750bb8dcbf12ad79ad8\nSkipping fetch of repeat blob sha256:fa80fd61aa7639742065e378c5cc5485244017fd6620e083dcf670cfcea4b4d9\nSkipping fetch of repeat blob sha256:741efa7f627a378459eea6369406e42aed10140fa8677ef34fd0642cf6df7e6d\nSkipping fetch of repeat blob sha256:fde064b4f586ff689338ce3f0bfe99200b410ea7383037fa67e2af8ef7571324\nCopying config sha256:0e8cd1a8361e0c5aae7fcc21ad7461956bfcf466ad655ff562de837a58538b27\n\r 0 B / 4.80 KB \r 4.80 KB / 4.80 KB \r 4.80 KB / 4.80 KB 6s\nWriting manifest to image destination\nStoring signatures\nExtracting to /var/lib/containers/atomic/atomic-openshift-node.0\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n" }
Done. Add OCP-21243
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3537