1641245 – Upgrade to 3.11 on atomic host fails during Install or Update node system container task

Bug 1641245 - Upgrade to 3.11 on atomic host fails during Install or Update node system container task

Summary: Upgrade to 3.11 on atomic host fails during Install or Update node system con...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.11.0
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Michael Gugino
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-20 11:09 UTC by Suresh
Modified:	2022-03-13 15:49 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-11-20 03:10:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:3537	0	None	None	None	2018-11-20 03:11:30 UTC

Description Suresh 2018-10-20 11:09:59 UTC

Description of problem:
Upgrade to 3.11 on atomic host fails during Install or Update node system container task. This is due to docker service was stopped in the previous task


TASK [openshift_node : Install or Update node system container] ****************************************************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/node_system_container_install.yml:2
Using module file /usr/share/ansible/openshift-ansible/roles/lib_openshift/library/oc_atomic_container.py
<master.example.com> ESTABLISH SSH CONNECTION FOR USER: cloud-user
<master.example.com> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o PreferredAuthentications=publickey -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=cloud-user -o ConnectTimeout=30 -o ControlPath=/root/.ansible/cp/%h-%r-%p master-01.ocp03.cacc.ch '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-miogwipkuyfmafemkqmtuyfmamhnzdog; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<master.example.com> (1, '\n{"msg": "time=\\"2018-10-19T16:43:36Z\\" level=fatal msg=\\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\\" \\n\\n", "failed": true, "rc": 1, "invocation": {"module_args": {"image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", "values": ["DNS_DOMAIN=cluster.local", "DOCKER_SERVICE=docker.service", "ADDTL_MOUNTS=,{\\"source\\": \\"/var/lib/origin/.docker\\", \\"destination\\": \\"/root/.docker\\", \\"type\\": \\"bind\\", \\"options\\": [\\"ro\\", \\"bind\\"]}"], "state": "latest", "name": "atomic-openshift-node"}}}\n', '')
fatal: [master-01.ocp03.cacc.ch]: FAILED! => {
    "changed": false,
    "invocation": {
        "module_args": {
            "image": "docker:registry.redhat.io/openshift3/ose-node:v3.11",
            "name": "atomic-openshift-node",
            "state": "latest",
            "values": [
                "DNS_DOMAIN=cluster.local",
                "DOCKER_SERVICE=docker.service",
                "ADDTL_MOUNTS=,{\"source\": \"/var/lib/origin/.docker\", \"destination\": \"/root/.docker\", \"type\": \"bind\", \"options\": [\"ro\", \"bind\"]}"
            ]
        }
    },
    "msg": "time=\"2018-10-19T16:43:36Z\" level=fatal msg=\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\" \n\n",
    "rc": 1
}



This is what the task is doing:

- name: Install or Update node system container
  oc_atomic_container:
    name: "{{ openshift_service_type }}-node"
    image: "{{ system_osn_image }}"
    values:
    - "DNS_DOMAIN={{ openshift.common.dns_domain }}"
    - "DOCKER_SERVICE={{ openshift_docker_service_name }}.service"
    - 'ADDTL_MOUNTS={{ l_node_syscon_add_mounts2 }}'
    state: latest
vars:

 l_node_syscon_auth_mounts_l: "{{ l_bind_docker_reg_auth | ternary(openshift_node_syscon_auth_mounts_l,[]) }}"
 l_node_syscon_add_mounts_l: "{{ openshift_node_syscon_add_mounts_l | union(l_node_syscon_auth_mounts_l) }}"
 l_node_syscon_add_mounts: ",{{ l_node_syscon_add_mounts_l | lib_utils_oo_l_of_d_to_csv }}"
 l_node_syscon_add_mounts2: "{{ (l_node_syscon_add_mounts != ',') | bool | ternary(l_node_syscon_add_mounts,'') }}"




Basically, Playbook stopped the docker service to kill existing static pods. Here is the task which is getting executed:


- name: stop docker to kill static pods
  service:
    name: docker
    state: stopped
  register: l_openshift_node_upgrade_docker_stop_result
  until: not (l_openshift_node_upgrade_docker_stop_result is failed)
  retries: 3
  delay: 30
  when: >
        inventory_hostname in groups['oo_masters_to_config']
        or (l_docker_upgrade is defined and l_docker_upgrade | bool)



The problem is, to start the node system container, it's using the image as docker:registry.redhat.io/openshift3/ose-node:v3.11 which will obviously fail if docker or conatiner engine is not running. So, either we have to remove the docker: in the image prefix (which I don't think would be a good idea) or start the docker/container engine before installing the node system container.




Version-Release number of the following components:

ansible-playbook 2.6.5

How reproducible:
Everytime


Steps to Reproduce:
1. Upgrade 3.10 deployed on Atomic host to 3.11

Actual results:

TASK [openshift_node : Install or Update node system container] ****************************************************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/node_system_container_install.yml:2
Using module file /usr/share/ansible/openshift-ansible/roles/lib_openshift/library/oc_atomic_container.py
<master.example.com> ESTABLISH SSH CONNECTION FOR USER: cloud-user
<master.example.com> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o PreferredAuthentications=publickey -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=cloud-user -o ConnectTimeout=30 -o ControlPath=/root/.ansible/cp/%h-%r-%p master.example.com '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-miogwipkuyfmafemkqmtuyfmamhnzdog; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<master.example.com> (1, '\n{"msg": "time=\\"2018-10-19T16:43:36Z\\" level=fatal msg=\\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\\" \\n\\n", "failed": true, "rc": 1, "invocation": {"module_args": {"image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", "values": ["DNS_DOMAIN=cluster.local", "DOCKER_SERVICE=docker.service", "ADDTL_MOUNTS=,{\\"source\\": \\"/var/lib/origin/.docker\\", \\"destination\\": \\"/root/.docker\\", \\"type\\": \\"bind\\", \\"options\\": [\\"ro\\", \\"bind\\"]}"], "state": "latest", "name": "atomic-openshift-node"}}}\n', '')
fatal: [master.example.com]: FAILED! => {
    "changed": false, 
    "invocation": {
        "module_args": {
            "image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", 
            "name": "atomic-openshift-node", 
            "state": "latest", 
            "values": [
                "DNS_DOMAIN=cluster.local", 
                "DOCKER_SERVICE=docker.service", 
                "ADDTL_MOUNTS=,{\"source\": \"/var/lib/origin/.docker\", \"destination\": \"/root/.docker\", \"type\": \"bind\", \"options\": [\"ro\", \"bind\"]}"
            ]
        }
    }, 
    "msg": "time=\"2018-10-19T16:43:36Z\" level=fatal msg=\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\" \n\n", 
    "rc": 1
}

PLAY RECAP *********************************************************************************************************************************************************************************************************************************
infra1.example.com     : ok=27   changed=2    unreachable=0    failed=0   
infra2.example.com     : ok=26   changed=2    unreachable=0    failed=0   
localhost                  : ok=36   changed=0    unreachable=0    failed=0   
master.example.com    : ok=361  changed=68   unreachable=0    failed=1   
master2.example.com    : ok=212  changed=42   unreachable=0    failed=0   
master3.example.com    : ok=212  changed=42   unreachable=0    failed=0   
node1.example.com      : ok=26   changed=2    unreachable=0    failed=0   
node2.example.com      : ok=26   changed=2    unreachable=0    failed=0   
node3.example.com   : ok=26   changed=2    unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************************************************************************************************************************************************************************
Initialization  : Complete (0:02:05)


Failure summary:


  1. Hosts:    master.example.com
     Play:     Update master nodes
     Task:     Install or Update node system container
     Message:  time="2018-10-19T16:43:36Z" level=fatal msg="Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?"

Comment 9 Michael Gugino 2018-10-30 13:02:32 UTC

PR created in 3.11: https://github.com/openshift/openshift-ansible/pull/10555

Comment 16 Scott Dodson 2018-11-01 14:50:16 UTC

*** Bug 1645164 has been marked as a duplicate of this bug. ***

Comment 19 liujia 2018-11-05 05:27:05 UTC

Can not re-produce on openshift-ansible-3.11.16-1.git.0.4ac6f81.el7.noarch according to following steps:

1. Fresh install ocp v3.10.45 on atomic hosts.
[root@ip-172-18-3-222 ~]# oc version
oc v3.10.45
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-3-222.ec2.internal:8443
openshift v3.10.45
kubernetes v1.10.0+b81c8f8

2. Edit inventory file to specify registry and token for upgrade
openshift_image_tag=v3.11.16
openshift_cluster_monitoring_operator_node_selector={"role": "node"}
oreg_auth_user="{{ lookup('env','REG_AUTH_USER2') }}"
oreg_auth_password="{{ lookup('env','REG_AUTH_PASSWORD2') }}"
oreg_url=registry.redhat.io/openshift3/ose-${component}:${version}

3. Run upgrade_control_plane to upgrade master first.
Upgrade succeed.
[root@ip-172-18-3-222 ~]# oc version
oc v3.11.16
kubernetes v1.11.0+d4cacc0
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-3-222.ec2.internal:8443
openshift v3.11.16
kubernetes v1.11.0+d4cacc0
[root@ip-172-18-3-222 ~]# oc get node
NAME                            STATUS    ROLES     AGE       VERSION
ip-172-18-13-218.ec2.internal   Ready     compute   1h        v1.10.0+b81c8f8
ip-172-18-3-222.ec2.internal    Ready     master    1h        v1.11.0+d4cacc0
ip-172-18-8-46.ec2.internal     Ready     <none>    1h        v1.10.0+b81c8f8

TASK [openshift_node : Install or Update node system container] ****************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/node_system_container_install.yml:2
Monday 05 November 2018  03:15:09 +0000 (0:00:03.543)       0:15:04.794 ******* 
changed: [x] => {"changed": true, "msg": "Extracting to /var/lib/containers/atomic/atomic-openshift-node.0\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"}

4. Run upgrade node
Upgrade succeed.
[root@ip-172-18-3-222 ~]# oc get node
NAME                            STATUS    ROLES     AGE       VERSION
ip-172-18-13-218.ec2.internal   Ready     compute   3h        v1.11.0+d4cacc0
ip-172-18-3-222.ec2.internal    Ready     master    3h        v1.11.0+d4cacc0
ip-172-18-8-46.ec2.internal     Ready     <none>    3h        v1.11.0+d4cacc0

TASK [openshift_node : Install or Update node system container] ****************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/node_system_container_install.yml:2
Monday 05 November 2018  03:45:01 +0000 (0:00:03.651)       0:05:41.271 ******* 
changed: [x] => {"changed": true, "msg": "Extracting to /var/lib/containers/atomic/atomic-openshift-node.0\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"}

Comment 20 liujia 2018-11-06 07:23:41 UTC

Dig into the fail task [Install or Update node system container], I thought I found the root cause for the original issue in description.

According to upgrade log in comment 2.
TASK [openshift_node : Install or Update node system container]
....
fatal: [master-01.ocp03.cacc.ch]: FAILED! => {
    "changed": false, 
    "invocation": {
        "module_args": {
            "image": "docker:registry.redhat.io/openshift3/ose-node:v3.11", 
            "name": "atomic-openshift-node", 
            "state": "latest", 
            "values": [
                "DNS_DOMAIN=cluster.local", 
                "DOCKER_SERVICE=docker.service", 
                "ADDTL_MOUNTS=,{\"source\": \"/var/lib/origin/.docker\", \"destination\": \"/root/.docker\", \"type\": \"bind\", \"options\": [\"ro\", \"bind\"]}"
            ]
        }
    }, 
    "msg": "time=\"2018-10-19T16:43:36Z\" level=fatal msg=\"Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?\" \n\n", 
    "rc": 1
}

The name of image was "docker:registry.redhat.io/openshift3/ose-node:v3.11" instead of "registry.redhat.io/openshift3/ose-node:v3.11". This is why atomic command have relation with docker daemon. Let's look at following two image's name when do atomic install while docker service stopped.

[root@ip-172-18-11-119 ~]# atomic install --system --system-package=no --name=atomic-openshift-node registry.redhat.io/openshift3/ose-node:v3.11
Getting image source signatures
Copying blob sha256:367d845540573038025f445c654675aa63905ec8682938fb45bc00f40849c37b
 71.46 MB / 71.46 MB [======================================================] 2s
Copying blob sha256:b82a357e4f15fda58e9728fced8558704e3a2e1d100e93ac408edb45fe3a5cb9
 1.27 KB / 1.27 KB [========================================================] 0s
...

[root@ip-172-18-11-119 ~]# atomic install --system --system-package=no --name=atomic-openshift-node docker:registry.redhat.io/openshift3/ose-node:v3.11
FATA[0000] Error initializing source docker-daemon:registry.redhat.io/openshift3/ose-node:v3.11: Error loading image from docker engine: Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running? 

Then, let's come to the reason why the name of image including "docker:*". Let's checked the code.
# vim roles/openshift_node/tasks/node_system_container_install.yml
---
- name: Install or Update node system container
  oc_atomic_container:
    name: "{{ openshift_service_type }}-node"
    image: "{{ system_osn_image }}"
...

# grep -r "system_osn_image"
roles/openshift_node/defaults/main.yml:system_osn_image: "{{ (system_images_registry == 'docker') | ternary('docker:' + l_osn_image, l_osn_image) }}"
...

# grep -r "system_images_registry"
roles/openshift_node/defaults/main.yml:system_images_registry: "docker"

The default value of system_images_registry was "docker" and it was assembled in 
"system_osn_image", which resulted that atomic command was influenced by docker daemon.

In QE's test, we need specify correct "system_images_registry" when fresh system container install v3.10 on atomic hosts. And then upgrade against the cluster will succeed because atomic command pull image from correct registry instead of docker:***.

Please refer to the variable info in [1] 

[1] https://github.com/openshift/openshift-ansible/blob/master/inventory/hosts.example#L85

The resolution before the fix released for this issue should be that specifying correct  "system_images_registry" when install/upgrade.

In conclusion, QE considered it seems not related with the orders in playbook. If "system_images_registry" should be specified during install/upgrade, we should documented it just as [1]. If "system_images_registry" could be un-specified, then it should be consistent with other default registry, such as access or redhat.io.

Change bug status back to wait for the further fix.

Comment 21 Michael Gugino 2018-11-06 13:10:00 UTC

(In reply to liujia from comment #20)
> Dig into the fail task [Install or Update node system container], I thought
> I found the root cause for the original issue in description.
> 
>...
>
> The name of image was "docker:registry.redhat.io/openshift3/ose-node:v3.11"
> instead of "registry.redhat.io/openshift3/ose-node:v3.11". This is why
> atomic command have relation with docker daemon. Let's look at following two
> image's name when do atomic install while docker service stopped.

We didn't change this in 3.11, it's been that way for some time.  We did change in 3.11 that docker is stopped during the upgrade.

This might be another suitable fix, but I'm unsure of the trade offs, I'm pretty sure what is in 3.11 now is good as well.

Comment 22 liujia 2018-11-07 03:25:18 UTC

I still can not verify if it works well due to new blocker bug 1646887. After the bug fixed, will test it and do some regression against current re-order tasks.

Furthermore, we need to confirm if it was a document bug or not. AFAIK, QE need to set "system_images_registry" to correct registry during install/upgrade against system container ocp. If yes, then either fix from openshift-ansible is not necessary and will add risk of regression. 

[1] https://github.com/openshift/openshift-ansible/blob/master/inventory/hosts.example#L85

Comment 24 liujia 2018-11-09 06:24:28 UTC

According to comment 20 - comment 23, verify the bug without "system_images_registry" set in inventory file. QE will add testcase for the new   scenario.

Version:
openshift-ansible-3.11.41-1.git.0.f711b2d.el7.noarch

Steps:
1. system container install ocp v3.10 on atomic hosts.
2. ensure no "system_images_registry" set in inventory file
3. upgrade above ocp to v3.11

Upgrade succeed.
TASK [openshift_node : Install or Update node system container] ****************
changed: [x] => {
    "changed": true, 
    "invocation": {
        "module_args": {
            "image": "docker:registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.11.41", 
            "name": "atomic-openshift-node", 
            "state": "latest", 
            "values": [
                "DNS_DOMAIN=cluster.local", 
                "DOCKER_SERVICE=docker.service", 
                "ADDTL_MOUNTS=,{\"source\": \"/var/lib/origin/.docker\", \"destination\": \"/root/.docker\", \"type\": \"bind\", \"options\": [\"ro\", \"bind\"]}"
            ]
        }
    }, 
    "msg": "Getting image source signatures\nSkipping fetch of repeat blob sha256:dd7d5adb4579031663c0489591f9516900e3c64727ca9ad0bc4516265703ac92\nSkipping fetch of repeat blob sha256:27e45ca143e19ec3a4f6ff98ffbd470680ddb396c83ae76a9dc5e28ec6ade24d\nSkipping fetch of repeat blob sha256:32344fd2441451bcc9f861a159e2a56face2c56c94aa3750bb8dcbf12ad79ad8\nSkipping fetch of repeat blob sha256:fa80fd61aa7639742065e378c5cc5485244017fd6620e083dcf670cfcea4b4d9\nSkipping fetch of repeat blob sha256:741efa7f627a378459eea6369406e42aed10140fa8677ef34fd0642cf6df7e6d\nSkipping fetch of repeat blob sha256:fde064b4f586ff689338ce3f0bfe99200b410ea7383037fa67e2af8ef7571324\nCopying config sha256:0e8cd1a8361e0c5aae7fcc21ad7461956bfcf466ad655ff562de837a58538b27\n\r 0 B / 4.80 KB \r 4.80 KB / 4.80 KB \r 4.80 KB / 4.80 KB  6s\nWriting manifest to image destination\nStoring signatures\nExtracting to /var/lib/containers/atomic/atomic-openshift-node.0\nsystemctl daemon-reload\nsystemd-tmpfiles --create /etc/tmpfiles.d/atomic-openshift-node.conf\nsystemctl enable atomic-openshift-node\n"
}

Comment 25 liujia 2018-11-09 06:49:27 UTC

Done. Add OCP-21243

Comment 27 errata-xmlrpc 2018-11-20 03:10:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3537

Note You need to log in before you can comment on or make changes to this bug.