Bug 1685072

Summary: Upgrade playbook fails at [openshift_node : stop docker to kill static pods] task in node with CRI-O
Product: OpenShift Container Platform Reporter: Joel Rosental R. <jrosenta>
Component: Cluster Version OperatorAssignee: Scott Dodson <sdodson>
Status: CLOSED ERRATA QA Contact: Weihua Meng <wmeng>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, jokerman, mmccomas, wmeng
Target Milestone: ---   
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously the upgrade would attempt to stop docker on nodes that had been configured to only run cri-o which resulted in a playbook failure. Now we no longer attempt to stop docker on nodes that are configured only for cri-o ensuring successful upgrades.
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-04-11 05:38:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Joel Rosental R. 2019-03-04 10:09:20 UTC
Description of problem:
While trying to upgrade from OCP 3.11.69 to 3.11.82 in a cluster running CRI-O instead of docker, the upgrade playbook fails with the following error:


2019-02-22 12:53:53,687 p=742 u=sys.openshift |  TASK [openshift_node : stop docker to kill static pods] **************************************************************************************************************
2019-02-22 12:53:53,687 p=742 u=sys.openshift |  task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/stop_services.yml:10
2019-02-22 12:53:53,687 p=742 u=sys.openshift |  Friday 22 February 2019  12:53:53 +0100 (0:00:00.657)       0:31:44.988 ******* 
2019-02-22 12:53:53,745 p=742 u=sys.openshift |  Running systemd
2019-02-22 12:53:53,820 p=742 u=sys.openshift |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/systemd.py
2019-02-22 12:53:53,966 p=742 u=sys.openshift |  Escalation succeeded
2019-02-22 12:53:54,121 p=742 u=sys.openshift |  FAILED - RETRYING: stop docker to kill static pods (3 retries left).Result was: {
    "attempts": 1, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "daemon_reload": false, 
            "enabled": null, 
            "force": null, 
            "masked": null, 
            "name": "docker", 
            "no_block": false, 
            "state": "stopped", 
            "user": false
        }
    }, 
    "msg": "Could not find the requested service docker: host", 
    "retries": 4
}

As per xx it looks like it's expecting that masters are running docker:

- name: stop docker to kill static pods
  service:
    name: docker
    state: stopped
  register: l_openshift_node_upgrade_docker_stop_result
  until: not (l_openshift_node_upgrade_docker_stop_result is failed)
  retries: 3
  delay: 30
  when: >
        inventory_hostname in groups['oo_masters_to_config']
        or (l_docker_upgrade is defined and l_docker_upgrade | bool)


Version-Release number of the following components:
openshift-ansible-playbooks-3.11.82-3.git.0.9718d0a.el7.noarch
ansible --version
ansible 2.6.13
config file = /etc/ansible/ansible.cfg
configured module search path = [u'/home/sys.openshift/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
ansible python module location = /usr/lib/python2.7/site-packages/ansible
executable location = /usr/bin/ansible
python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]


How reproducible:
Always

Steps to Reproduce:
1. ansible-playbook -i <hosts-file> playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.yml


Actual results:

2019-02-22 12:53:53,687 p=742 u=sys.openshift |  TASK [openshift_node : stop docker to kill static pods] **************************************************************************************************************
2019-02-22 12:53:53,687 p=742 u=sys.openshift |  task path: /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/stop_services.yml:10
2019-02-22 12:53:53,687 p=742 u=sys.openshift |  Friday 22 February 2019  12:53:53 +0100 (0:00:00.657)       0:31:44.988 ******* 
2019-02-22 12:53:53,745 p=742 u=sys.openshift |  Running systemd
2019-02-22 12:53:53,820 p=742 u=sys.openshift |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/systemd.py
2019-02-22 12:53:53,966 p=742 u=sys.openshift |  Escalation succeeded
2019-02-22 12:53:54,121 p=742 u=sys.openshift |  FAILED - RETRYING: stop docker to kill static pods (3 retries left).Result was: {
    "attempts": 1, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "daemon_reload": false, 
            "enabled": null, 
            "force": null, 
            "masked": null, 
            "name": "docker", 
            "no_block": false, 
            "state": "stopped", 
            "user": false
        }
    }, 
    "msg": "Could not find the requested service docker: host", 
    "retries": 4
}
2019-02-22 12:54:24,154 p=742 u=sys.openshift |  Running systemd
2019-02-22 12:54:24,243 p=742 u=sys.openshift |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/systemd.py
2019-02-22 12:54:24,523 p=742 u=sys.openshift |  Escalation succeeded
2019-02-22 12:54:24,707 p=742 u=sys.openshift |  FAILED - RETRYING: stop docker to kill static pods (2 retries left).Result was: {
    "attempts": 2, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "daemon_reload": false, 
            "enabled": null, 
            "force": null, 
            "masked": null, 
            "name": "docker", 
            "no_block": false, 
            "state": "stopped", 
            "user": false
        }
    }, 
    "msg": "Could not find the requested service docker: host", 
    "retries": 4
}
2019-02-22 12:54:54,740 p=742 u=sys.openshift |  Running systemd
2019-02-22 12:54:54,825 p=742 u=sys.openshift |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/systemd.py
2019-02-22 12:54:55,190 p=742 u=sys.openshift |  Escalation succeeded
2019-02-22 12:54:55,357 p=742 u=sys.openshift |  FAILED - RETRYING: stop docker to kill static pods (1 retries left).Result was: {
    "attempts": 3, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "daemon_reload": false, 
            "enabled": null, 
            "force": null, 
            "masked": null, 
            "name": "docker", 
            "no_block": false, 
            "state": "stopped", 
            "user": false
        }
    }, 
    "msg": "Could not find the requested service docker: host", 
    "retries": 4
}
2019-02-22 12:55:25,360 p=742 u=sys.openshift |  Running systemd
2019-02-22 12:55:25,443 p=742 u=sys.openshift |  Using module file /usr/lib/python2.7/site-packages/ansible/modules/system/systemd.py
2019-02-22 12:55:25,744 p=742 u=sys.openshift |  Escalation succeeded
2019-02-22 12:55:25,989 p=742 u=sys.openshift |  fatal: [tux172.xyz.com]: FAILED! => {
    "attempts": 3, 
    "changed": false, 
    "invocation": {
        "module_args": {
            "daemon_reload": false, 
            "enabled": null, 
            "force": null, 
            "masked": null, 
            "name": "docker", 
            "no_block": false, 
            "state": "stopped", 
            "user": false
        }
    }, 
    "msg": "Could not find the requested service docker: host"
}
2019-02-22 12:55:25,993 p=742 u=sys.openshift |  	to retry, use: --limit @/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_11/upgrade.retry

2019-02-22 12:55:25,993 p=742 u=sys.openshift |  PLAY RECAP ***********************************************************************************************************************************************************
2019-02-22 12:55:25,993 p=742 u=sys.openshift |  localhost                  : ok=36   changed=0    unreachable=0    failed=0   
2019-02-22 12:55:25,993 p=742 u=sys.openshift |  tux123.xyz.com  : ok=431  changed=77   unreachable=0    failed=0   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux172.xyz.com  : ok=255  changed=48   unreachable=0    failed=1   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux173.xyz.com  : ok=242  changed=39   unreachable=0    failed=0   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux174.xyz.com  : ok=24   changed=1    unreachable=0    failed=0   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux175.xyz.com  : ok=24   changed=1    unreachable=0    failed=0   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux176.xyz.com  : ok=24   changed=1    unreachable=0    failed=0   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux177.xyz.com  : ok=24   changed=1    unreachable=0    failed=0   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux178.xyz.com  : ok=24   changed=1    unreachable=0    failed=0   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux179.xyz.com  : ok=24   changed=1    unreachable=0    failed=0   
2019-02-22 12:55:25,994 p=742 u=sys.openshift |  tux180.xyz.com  : ok=24   changed=1    unreachable=0    failed=0   
2019-02-22 12:55:25,995 p=742 u=sys.openshift |  tux181.xyz.com  : ok=24   changed=1    unreachable=0    failed=0   
2019-02-22 12:55:25,995 p=742 u=sys.openshift |  INSTALLER STATUS *****************************************************************************************************************************************************
2019-02-22 12:55:25,997 p=742 u=sys.openshift |  Initialization  : Complete (0:03:58)
2019-02-22 12:55:25,997 p=742 u=sys.openshift |  Friday 22 February 2019  12:55:25 +0100 (0:01:32.309)       0:33:17.298 ******* 
2019-02-22 12:55:25,997 p=742 u=sys.openshift |  =============================================================================== 
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  openshift_node : update package meta data to speed install later. ------------------------------------------------------------------------------------------- 134.13s
/usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade_pre.yml:13 ----------------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  openshift_node : stop docker to kill static pods ------------------------------------------------------------------------------------------------------------- 92.31s
/usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/stop_services.yml:10 ------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  Run variable sanity checks ----------------------------------------------------------------------------------------------------------------------------------- 59.98s
/usr/share/ansible/openshift-ansible/playbooks/init/sanity_checks.yml:14 --------------------------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  openshift_node : Wait for master API to come back online ----------------------------------------------------------------------------------------------------- 59.88s
/usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/restart.yml:65 ------------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  openshift_excluder : Get available excluder version ---------------------------------------------------------------------------------------------------------- 52.80s
/usr/share/ansible/openshift-ansible/roles/openshift_excluder/tasks/verify_excluder.yml:4 ---------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  openshift_node : Clean up cri-o pods ------------------------------------------------------------------------------------------------------------------------- 39.42s
/usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade/stop_services.yml:31 ------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  openshift_node : Ensure cri-o is updated --------------------------------------------------------------------------------------------------------------------- 38.18s
/usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/upgrade.yml:36 --------------------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  Gathering Facts ---------------------------------------------------------------------------------------------------------------------------------------------- 34.59s
/usr/share/ansible/openshift-ansible/playbooks/openshift-node/private/registry_auth.yml:4 ---------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  Gathering Facts ---------------------------------------------------------------------------------------------------------------------------------------------- 34.48s
/usr/share/ansible/openshift-ansible/playbooks/init/basic_facts.yml:7 -----------------------------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  openshift_excluder : Install docker excluder - yum ----------------------------------------------------------------------------------------------------------- 33.99s
/usr/share/ansible/openshift-ansible/roles/openshift_excluder/tasks/install.yml:9 -----------------------------------------------------------------------------------
2019-02-22 12:55:26,001 p=742 u=sys.openshift |  Gathering Facts ---------------------------------------------------------------------------------------------------------------------------------------------- 33.26s
/usr/share/ansible/openshift-ansible/playbooks/openshift-node/private/registry_auth.yml:25 --------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  openshift_excluder : Install openshift excluder - yum -------------------------------------------------------------------------------------------------------- 32.70s
/usr/share/ansible/openshift-ansible/roles/openshift_excluder/tasks/install.yml:34 ----------------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  openshift_control_plane : Check status of control plane image pre-pull --------------------------------------------------------------------------------------- 31.31s
/usr/share/ansible/openshift-ansible/roles/openshift_control_plane/tasks/pre_pull_poll.yml:2 ------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  openshift_node_group : Wait for the sync daemonset to become ready and available ----------------------------------------------------------------------------- 22.41s
/usr/share/ansible/openshift-ansible/roles/openshift_node_group/tasks/sync.yml:65 -----------------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  Set fact of no_proxy_internal_hostnames ---------------------------------------------------------------------------------------------------------------------- 19.67s
/usr/share/ansible/openshift-ansible/playbooks/init/cluster_facts.yml:42 --------------------------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  Run variable sanity checks ----------------------------------------------------------------------------------------------------------------------------------- 19.49s
/usr/share/ansible/openshift-ansible/playbooks/init/sanity_checks.yml:14 --------------------------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  Gathering Facts ---------------------------------------------------------------------------------------------------------------------------------------------- 18.13s
/usr/share/ansible/openshift-ansible/playbooks/init/cluster_facts.yml:2 ---------------------------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  Gathering Facts ---------------------------------------------------------------------------------------------------------------------------------------------- 18.13s
/usr/share/ansible/openshift-ansible/playbooks/init/version.yml:12 --------------------------------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  Initialize openshift.node.sdn_mtu ---------------------------------------------------------------------------------------------------------------------------- 17.21s
/usr/share/ansible/openshift-ansible/playbooks/init/cluster_facts.yml:60 --------------------------------------------------------------------------------------------
2019-02-22 12:55:26,002 p=742 u=sys.openshift |  Gather Cluster facts ----------------------------------------------------------------------------------------------------------------------------------------- 17.05s
/usr/share/ansible/openshift-ansible/playbooks/init/cluster_facts.yml:27 --------------------------------------------------------------------------------------------
2019-02-22 12:55:26,003 p=742 u=sys.openshift |  Failure summary:


  1. Hosts:    tux172.xyz.com
     Play:     Update master nodes
     Task:     stop docker to kill static pods
     Message:  Could not find the requested service docker: host


Expected results:
It should check whether other runtimes (such as CRI-O) are installed instead of docker.

Additional info:

Comment 5 Weihua Meng 2019-03-25 09:56:51 UTC
Fixed.

openshift-ansible-3.11.98-1.git.0.3cfa7c3.el7


the task skipped for openshift_use_crio_only=True nodes 

TASK [openshift_node : stop docker to kill static pods] ************************
skipping: [qe-wmeng3r31169-np-1.0325-5g4.qe.rhcloud.com] => {
    "changed": false, 
    "skip_reason": "Conditional result was False"
}

Comment 7 errata-xmlrpc 2019-04-11 05:38:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0636