Description of problem: Node evacuation failed during upgrade. the atomic-openshift-master and atomic-openshift-node weren't started automatically after docker was upgrade to 1.9, so oadm failed to connect the server. Version-Release number of selected component (if applicable): atomic-openshift-utils-3.0.85 How reproducible: always Steps to Reproduce: 1. install containerized OSE 3.1 2. upgrade to OSE 3.2 3. check the upgrade log 4. check docker and service status Actual results: 3. TASK: [Prepare for Node evacuation] ******************************************* <osecontain-master1.example.com> ESTABLISH CONNECTION FOR USER: root <osecontain-master1.example.com> REMOTE_MODULE command /usr/local/bin/oadm manage-node osecontain-master1.example.com --schedulable=false <osecontain-master1.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 osecontain-master1.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163 && echo $HOME/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163' <osecontain-master1.example.com> PUT /tmp/tmpAMUlYb TO /root/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163/command <osecontain-master1.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 osecontain-master1.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163/command; rm -rf /root/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163/ >/dev/null 2>&1' failed: [osecontain-master1.example.com -> osecontain-master1.example.com] => {"changed": true, "cmd": ["/usr/local/bin/oadm", "manage-node", "osecontain-master1.example.com", "--schedulable=false"], "delta": "0:00:02.937325", "end": "2016-04-28 16:49:23.597150", "rc": 1, "start": "2016-04-28 16:49:20.659825", "warnings": []} stderr: ================================================================================ ATTENTION: You are running oadm via a wrapper around 'docker run openshift3/ose:v3.1.1.6'. This wrapper is intended only to be used to bootstrap an environment. Please install client tools on another host once you have granted cluster-admin privileges to a user. See https://docs.openshift.com/enterprise/latest/cli_reference/get_started_cli.html ================================================================================= The connection to the server osecontain-master1.example.com:8443 was refused - did you specify the right host or port? FATAL: all hosts have already failed -- aborting PLAY RECAP ******************************************************************** to retry, use: --limit @/root/upgrade.retry localhost : ok=17 changed=0 unreachable=0 failed=0 osecontain-master1.example.com : ok=161 changed=35 unreachable=0 failed=1 osecontain-node1.example.com : ok=106 changed=27 unreachable=0 failed=0 4.1 [root@osecontain-master1 ~]# docker version Client: Version: 1.9.1 API version: 1.21 Package version: docker-1.9.1-25.el7.x86_64 Go version: go1.4.2 Git commit: 78ee77d/1.9.1 Built: OS/Arch: linux/amd64 4.2 [root@osecontain-master1 ~]# systemctl status atomic-openshift-node ● atomic-openshift-node.service Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Active: inactive (dead) since Thu 2016-04-28 16:47:11 CST; 17min ago Main PID: 1724 (code=exited, status=0/SUCCESS) Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.460284 1766 manager.go:1690] Need to restart pod infra container for "mysql-1-m9whk_cakephpmysql(192c77e4-0d1b-11e6-a...t is changed Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.460372 1766 manager.go:1725] Infra Container is being recreated. "mysql" will be restarted. Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.480742 1766 manager.go:1368] Killing container "ec773aa7480a4ff156fa15fb601e96abc3ffb4f3d3884584e16b72f4f4267c04 mysq...grace period Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.525764 1766 manager.go:1690] Need to restart pod infra container for "docker-registry-2-4qjci_default(66c255f2-0d1a-1...t is changed Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.525900 1766 manager.go:1725] Infra Container is being recreated. "registry" will be restarted. Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.526016 1766 manager.go:1368] Killing container "4757bbd90729d410c1a6216e2dbe42cc81408b4f365f5e3d1521eea93e2074bd regi...grace period Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.577946 1766 proxier.go:494] Removing endpoints for "default/router:80-tcp" Apr 28 16:47:11 osecontain-master1.example.com docker[2142]: atomic-openshift-node Apr 28 16:47:11 osecontain-master1.example.com systemd[1]: Stopped atomic-openshift-node.service. Apr 28 16:49:05 osecontain-master1.example.com systemd[1]: Stopped atomic-openshift-node.service. Hint: Some lines were ellipsized, use -l to show in full. Expected results: upgrade success Additional info: The atomic-openshift-master and atomic-openshift-node can be started manually.
Created attachment 1152143 [details] Logs for upgrade on Atomic hosts When upgrade native ha on ATOMIC Host, Hit the same issue. It was a testblock for containerized OSE Upgrade. -bash-4.2# systemctl status atomic-openshift-master-api ● atomic-openshift-master-api.service - Atomic OpenShift Master API Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Fri 2016-04-29 06:30:13 UTC; 11min ago Docs: https://github.com/openshift/origin Process: 18174 ExecStop=/usr/bin/docker stop atomic-openshift-master-api (code=exited, status=0/SUCCESS) Process: 17413 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS) Process: 17412 ExecStart=/usr/bin/docker run --rm --privileged --net=host --name atomic-openshift-master-api --env-file=/etc/sysconfig/atomic-openshift-master-api -v /var/lib/origin:/var/lib/origin -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin openshift3/ose:${IMAGE_VERSION} start master api --config=${CONFIG_FILE} $OPTIONS (code=exited, status=2) Process: 17407 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-api (code=exited, status=1/FAILURE) Main PID: 17412 (code=exited, status=2) Apr 29 06:29:11 atomic1master1.example.com atomic-openshift-master-api[17412]: [185.391µs] [12.375µs] About to store object in database Apr 29 06:29:11 atomic1master1.example.com atomic-openshift-master-api[17412]: [475.968183ms] [475.782792ms] END Apr 29 06:29:12 atomic1master1.example.com atomic-openshift-master-api[17412]: I0429 02:29:12.225890 1 ensure.go:86] Added replication-controller service accounts to the system:replication-co...role: <nil> Apr 29 06:29:12 atomic1master1.example.com atomic-openshift-master-api[17412]: I0429 02:29:12.551309 1 run_components.go:199] DNS listening at 0.0.0.0:53 Apr 29 06:30:12 atomic1master1.example.com systemd[1]: Stopping Atomic OpenShift Master API... Apr 29 06:30:13 atomic1master1.example.com atomic-openshift-master-api[18174]: atomic-openshift-master-api Apr 29 06:30:13 atomic1master1.example.com systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=2/INVALIDARGUMENT Apr 29 06:30:13 atomic1master1.example.com systemd[1]: Stopped Atomic OpenShift Master API. Apr 29 06:30:13 atomic1master1.example.com systemd[1]: Unit atomic-openshift-master-api.service entered failed state. Apr 29 06:30:13 atomic1master1.example.com systemd[1]: atomic-openshift-master-api.service failed. Hint: Some lines were ellipsized, use -l to show in full.
Quicker and simpler reproducer, install with on RHEL server 7.2 with openshift_image_tag=v3.1.1.6, using latest openshift-ansible. Make sure this lands you with docker 1.8. Then change to openshift_image_tag=v3.2.0.20, run upgrade playbook.
https://github.com/openshift/openshift-ansible/pull/1918 Fixed by using systemctl to restart docker, which is able to automatically restart the dependent services. Ansible's service command does a full stop and start, which does not.
We will tackle the other issue where we're bouncing docker, then trying to evacuate the node, as part of upcoming improvements to upgrade.
For this bugs, The node evacuation pass, so moved to verified. TASK: [Prepare for Node evacuation] ******************************************* <host4master.example.com> ESTABLISH CONNECTION FOR USER: root <host4master.example.com> REMOTE_MODULE command /usr/local/bin/oadm manage-node host4node.example.com --schedulable=false <host4master.example.com> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 host4master.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828 && echo $HOME/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828' <host4master.example.com> PUT /tmp/tmp7XkjxL TO /root/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828/command <host4master.example.com> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 host4master.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828/command; rm -rf /root/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828/ >/dev/null 2>&1' changed: [host4node.example.com -> host4master.example.com] => {"changed": true, "cmd": ["/usr/local/bin/oadm", "manage-node", "host4node.example.com", "--schedulable=false"], "delta": "0:00:03.309119", "end": "2016-06-08 06:51:30.566772", "rc": 0, "start": "2016-06-08 06:51:27.257653", "stderr": "\n================================================================================\nATTENTION: You are running oadm via a wrapper around 'docker run openshift3/ose:v3.2.1.1'.\nThis wrapper is intended only to be used to bootstrap an environment. Please\ninstall client tools on another host once you have granted cluster-admin\nprivileges to a user. \nSee https://docs.openshift.com/enterprise/latest/cli_reference/get_started_cli.html\n=================================================================================", "stdout": "NAME STATUS AGE\nhost4node.example.com Ready,SchedulingDisabled 2h", "warnings": []} TASK: [Evacuate Node for Kubelet upgrade] ************************************* <host4master.example.com> ESTABLISH CONNECTION FOR USER: root <host4master.example.com> REMOTE_MODULE command /usr/local/bin/oadm manage-node host4node.example.com --evacuate --force <host4master.example.com> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 host4master.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623 && echo $HOME/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623' <host4master.example.com> PUT /tmp/tmp4NPAR_ TO /root/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623/command <host4master.example.com> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 host4master.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623/command; rm -rf /root/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623/ >/dev/null 2>&1' changed: [host4node.example.com -> host4master.example.com] => {"changed": true, "cmd": ["/usr/local/bin/oadm", "manage-node", "host4node.example.com", "--evacuate", "--force"], "delta": "0:00:02.874339", "end": "2016-06-08 06:51:33.655613", "rc": 0, "start": "2016-06-08 06:51:30.781274", "stderr": "\n================================================================================\nATTENTION: You are running oadm via a wrapper around 'docker run openshift3/ose:v3.2.1.1'.\nThis wrapper is intended only to be used to bootstrap an environment. Please\ninstall client tools on another host once you have granted cluster-admin\nprivileges to a user. \nSee https://docs.openshift.com/enterprise/latest/cli_reference/get_started_cli.html\n=================================================================================", "stdout": "\nMigrating these pods on node: host4node.example.com\n\nNAME READY STATUS RESTARTS AGE\ncakephp-mysql-example-1-gao39 1/1 Running 0 16s\nmysql-1-6fwhs 1/1 Running 0 2h\ndocker-registry-2-ppfh8 1/1 Running 0 2h\nrouter-1-f9xba 1/1 Running 0 2h", "warnings": []} TASK: [Upgrade packages] ****************************************************** skipping: [host4node.example.com]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1344