Bug 1331380 - Prepare for Node evacuation failed during containerized upgrade
Summary: Prepare for Node evacuation failed during containerized upgrade
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.2.1
Assignee: Devan Goodwin
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-04-28 12:14 UTC by Anping Li
Modified: 2016-06-27 15:04 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-06-27 15:04:22 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Logs for upgrade on Atomic hosts (91.09 KB, application/x-gzip)
2016-04-29 06:50 UTC, Anping Li
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2016:1344 0 normal SHIPPED_LIVE Red Hat OpenShift Enterprise atomic-openshift-utils bug fix update 2016-06-27 19:03:23 UTC

Description Anping Li 2016-04-28 12:14:50 UTC
Description of problem:
Node evacuation failed during upgrade.  the atomic-openshift-master and atomic-openshift-node weren't started automatically after docker was upgrade to 1.9,  so oadm failed to connect the server.


Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.0.85

How reproducible:
always

Steps to Reproduce:
1. install containerized OSE 3.1
2. upgrade to OSE 3.2
3. check the upgrade log
4. check docker and service status

Actual results:
3. 
TASK: [Prepare for Node evacuation] *******************************************
<osecontain-master1.example.com> ESTABLISH CONNECTION FOR USER: root
<osecontain-master1.example.com> REMOTE_MODULE command /usr/local/bin/oadm manage-node osecontain-master1.example.com --schedulable=false
<osecontain-master1.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 osecontain-master1.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163 && echo $HOME/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163'
<osecontain-master1.example.com> PUT /tmp/tmpAMUlYb TO /root/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163/command
<osecontain-master1.example.com> EXEC ssh -C -tt -v -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 osecontain-master1.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163/command; rm -rf /root/.ansible/tmp/ansible-tmp-1461833360.62-21386916479163/ >/dev/null 2>&1'
failed: [osecontain-master1.example.com -> osecontain-master1.example.com] => {"changed": true, "cmd": ["/usr/local/bin/oadm", "manage-node", "osecontain-master1.example.com", "--schedulable=false"], "delta": "0:00:02.937325", "end": "2016-04-28 16:49:23.597150", "rc": 1, "start": "2016-04-28 16:49:20.659825", "warnings": []}
stderr:
================================================================================
ATTENTION: You are running oadm via a wrapper around 'docker run openshift3/ose:v3.1.1.6'.
This wrapper is intended only to be used to bootstrap an environment. Please
install client tools on another host once you have granted cluster-admin
privileges to a user.
See https://docs.openshift.com/enterprise/latest/cli_reference/get_started_cli.html
=================================================================================

The connection to the server osecontain-master1.example.com:8443 was refused - did you specify the right host or port?

FATAL: all hosts have already failed -- aborting

PLAY RECAP ********************************************************************
           to retry, use: --limit @/root/upgrade.retry

localhost                  : ok=17   changed=0    unreachable=0    failed=0
osecontain-master1.example.com : ok=161  changed=35   unreachable=0    failed=1
osecontain-node1.example.com : ok=106  changed=27   unreachable=0    failed=0

4.1 
[root@osecontain-master1 ~]# docker version
Client:
 Version:         1.9.1
 API version:     1.21
 Package version: docker-1.9.1-25.el7.x86_64
 Go version:      go1.4.2
 Git commit:      78ee77d/1.9.1
 Built:           
 OS/Arch:         linux/amd64

4.2 
[root@osecontain-master1 ~]# systemctl status atomic-openshift-node
● atomic-openshift-node.service
   Loaded: loaded (/etc/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Thu 2016-04-28 16:47:11 CST; 17min ago
 Main PID: 1724 (code=exited, status=0/SUCCESS)

Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.460284    1766 manager.go:1690] Need to restart pod infra container for "mysql-1-m9whk_cakephpmysql(192c77e4-0d1b-11e6-a...t is changed
Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.460372    1766 manager.go:1725] Infra Container is being recreated. "mysql" will be restarted.
Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.480742    1766 manager.go:1368] Killing container "ec773aa7480a4ff156fa15fb601e96abc3ffb4f3d3884584e16b72f4f4267c04 mysq...grace period
Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.525764    1766 manager.go:1690] Need to restart pod infra container for "docker-registry-2-4qjci_default(66c255f2-0d1a-1...t is changed
Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.525900    1766 manager.go:1725] Infra Container is being recreated. "registry" will be restarted.
Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.526016    1766 manager.go:1368] Killing container "4757bbd90729d410c1a6216e2dbe42cc81408b4f365f5e3d1521eea93e2074bd regi...grace period
Apr 28 16:47:09 osecontain-master1.example.com docker[1724]: I0428 16:47:09.577946    1766 proxier.go:494] Removing endpoints for "default/router:80-tcp"
Apr 28 16:47:11 osecontain-master1.example.com docker[2142]: atomic-openshift-node
Apr 28 16:47:11 osecontain-master1.example.com systemd[1]: Stopped atomic-openshift-node.service.
Apr 28 16:49:05 osecontain-master1.example.com systemd[1]: Stopped atomic-openshift-node.service.
Hint: Some lines were ellipsized, use -l to show in full.


Expected results:
upgrade success

Additional info:
The atomic-openshift-master and atomic-openshift-node can be started manually.

Comment 1 Anping Li 2016-04-29 06:50:52 UTC
Created attachment 1152143 [details]
Logs for upgrade on Atomic hosts

When upgrade native ha on ATOMIC Host, Hit the same issue. It was a testblock for containerized OSE Upgrade.

-bash-4.2# systemctl status atomic-openshift-master-api
● atomic-openshift-master-api.service - Atomic OpenShift Master API
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Fri 2016-04-29 06:30:13 UTC; 11min ago
     Docs: https://github.com/openshift/origin
  Process: 18174 ExecStop=/usr/bin/docker stop atomic-openshift-master-api (code=exited, status=0/SUCCESS)
  Process: 17413 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 17412 ExecStart=/usr/bin/docker run --rm --privileged --net=host --name atomic-openshift-master-api --env-file=/etc/sysconfig/atomic-openshift-master-api -v /var/lib/origin:/var/lib/origin -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin openshift3/ose:${IMAGE_VERSION} start master api --config=${CONFIG_FILE} $OPTIONS (code=exited, status=2)
  Process: 17407 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-api (code=exited, status=1/FAILURE)
 Main PID: 17412 (code=exited, status=2)

Apr 29 06:29:11 atomic1master1.example.com atomic-openshift-master-api[17412]: [185.391µs] [12.375µs] About to store object in database
Apr 29 06:29:11 atomic1master1.example.com atomic-openshift-master-api[17412]: [475.968183ms] [475.782792ms] END
Apr 29 06:29:12 atomic1master1.example.com atomic-openshift-master-api[17412]: I0429 02:29:12.225890       1 ensure.go:86] Added replication-controller service accounts to the system:replication-co...role: <nil>
Apr 29 06:29:12 atomic1master1.example.com atomic-openshift-master-api[17412]: I0429 02:29:12.551309       1 run_components.go:199] DNS listening at 0.0.0.0:53
Apr 29 06:30:12 atomic1master1.example.com systemd[1]: Stopping Atomic OpenShift Master API...
Apr 29 06:30:13 atomic1master1.example.com atomic-openshift-master-api[18174]: atomic-openshift-master-api
Apr 29 06:30:13 atomic1master1.example.com systemd[1]: atomic-openshift-master-api.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Apr 29 06:30:13 atomic1master1.example.com systemd[1]: Stopped Atomic OpenShift Master API.
Apr 29 06:30:13 atomic1master1.example.com systemd[1]: Unit atomic-openshift-master-api.service entered failed state.
Apr 29 06:30:13 atomic1master1.example.com systemd[1]: atomic-openshift-master-api.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

Comment 5 Devan Goodwin 2016-05-18 17:54:32 UTC
Quicker and simpler reproducer, install with on RHEL server 7.2 with openshift_image_tag=v3.1.1.6, using latest openshift-ansible. Make sure this lands you with docker 1.8.

Then change to openshift_image_tag=v3.2.0.20, run upgrade playbook.

Comment 6 Devan Goodwin 2016-05-19 11:47:32 UTC
https://github.com/openshift/openshift-ansible/pull/1918

Fixed by using systemctl to restart docker, which is able to automatically restart the dependent services. Ansible's service command does a full stop and start, which does not.

Comment 7 Devan Goodwin 2016-05-19 11:49:26 UTC
We will tackle the other issue where we're bouncing docker, then trying to evacuate the node, as part of upcoming improvements to upgrade.

Comment 13 Anping Li 2016-06-08 10:54:32 UTC
For this bugs, The node evacuation pass, so moved to verified.

TASK: [Prepare for Node evacuation] ******************************************* 
<host4master.example.com> ESTABLISH CONNECTION FOR USER: root
<host4master.example.com> REMOTE_MODULE command /usr/local/bin/oadm manage-node host4node.example.com --schedulable=false
<host4master.example.com> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 host4master.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828 && echo $HOME/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828'
<host4master.example.com> PUT /tmp/tmp7XkjxL TO /root/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828/command
<host4master.example.com> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 host4master.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828/command; rm -rf /root/.ansible/tmp/ansible-tmp-1465383087.54-96651821464828/ >/dev/null 2>&1'
changed: [host4node.example.com -> host4master.example.com] => {"changed": true, "cmd": ["/usr/local/bin/oadm", "manage-node", "host4node.example.com", "--schedulable=false"], "delta": "0:00:03.309119", "end": "2016-06-08 06:51:30.566772", "rc": 0, "start": "2016-06-08 06:51:27.257653", "stderr": "\n================================================================================\nATTENTION: You are running oadm via a wrapper around 'docker run openshift3/ose:v3.2.1.1'.\nThis wrapper is intended only to be used to bootstrap an environment. Please\ninstall client tools on another host once you have granted cluster-admin\nprivileges to a user. \nSee https://docs.openshift.com/enterprise/latest/cli_reference/get_started_cli.html\n=================================================================================", "stdout": "NAME                    STATUS                     AGE\nhost4node.example.com   Ready,SchedulingDisabled   2h", "warnings": []}

TASK: [Evacuate Node for Kubelet upgrade] ************************************* 
<host4master.example.com> ESTABLISH CONNECTION FOR USER: root
<host4master.example.com> REMOTE_MODULE command /usr/local/bin/oadm manage-node host4node.example.com --evacuate --force
<host4master.example.com> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 host4master.example.com /bin/sh -c 'mkdir -p $HOME/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623 && echo $HOME/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623'
<host4master.example.com> PUT /tmp/tmp4NPAR_ TO /root/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623/command
<host4master.example.com> EXEC ssh -C -tt -vvv -o ControlMaster=auto -o ControlPersist=60s -o ControlPath="/root/.ansible/cp/ansible-ssh-%h-%p-%r" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o ConnectTimeout=10 host4master.example.com /bin/sh -c 'LANG=C LC_CTYPE=C /usr/bin/python /root/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623/command; rm -rf /root/.ansible/tmp/ansible-tmp-1465383091.08-106825151439623/ >/dev/null 2>&1'
changed: [host4node.example.com -> host4master.example.com] => {"changed": true, "cmd": ["/usr/local/bin/oadm", "manage-node", "host4node.example.com", "--evacuate", "--force"], "delta": "0:00:02.874339", "end": "2016-06-08 06:51:33.655613", "rc": 0, "start": "2016-06-08 06:51:30.781274", "stderr": "\n================================================================================\nATTENTION: You are running oadm via a wrapper around 'docker run openshift3/ose:v3.2.1.1'.\nThis wrapper is intended only to be used to bootstrap an environment. Please\ninstall client tools on another host once you have granted cluster-admin\nprivileges to a user. \nSee https://docs.openshift.com/enterprise/latest/cli_reference/get_started_cli.html\n=================================================================================", "stdout": "\nMigrating these pods on node: host4node.example.com\n\nNAME      READY     STATUS    RESTARTS   AGE\ncakephp-mysql-example-1-gao39   1/1       Running   0         16s\nmysql-1-6fwhs   1/1       Running   0         2h\ndocker-registry-2-ppfh8   1/1       Running   0         2h\nrouter-1-f9xba   1/1       Running   0         2h", "warnings": []}

TASK: [Upgrade packages] ****************************************************** 
skipping: [host4node.example.com]

Comment 15 errata-xmlrpc 2016-06-27 15:04:22 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1344


Note You need to log in before you can comment on or make changes to this bug.