Bug 1509837 - Upgrade may fail when restart master controllers
Summary: Upgrade may fail when restart master controllers
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.7.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.7.0
Assignee: Michael Gugino
QA Contact: liujia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-11-06 07:43 UTC by liujia
Modified: 2017-11-28 22:21 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-11-28 22:21:49 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:3188 0 normal SHIPPED_LIVE Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update 2017-11-29 02:34:54 UTC

Description liujia 2017-11-06 07:43:09 UTC
Description of problem:
Upgrade against non-ha containerzied ocp failed at task [restart master controllers].
RUNNING HANDLER [restart master controllers] *******************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-controllers.service\" and \"journalctl -xe\" for details.\n"}

===================more info
Checked on master hosts to find that atomic-openshift-master-controllers was running. 

some journal log
<--snip-->
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: Failed to start Atomic OpenShift Master Controllers.
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service failed.
Nov 06 00:42:12 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart.
Nov 06 00:42:12 qe-jliu-con-master-1 systemd[1]: Starting Atomic OpenShift Master Controllers...
Nov 06 00:42:13 qe-jliu-con-master-1 atomic-openshift-master-controllers[3392]: Error response from daemon: No such container: atomic-openshift-master-controllers

<--snip-->

# systemctl status atomic-openshift-master*
● atomic-openshift-master-api.service - Atomic OpenShift Master API
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-11-06 00:41:56 EST; 7min ago
     Docs: https://github.com/openshift/origin
  Process: 3086 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 3081 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-api (code=exited, status=1/FAILURE)
 Main PID: 3085 (docker-current)
...
● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-11-06 00:42:23 EST; 6min ago
     Docs: https://github.com/openshift/origin
  Process: 3382 ExecStop=/usr/bin/docker stop atomic-openshift-master-controllers (code=exited, status=1/FAILURE)
  Process: 3397 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 3392 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-controllers (code=exited, status=1/FAILURE)
 Main PID: 3396 (docker-current)
...
● atomic-openshift-master.service
   Loaded: not-found (Reason: No such file or directory)
   Active: active (running) since Mon 2017-11-06 00:41:34 EST; 7min ago
 Main PID: 1843 (docker-current)
   CGroup: /system.slice/atomic-openshift-master.service
           └─1843 /usr/bin/docker-current run --rm --privileged --net=host --name atomic-openshift-master --env-file=/etc/sysconfig/atomic-openshift-master -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin -v /etc/origin/cloudprovider:/etc/origin/cloudprovider -v /etc/pki:/etc/pki:ro openshift3/ose:v3.6.173.0.63 start master --config=/etc/origin/master/master-config.yaml --loglevel=5


# docker images
REPOSITORY                                          TAG                 IMAGE ID            CREATED             SIZE
registry.ops.openshift.com/openshift3/ose           v3.7                e63c03f3ae7b        4 days ago          1.059 GB
registry.ops.openshift.com/openshift3/ose           v3.7.0              e63c03f3ae7b        4 days ago          1.059 GB
registry.ops.openshift.com/openshift3/openvswitch   v3.6.173.0.63       e7d2769a89cf        6 days ago          1.159 GB
registry.ops.openshift.com/openshift3/node          v3.6.173.0.63       f1fe7e034bec        6 days ago          1.157 GB
registry.ops.openshift.com/openshift3/ose           v3.6                ef9ef5dca033        6 days ago          970.6 MB
registry.ops.openshift.com/openshift3/ose           v3.6.173.0.63       ef9ef5dca033        6 days ago          970.6 MB

# docker ps
CONTAINER ID        IMAGE                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
a155f91c04df        openshift3/ose:v3.7.0                  "/usr/bin/openshift s"   8 minutes ago       Up 8 minutes                            atomic-openshift-master-controllers
f9edfcd6b082        openshift3/ose:v3.7.0                  "/usr/bin/openshift s"   8 minutes ago       Up 8 minutes                            atomic-openshift-master-api
403b30456dab        openshift3/node:v3.6.173.0.63          "/usr/local/bin/origi"   9 minutes ago       Up 9 minutes                            atomic-openshift-node
fba2480d2cf5        openshift3/ose:v3.6.173.0.63           "/usr/bin/openshift s"   9 minutes ago       Up 9 minutes                            atomic-openshift-master
dee2ab535184        openshift3/openvswitch:v3.6.173.0.63   "/usr/local/bin/ovs-r"   4 hours ago         Up 4 hours              

Version-Release number of the following components:
openshift-ansible-3.7.0-0.194.0.git.0.e8af207.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Containerzied install ocp v3.6 with non-ha deployed.
2. Upgrade above to v3.7
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Upgrade log and master-controller journal log in attachment

Comment 3 Scott Dodson 2017-11-06 13:51:14 UTC
Reading through the journal logs I think this is just happening because the api server is still bootstrapping when we first attempt to start the controllers. We see the controllers stop with the following error then restart at which point they're successful.

Nov 06 00:42:00 qe-jliu-con-master-1 atomic-openshift-master-controllers[3222]: F1106 05:42:00.078464       1 client_builder.go:273] unable to get token for service account: error watching
Nov 06 00:42:00 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a


So lets either add retries to these service starts or ignore the failure then loop for a period checking the status of the service.

Comment 4 Michael Gugino 2017-11-06 14:56:47 UTC
@sdodson

I think we should investigate systemd's handling of the starting of the api service if the controller requires that as a dependency.

I will implement the work-around for now as we're close to freeze.

https://www.freedesktop.org/software/systemd/man/sd_notify.html#

Comment 5 Michael Gugino 2017-11-06 17:51:54 UTC
PR Created: https://github.com/openshift/openshift-ansible/pull/6027

Comment 6 Scott Dodson 2017-11-07 21:24:05 UTC
openshift-ansible-3.7.0-0.197.0

Comment 7 liujia 2017-11-08 01:52:44 UTC
Verified on openshift-ansible-3.7.0-0.197.0.git.0.f40c09c.el7.noarch.

Comment 10 errata-xmlrpc 2017-11-28 22:21:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188


Note You need to log in before you can comment on or make changes to this bug.