1509837 – Upgrade may fail when restart master controllers

Bug 1509837 - Upgrade may fail when restart master controllers

Summary: Upgrade may fail when restart master controllers

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.7.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Michael Gugino
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-11-06 07:43 UTC by liujia
Modified:	2017-11-28 22:21 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-11-28 22:21:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2017:3188	0	normal	SHIPPED_LIVE	Moderate: Red Hat OpenShift Container Platform 3.7 security, bug, and enhancement update	2017-11-29 02:34:54 UTC

Description liujia 2017-11-06 07:43:09 UTC

Description of problem:
Upgrade against non-ha containerzied ocp failed at task [restart master controllers].
RUNNING HANDLER [restart master controllers] *******************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-controllers.service\" and \"journalctl -xe\" for details.\n"}

===================more info
Checked on master hosts to find that atomic-openshift-master-controllers was running. 

some journal log
<--snip-->
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: Failed to start Atomic OpenShift Master Controllers.
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service failed.
Nov 06 00:42:12 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart.
Nov 06 00:42:12 qe-jliu-con-master-1 systemd[1]: Starting Atomic OpenShift Master Controllers...
Nov 06 00:42:13 qe-jliu-con-master-1 atomic-openshift-master-controllers[3392]: Error response from daemon: No such container: atomic-openshift-master-controllers

<--snip-->

# systemctl status atomic-openshift-master*
● atomic-openshift-master-api.service - Atomic OpenShift Master API
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-11-06 00:41:56 EST; 7min ago
     Docs: https://github.com/openshift/origin
  Process: 3086 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 3081 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-api (code=exited, status=1/FAILURE)
 Main PID: 3085 (docker-current)
...
● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-11-06 00:42:23 EST; 6min ago
     Docs: https://github.com/openshift/origin
  Process: 3382 ExecStop=/usr/bin/docker stop atomic-openshift-master-controllers (code=exited, status=1/FAILURE)
  Process: 3397 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 3392 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-controllers (code=exited, status=1/FAILURE)
 Main PID: 3396 (docker-current)
...
● atomic-openshift-master.service
   Loaded: not-found (Reason: No such file or directory)
   Active: active (running) since Mon 2017-11-06 00:41:34 EST; 7min ago
 Main PID: 1843 (docker-current)
   CGroup: /system.slice/atomic-openshift-master.service
           └─1843 /usr/bin/docker-current run --rm --privileged --net=host --name atomic-openshift-master --env-file=/etc/sysconfig/atomic-openshift-master -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin -v /etc/origin/cloudprovider:/etc/origin/cloudprovider -v /etc/pki:/etc/pki:ro openshift3/ose:v3.6.173.0.63 start master --config=/etc/origin/master/master-config.yaml --loglevel=5


# docker images
REPOSITORY                                          TAG                 IMAGE ID            CREATED             SIZE
registry.ops.openshift.com/openshift3/ose           v3.7                e63c03f3ae7b        4 days ago          1.059 GB
registry.ops.openshift.com/openshift3/ose           v3.7.0              e63c03f3ae7b        4 days ago          1.059 GB
registry.ops.openshift.com/openshift3/openvswitch   v3.6.173.0.63       e7d2769a89cf        6 days ago          1.159 GB
registry.ops.openshift.com/openshift3/node          v3.6.173.0.63       f1fe7e034bec        6 days ago          1.157 GB
registry.ops.openshift.com/openshift3/ose           v3.6                ef9ef5dca033        6 days ago          970.6 MB
registry.ops.openshift.com/openshift3/ose           v3.6.173.0.63       ef9ef5dca033        6 days ago          970.6 MB

# docker ps
CONTAINER ID        IMAGE                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
a155f91c04df        openshift3/ose:v3.7.0                  "/usr/bin/openshift s"   8 minutes ago       Up 8 minutes                            atomic-openshift-master-controllers
f9edfcd6b082        openshift3/ose:v3.7.0                  "/usr/bin/openshift s"   8 minutes ago       Up 8 minutes                            atomic-openshift-master-api
403b30456dab        openshift3/node:v3.6.173.0.63          "/usr/local/bin/origi"   9 minutes ago       Up 9 minutes                            atomic-openshift-node
fba2480d2cf5        openshift3/ose:v3.6.173.0.63           "/usr/bin/openshift s"   9 minutes ago       Up 9 minutes                            atomic-openshift-master
dee2ab535184        openshift3/openvswitch:v3.6.173.0.63   "/usr/local/bin/ovs-r"   4 hours ago         Up 4 hours              

Version-Release number of the following components:
openshift-ansible-3.7.0-0.194.0.git.0.e8af207.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Containerzied install ocp v3.6 with non-ha deployed.
2. Upgrade above to v3.7
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Upgrade log and master-controller journal log in attachment

Comment 3 Scott Dodson 2017-11-06 13:51:14 UTC

Reading through the journal logs I think this is just happening because the api server is still bootstrapping when we first attempt to start the controllers. We see the controllers stop with the following error then restart at which point they're successful.

Nov 06 00:42:00 qe-jliu-con-master-1 atomic-openshift-master-controllers[3222]: F1106 05:42:00.078464       1 client_builder.go:273] unable to get token for service account: error watching
Nov 06 00:42:00 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a


So lets either add retries to these service starts or ignore the failure then loop for a period checking the status of the service.

Comment 4 Michael Gugino 2017-11-06 14:56:47 UTC

@sdodson

I think we should investigate systemd's handling of the starting of the api service if the controller requires that as a dependency.

I will implement the work-around for now as we're close to freeze.

https://www.freedesktop.org/software/systemd/man/sd_notify.html#

Comment 5 Michael Gugino 2017-11-06 17:51:54 UTC

PR Created: https://github.com/openshift/openshift-ansible/pull/6027

Comment 6 Scott Dodson 2017-11-07 21:24:05 UTC

openshift-ansible-3.7.0-0.197.0

Comment 7 liujia 2017-11-08 01:52:44 UTC

Verified on openshift-ansible-3.7.0-0.197.0.git.0.f40c09c.el7.noarch.

Comment 10 errata-xmlrpc 2017-11-28 22:21:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188

Note You need to log in before you can comment on or make changes to this bug.