Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1509837

Summary: Upgrade may fail when restart master controllers
Product: OpenShift Container Platform Reporter: liujia <jiajliu>
Component: Cluster Version OperatorAssignee: Michael Gugino <mgugino>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.7.0CC: aos-bugs, jokerman, mmccomas
Target Milestone: ---   
Target Release: 3.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-11-28 22:21:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description liujia 2017-11-06 07:43:09 UTC
Description of problem:
Upgrade against non-ha containerzied ocp failed at task [restart master controllers].
RUNNING HANDLER [restart master controllers] *******************************************************************************************
fatal: [x.x.x.x]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-master-controllers: Job for atomic-openshift-master-controllers.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-master-controllers.service\" and \"journalctl -xe\" for details.\n"}

===================more info
Checked on master hosts to find that atomic-openshift-master-controllers was running. 

some journal log
<--snip-->
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: Failed to start Atomic OpenShift Master Controllers.
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: Unit atomic-openshift-master-controllers.service entered failed state.
Nov 06 00:42:07 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service failed.
Nov 06 00:42:12 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service holdoff time over, scheduling restart.
Nov 06 00:42:12 qe-jliu-con-master-1 systemd[1]: Starting Atomic OpenShift Master Controllers...
Nov 06 00:42:13 qe-jliu-con-master-1 atomic-openshift-master-controllers[3392]: Error response from daemon: No such container: atomic-openshift-master-controllers

<--snip-->

# systemctl status atomic-openshift-master*
● atomic-openshift-master-api.service - Atomic OpenShift Master API
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-api.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-11-06 00:41:56 EST; 7min ago
     Docs: https://github.com/openshift/origin
  Process: 3086 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 3081 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-api (code=exited, status=1/FAILURE)
 Main PID: 3085 (docker-current)
...
● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/etc/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled)
   Active: active (running) since Mon 2017-11-06 00:42:23 EST; 6min ago
     Docs: https://github.com/openshift/origin
  Process: 3382 ExecStop=/usr/bin/docker stop atomic-openshift-master-controllers (code=exited, status=1/FAILURE)
  Process: 3397 ExecStartPost=/usr/bin/sleep 10 (code=exited, status=0/SUCCESS)
  Process: 3392 ExecStartPre=/usr/bin/docker rm -f atomic-openshift-master-controllers (code=exited, status=1/FAILURE)
 Main PID: 3396 (docker-current)
...
● atomic-openshift-master.service
   Loaded: not-found (Reason: No such file or directory)
   Active: active (running) since Mon 2017-11-06 00:41:34 EST; 7min ago
 Main PID: 1843 (docker-current)
   CGroup: /system.slice/atomic-openshift-master.service
           └─1843 /usr/bin/docker-current run --rm --privileged --net=host --name atomic-openshift-master --env-file=/etc/sysconfig/atomic-openshift-master -v /var/lib/origin:/var/lib/origin -v /var/log:/var/log -v /var/run/docker.sock:/var/run/docker.sock -v /etc/origin:/etc/origin -v /etc/origin/cloudprovider:/etc/origin/cloudprovider -v /etc/pki:/etc/pki:ro openshift3/ose:v3.6.173.0.63 start master --config=/etc/origin/master/master-config.yaml --loglevel=5


# docker images
REPOSITORY                                          TAG                 IMAGE ID            CREATED             SIZE
registry.ops.openshift.com/openshift3/ose           v3.7                e63c03f3ae7b        4 days ago          1.059 GB
registry.ops.openshift.com/openshift3/ose           v3.7.0              e63c03f3ae7b        4 days ago          1.059 GB
registry.ops.openshift.com/openshift3/openvswitch   v3.6.173.0.63       e7d2769a89cf        6 days ago          1.159 GB
registry.ops.openshift.com/openshift3/node          v3.6.173.0.63       f1fe7e034bec        6 days ago          1.157 GB
registry.ops.openshift.com/openshift3/ose           v3.6                ef9ef5dca033        6 days ago          970.6 MB
registry.ops.openshift.com/openshift3/ose           v3.6.173.0.63       ef9ef5dca033        6 days ago          970.6 MB

# docker ps
CONTAINER ID        IMAGE                                  COMMAND                  CREATED             STATUS              PORTS               NAMES
a155f91c04df        openshift3/ose:v3.7.0                  "/usr/bin/openshift s"   8 minutes ago       Up 8 minutes                            atomic-openshift-master-controllers
f9edfcd6b082        openshift3/ose:v3.7.0                  "/usr/bin/openshift s"   8 minutes ago       Up 8 minutes                            atomic-openshift-master-api
403b30456dab        openshift3/node:v3.6.173.0.63          "/usr/local/bin/origi"   9 minutes ago       Up 9 minutes                            atomic-openshift-node
fba2480d2cf5        openshift3/ose:v3.6.173.0.63           "/usr/bin/openshift s"   9 minutes ago       Up 9 minutes                            atomic-openshift-master
dee2ab535184        openshift3/openvswitch:v3.6.173.0.63   "/usr/local/bin/ovs-r"   4 hours ago         Up 4 hours              

Version-Release number of the following components:
openshift-ansible-3.7.0-0.194.0.git.0.e8af207.el7.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Containerzied install ocp v3.6 with non-ha deployed.
2. Upgrade above to v3.7
3.

Actual results:
Upgrade failed.

Expected results:
Upgrade succeed.

Additional info:
Upgrade log and master-controller journal log in attachment

Comment 3 Scott Dodson 2017-11-06 13:51:14 UTC
Reading through the journal logs I think this is just happening because the api server is still bootstrapping when we first attempt to start the controllers. We see the controllers stop with the following error then restart at which point they're successful.

Nov 06 00:42:00 qe-jliu-con-master-1 atomic-openshift-master-controllers[3222]: F1106 05:42:00.078464       1 client_builder.go:273] unable to get token for service account: error watching
Nov 06 00:42:00 qe-jliu-con-master-1 systemd[1]: atomic-openshift-master-controllers.service: main process exited, code=exited, status=255/n/a


So lets either add retries to these service starts or ignore the failure then loop for a period checking the status of the service.

Comment 4 Michael Gugino 2017-11-06 14:56:47 UTC
@sdodson

I think we should investigate systemd's handling of the starting of the api service if the controller requires that as a dependency.

I will implement the work-around for now as we're close to freeze.

https://www.freedesktop.org/software/systemd/man/sd_notify.html#

Comment 5 Michael Gugino 2017-11-06 17:51:54 UTC
PR Created: https://github.com/openshift/openshift-ansible/pull/6027

Comment 6 Scott Dodson 2017-11-07 21:24:05 UTC
openshift-ansible-3.7.0-0.197.0

Comment 7 liujia 2017-11-08 01:52:44 UTC
Verified on openshift-ansible-3.7.0-0.197.0.git.0.f40c09c.el7.noarch.

Comment 10 errata-xmlrpc 2017-11-28 22:21:49 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:3188