Bug 1420700

Summary:	Unexpected master controller service restart when run upgrade playbook
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Cluster Version Operator	Assignee:	Steve Milner <smilner>
Status:	CLOSED NOTABUG	QA Contact:	Anping Li <anli>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.5.0	CC:	anli, aos-bugs, jchaloup, jokerman, mmccomas, sdodson
Target Milestone:	---	Flags:	sdodson: needinfo-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-02-21 14:01:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description liujia 2017-02-09 10:32:29 UTC

Description of problem:
Trigger an upgrade from ocp3.4 to latest ocp3.5 with hooks, upgrade successfully but got some info from master_hook output that one of master's controller service restarted unexpectedly before it comes to restart master task.

TASK [debug] *******************************************************************
ok: [openshift-x.x.x.x] => {}

MSG:

Running master upgrade hook /root/work/playbooks/master_hook.yml

TASK [include] *****************************************************************
included: /root/work/playbooks/master_hook.yml for openshift-x.x.x.x

TASK [check master service status] *********************************************
changed: [openshift-x.x.x.x]

TASK [debug] *******************************************************************
ok: [openshift-x.x.x.x] => {}

MSG:

● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2017-02-09 03:04:26 EST; 51s ago
......
......
TASK [notice and ensure user to restart master service or system] **************
[notice and ensure user to restart master service or system]
Masters "services" will be restarted,press ENTER to start the master upgrade or CTRL-C to abort.:
ok: [openshift-x.x.x.x]
......
......
TASK [Restart master controllers] **********************************************
changed: [openshift-x.x.x.x]

TASK [debug] *******************************************************************
ok: [openshift-x.x.x.x] => {}

MSG:

Running master post-upgrade hook /root/work/playbooks/post_master_hook.yml

TASK [include] *****************************************************************
included: /root/work/playbooks/post_master_hook.yml for openshift-119.lab.eng.nay.redhat.com

TASK [check master service status] *********************************************
changed: [openshift-x.x.x.x]

TASK [debug] *******************************************************************
ok: [openshift-x.x.x.x] => {}

MSG:

● atomic-openshift-master-controllers.service - Atomic OpenShift Master Controllers
   Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-master-controllers.service; enabled; vendor preset: disabled)
   Active: active (running) since Thu 2017-02-09 03:06:45 EST; 860ms ago
.......

Version-Release number of selected component (if applicable):
atomic-openshift-utils-3.5.4-1.git.0.034b615.el7.noarch

How reproducible:
always(test twice, happened twice)

Steps to Reproduce:
1.Install ocp3.4 on ha env
2.Edit inventory file to add three hooks
openshift_master_upgrade_pre_hook=/root/work/playbooks/pre_master_hook.yml(pause to ensure user about following master upgrade)
openshift_master_upgrade_hook=/root/work/playbooks/master_hook.yml(check and output master service status, and then pause to ensure user about following master service restart )
openshift_master_upgrade_post_hook=/root/work/playbooks/post_master_hook.yml(check and output current master and node service after master service restart)
3.Trigger upgrade
# ansible-playbook -i /tmp/hosts /usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml

Actual results:
Atomic-openshift-master-controllers service was restarted unexpectedly.

Expected results:
Master service should be restarted only at restart master tasks.

Additional info:

Comment 1 liujia 2017-02-09 10:34:39 UTC

# cat pre_master_hook.yml 
---
- name: ensure user agree to start an upgrade
  pause:
      prompt: "Master upgrade of \"{{ inventory_hostname }}\" is about to start, press ENTER to start the master upgrade or CTRL-C to abort."
 
# cat master_hook.yml 
---
- name: check master service status
  shell: systemctl status atomic-openshift-master*
  register: status
- debug: msg="{{ status.stdout }}"

- name: notice and ensure user to restart master service or system
  pause:
      prompt: "Masters \"{{ openshift.common.rolling_restart_mode }}\" will be restarted,press ENTER to start the master upgrade or CTRL-C to abort."
  when: openshift.common.rolling_restart_mode is defined[root@openshift-121 playbooks]# 

#cat post_master_hook.yml 
---
- name: check master service status
  shell: systemctl status atomic-openshift-master*
  register: masters
- debug: msg="{{ masters.stdout }}"

- name: check node service status
  shell: systemctl status atomic-openshift-node
  register: nodes
- debug: msg="{{ nodes.stdout }}"

Comment 2 liujia 2017-02-09 10:36:42 UTC

Created attachment 1248776 [details]
upgrade.log

Comment 3 liujia 2017-02-09 10:37:34 UTC

Created attachment 1248777 [details]
master controller service.log

Comment 4 Steve Milner 2017-02-16 20:40:00 UTC

The logs show between 03:04:26 and 03:06:44 the master controller could not access the master api and then restarted without a graceful request. It also suffered multiple "Unexpected EOF during watch stream event decoding" events. I wonder if it's coming back faster than the master api can respond causing the master-controller to bounce (as the systemd unit for master-controller ensures Restart).

Comment 5 Jan Chaloupka 2017-02-17 12:46:33 UTC

Quick question: does it matter how many times a master controller is restarted during a master upgrade at all? 

Observation:
openshift-119.lab.eng.nay.redhat.com:
  Before TASK [Restart master controllers]:
    Active: active (running) since Thu 2017-02-09 03:04:26 EST; 51s ago
  After TASK [Restart master controllers]:
    Active: active (running) since Thu 2017-02-09 03:06:45 EST; 860ms ago

openshift-149.lab.eng.nay.redhat.com:
  Before TASK [Restart master controllers]:
   Active: active (running) since Thu 2017-02-09 03:10:55 EST; 27s ago
  After TASK [Restart master controllers]:
   Active: active (running) since Thu 2017-02-09 03:16:44 EST; 867ms ago

openshift-151.lab.eng.nay.redhat.com
  Before TASK [Restart master controllers]:
   Active: active (running) since Thu 2017-02-09 01:07:25 EST; 2h 15min ago
  After TASK [Restart master controllers]:
   Active: active (running) since Thu 2017-02-09 03:25:43 EST; 807ms ago

The master controller was restarted only on two master machines, each one with different time delta.
Given the third master machine has time delta 2h 15min ago, it was not restarted before the "TASK [Restart master controllers]".

Comment 9 Anping Li 2017-02-21 10:52:08 UTC

There is a similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1385530. If the installer can not fix it.  I think it may affect the downtime.  What can we do for these bugs?

Comment 10 Jan Chaloupka 2017-02-21 11:32:52 UTC

I don't think it is something critical. The master controller can be restarted even when the upgrade is not run. Just during the upgrade the chance is higher. Still, it is expected a master node is going to be "temporarily" non-operational during upgrade. Do we guarantee any maximum downtime for each master node during upgrade time?

Comment 11 Scott Dodson 2017-02-21 14:01:32 UTC

I agree with the assessment that this is normal and non fatal. Without load balancer orchestration provided via hooks we should expect the leaseholding controller to fail as api servers cycle.

We should still look into the referenced bug 1385530 but I don't think that should be considered a blocker unless a controller is left in a state where it never recovers without manual intervention.

Comment 12 liujia 2017-02-22 02:54:33 UTC

@Jan, Scott

I agree that the master controller can be restarted even when the upgrade is not run(It is another issue, maybe we should follow up in later test).

But for this issue, it happened only during the upgrade package process, so it seems that what the tools done in upgrade result that the master controller restarted, so I wonder if it is a expected result, if not, i think it should be a bug which should be recorded even if it is really normal and not fatal.

Comment 13 Jan Chaloupka 2017-02-22 11:38:32 UTC

Mostly likely when a master is restarted a resource watch opened between the master and the master controller can get malformed and result in EOF error. First of all, that is something that needs to be fixed in Kubernetes itself (if it is possible) or the master controller should try to renew the resource watch instead of getting restarted. There are bugs already reported which track the issue. What we could do is to measure how much time it takes to have the master controller restarted and try to accommodate this information into the upgrade play. However, we can not guarantee the time constraint as the upgrade play is waiting for a master to come back once it is restarted and that can take some deterministic intervals of time. Besides, we don't know if the master controller is restarted after of before the master. Would be great if the ansible could provide a timestamp of task completion (maybe running with -vvv could provide that information).