Bug 1420700
Summary: | Unexpected master controller service restart when run upgrade playbook | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | liujia <jiajliu> |
Component: | Cluster Version Operator | Assignee: | Steve Milner <smilner> |
Status: | CLOSED NOTABUG | QA Contact: | Anping Li <anli> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 3.5.0 | CC: | anli, aos-bugs, jchaloup, jokerman, mmccomas, sdodson |
Target Milestone: | --- | Flags: | sdodson:
needinfo-
|
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-02-21 14:01:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
liujia
2017-02-09 10:32:29 UTC
# cat pre_master_hook.yml --- - name: ensure user agree to start an upgrade pause: prompt: "Master upgrade of \"{{ inventory_hostname }}\" is about to start, press ENTER to start the master upgrade or CTRL-C to abort." # cat master_hook.yml --- - name: check master service status shell: systemctl status atomic-openshift-master* register: status - debug: msg="{{ status.stdout }}" - name: notice and ensure user to restart master service or system pause: prompt: "Masters \"{{ openshift.common.rolling_restart_mode }}\" will be restarted,press ENTER to start the master upgrade or CTRL-C to abort." when: openshift.common.rolling_restart_mode is defined[root@openshift-121 playbooks]# #cat post_master_hook.yml --- - name: check master service status shell: systemctl status atomic-openshift-master* register: masters - debug: msg="{{ masters.stdout }}" - name: check node service status shell: systemctl status atomic-openshift-node register: nodes - debug: msg="{{ nodes.stdout }}" Created attachment 1248776 [details]
upgrade.log
Created attachment 1248777 [details]
master controller service.log
The logs show between 03:04:26 and 03:06:44 the master controller could not access the master api and then restarted without a graceful request. It also suffered multiple "Unexpected EOF during watch stream event decoding" events. I wonder if it's coming back faster than the master api can respond causing the master-controller to bounce (as the systemd unit for master-controller ensures Restart). Quick question: does it matter how many times a master controller is restarted during a master upgrade at all? Observation: openshift-119.lab.eng.nay.redhat.com: Before TASK [Restart master controllers]: Active: active (running) since Thu 2017-02-09 03:04:26 EST; 51s ago After TASK [Restart master controllers]: Active: active (running) since Thu 2017-02-09 03:06:45 EST; 860ms ago openshift-149.lab.eng.nay.redhat.com: Before TASK [Restart master controllers]: Active: active (running) since Thu 2017-02-09 03:10:55 EST; 27s ago After TASK [Restart master controllers]: Active: active (running) since Thu 2017-02-09 03:16:44 EST; 867ms ago openshift-151.lab.eng.nay.redhat.com Before TASK [Restart master controllers]: Active: active (running) since Thu 2017-02-09 01:07:25 EST; 2h 15min ago After TASK [Restart master controllers]: Active: active (running) since Thu 2017-02-09 03:25:43 EST; 807ms ago The master controller was restarted only on two master machines, each one with different time delta. Given the third master machine has time delta 2h 15min ago, it was not restarted before the "TASK [Restart master controllers]". There is a similar bug https://bugzilla.redhat.com/show_bug.cgi?id=1385530. If the installer can not fix it. I think it may affect the downtime. What can we do for these bugs? I don't think it is something critical. The master controller can be restarted even when the upgrade is not run. Just during the upgrade the chance is higher. Still, it is expected a master node is going to be "temporarily" non-operational during upgrade. Do we guarantee any maximum downtime for each master node during upgrade time? I agree with the assessment that this is normal and non fatal. Without load balancer orchestration provided via hooks we should expect the leaseholding controller to fail as api servers cycle. We should still look into the referenced bug 1385530 but I don't think that should be considered a blocker unless a controller is left in a state where it never recovers without manual intervention. @Jan, Scott I agree that the master controller can be restarted even when the upgrade is not run(It is another issue, maybe we should follow up in later test). But for this issue, it happened only during the upgrade package process, so it seems that what the tools done in upgrade result that the master controller restarted, so I wonder if it is a expected result, if not, i think it should be a bug which should be recorded even if it is really normal and not fatal. Mostly likely when a master is restarted a resource watch opened between the master and the master controller can get malformed and result in EOF error. First of all, that is something that needs to be fixed in Kubernetes itself (if it is possible) or the master controller should try to renew the resource watch instead of getting restarted. There are bugs already reported which track the issue. What we could do is to measure how much time it takes to have the master controller restarted and try to accommodate this information into the upgrade play. However, we can not guarantee the time constraint as the upgrade play is waiting for a master to come back once it is restarted and that can take some deterministic intervals of time. Besides, we don't know if the master controller is restarted after of before the master. Would be great if the ansible could provide a timestamp of task completion (maybe running with -vvv could provide that information). |