1618663 – Upgrade hang on the task [openshift_console : Waiting for console rollout to complete]

Bug 1618663 - Upgrade hang on the task [openshift_console : Waiting for console rollout to complete]

Summary: Upgrade hang on the task [openshift_console : Waiting for console rollout to ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Samuel Padgett
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-17 09:11 UTC by liujia
Modified:	2018-09-25 11:22 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-09-25 11:22:40 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description liujia 2018-08-17 09:11:10 UTC

Description of problem:
Upgrade hang (at least 15min) on the task [openshift_console : Waiting for console rollout to complete]. Playbook need a timeout for this task.

---
- name: Waiting for console rollout to complete
# `oc rollout status` will block until either the rollout succeeds or `spec.progressDeadlineSeconds` elapse.
# A zero return code indicates the rollout succeeded.
# https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment
command: >
{{ openshift_client_binary }} rollout status deployment/console --config={{ openshift.common.config_base }}/master/admin.kubeconfig -n openshift-console
changed_when: false
# Ignore errors so we can log troubleshooting info on failures.
ignore_errors: yes
register: console_rollout_status

Run this command manually during the playbook hang.
[root@jliu-10-master-etcd-nfs-1 ~]# oc rollout status deployment/console --config=/etc/origin/master/admin.kubeconfig -n openshift-console
Waiting for deployment spec update to be observed...
Waiting for deployment spec update to be observed...
error: watch closed before Until timeout

Version-Release number of the following components:
openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch

How reproducible:
sometimes

Steps to Reproduce:
1. Install ocp v3.10 with default registry(registry.access.redhat.com)
2. Setting oreg_url to registry.dev.redhat.io, and correct username/passwd for the registry
3. Run upgrade against above ocp

Actual results:
Upgrade hang.

Expected results:
Upgrade succeed.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 3 liujia 2018-08-21 07:12:18 UTC

Block upgrade test.

Comment 4 liujia 2018-08-21 07:13:18 UTC

Still hit it on openshift-ansible-3.11.0-0.19.0.git.0.ebd1bf9None.noarch

Comment 5 Samuel Padgett 2018-08-22 18:43:11 UTC

We should add a timeout. In the meantime, can you include the output from a few commands to help debug?

$ oc get pods -n openshift-console
$ oc get events -n openshift-console
$ oc logs deploy/console -n openshift-console

Comment 6 Samuel Padgett 2018-08-22 19:06:13 UTC

Can you confirm you definitely waited more than 10 minutes? `oc rollout status` should fail after `spec.progressDeadlineSeconds` is exceeded. It's set to 600s right now:

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_console/files/console-template.yaml#L55

Comment 7 Samuel Padgett 2018-08-22 19:09:54 UTC

> error: watch closed before Until timeout

That seems to be the same as

https://github.com/kubernetes/kubernetes/issues/40224

cc Tomas

Comment 8 Tomáš Nožička 2018-08-23 10:51:05 UTC

yep, there are multiple PRs I have opened upstream leading to a fix, hopefully for 1.12 but since it's changing a lot of internals and any of them require approval superpowers, it's going a bit slow.

The premature timeout is usually happening on API timeout so I am not sure that's the cause of your issue, or just a manifestation of something else slow/broken. It's usually like 5m, I guess console should rollout by that time. Timing out the command to see how long it actually waited is a good first step.

Comment 9 liujia 2018-08-23 10:53:19 UTC

Did not hit it on openshift-ansible-3.11.0-0.20.0.git.0.ec6d8caNone.noarch.

Whenever I hit it again, I will catch above info.

Comment 10 Samuel Padgett 2018-08-23 11:42:36 UTC

> Timing out the command to see how long it actually waited is a good first step.

Tomas, the `oc rollout status` command should fail after progressDeadlineSeconds, correct? At least that's what the doc says:

https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#deployment-status

```
You can check if a Deployment has failed to progress by using kubectl rollout status. kubectl rollout status returns a non-zero exit code if the Deployment has exceeded the progression deadline.
```

I remember testing this when I added the check, and it worked for me.

jiajliu - can you confirm you definitely waited more than 10 minutes or are you estimating?

Comment 11 Tomáš Nožička 2018-08-23 12:08:39 UTC

Correct. But not if you hit Until issues first. Something, usually the API timeout or LB, kills the open connection for GET call, closing the watcher and causing the "error: watch closed before Until timeout".

Comment 12 Samuel Padgett 2018-08-23 12:19:34 UTC

Yeah, the watch closed error happened when jiajliu ran the command manually, not as part of the install. But could be an indication that something is generally wrong? We rollout console right after the masters are updated.

Thanks for confirming on the `rollout status` timeout.

Comment 13 Tomáš Nožička 2018-08-23 12:53:44 UTC

Yeah, likely something else is wrong, I would have to see `oc get deploy,rs,po -o yaml` and `oc get events -o yaml`, possibly master logs at loglevel 4 to identify the cause. Hard to say if master's upgrade is the cause without more information. I don't see a reason why that approach shouldn't work after updating masters, given they come up fine.

(btw. API server restart is another cause for the closed watch.)

Comment 14 liujia 2018-08-24 01:05:17 UTC

> jiajliu - can you confirm you definitely waited more than 10
> minutes or are you estimating?

Sure, more than 10 minutes. I have to abort the playbook manually in more than 30 minutes.

Comment 15 Samuel Padgett 2018-08-24 14:55:21 UTC

We will need more detail to debug this further.

Are we able to remove the test blocker flag since you can no longer reproduce? Can you let us know if the next upgrade works?

Comment 16 liujia 2018-08-27 01:16:01 UTC

Did not hit it now, so remove testblocker.

Comment 17 Scott Dodson 2018-09-04 11:56:30 UTC

Moving ON_QA, if this is no longer happening lets CLOSED NOTABUG this.

Comment 19 liujia 2018-09-25 06:16:53 UTC

Whether changing registry during upgrade or not, did not hit it on openshift-ansible-3.11.14-1.git.0.65a0c0c.el7.noarch.

Note You need to log in before you can comment on or make changes to this bug.