Description of problem: Upgrade hang (at least 15min) on the task [openshift_console : Waiting for console rollout to complete]. Playbook need a timeout for this task. --- - name: Waiting for console rollout to complete # `oc rollout status` will block until either the rollout succeeds or `spec.progressDeadlineSeconds` elapse. # A zero return code indicates the rollout succeeded. # https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#complete-deployment command: > {{ openshift_client_binary }} rollout status deployment/console --config={{ openshift.common.config_base }}/master/admin.kubeconfig -n openshift-console changed_when: false # Ignore errors so we can log troubleshooting info on failures. ignore_errors: yes register: console_rollout_status Run this command manually during the playbook hang. [root@jliu-10-master-etcd-nfs-1 ~]# oc rollout status deployment/console --config=/etc/origin/master/admin.kubeconfig -n openshift-console Waiting for deployment spec update to be observed... Waiting for deployment spec update to be observed... error: watch closed before Until timeout Version-Release number of the following components: openshift-ansible-3.11.0-0.16.0.git.0.e82689aNone.noarch How reproducible: sometimes Steps to Reproduce: 1. Install ocp v3.10 with default registry(registry.access.redhat.com) 2. Setting oreg_url to registry.dev.redhat.io, and correct username/passwd for the registry 3. Run upgrade against above ocp Actual results: Upgrade hang. Expected results: Upgrade succeed. Additional info: Please attach logs from ansible-playbook with the -vvv flag
Block upgrade test.
Still hit it on openshift-ansible-3.11.0-0.19.0.git.0.ebd1bf9None.noarch
We should add a timeout. In the meantime, can you include the output from a few commands to help debug? $ oc get pods -n openshift-console $ oc get events -n openshift-console $ oc logs deploy/console -n openshift-console
Can you confirm you definitely waited more than 10 minutes? `oc rollout status` should fail after `spec.progressDeadlineSeconds` is exceeded. It's set to 600s right now: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_console/files/console-template.yaml#L55
> error: watch closed before Until timeout That seems to be the same as https://github.com/kubernetes/kubernetes/issues/40224 cc Tomas
yep, there are multiple PRs I have opened upstream leading to a fix, hopefully for 1.12 but since it's changing a lot of internals and any of them require approval superpowers, it's going a bit slow. The premature timeout is usually happening on API timeout so I am not sure that's the cause of your issue, or just a manifestation of something else slow/broken. It's usually like 5m, I guess console should rollout by that time. Timing out the command to see how long it actually waited is a good first step.
Did not hit it on openshift-ansible-3.11.0-0.20.0.git.0.ec6d8caNone.noarch. Whenever I hit it again, I will catch above info.
> Timing out the command to see how long it actually waited is a good first step. Tomas, the `oc rollout status` command should fail after progressDeadlineSeconds, correct? At least that's what the doc says: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#deployment-status ``` You can check if a Deployment has failed to progress by using kubectl rollout status. kubectl rollout status returns a non-zero exit code if the Deployment has exceeded the progression deadline. ``` I remember testing this when I added the check, and it worked for me. jiajliu - can you confirm you definitely waited more than 10 minutes or are you estimating?
Correct. But not if you hit Until issues first. Something, usually the API timeout or LB, kills the open connection for GET call, closing the watcher and causing the "error: watch closed before Until timeout".
Yeah, the watch closed error happened when jiajliu ran the command manually, not as part of the install. But could be an indication that something is generally wrong? We rollout console right after the masters are updated. Thanks for confirming on the `rollout status` timeout.
Yeah, likely something else is wrong, I would have to see `oc get deploy,rs,po -o yaml` and `oc get events -o yaml`, possibly master logs at loglevel 4 to identify the cause. Hard to say if master's upgrade is the cause without more information. I don't see a reason why that approach shouldn't work after updating masters, given they come up fine. (btw. API server restart is another cause for the closed watch.)
> jiajliu - can you confirm you definitely waited more than 10 > minutes or are you estimating? Sure, more than 10 minutes. I have to abort the playbook manually in more than 30 minutes.
We will need more detail to debug this further. Are we able to remove the test blocker flag since you can no longer reproduce? Can you let us know if the next upgrade works?
Did not hit it now, so remove testblocker.
Moving ON_QA, if this is no longer happening lets CLOSED NOTABUG this.
Whether changing registry during upgrade or not, did not hit it on openshift-ansible-3.11.14-1.git.0.65a0c0c.el7.noarch.