Hide Forgot
Description of problem: When network connection or system itself is slow, playbook timeouts on waiting for multus pods to rollout. Probably due to downloading images from registry. Pods eventually come up successfully. playbook expect them to be up in 2m, but in my environment they came up after 4 minutes. kube-system kube-multus-amd64-qqmmc 0/1 ContainerCreating 0 3m kube-system kube-ovs-cni-plugin-amd64-fb6hr 0/1 ContainerCreating 0 3m Version-Release number of selected component (if applicable): kubevirt-ansible-0.9.2-4.9c5b566.noarch How reproducible: sometimes Steps to Reproduce: 1. ansible-playbook -i inventory -e@/usr/share/ansible/kubevirt-ansible/vars/all.yml -e@/usr/share/ansible/kubevirt-ansible/vars/cnv.yml -e "registry_url=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888" /usr/share/ansible/kubevirt-ansible/playbooks/kubevirt.yml Actual results: TASK [network-multus : Wait until multus is running] *************************** FAILED - RETRYING: Wait until multus is running (20 retries left). FAILED - RETRYING: Wait until multus is running (19 retries left). FAILED - RETRYING: Wait until multus is running (18 retries left). FAILED - RETRYING: Wait until multus is running (17 retries left). FAILED - RETRYING: Wait until multus is running (16 retries left). FAILED - RETRYING: Wait until multus is running (15 retries left). FAILED - RETRYING: Wait until multus is running (14 retries left). FAILED - RETRYING: Wait until multus is running (13 retries left). FAILED - RETRYING: Wait until multus is running (12 retries left). FAILED - RETRYING: Wait until multus is running (11 retries left). FAILED - RETRYING: Wait until multus is running (10 retries left). FAILED - RETRYING: Wait until multus is running (9 retries left). FAILED - RETRYING: Wait until multus is running (8 retries left). FAILED - RETRYING: Wait until multus is running (7 retries left). FAILED - RETRYING: Wait until multus is running (6 retries left). FAILED - RETRYING: Wait until multus is running (5 retries left). FAILED - RETRYING: Wait until multus is running (4 retries left). FAILED - RETRYING: Wait until multus is running (3 retries left). FAILED - RETRYING: Wait until multus is running (2 retries left). FAILED - RETRYING: Wait until multus is running (1 retries left). fatal: [localhost]: FAILED! => {\"attempts\": 20, \"changed\": true, \"cmd\": \"oc -n kube-system get daemonset | grep kube-multus-amd64 | awk '{ if ($3 == $4) print \\\"0\\\"; else print \\\"1\\\"}'\", \"delta\": \"0:00:00.210703\", \"end\": \"2019-01-08 04:30:34.156477\", \"rc\": 0, \"start\": \"2019-01-08 04:30:33.945774\", \"stderr\": \"\", \"stderr_lines\": [], \"stdout\": \"1\", \"stdout_lines\": [\"1\"]} \tto retry, use: --limit @/usr/share/ansible/kubevirt-ansible/playbooks/kubevirt.retry Expected results: Playbook waits for pods to get up, in case they are still in ContainerCreating state. Additional info: PLAY [Initial configuration] *************************************************** TASK [Login As Super User] ***************************************************** skipping: [localhost] TASK [Config kubernetes client binary] ***************************************** skipping: [localhost] TASK [Config openshift client binary] ****************************************** ok: [localhost] PLAY [Initial configuration] *************************************************** TASK [Login As Super User] ***************************************************** skipping: [localhost] TASK [Config kubernetes client binary] ***************************************** skipping: [localhost] TASK [Config openshift client binary] ****************************************** ok: [localhost] PLAY [nodes masters] *********************************************************** TASK [Gathering Facts] ********************************************************* ok: [172.16.0.25] ok: [172.16.0.24] ok: [172.16.0.16] TASK [remove multus config from nodes on deprovisioning] *********************** skipping: [172.16.0.16] => (item=/etc/cni/net.d/00-multus.conf) skipping: [172.16.0.16] => (item=/etc/cni/net.d/multus.d) skipping: [172.16.0.24] => (item=/etc/cni/net.d/00-multus.conf) skipping: [172.16.0.24] => (item=/etc/cni/net.d/multus.d) skipping: [172.16.0.25] => (item=/etc/cni/net.d/00-multus.conf) skipping: [172.16.0.25] => (item=/etc/cni/net.d/multus.d) TASK [make sure ovs is installed] ********************************************** skipping: [172.16.0.16] skipping: [172.16.0.24] skipping: [172.16.0.25] TASK [enable and start OVS] **************************************************** skipping: [172.16.0.16] skipping: [172.16.0.24] skipping: [172.16.0.25] TASK [Create /etc/pcidp] ******************************************************* skipping: [172.16.0.16] skipping: [172.16.0.24] skipping: [172.16.0.25] TASK [Configure SR-IOV DP allocation pool] ************************************* skipping: [172.16.0.16] skipping: [172.16.0.24] skipping: [172.16.0.25] TASK [Fix SELinux labels for /var/lib/kubelet/device-plugins/] ***************** skipping: [172.16.0.16] skipping: [172.16.0.24] skipping: [172.16.0.25] PLAY [Deploy network roles] **************************************************** TASK [network-multus : include_tasks] ****************************************** included: /usr/share/ansible/kubevirt-ansible/roles/network-multus/tasks/provision.yml for localhost TASK [network-multus : Check if namespace \"kube-system\" exists] **************** changed: [localhost] TASK [network-multus : Create kube-system namespace] *************************** skipping: [localhost] TASK [network-multus : openshift cni config] *********************************** ok: [localhost] TASK [network-multus : kubernetes cni config] ********************************** skipping: [localhost] TASK [network-multus : Render multus deployment yaml] ************************** changed: [localhost] TASK [network-multus : Create multus Resources] ******************************** changed: [localhost] TASK [network-multus : Render cni plugins deployment yaml] ********************* skipping: [localhost] TASK [network-multus : Create cni plugins Resources] *************************** skipping: [localhost] TASK [network-multus : Render OVS deployment yaml] ***************************** changed: [localhost] TASK [network-multus : Create ovs Resources] *********************************** changed: [localhost] TASK [network-multus : Render ovs-vsctl deployment yaml] *********************** changed: [localhost] TASK [network-multus : Create ovs-vsctl resources] ***************************** changed: [localhost] TASK [network-multus : Render SR-IOV DP deployment yaml] *********************** skipping: [localhost] TASK [network-multus : Create SR-IOV DP resources] ***************************** skipping: [localhost] TASK [network-multus : Render SR-IOV CNI deployment yaml] ********************** skipping: [localhost] TASK [network-multus : Create SR-IOV CNI resources] **************************** skipping: [localhost] TASK [network-multus : Render SR-IOV network CRD yaml] ************************* skipping: [localhost] TASK [network-multus : Create SR-IOV network CRD] ****************************** skipping: [localhost] TASK [network-multus : Wait until multus is running] *************************** FAILED - RETRYING: Wait until multus is running (20 retries left). FAILED - RETRYING: Wait until multus is running (19 retries left). FAILED - RETRYING: Wait until multus is running (18 retries left). FAILED - RETRYING: Wait until multus is running (17 retries left). FAILED - RETRYING: Wait until multus is running (16 retries left). FAILED - RETRYING: Wait until multus is running (15 retries left). FAILED - RETRYING: Wait until multus is running (14 retries left). FAILED - RETRYING: Wait until multus is running (13 retries left). FAILED - RETRYING: Wait until multus is running (12 retries left). FAILED - RETRYING: Wait until multus is running (11 retries left). FAILED - RETRYING: Wait until multus is running (10 retries left). FAILED - RETRYING: Wait until multus is running (9 retries left). FAILED - RETRYING: Wait until multus is running (8 retries left). FAILED - RETRYING: Wait until multus is running (7 retries left). FAILED - RETRYING: Wait until multus is running (6 retries left). FAILED - RETRYING: Wait until multus is running (5 retries left). FAILED - RETRYING: Wait until multus is running (4 retries left). FAILED - RETRYING: Wait until multus is running (3 retries left). FAILED - RETRYING: Wait until multus is running (2 retries left). FAILED - RETRYING: Wait until multus is running (1 retries left). fatal: [localhost]: FAILED! => {\"attempts\": 20, \"changed\": true, \"cmd\": \"oc -n kube-system get daemonset | grep kube-multus-amd64 | awk '{ if ($3 == $4) print \\\"0\\\"; else print \\\"1\\\"}'\", \"delta\": \"0:00:00.210703\", \"end\": \"2019-01-08 04:30:34.156477\", \"rc\": 0, \"start\": \"2019-01-08 04:30:33.945774\", \"stderr\": \"\", \"stderr_lines\": [], \"stdout\": \"1\", \"stdout_lines\": [\"1\"]} \tto retry, use: --limit @/usr/share/ansible/kubevirt-ansible/playbooks/kubevirt.retry PLAY RECAP ********************************************************************* 172.16.0.16 : ok=1 changed=0 unreachable=0 failed=0 172.16.0.24 : ok=1 changed=0 unreachable=0 failed=0 172.16.0.25 : ok=1 changed=0 unreachable=0 failed=0 localhost : ok=11 changed=7 unreachable=0 failed=1
@sscheink Even new 24*10s timeout wasn't enough to bring multus up on our environment.
@Lukas have you seen this issue since?
Is the timeout happening because the images are pulling? Or is the cluster very slow to respond?
(In reply to Nelly Credi from comment #2) > @Lukas have you seen this issue since? I didn't play with clusters from a while, so I can not confirm at the moment. (In reply to Ryan Hallisey from comment #3) > Is the timeout happening because the images are pulling? Or is the cluster > very slow to respond? It was waiting for pulling / download images, this I know for sure.
(In reply to Lukas Bednar from comment #4) > It was waiting for pulling / download images, this I know for sure. Thanks Lukas. Though not ideal, customers can pre-pull images to mitigate this (needs to be documented?). We'll want to be careful about increasing the retry too high and causing UX to suffer. In my opinion, we can leave the retry as is here. WDYT?
(In reply to Ryan Hallisey from comment #5) > (In reply to Lukas Bednar from comment #4) > > It was waiting for pulling / download images, this I know for sure. > > Thanks Lukas. Though not ideal, customers can pre-pull images to mitigate You just hit a head of nail, as a part of consuming static build we are pre-pulling images on all nodes, we started to do this two weeks ago, that is probably reason why we don't see this issue anymore (I assume that we don't hit it, because I don't hear ppl complaining about it). > this (needs to be documented?). We'll want to be careful about increasing If we assume customer to pre-pull images, then it should be documented for sure. > the retry too high and causing UX to suffer. In my opinion, we can leave the > retry as is here. WDYT? I believe that best would be add two tests there, instead of waiting for container is ready, we can have 1) timeout for pulling image, and then 2) timeout for waiting that container became ready. Second thing question which I have here, is why is that network container so big ? Is it possible to reduce size of it ?
> I believe that best would be add two tests there, instead of waiting for container is ready, > we can have 1) timeout for pulling image, and then 2) timeout for waiting that container became ready. I like this idea, not because it would solve anything, but because it is going to provide a better error message. > Second thing question which I have here, is why is that network container so big ? > Is it possible to reduce size of it ? I don't think it's about its size (it isn't THAT big). It's about the order. We pull Multus first, and we pull base rhel7 as a dependency, which takes most of the time. We'd to look into this next week, but I would not block the release on it, as a simple workaround (retry) exists.
Hey - what needs to be documented for this BZ? Should I add a step to the installation procedure, or did you want a Known Issue w/ workaround added to the Release Notes? Thanks.
I hope we can document the Known Issue into the code; but if we don't, I've suggested some text for out-of-product note.
Yes, I am afraid we'd need to document it away.
Hey Dan, can you please review the Known Issue I added for this BZ? Thanks. https://github.com/openshift/openshift-docs/pull/13374/commits/9d13a15ea86bf892a1bea5fdd1bc18976807dbe7
Ack
What's the status? As reminder this is targeted for 1.4.
Is this bug should be ON_QA? since the patch [1] is not merged yet. [1] https://github.com/openshift/openshift-docs/pull/13374/commits/6db4a6c13f4d06e70f9d7b6516ac66690c6a6de1
(In reply to Meni Yakove from comment #16) > Is this bug should be ON_QA? since the patch [1] is not merged yet. Speak with Pan because if "merging" means that it may get published if other teams are publishing the latest OCP documentation... then definitely no, we want to verify this before merging. If it's safe to merge, then OK it's on Pan side to merge ASAP.
(In reply to Federico Simoncelli from comment #17) > (In reply to Meni Yakove from comment #16) > > Is this bug should be ON_QA? since the patch [1] is not merged yet. > > Speak with Pan because if "merging" means that it may get published if other > teams are publishing the latest OCP documentation... then definitely no, we > want to verify this before merging. > If it's safe to merge, then OK it's on Pan side to merge ASAP. Hi Federico/Meni/Dan, this particular PR is for all of the release notes for 1.4. For that reason, I have to keep it open until I'm sure that there are no other release notes needed. I won't change the Known Issues text for this BZ unless I hear otherwise. (In general, CNV doc PRs can be merged to master without publishing anywhere because the CNV docs are not on docs.okd.io (the upstream version of the OpenShift docs). I would have to cherrypick to 3.11 for them to show up on docs.openshift.com) HTH
Looks good to me.