1664274 – [kubevirt-ansible] When network connection or system itself is slow, playbook timeouts on waiting for multus pods to rollout

Bug 1664274 - [kubevirt-ansible] When network connection or system itself is slow, playbook timeouts on waiting for multus pods to rollout

Summary: [kubevirt-ansible] When network connection or system itself is slow, playbook...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Container Native Virtualization (CNV)
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	1.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	1.4
Assignee:	Sebastian Scheinkman
QA Contact:	Meni Yakove
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-08 10:00 UTC by Lukas Bednar
Modified:	2019-03-05 11:20 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Known Issue
Doc Text:	Cause: slow connection to your container registry Consequence: 4 minutes are not enough to pull in the multus image and its underlying layers Workaround (if any): wait a bit and try again Result:
Clone Of:
Environment:
Last Closed:	2019-03-05 11:20:20 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Lukas Bednar 2019-01-08 10:00:55 UTC

Description of problem:
When network connection or system itself is slow, playbook timeouts on waiting for multus pods to rollout.

Probably due to downloading images from registry.
Pods eventually come up successfully. playbook expect them to be up in 2m, but in my environment they came up after 4 minutes.
kube-system                         kube-multus-amd64-qqmmc                                       0/1       ContainerCreating   0          3m
kube-system                         kube-ovs-cni-plugin-amd64-fb6hr                               0/1       ContainerCreating   0          3m

Version-Release number of selected component (if applicable):
kubevirt-ansible-0.9.2-4.9c5b566.noarch

How reproducible: sometimes


Steps to Reproduce:
1. ansible-playbook -i inventory -e@/usr/share/ansible/kubevirt-ansible/vars/all.yml -e@/usr/share/ansible/kubevirt-ansible/vars/cnv.yml -e "registry_url=brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888" /usr/share/ansible/kubevirt-ansible/playbooks/kubevirt.yml

Actual results:

TASK [network-multus : Wait until multus is running] ***************************
FAILED - RETRYING: Wait until multus is running (20 retries left).
FAILED - RETRYING: Wait until multus is running (19 retries left).
FAILED - RETRYING: Wait until multus is running (18 retries left).
FAILED - RETRYING: Wait until multus is running (17 retries left).
FAILED - RETRYING: Wait until multus is running (16 retries left).
FAILED - RETRYING: Wait until multus is running (15 retries left).
FAILED - RETRYING: Wait until multus is running (14 retries left).
FAILED - RETRYING: Wait until multus is running (13 retries left).
FAILED - RETRYING: Wait until multus is running (12 retries left).
FAILED - RETRYING: Wait until multus is running (11 retries left).
FAILED - RETRYING: Wait until multus is running (10 retries left).
FAILED - RETRYING: Wait until multus is running (9 retries left).
FAILED - RETRYING: Wait until multus is running (8 retries left).
FAILED - RETRYING: Wait until multus is running (7 retries left).
FAILED - RETRYING: Wait until multus is running (6 retries left).
FAILED - RETRYING: Wait until multus is running (5 retries left).
FAILED - RETRYING: Wait until multus is running (4 retries left).
FAILED - RETRYING: Wait until multus is running (3 retries left).
FAILED - RETRYING: Wait until multus is running (2 retries left).
FAILED - RETRYING: Wait until multus is running (1 retries left).
fatal: [localhost]: FAILED! => {\"attempts\": 20, \"changed\": true, \"cmd\": \"oc -n kube-system get daemonset | grep kube-multus-amd64 | awk '{ if ($3 == $4) print \\\"0\\\"; else print \\\"1\\\"}'\", \"delta\": \"0:00:00.210703\", \"end\": \"2019-01-08 04:30:34.156477\", \"rc\": 0, \"start\": \"2019-01-08 04:30:33.945774\", \"stderr\": \"\", \"stderr_lines\": [], \"stdout\": \"1\", \"stdout_lines\": [\"1\"]}
\tto retry, use: --limit @/usr/share/ansible/kubevirt-ansible/playbooks/kubevirt.retry

Expected results:
Playbook waits for pods to get up, in case they are still in ContainerCreating state.


Additional info:
PLAY [Initial configuration] ***************************************************

TASK [Login As Super User] *****************************************************
skipping: [localhost]

TASK [Config kubernetes client binary] *****************************************
skipping: [localhost]

TASK [Config openshift client binary] ******************************************
ok: [localhost]

PLAY [Initial configuration] ***************************************************

TASK [Login As Super User] *****************************************************
skipping: [localhost]

TASK [Config kubernetes client binary] *****************************************
skipping: [localhost]

TASK [Config openshift client binary] ******************************************
ok: [localhost]

PLAY [nodes masters] ***********************************************************

TASK [Gathering Facts] *********************************************************
ok: [172.16.0.25]
ok: [172.16.0.24]
ok: [172.16.0.16]

TASK [remove multus config from nodes on deprovisioning] ***********************
skipping: [172.16.0.16] => (item=/etc/cni/net.d/00-multus.conf)
skipping: [172.16.0.16] => (item=/etc/cni/net.d/multus.d)
skipping: [172.16.0.24] => (item=/etc/cni/net.d/00-multus.conf)
skipping: [172.16.0.24] => (item=/etc/cni/net.d/multus.d)
skipping: [172.16.0.25] => (item=/etc/cni/net.d/00-multus.conf)
skipping: [172.16.0.25] => (item=/etc/cni/net.d/multus.d)

TASK [make sure ovs is installed] **********************************************
skipping: [172.16.0.16]
skipping: [172.16.0.24]
skipping: [172.16.0.25]

TASK [enable and start OVS] ****************************************************
skipping: [172.16.0.16]
skipping: [172.16.0.24]
skipping: [172.16.0.25]

TASK [Create /etc/pcidp] *******************************************************
skipping: [172.16.0.16]
skipping: [172.16.0.24]
skipping: [172.16.0.25]

TASK [Configure SR-IOV DP allocation pool] *************************************
skipping: [172.16.0.16]
skipping: [172.16.0.24]
skipping: [172.16.0.25]

TASK [Fix SELinux labels for /var/lib/kubelet/device-plugins/] *****************
skipping: [172.16.0.16]
skipping: [172.16.0.24]
skipping: [172.16.0.25]

PLAY [Deploy network roles] ****************************************************

TASK [network-multus : include_tasks] ******************************************
included: /usr/share/ansible/kubevirt-ansible/roles/network-multus/tasks/provision.yml for localhost

TASK [network-multus : Check if namespace \"kube-system\" exists] ****************
changed: [localhost]

TASK [network-multus : Create kube-system namespace] ***************************
skipping: [localhost]

TASK [network-multus : openshift cni config] ***********************************
ok: [localhost]

TASK [network-multus : kubernetes cni config] **********************************
skipping: [localhost]

TASK [network-multus : Render multus deployment yaml] **************************
changed: [localhost]

TASK [network-multus : Create multus Resources] ********************************
changed: [localhost]

TASK [network-multus : Render cni plugins deployment yaml] *********************
skipping: [localhost]

TASK [network-multus : Create cni plugins Resources] ***************************
skipping: [localhost]

TASK [network-multus : Render OVS deployment yaml] *****************************
changed: [localhost]

TASK [network-multus : Create ovs Resources] ***********************************
changed: [localhost]

TASK [network-multus : Render ovs-vsctl deployment yaml] ***********************
changed: [localhost]

TASK [network-multus : Create ovs-vsctl resources] *****************************
changed: [localhost]

TASK [network-multus : Render SR-IOV DP deployment yaml] ***********************
skipping: [localhost]

TASK [network-multus : Create SR-IOV DP resources] *****************************
skipping: [localhost]

TASK [network-multus : Render SR-IOV CNI deployment yaml] **********************
skipping: [localhost]

TASK [network-multus : Create SR-IOV CNI resources] ****************************
skipping: [localhost]

TASK [network-multus : Render SR-IOV network CRD yaml] *************************
skipping: [localhost]

TASK [network-multus : Create SR-IOV network CRD] ******************************
skipping: [localhost]

TASK [network-multus : Wait until multus is running] ***************************
FAILED - RETRYING: Wait until multus is running (20 retries left).
FAILED - RETRYING: Wait until multus is running (19 retries left).
FAILED - RETRYING: Wait until multus is running (18 retries left).
FAILED - RETRYING: Wait until multus is running (17 retries left).
FAILED - RETRYING: Wait until multus is running (16 retries left).
FAILED - RETRYING: Wait until multus is running (15 retries left).
FAILED - RETRYING: Wait until multus is running (14 retries left).
FAILED - RETRYING: Wait until multus is running (13 retries left).
FAILED - RETRYING: Wait until multus is running (12 retries left).
FAILED - RETRYING: Wait until multus is running (11 retries left).
FAILED - RETRYING: Wait until multus is running (10 retries left).
FAILED - RETRYING: Wait until multus is running (9 retries left).
FAILED - RETRYING: Wait until multus is running (8 retries left).
FAILED - RETRYING: Wait until multus is running (7 retries left).
FAILED - RETRYING: Wait until multus is running (6 retries left).
FAILED - RETRYING: Wait until multus is running (5 retries left).
FAILED - RETRYING: Wait until multus is running (4 retries left).
FAILED - RETRYING: Wait until multus is running (3 retries left).
FAILED - RETRYING: Wait until multus is running (2 retries left).
FAILED - RETRYING: Wait until multus is running (1 retries left).
fatal: [localhost]: FAILED! => {\"attempts\": 20, \"changed\": true, \"cmd\": \"oc -n kube-system get daemonset | grep kube-multus-amd64 | awk '{ if ($3 == $4) print \\\"0\\\"; else print \\\"1\\\"}'\", \"delta\": \"0:00:00.210703\", \"end\": \"2019-01-08 04:30:34.156477\", \"rc\": 0, \"start\": \"2019-01-08 04:30:33.945774\", \"stderr\": \"\", \"stderr_lines\": [], \"stdout\": \"1\", \"stdout_lines\": [\"1\"]}
\tto retry, use: --limit @/usr/share/ansible/kubevirt-ansible/playbooks/kubevirt.retry

PLAY RECAP *********************************************************************
172.16.0.16                : ok=1    changed=0    unreachable=0    failed=0  
172.16.0.24                : ok=1    changed=0    unreachable=0    failed=0  
172.16.0.25                : ok=1    changed=0    unreachable=0    failed=0  
localhost                  : ok=11   changed=7    unreachable=0    failed=1

Comment 1 Lukas Bednar 2019-01-10 16:34:34 UTC

@sscheink Even new 24*10s timeout wasn't enough to bring multus up on our environment.

Comment 2 Nelly Credi 2019-01-28 13:28:47 UTC

@Lukas have you seen this issue since?

Comment 3 Ryan Hallisey 2019-01-28 13:57:37 UTC

Is the timeout happening because the images are pulling? Or is the cluster very slow to respond?

Comment 4 Lukas Bednar 2019-01-31 12:35:42 UTC

(In reply to Nelly Credi from comment #2)
> @Lukas have you seen this issue since?

I didn't play with clusters from a while, so I can not confirm at the moment.

(In reply to Ryan Hallisey from comment #3)
> Is the timeout happening because the images are pulling? Or is the cluster
> very slow to respond?

It was waiting for pulling / download images, this I know for sure.

Comment 5 Ryan Hallisey 2019-01-31 12:53:43 UTC

(In reply to Lukas Bednar from comment #4)
> It was waiting for pulling / download images, this I know for sure.

Thanks Lukas.  Though not ideal, customers can pre-pull images to mitigate this (needs to be documented?). We'll want to be careful about increasing the retry too high and causing UX to suffer. In my opinion, we can leave the retry as is here. WDYT?

Comment 6 Lukas Bednar 2019-01-31 13:44:39 UTC

(In reply to Ryan Hallisey from comment #5)
> (In reply to Lukas Bednar from comment #4)
> > It was waiting for pulling / download images, this I know for sure.
> 
> Thanks Lukas.  Though not ideal, customers can pre-pull images to mitigate

You just hit a head of nail, as a part of consuming static build we are pre-pulling images on all nodes, we started to do this two weeks ago, that is probably reason why we don't see this issue anymore (I assume that we don't hit it, because I don't hear ppl complaining about it).

> this (needs to be documented?). We'll want to be careful about increasing

If we assume customer to pre-pull images, then it should be documented for sure.

> the retry too high and causing UX to suffer. In my opinion, we can leave the
> retry as is here. WDYT?

I believe that best would be add two tests there, instead of waiting for container is ready,
we can have 1) timeout for pulling image, and then 2) timeout for waiting that container became ready.

Second thing question which I have here, is why is that network container so big ?
Is it possible to reduce size of it ?

Comment 7 Dan Kenigsberg 2019-01-31 15:41:59 UTC

> I believe that best would be add two tests there, instead of waiting for container is ready,
> we can have 1) timeout for pulling image, and then 2) timeout for waiting that container became ready.

I like this idea, not because it would solve anything, but because it is going to provide a better error message.

> Second thing question which I have here, is why is that network container so big ?
> Is it possible to reduce size of it ?

I don't think it's about its size (it isn't THAT big). It's about the order. We pull Multus first, and we pull base rhel7 as a dependency, which takes most of the time.

We'd to look into this next week, but I would not block the release on it, as a simple workaround (retry) exists.

Comment 8 Pan Ousley 2019-02-01 18:23:07 UTC

Hey - what needs to be documented for this BZ? Should I add a step to the installation procedure, or did you want a Known Issue w/ workaround added to the Release Notes? Thanks.

Comment 9 Dan Kenigsberg 2019-02-03 06:41:39 UTC

I hope we can document the Known Issue into the code; but if we don't, I've suggested some text for out-of-product note.

Comment 12 Dan Kenigsberg 2019-02-04 13:27:33 UTC

Yes, I am afraid we'd need to document it away.

Comment 13 Pan Ousley 2019-02-05 21:19:01 UTC

Hey Dan, can you please review the Known Issue I added for this BZ? Thanks.

https://github.com/openshift/openshift-docs/pull/13374/commits/9d13a15ea86bf892a1bea5fdd1bc18976807dbe7

Comment 14 Dan Kenigsberg 2019-02-06 05:55:21 UTC

Ack

Comment 15 Federico Simoncelli 2019-02-11 09:48:58 UTC

What's the status? As reminder this is targeted for 1.4.

Comment 16 Meni Yakove 2019-02-11 09:53:02 UTC

Is this bug should be ON_QA? since the patch [1] is not merged yet.

[1] https://github.com/openshift/openshift-docs/pull/13374/commits/6db4a6c13f4d06e70f9d7b6516ac66690c6a6de1

Comment 17 Federico Simoncelli 2019-02-11 10:38:05 UTC

(In reply to Meni Yakove from comment #16)
> Is this bug should be ON_QA? since the patch [1] is not merged yet.

Speak with Pan because if "merging" means that it may get published if other teams are publishing the latest OCP documentation... then definitely no, we want to verify this before merging.
If it's safe to merge, then OK it's on Pan side to merge ASAP.

Comment 18 Pan Ousley 2019-02-11 12:37:32 UTC

(In reply to Federico Simoncelli from comment #17)
> (In reply to Meni Yakove from comment #16)
> > Is this bug should be ON_QA? since the patch [1] is not merged yet.
> 
> Speak with Pan because if "merging" means that it may get published if other
> teams are publishing the latest OCP documentation... then definitely no, we
> want to verify this before merging.
> If it's safe to merge, then OK it's on Pan side to merge ASAP.

Hi Federico/Meni/Dan, this particular PR is for all of the release notes for 1.4. For that reason, I have to keep it open until I'm sure that there are no other release notes needed. I won't change the Known Issues text for this BZ unless I hear otherwise.

(In general, CNV doc PRs can be merged to master without publishing anywhere because the CNV docs are not on docs.okd.io (the upstream version of the OpenShift docs). I would have to cherrypick to 3.11 for them to show up on docs.openshift.com)

HTH

Comment 19 Meni Yakove 2019-02-11 12:53:46 UTC

Looks good to me.

Note You need to log in before you can comment on or make changes to this bug.