1538338 – "openstack overcloud container image upload" fails in the middle of uploading images to the local registry.

Bug 1538338 - "openstack overcloud container image upload" fails in the middle of uploading images to the local registry.

Summary: "openstack overcloud container image upload" fails in the middle of uploading...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-common
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	beta
Target Release:	13.0 (Queens)
Assignee:	Toure Dunnon
QA Contact:	Alexander Chuzhoy
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1559062
TreeView+	depends on / blocked

Reported:	2018-01-24 20:59 UTC by Alexander Chuzhoy
Modified:	2018-06-27 13:44 UTC (History)
CC List:	14 users (show)
Fixed In Version:	openstack-tripleo-common-8.5.1-0.20180304032202.e8d9da9.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1559062 (view as bug list)
Environment:
Last Closed:	2018-06-27 13:43:23 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1746305	None	None	None	2018-01-30 17:52:07 UTC
Launchpad	1749663	None	None	None	2018-03-01 15:48:31 UTC
OpenStack gerrit	539383	None	MERGED	Make container image upload more resilient	2020-12-15 21:47:33 UTC
OpenStack gerrit	548914	None	MERGED	Use tenacity for image upload retries	2020-12-15 21:47:33 UTC
RDO	12744	None	None	None	2018-03-21 15:39:00 UTC
Red Hat Product Errata	RHEA-2018:2086	None	None	None	2018-06-27 13:44:09 UTC

Description Alexander Chuzhoy 2018-01-24 20:59:34 UTC

"openstack overcloud container image upload" fails in the middle of uploading images to the local regitry.


Environment:
openstack-tripleo-heat-templates-8.0.0-0.20180103192341.el7ost.noarch
openstack-puppet-modules-11.0.0-0.20171011152327.71ad01c.el7ost.noarch
instack-undercloud-8.1.1-0.20171223221738.el7ost.noarch



Steps to reproduce:
Try to upload container images to the local registry:


(undercloud) [stack@undercloud74 ~]$ openstack overcloud container image upload --verbose --config-file /home/stack/container_images.yaml


START with options: [u'overcloud', u'container', u'image', u'upload', u'--verbose', u'--config-file', u'/home/stack/container_images.yaml']
command: overcloud container image upload -> tripleoclient.v1.container_image.UploadImage (auth=False)
Using config files: [u'/home/stack/container_images.yaml']
imagename: registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest
Completed upload for docker image registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest
imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-aodh-api:13.0-20180112.1
imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-aodh-notifier:13.0-20180112.1
imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-ceilometer-notification:13.0-20180112.1
imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:13.0-20180112.1
imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-gnocchi-api:13.0-20180112.1
None: None
END return value: 1



The w/a is to re-run the command.

Comment 2 Emilien Macchi 2018-01-25 17:34:45 UTC

If the workaround is to run again, I guess we need some sort of "retry" mechanism. Steve any thoughts on how we can address that?

Comment 3 Alexander Chuzhoy 2018-01-25 20:51:56 UTC

To clarify: need to re-run until the upload is successful (i.e. one re-run is not enough).

Comment 4 Steve Baker 2018-01-26 02:12:41 UTC

We already retry the "docker pull" since hub.docker.com is flakey, so we can add a retry the "docker push" as well

Comment 5 Emilien Macchi 2018-01-29 16:39:08 UTC

We are having troubles to reproduce the issue in Omri's environment.
While running the image prepare which makes docker pull/push, I noticed that dockerd-current process was sometimes taking 150% of CPU.
I wouldn't be surprised if TripleO deployment fails to push OSP13 containers into the registry because dockerd-current fails to process all the requests to push containers in the registry.

Before adding a retry function to the push command, I would try to increase the undercloud flavor and probably give more CPU/memory and see if that helps.

We need to keep in mind the undercloud here is a VM and the lack of resources could possibly cause this problem. Of course we can expect our customers having the same issue in their production, which means we have to see what size the undercloud has to be and if a retry on docker push is needed.

Comment 11 Steve Baker 2018-01-31 00:16:24 UTC

Can you please report how many CPU cores on the undercloud flavors which are having issues?

I'm going to propose the following for calculating the worker count instead of hardcoding to 4

max(2, cpu_count/2)

This would create 3 workers for a 6 core undercloud, and 4 workers for an 8 cloud undercloud.

If this issue has *ever* been seen on an 8 core flavor then we'll need a different equation.

Comment 12 Alexander Chuzhoy 2018-01-31 00:22:26 UTC

(undercloud) [stack@undercloud74 ~]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                12
On-line CPU(s) list:   0-11
Thread(s) per core:    1
Core(s) per socket:    1
Socket(s):             12
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel Xeon E3-12xx v2 (Ivy Bridge)
Stepping:              9
CPU MHz:               2199.998
BogoMIPS:              4399.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              4096K
NUMA node0 CPU(s):     0-11
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt

Comment 13 Steve Baker 2018-01-31 00:53:04 UTC

Upstream fix has a retry loop for pushes, and a worker count based on CPU count

Comment 17 Alexander Chuzhoy 2018-02-14 22:48:15 UTC

I don't encounter the issue anymore:

Environment:
openstack-tripleo-heat-templates-8.0.0-0.20180122224017.el7ost.noarch
instack-undercloud-8.1.1-0.20180117134321.el7ost.noarch
openstack-puppet-modules-11.0.0-0.20171011152327.71ad01c.el7ost.noarch

Comment 18 Alexander Chuzhoy 2018-02-14 23:10:27 UTC

Switched to verified based on comment #17.

Will re-open if reproduces.

Comment 19 Alexander Chuzhoy 2018-02-15 18:03:30 UTC

The issue was reproduced.

Comment 21 Steve Baker 2018-03-01 11:08:48 UTC

Upstream has found a different varient of this bug:
https://bugs.launchpad.net/tripleo/+bug/1749663

Which I've done a fix for:
https://review.openstack.org/#/c/548914/

I'm setting NEEDINFO on Alex for an opinion on whether we should use this bz for this tenacity based fix.

Comment 22 Alex Schultz 2018-03-01 15:48:07 UTC

Yea we can. I'll add it to this BZ. We'll need to ensure the package availability.

Comment 26 Omri Hochman 2018-04-04 13:35:49 UTC

unable to reproduce with :
openstack-tripleo-common-8.5.1-0.20180326153322.91f52e9.el7ost.noarch

Comment 28 errata-xmlrpc 2018-06-27 13:43:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086

Note You need to log in before you can comment on or make changes to this bug.