"openstack overcloud container image upload" fails in the middle of uploading images to the local regitry.
Steps to reproduce:
Try to upload container images to the local registry:
(undercloud) [stack@undercloud74 ~]$ openstack overcloud container image upload --verbose --config-file /home/stack/container_images.yaml
START with options: [u'overcloud', u'container', u'image', u'upload', u'--verbose', u'--config-file', u'/home/stack/container_images.yaml']
command: overcloud container image upload -> tripleoclient.v1.container_image.UploadImage (auth=False)
Using config files: [u'/home/stack/container_images.yaml']
Completed upload for docker image registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest
END return value: 1
The w/a is to re-run the command.
If the workaround is to run again, I guess we need some sort of "retry" mechanism. Steve any thoughts on how we can address that?
To clarify: need to re-run until the upload is successful (i.e. one re-run is not enough).
We already retry the "docker pull" since hub.docker.com is flakey, so we can add a retry the "docker push" as well
We are having troubles to reproduce the issue in Omri's environment.
While running the image prepare which makes docker pull/push, I noticed that dockerd-current process was sometimes taking 150% of CPU.
I wouldn't be surprised if TripleO deployment fails to push OSP13 containers into the registry because dockerd-current fails to process all the requests to push containers in the registry.
Before adding a retry function to the push command, I would try to increase the undercloud flavor and probably give more CPU/memory and see if that helps.
We need to keep in mind the undercloud here is a VM and the lack of resources could possibly cause this problem. Of course we can expect our customers having the same issue in their production, which means we have to see what size the undercloud has to be and if a retry on docker push is needed.
Can you please report how many CPU cores on the undercloud flavors which are having issues?
I'm going to propose the following for calculating the worker count instead of hardcoding to 4
This would create 3 workers for a 6 core undercloud, and 4 workers for an 8 cloud undercloud.
If this issue has *ever* been seen on an 8 core flavor then we'll need a different equation.
(undercloud) [stack@undercloud74 ~]$ lscpu
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
On-line CPU(s) list: 0-11
Thread(s) per core: 1
Core(s) per socket: 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model name: Intel Xeon E3-12xx v2 (Ivy Bridge)
CPU MHz: 2199.998
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 4096K
NUMA node0 CPU(s): 0-11
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
Upstream fix has a retry loop for pushes, and a worker count based on CPU count
I don't encounter the issue anymore:
Switched to verified based on comment #17.
Will re-open if reproduces.
The issue was reproduced.
Upstream has found a different varient of this bug:
Which I've done a fix for:
I'm setting NEEDINFO on Alex for an opinion on whether we should use this bz for this tenacity based fix.
Yea we can. I'll add it to this BZ. We'll need to ensure the package availability.
unable to reproduce with :
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.