"openstack overcloud container image upload" fails in the middle of uploading images to the local regitry. Environment: openstack-tripleo-heat-templates-8.0.0-0.20180103192341.el7ost.noarch openstack-puppet-modules-11.0.0-0.20171011152327.71ad01c.el7ost.noarch instack-undercloud-8.1.1-0.20171223221738.el7ost.noarch Steps to reproduce: Try to upload container images to the local registry: (undercloud) [stack@undercloud74 ~]$ openstack overcloud container image upload --verbose --config-file /home/stack/container_images.yaml START with options: [u'overcloud', u'container', u'image', u'upload', u'--verbose', u'--config-file', u'/home/stack/container_images.yaml'] command: overcloud container image upload -> tripleoclient.v1.container_image.UploadImage (auth=False) Using config files: [u'/home/stack/container_images.yaml'] imagename: registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest Completed upload for docker image registry.access.redhat.com/rhceph/rhceph-3-rhel7:latest imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-aodh-api:13.0-20180112.1 imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-aodh-notifier:13.0-20180112.1 imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-ceilometer-notification:13.0-20180112.1 imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-cinder-volume:13.0-20180112.1 imagename: docker-registry.engineering.redhat.com/rhosp13/openstack-gnocchi-api:13.0-20180112.1 None: None END return value: 1 The w/a is to re-run the command.
If the workaround is to run again, I guess we need some sort of "retry" mechanism. Steve any thoughts on how we can address that?
To clarify: need to re-run until the upload is successful (i.e. one re-run is not enough).
We already retry the "docker pull" since hub.docker.com is flakey, so we can add a retry the "docker push" as well
We are having troubles to reproduce the issue in Omri's environment. While running the image prepare which makes docker pull/push, I noticed that dockerd-current process was sometimes taking 150% of CPU. I wouldn't be surprised if TripleO deployment fails to push OSP13 containers into the registry because dockerd-current fails to process all the requests to push containers in the registry. Before adding a retry function to the push command, I would try to increase the undercloud flavor and probably give more CPU/memory and see if that helps. We need to keep in mind the undercloud here is a VM and the lack of resources could possibly cause this problem. Of course we can expect our customers having the same issue in their production, which means we have to see what size the undercloud has to be and if a retry on docker push is needed.
Can you please report how many CPU cores on the undercloud flavors which are having issues? I'm going to propose the following for calculating the worker count instead of hardcoding to 4 max(2, cpu_count/2) This would create 3 workers for a 6 core undercloud, and 4 workers for an 8 cloud undercloud. If this issue has *ever* been seen on an 8 core flavor then we'll need a different equation.
(undercloud) [stack@undercloud74 ~]$ lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 1 Core(s) per socket: 1 Socket(s): 12 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 58 Model name: Intel Xeon E3-12xx v2 (Ivy Bridge) Stepping: 9 CPU MHz: 2199.998 BogoMIPS: 4399.99 Hypervisor vendor: KVM Virtualization type: full L1d cache: 32K L1i cache: 32K L2 cache: 4096K NUMA node0 CPU(s): 0-11 Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 cx16 sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm fsgsbase smep erms xsaveopt
Upstream fix has a retry loop for pushes, and a worker count based on CPU count
I don't encounter the issue anymore: Environment: openstack-tripleo-heat-templates-8.0.0-0.20180122224017.el7ost.noarch instack-undercloud-8.1.1-0.20180117134321.el7ost.noarch openstack-puppet-modules-11.0.0-0.20171011152327.71ad01c.el7ost.noarch
Switched to verified based on comment #17. Will re-open if reproduces.
The issue was reproduced.
Upstream has found a different varient of this bug: https://bugs.launchpad.net/tripleo/+bug/1749663 Which I've done a fix for: https://review.openstack.org/#/c/548914/ I'm setting NEEDINFO on Alex for an opinion on whether we should use this bz for this tenacity based fix.
Yea we can. I'll add it to this BZ. We'll need to ensure the package availability.
unable to reproduce with : openstack-tripleo-common-8.5.1-0.20180326153322.91f52e9.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086