Red Hat Bugzilla – Bug 1593715
[UPGRADES] DockerInsecureRegistryAddress parameter is not propagated to OC nodes during upgrade
Last modified: 2018-09-20 22:23:40 EDT
Description of problem: ----------------------- Registries specified within DockerInsecureRegistryAddress in container-images.yaml inventory file are not propagated to /etc/sysconfig/docker on overcloud nodes. Excerpt from container-images.yaml: DockerInsecureRegistryAddress: - registry.one.example.com - registry.two.example.com After overcloud upgrade prepare: openstack overcloud upgrade prepare --stack qe-Cloud-1 \ --templates /usr/share/openstack-tripleo-heat-templates \ -e /home/stack/composable_roles/roles/nodes.yaml \ -e /home/stack/composable_roles/internal.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \ -e /home/stack/composable_roles/network/network-environment.yaml \ -e /home/stack/composable_roles/enable-tls.yaml \ -e /home/stack/composable_roles/inject-trust-anchor.yaml \ -e /home/stack/composable_roles/public_vip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \ -e /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \ -e /home/stack/composable_roles/hostnames.yaml \ -e /home/stack/composable_roles/debug.yaml \ -e /home/stack/composable_roles/config_heat.yaml \ -e /home/stack/composable_roles/docker-images.yaml \ -e /home/stack/composable_roles/docker-images.yaml \ --roles-file /home/stack/composable_roles/roles/roles_data.yaml 2>&1 [root@controller-0 ~]# awk '/INSECURE_REGISTRY=/' /etc/sysconfig/docker INSECURE_REGISTRY="--insecure-registry 192.168.24.1:8787" registry.one.example.com and registry.two.example.com are not in /etc/sysconfig/docker Problem: ======== Overcloud upgrade fails when trying to fetch new docker images: u'TASK [Pull latest Haproxy images] **********************************************', u'fatal: [192.168.24.15]: FAILED! => {"changed": true, "cmd": ["docker", "pull", "rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy:2018-06-15.2"], "delta": "0:00:00.040523", "end": "2018-06-21 12:22:24.503019", "msg": "non-zero return code", "rc": 1, "start": "2018-06-21 12:22:24.462496", "stderr": "Get https://rhos-qe-mirror-qeos.usersys.redhat.com:5000/v1/_ping: http: server gave HTTP response to HTTPS client", "stderr_lines": ["Get https://rhos-qe-mirror-qeos.usersys.redhat.com:5000/v1/_ping: http: server gave HTTP response to HTTPS client"], "stdout": "Trying to pull repository rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy ... ", "stdout_lines": ["Trying to pull repository rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy ... "]}', u'fatal: [192.168.24.23]: FAILED! => {"changed": true, "cmd": ["docker", "pull", "rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy:2018-06-15.2"], "delta": "0:00:00.036477", "end": "2018-06-21 12:22:24.566352", "msg": "non-zero return code", "rc": 1, "start": "2018-06-21 12:22:24.529875", "stderr": "Get https://rhos-qe-mirror-qeos.usersys.redhat.com:5000/v1/_ping: http: server gave HTTP response to HTTPS client", "stderr_lines": ["Get https://rhos-qe-mirror-qeos.usersys.redhat.com:5000/v1/_ping: http: server gave HTTP response to HTTPS client"], "stdout": "Trying to pull repository rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy ... ", "stdout_lines": ["Trying to pull repository rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy ... "]}', u'fatal: [192.168.24.14]: FAILED! => {"changed": true, "cmd": ["docker", "pull", "rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy:2018-06-15.2"], "delta": "0:00:00.042652", "end": "2018-06-21 12:22:24.575743", "msg": "non-zero return code", "rc": 1, "start": "2018-06-21 12:22:24.533091", "stderr": "Get https://rhos-qe-mirror-qeos.usersys.redhat.com:5000/v1/_ping: http: server gave HTTP response to HTTPS client", "stderr_lines": ["Get https://rhos-qe-mirror-qeos.usersys.redhat.com:5000/v1/_ping: http: server gave HTTP response to HTTPS client"], "stdout": "Trying to pull repository rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy ... ", "stdout_lines": ["Trying to pull repository rhos-qe-mirror-qeos.usersys.redhat.com:5000/rhosp13/openstack-haproxy ... "]}', u'', Version-Release number of selected component (if applicable): ------------------------------------------------------------- openstack-tripleo-heat-templates-8.0.2-35.el7ost.noarch python-tripleoclient-9.2.1-12.el7ost.noarch How reproducible: ----------------- So far 100% Steps to Reproduce: ------------------- 1. Deploy RHOS-12 2. Upgrade UC to RHOS-13 3. Setup latest repos on overcloud nodes 4. Prepare container images file that point to registries that differ from the one used during deploy Additional info: virtual env with composable roles.
After checking registries on all nodes it seems that nodes in CephStorage and Compute roles got registry configured correctly. Nodes in ControllerOpenstack, Database, Messaging, Networker have 'wrong' entries.
Root cause: * During `upgrade prepare`, the config management is disabled, so /etc/sysconfig/docker will not get the new values there. * During `upgrade run` we first run upgrade_tasks and then normal deploy tasks. /etc/sysconfig/docker is updated early during the deploy tasks. ** For services managed by Paunch, the refetching of images and creation of containers is handled by Paunch *after* the docker config has been updated, and there's no problem. ** For services managed by pacemaker, the new images are fetched and tagged in upgrade_tasks, so *before* the docker config has been updated. ** The above explains why this issue only appears on nodes which have some pacemaker-managed containers. * The issue was triggered by switching to a different registry during the upgrade than what was used for deploy, and they were both used in "insecure" mode. Impact: * This problem only appears when overcloud is fetching images from an "insecure registry" and the insecure registry used for the upgrade is different than what was used in the last preceding config management run (likely a preceding `overcloud deploy` command). * In production environments it's likely that users would either point to CDN (not insecure registry, won't trigger the issue) or that they'd continue using the same insecure registry (e.g. on undercloud) both for deploy and upgrade (again this wouldn't trigger the problem). Workarounds: If using a different insecure registry for upgrade than for deploy, i see these options to avoid the upgrade failure: * Before upgrading, run an `overcloud deploy` with the new registry URLs added to DockerInsecureRegistryAddress parameter. * Update /etc/sysconfig/docker on the overcloud nodes manually to add the registry URLs (only necessary for nodes which run containers managed by pacemaker).
Based on the above investigation i'll triage this as medium/medium but feel free to adjust or post more feedback.