Bug 1765003 - Changing registry namespace and prefix break update of HA services.
Summary: Changing registry namespace and prefix break update of HA services.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.0 (Train)
Hardware: x86_64
OS: Linux
high
medium
Target Milestone: rc
: 16.1 (Train on RHEL 8.2)
Assignee: RHOS Maint
QA Contact: Sasha Smolyak
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-10-24 07:01 UTC by Sofer Athlan-Guyot
Modified: 2023-09-14 05:44 UTC (History)
7 users (show)

Fixed In Version: openstack-tripleo-heat-templates-11.3.2-0.20200315025718.033aae9.el8ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1824260 (view as bug list)
Environment:
Last Closed: 2020-07-29 07:49:26 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:3148 0 None None None 2020-07-29 07:50:20 UTC

Description Sofer Athlan-Guyot 2019-10-24 07:01:51 UTC
Description of problem:

Update from GA to RHOS_TRUNK-15.0-RHEL-8-20191017.n.0 using default namespace and prefix doesn't work.  In particular all HA services are not updated, breaking in one way or another depending on how the pcmklatest system is implemented for the service (ovn for instance is different than mariadb).

Note that update doesn't "fail" during controller update even if the control plane is broken, it fails during compute update:


Error running ['podman', 'run', '--name', 'nova_wait_for_compute_service', '--label', 'config_id=tripleo_step4', '--label', 'container_name=nova_wait_for_compute_service', '--label', 'managed_by=paunch', '--label', 'config_data={\"command\": \"/container-config-scripts/pyshim.sh /container-config-scripts/nova_wait_for_compute_service.py\

2019-10-23 18:20:44 |         "  File \"/container-config-scripts/nova_wait_for_compute_service.py\", line 102, in <module>",
2019-10-23 18:20:44 |         "    service_list = nova.services.list(binary='nova-compute')",
2019-10-23 18:20:44 |         "  File \"/usr/lib/python3.6/site-packages/novaclient/v2/services.py\", line 52, in list",
2019-10-23 18:20:44 |         "    return self._list(url, \"services\")",
2019-10-23 18:20:44 |         "  File \"/usr/lib/python3.6/site-packages/novaclient/base.py\", line 254, in _list",
2019-10-23 18:20:44 |         "    resp, body = self.api.client.get(url)",
2019-10-23 18:20:44 |         "  File \"/usr/lib/python3.6/site-packages/novaclient/client.py\", line 72, in request",
2019-10-23 18:20:44 |         "  File \"/usr/lib/python3.6/site-packages/keystoneauth1/identity/generic/base.py\", line 208, in get_auth_ref",
2019-10-23 18:20:44 |         "    return self._plugin.get_auth_ref(session, **kwargs)",
2019-10-23 18:20:44 |         "keystoneauth1.exceptions.connection.ConnectFailure: Unable to establish connection to http://172.17.1.108:5000/v3/auth/tokens: HTTPConnectionPool(host='172.17.1.108', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc61d06fc88>: Failed to establish a new connection: [Errno 113] No route to host',))",
2019-10-23 18:20:44 |         "urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7fc61d09ac18>: Failed to establish a new connection: [Errno 113] No route to host",
2019-10-23 18:20:44 |         "urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='172.17.1.108', port=5000): Max retries exceeded with url: /v3/auth/tokens (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fc61d09ac18>: Failed to establish a new connection: [Errno 113] No route to host',))",


As the ovndb are down.

On the controller this is the status of the cluster:

Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-0 (version 2.0.1-4.el8_0.4-0eb7991564) - partition with quorum
Last updated: Thu Oct 24 06:49:35 2019
Last change: Wed Oct 23 21:49:07 2019 by root via crm_resource on controller-2

15 nodes configured
46 resources configured

Online: [ controller-0 controller-1 controller-2 ]
GuestOnline: [ galera-bundle-0@controller-1 galera-bundle-1@controller-2 galera-bundle-2@controller-0 rabbitmq-bundle-0@controller-1 rabbitmq-bundle-1@controller-0 rabbitmq-bundle-2@controller-2 redis-bundle-0@con
troller-1 redis-bundle-1@controller-0 redis-bundle-2@controller-2 ]

Full list of resources:

 podman container set: galera-bundle [192.168.24.1:8787/rhosp15/openstack-mariadb:pcmklatest]
   galera-bundle-0      (ocf::heartbeat:galera):        Master controller-1
   galera-bundle-1      (ocf::heartbeat:galera):        Master controller-2
   galera-bundle-2      (ocf::heartbeat:galera):        Master controller-0
 podman container set: rabbitmq-bundle [192.168.24.1:8787/rhosp15/openstack-rabbitmq:pcmklatest]
   rabbitmq-bundle-0    (ocf::heartbeat:rabbitmq-cluster):      Started controller-1
   rabbitmq-bundle-1    (ocf::heartbeat:rabbitmq-cluster):      Started controller-0
   rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
 podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest]
  rabbitmq-bundle-2    (ocf::heartbeat:rabbitmq-cluster):      Started controller-2
 podman container set: redis-bundle [192.168.24.1:8787/rhosp15/openstack-redis:pcmklatest]
   redis-bundle-0       (ocf::heartbeat:redis): Slave controller-1
   redis-bundle-1       (ocf::heartbeat:redis): Master controller-0
   redis-bundle-2       (ocf::heartbeat:redis): Slave controller-2
 ip-192.168.24.15       (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-10.0.0.110  (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.1.72 (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-172.17.1.108        (ocf::heartbeat:IPaddr2):       Stopped
 ip-172.17.3.110        (ocf::heartbeat:IPaddr2):       Started controller-0
 ip-172.17.4.102        (ocf::heartbeat:IPaddr2):       Started controller-0
 podman container set: haproxy-bundle [192.168.24.1:8787/rhosp15/openstack-haproxy:pcmklatest]
   haproxy-bundle-podman-0      (ocf::heartbeat:podman):        Started controller-2
   haproxy-bundle-podman-1      (ocf::heartbeat:podman):        Started controller-0
   haproxy-bundle-podman-2      (ocf::heartbeat:podman):        Started controller-1
 podman container set: ovn-dbs-bundle [192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest]
   ovn-dbs-bundle-0     (ocf::ovn:ovndb-servers):       Stopped
   ovn-dbs-bundle-1     (ocf::ovn:ovndb-servers):       Stopped
   ovn-dbs-bundle-2     (ocf::ovn:ovndb-servers):       Stopped
 podman container: openstack-cinder-volume [192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest]
   openstack-cinder-volume-podman-0     (ocf::heartbeat:podman):        Started controller-2

Failed Resource Actions:
* ovn-dbs-bundle-podman-0_start_0 on controller-2 'unknown error' (1): call=100, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 21:31:29 2019', queued=0ms, exec=616ms
* ovn-dbs-bundle-podman-1_start_0 on controller-2 'unknown error' (1): call=112, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 21:31:32 2019', queued=0ms, exec=290ms
* ovn-dbs-bundle-podman-2_start_0 on controller-2 'unknown error' (1): call=115, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 21:31:35 2019', queued=0ms, exec=263ms
* openstack-cinder-volume-podman-0_start_0 on controller-0 'unknown error' (1): call=156, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest',
    last-rc-change='Wed Oct 23 21:28:32 2019', queued=0ms, exec=292ms
* ovn-dbs-bundle-podman-1_start_0 on controller-0 'unknown error' (1): call=142, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 20:54:11 2019', queued=0ms, exec=303ms
* ovn-dbs-bundle-podman-2_start_0 on controller-0 'unknown error' (1): call=158, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 21:28:37 2019', queued=0ms, exec=283ms
* ovn-dbs-bundle-podman-0_start_0 on controller-0 'unknown error' (1): call=152, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 21:06:29 2019', queued=0ms, exec=347ms
* openstack-cinder-volume-podman-0_start_0 on controller-1 'unknown error' (1): call=132, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-cinder-volume:pcmklatest',
    last-rc-change='Wed Oct 23 21:28:30 2019', queued=0ms, exec=294ms
* ovn-dbs-bundle-podman-2_start_0 on controller-1 'unknown error' (1): call=134, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 21:28:34 2019', queued=0ms, exec=276ms
* ovn-dbs-bundle-podman-1_start_0 on controller-1 'unknown error' (1): call=111, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 21:09:25 2019', queued=0ms, exec=287ms
* ovn-dbs-bundle-podman-0_start_0 on controller-1 'unknown error' (1): call=101, status=complete, exitreason='failed to pull image 192.168.24.1:8787/rhosp15/openstack-ovn-northd:pcmklatest',
    last-rc-change='Wed Oct 23 21:09:22 2019', queued=0ms, exec=354ms

and here the images related to the cluster:

[heat-admin@controller-0 ~]$ sudo podman images | grep pcmklatest
192.168.24.1:8787/rh-osbs/rhosp15-openstack-cinder-volume             pcmklatest   74e6fe302909   39 hours ago   1.22 GB
192.168.24.1:8787/rh-osbs/rhosp15-openstack-ovn-northd                pcmklatest   d0e090f75aa9   39 hours ago   720 MB
192.168.24.1:8787/rh-osbs/rhosp15-openstack-redis                     pcmklatest   2d20cc6fa3aa   39 hours ago   550 MB
192.168.24.1:8787/rh-osbs/rhosp15-openstack-rabbitmq                  pcmklatest   66e9ddfe41bf   39 hours ago   590 MB
192.168.24.1:8787/rh-osbs/rhosp15-openstack-haproxy                   pcmklatest   ac940eaa469d   39 hours ago   548 MB
192.168.24.1:8787/rh-osbs/rhosp15-openstack-mariadb                   pcmklatest   40210064f9e0   39 hours ago   763 MB
192.168.24.1:8787/rhosp15/openstack-redis                             pcmklatest   cb55f02698e9   5 weeks ago    502 MB
192.168.24.1:8787/rhosp15/openstack-haproxy                           pcmklatest   c5826c9e9bed   5 weeks ago    500 MB
192.168.24.1:8787/rhosp15/openstack-rabbitmq                          pcmklatest   df24602a69cc   5 weeks ago    543 MB
192.168.24.1:8787/rhosp15/openstack-mariadb                           pcmklatest   5a9441eaa9e4   5 weeks ago    706 MB


Quick analysis:

When updating from GA to RHOS_TRUNK-15.0-RHEL-8-20191017.n.0 the prefix and path part of the namespace change.

For 1017:

---
container-image-prepare:
  namespace: registry-proxy.engineering.redhat.com/rh-osbs
  prefix: rhosp15-openstack-
  tag: 20191014.2
puddle:
  rhosp: 15.0
  id: RHOS_TRUNK-15.0-RHEL-8-20191011.n.0

For GA:

container-image-prepare:
  namespace: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhosp15
  prefix: openstack-
  tag: 20190926.1
puddle:
  rhosp: 15.0
  id: RHOS_TRUNK-15.0-RHEL-8-20190926.n.0


This lead to different image name:

 - 1017: rh-osbs/rhosp15-openstack-mariadb
 - GA: rhosp15/openstack-mariadb


How reproducible:

Always as soon as you change the path part of the namespace and/or the prefix of the images

Steps to Reproduce:
1. install osp15 with a namespace/prefix leading to image name rhosp15/openstack-<image-name>
2. update container-prepare.yaml with namespace/prefix leading to image rh-osbs/rhosp15-openstack-<image-name>
3. Run the update

Actual results:

It breaks.

Additional information.

I think this kind of namespace/prefix change would break all update back to at least osp13 (since the pcmklatest mechanism exists).  But we never had a bug for it, so my guess is that it's not something that happened in real deployment.

I put a medium severity because we need first to assess if that will affect only Test environment or if that will be something facing customer as well.

Comment 2 Sofer Athlan-Guyot 2019-10-24 07:11:43 UTC
Hi,

so, first a need info for rhos-delivery.  

We would like to know if the namespace/prefix change (related to quay.io) will happen for customer as well.  If not, then it's purely a testing issue.  But, as this problem won't have an easy solution what would be our best choice in testing update without namespace change.  I can see two solutions there:
 - either we can deploy GA using quay.io style image's name:
 - either we can have phase1 image's name compatible with GA;

Something else ?

Thanks,

Comment 3 Jon Schlueter 2019-10-24 09:26:23 UTC
This will happen in the Field if a customer deploys from latest and then later tries to shift to satellite so they can better control what is exposed when to their production environment.

Comment 10 Alex McLeod 2020-06-16 12:28:46 UTC
If this bug requires doc text for errata release, please set the 'Doc Type' and provide draft text according to the template in the 'Doc Text' field. The documentation team will review, edit, and approve the text.

If this bug does not require doc text, please set the 'requires_doc_text' flag to '-'.

Comment 14 errata-xmlrpc 2020-07-29 07:49:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:3148

Comment 15 Red Hat Bugzilla 2023-09-14 05:44:56 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days


Note You need to log in before you can comment on or make changes to this bug.