Bug 1813642 - Controller update from 13.0.7 to 13.0.11 is failing (OSP Director 13)
Summary: Controller update from 13.0.7 to 13.0.11 is failing (OSP Director 13)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: z12
: 13.0 (Queens)
Assignee: Sofer Athlan-Guyot
QA Contact: David Rosenfeld
URL:
Whiteboard:
: 1837872 (view as bug list)
Depends On:
Blocks: epmosp13bugs 1802301
TreeView+ depends on / blocked
 
Reported: 2020-03-15 06:06 UTC by Sai Ram Peesapati
Modified: 2020-06-24 12:49 UTC (History)
13 users (show)

Fixed In Version: openstack-tripleo-heat-templates-8.4.1-54.el7ost
Doc Type: Bug Fix
Doc Text:
This update fixes a bug that caused failure of upgrades from the OpenStack Platform maintenance release of 10 July 2019 (RHOSP 13.0.7). Specifically, it fixes a cell management error in OpenStack Compute (nova). Now you can upgrade to newer OpenStack versions from the OpenStack Platform maintenance release of 10 July 2019 (RHOSP 13.0.7).
Clone Of:
Environment:
Last Closed: 2020-06-24 11:33:20 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ansible-error.json (158.32 KB, text/plain)
2020-03-15 15:14 UTC, Sai Ram Peesapati
no flags Details
update run output (1.53 MB, text/plain)
2020-03-20 23:48 UTC, Sai Ram Peesapati
no flags Details


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 720090 0 None MERGED Make sure nova_api_ensure_cell0_database_url is deleted. 2021-02-20 08:30:46 UTC
Red Hat Product Errata RHBA-2020:2718 0 None None None 2020-06-24 11:33:56 UTC

Description Sai Ram Peesapati 2020-03-15 06:06:13 UTC
Description of problem:
Trying to test OSP Director 13 controller update from rhosp-release 13.0.7 to 13.0.11 and ran into below issue

"Running container: nova_api_ensure_cell0_database_url", 
"$ docker ps -a --filter label=container_name=nova_api_ensure_cell0_database_url --filter label=config_id=tripleo_step3 --format {{.Names}}", 
"Did not find container with \"['docker', 'ps', '-a', '--filter', 'label=container_name=nova_api_ensure_cell0_database_url', '--filter', 'label=config_id=tripleo_step3', '--format', '{{.Names}}']\"
 - retrying without config_id",
 "$ docker ps -a --filter label=container_name=nova_api_ensure_cell0_database_url --format {{.Names}}",
"nova_api_ensure_cell0_database_url", 
"$ docker run --name nova_api_ensure_cell0_database_url --label config_id=tripleo_step3 --label container_name=nova_api_ensure_cell0_database_url --label managed_by=paunch --label config_data={\"start_order\": 3, \"image\": \"registry.access.redhat.com/rhosp13/openstack-nova-api:13.0-114\", \"environment\": [\"TRIPLEO_CONFIG_HASH=249fea517d1efae4911e08838277b246\"], \"command\": \"/usr/bin/bootstrap_host_exec nova_api /nova_api_ensure_cell0_database_url.sh\", \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/log/containers/nova:/var/log/nova\", \"/var/log/containers/httpd/nova-api:/var/log/httpd\", \"/var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro\", \"/var/lib/config-data/nova/etc/nova/:/etc/nova/:ro\", \"/var/log/containers/nova:/var/log/nova\", \"/var/lib/config-data/puppet-generated/nova/:/var/lib/kolla/config_files/src:ro\", \"/var/lib/docker-config-scripts/nova_api_ensure_cell0_database_url.sh:/nova_api_ensure_cell0_database_url.sh:ro\"], \"net\": \"host\", \"detach\": false} --env=TRIPLEO_CONFIG_HASH=249fea517d1efae4911e08838277b246 --net=host --user=root --volume=/etc/hosts:/etc/hosts:ro --volume=/etc/localtime:/etc/localtime:ro --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro --volume=/dev/log:/dev/log --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro --volume=/etc/puppet:/etc/puppet:ro --volume=/var/log/containers/nova:/var/log/nova --volume=/var/log/containers/httpd/nova-api:/var/log/httpd --volume=/var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro --volume=/var/lib/config-data/nova/etc/nova/:/etc/nova/:ro --volume=/var/log/containers/nova:/var/log/nova --volume=/var/lib/config-data/puppet-generated/nova/:/var/lib/kolla/config_files/src:ro --volume=/var/lib/docker-config-scripts/nova_api_ensure_cell0_database_url.sh:/nova_api_ensure_cell0_database_url.sh:ro --cpuset-cpus=0,1,2,3 registry.access.redhat.com/rhosp13/openstack-nova-api:13.0-114 /usr/bin/bootstrap_host_exec nova_api /nova_api_ensure_cell0_database_url.sh", 
"/usr/bin/docker-current: Error response from daemon: Conflict. The container name \"/nova_api_ensure_cell0_database_url\" is already in use by container 580df6343e9af347fdf157e1f00fe37e0155c02ed368263ce8fc08466fcf7824. 
You have to remove (or rename) that container to be able to reuse that name..", 
"See '/usr/bin/docker-current run --help'.",
"Error running ['docker', 'run', '--name', u'nova_api_ensure_cell0_database_url', '--label', 'config_id=tripleo_step3', '--label', 'container_name=nova_api_ensure_cell0_database_url', '--label', 'managed_by=paunch', '--label', 'config_data={\"start_order\": 3, \"image\": \"registry.access.redhat.com/rhosp13/openstack-nova-api:13.0-114\", \"environment\": [\"TRIPLEO_CONFIG_HASH=249fea517d1efae4911e08838277b246\"], \"command\": \"/usr/bin/bootstrap_host_exec nova_api /nova_api_ensure_cell0_database_url.sh\", \"user\": \"root\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/log/containers/nova:/var/log/nova\", \"/var/log/containers/httpd/nova-api:/var/log/httpd\", \"/var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro\", \"/var/lib/config-data/nova/etc/nova/:/etc/nova/:ro\", \"/var/log/containers/nova:/var/log/nova\", \"/var/lib/config-data/puppet-generated/nova/:/var/lib/kolla/config_files/src:ro\", \"/var/lib/docker-config-scripts/nova_api_ensure_cell0_database_url.sh:/nova_api_ensure_cell0_database_url.sh:ro\"], \"net\": \"host\", \"detach\": false}', '--env=TRIPLEO_CONFIG_HASH=249fea517d1efae4911e08838277b246', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/log/containers/nova:/var/log/nova', '--volume=/var/log/containers/httpd/nova-api:/var/log/httpd', '--volume=/var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro', '--volume=/var/lib/config-data/nova/etc/nova/:/etc/nova/:ro', '--volume=/var/log/containers/nova:/var/log/nova', '--volume=/var/lib/config-data/puppet-generated/nova/:/var/lib/kolla/config_files/src:ro', '--volume=/var/lib/docker-config-scripts/nova_api_ensure_cell0_database_url.sh:/nova_api_ensure_cell0_database_url.sh:ro', '--cpuset-cpus=0,1,2,3', 'registry.access.redhat.com/rhosp13/openstack-nova-api:13.0-114', '/usr/bin/bootstrap_host_exec', 'nova_api', '/nova_api_ensure_cell0_database_url.sh']. [125]",
"stderr: /usr/bin/docker-current: Error response from daemon: 
Conflict. The container name \"/nova_api_ensure_cell0_database_url\" is already in use by container 580df6343e9af347fdf157e1f00fe37e0155c02ed368263ce8fc08466fcf7824. You have to remove (or rename) that container to be able to reuse that name.."



Version-Release number of selected component (if applicable):
13.0.7 overcloud deployment versions:
openstack-tripleo-heat-templates-8.3.1-54.el7ost
rhosp-director-images-13.0-20190627.1.el7ost
rhosp-director-images-ipa-13.0-20190627.1.el7ost
Update versions:
openstack-tripleo-heat-templates-8.4.1-42.el7ost


How reproducible:
When trying to update overcloud from rhosp-release 13.0.7 to 13.0.11.


Steps to Reproduce:
1. Install fresh undercloud 13.0.11 and swap openstack-tripleo-heat-templates with openstack-tripleo-heat-templates-8.3.1-54.el7ost
2. Install rhosp-director-images-13.0-20190627.1.el7ost and rhosp-director-images-ipa-13.0-20190627.1.el7ost and update glance with 13.0.7 overcloud images.
3. Deploy 13.0.7 overcloud using [1] overcloud_images.yaml mentioned in additional info and running "openstack overcloud deploy --templates -e /home/stack/templates/overcloud_images.yaml --ntp-server <ntp-ip>"
4. Once the deployment is up, update openstack-tripleo-heat-templates to openstack-tripleo-heat-templates-8.4.1-42.el7ost
5. Run below commands
a. source stackrc
b. sudo openstack overcloud container image prepare --namespace=registry.access.redhat.com/rhosp13 --prefix=openstack- --tag-from-label {version}-{release} --output-env-file=/home/stack/templates/overcloud_images.yaml
c. openstack overcloud update prepare --templates -e /home/stack/templates/overcloud_images.yaml --ntp-server <ntp-ip>
d. Run "openstack overcloud update run --nodes Controller"


Actual results:
Conflict. The container name \"/nova_api_ensure_cell0_database_url\" is already in use by container 580df6343e9af347fdf157e1f00fe37e0155c02ed368263ce8fc08466fcf7824. You have to remove (or rename) that container to be able to reuse that name.."


Expected results:
Successfully update overcloud controller nodes from 13.0.7 to 13.0.11.


Additional info:
It looks like in openstack-tripleo-heat-templates-8.3.1-54.el7ost, "nova_api_ensure_cell0_database_url" container starts at "step: 5", whereas in openstack-tripleo-heat-templates-8.4.1-42.el7ost starting of "nova_api_ensure_cell0_database_url" container moved to "step: 3".
Due to which during the controller update, "nova_api_ensure_cell0_database_url" from 13.0.7 was still running when step is at 3 and when latest openstack-tripleo-heat-temaplates are trying to start "nova_api_ensure_cell0_database_url" at step 3, this is causing the issue:
Conflict. The container name \"/nova_api_ensure_cell0_database_url\" is already in use by container 580df6343e9af347fdf157e1f00fe37e0155c02ed368263ce8fc08466fcf7824. You have to remove (or rename) that container to be able to reuse that name.."


[1] overcloud_images.yaml
parameter_defaults:
  DockerAodhApiImage: registry.access.redhat.com/rhosp13/openstack-aodh-api:13.0-76
  DockerAodhConfigImage: registry.access.redhat.com/rhosp13/openstack-aodh-api:13.0-76
  DockerAodhEvaluatorImage: registry.access.redhat.com/rhosp13/openstack-aodh-evaluator:13.0-76
  DockerAodhListenerImage: registry.access.redhat.com/rhosp13/openstack-aodh-listener:13.0-75
  DockerAodhNotifierImage: registry.access.redhat.com/rhosp13/openstack-aodh-notifier:13.0-76
  DockerCeilometerCentralImage: registry.access.redhat.com/rhosp13/openstack-ceilometer-central:13.0-73
  DockerCeilometerComputeImage: registry.access.redhat.com/rhosp13/openstack-ceilometer-compute:13.0-75
  DockerCeilometerConfigImage: registry.access.redhat.com/rhosp13/openstack-ceilometer-central:13.0-73
  DockerCeilometerNotificationImage: registry.access.redhat.com/rhosp13/openstack-ceilometer-notification:13.0-75
  DockerCinderApiImage: registry.access.redhat.com/rhosp13/openstack-cinder-api:13.0-79
  DockerCinderConfigImage: registry.access.redhat.com/rhosp13/openstack-cinder-api:13.0-79
  DockerCinderSchedulerImage: registry.access.redhat.com/rhosp13/openstack-cinder-scheduler:13.0-81
  DockerCinderVolumeImage: registry.access.redhat.com/rhosp13/openstack-cinder-volume:13.0-79
  DockerClustercheckConfigImage: registry.access.redhat.com/rhosp13/openstack-mariadb:13.0-77
  DockerClustercheckImage: registry.access.redhat.com/rhosp13/openstack-mariadb:13.0-77
  DockerCrondConfigImage: registry.access.redhat.com/rhosp13/openstack-cron:13.0-82
  DockerCrondImage: registry.access.redhat.com/rhosp13/openstack-cron:13.0-82
  DockerGlanceApiConfigImage: registry.access.redhat.com/rhosp13/openstack-glance-api:13.0-78
  DockerGlanceApiImage: registry.access.redhat.com/rhosp13/openstack-glance-api:13.0-78
  DockerGnocchiApiImage: registry.access.redhat.com/rhosp13/openstack-gnocchi-api:13.0-76
  DockerGnocchiConfigImage: registry.access.redhat.com/rhosp13/openstack-gnocchi-api:13.0-76
  DockerGnocchiMetricdImage: registry.access.redhat.com/rhosp13/openstack-gnocchi-metricd:13.0-77
  DockerGnocchiStatsdImage: registry.access.redhat.com/rhosp13/openstack-gnocchi-statsd:13.0-76
  DockerHAProxyConfigImage: registry.access.redhat.com/rhosp13/openstack-haproxy:13.0-79
  DockerHAProxyImage: registry.access.redhat.com/rhosp13/openstack-haproxy:13.0-79
  DockerIscsidConfigImage: registry.access.redhat.com/rhosp13/openstack-iscsid:13.0-74
  DockerIscsidImage: registry.access.redhat.com/rhosp13/openstack-iscsid:13.0-74
  DockerKeystoneConfigImage: registry.access.redhat.com/rhosp13/openstack-keystone:13.0-74
  DockerKeystoneImage: registry.access.redhat.com/rhosp13/openstack-keystone:13.0-74
  DockerMemcachedConfigImage: registry.access.redhat.com/rhosp13/openstack-memcached:13.0-76
  DockerMemcachedImage: registry.access.redhat.com/rhosp13/openstack-memcached:13.0-76
  DockerMysqlClientConfigImage: registry.access.redhat.com/rhosp13/openstack-mariadb:13.0-77
  DockerMysqlConfigImage: registry.access.redhat.com/rhosp13/openstack-mariadb:13.0-77
  DockerMysqlImage: registry.access.redhat.com/rhosp13/openstack-mariadb:13.0-77
  DockerNeutronDHCPImage: registry.access.redhat.com/rhosp13/openstack-neutron-dhcp-agent:13.0-85
  DockerNeutronL3AgentImage: registry.access.redhat.com/rhosp13/openstack-neutron-l3-agent:13.0-83
  DockerNeutronMetadataImage: registry.access.redhat.com/rhosp13/openstack-neutron-metadata-agent:13.0-86
  DockerNovaApiImage: registry.access.redhat.com/rhosp13/openstack-nova-api:13.0-84
  DockerNovaComputeImage: registry.access.redhat.com/rhosp13/openstack-nova-compute:13.0-92
  DockerNovaConductorImage: registry.access.redhat.com/rhosp13/openstack-nova-conductor:13.0-82
  DockerNovaConfigImage: registry.access.redhat.com/rhosp13/openstack-nova-api:13.0-84
  DockerNovaConsoleauthImage: registry.access.redhat.com/rhosp13/openstack-nova-consoleauth:13.0-82
  DockerNovaLibvirtConfigImage: registry.access.redhat.com/rhosp13/openstack-nova-compute:13.0-92
  DockerNovaLibvirtImage: registry.access.redhat.com/rhosp13/openstack-nova-libvirt:13.0-95
  DockerNovaMetadataImage: registry.access.redhat.com/rhosp13/openstack-nova-api:13.0-84
  DockerNovaPlacementConfigImage: registry.access.redhat.com/rhosp13/openstack-nova-placement-api:13.0-83
  DockerNovaPlacementImage: registry.access.redhat.com/rhosp13/openstack-nova-placement-api:13.0-83
  DockerNovaSchedulerImage: registry.access.redhat.com/rhosp13/openstack-nova-scheduler:13.0-84
  DockerNovaVncProxyImage: registry.access.redhat.com/rhosp13/openstack-nova-novncproxy:13.0-85
  DockerOpenvswitchImage: registry.access.redhat.com/rhosp13/openstack-neutron-openvswitch-agent:13.0-84
  DockerPankoApiImage: registry.access.redhat.com/rhosp13/openstack-panko-api:13.0-76
  DockerPankoConfigImage: registry.access.redhat.com/rhosp13/openstack-panko-api:13.0-76
  DockerRabbitmqConfigImage: registry.access.redhat.com/rhosp13/openstack-rabbitmq:13.0-78
  DockerRabbitmqImage: registry.access.redhat.com/rhosp13/openstack-rabbitmq:13.0-78
  DockerRedisConfigImage: registry.access.redhat.com/rhosp13/openstack-redis:13.0-79
  DockerRedisImage: registry.access.redhat.com/rhosp13/openstack-redis:13.0-79
  DockerSwiftAccountImage: registry.access.redhat.com/rhosp13/openstack-swift-account:13.0-74
  DockerSwiftConfigImage: registry.access.redhat.com/rhosp13/openstack-swift-proxy-server:13.0-76
  DockerSwiftContainerImage: registry.access.redhat.com/rhosp13/openstack-swift-container:13.0-77
  DockerSwiftObjectImage: registry.access.redhat.com/rhosp13/openstack-swift-object:13.0-74
  DockerSwiftProxyImage: registry.access.redhat.com/rhosp13/openstack-swift-proxy-server:13.0-76
  DockerNeutronSriovImage: registry.access.redhat.com/rhosp13/openstack-neutron-sriov-agent:13.0-83

Comment 1 Sai Ram Peesapati 2020-03-15 15:14:15 UTC
Created attachment 1670327 [details]
ansible-error.json

Comment 3 Sunny Verma 2020-03-20 14:48:42 UTC
Hey guys, 
Any update on this ?

Thanks.

Comment 4 Sofer Athlan-Guyot 2020-03-20 17:38:52 UTC
Hi,

so we've seen that error before and it was usually due to the
container not being properly removed in the first place.

> It looks like in openstack-tripleo-heat-templates-8.3.1-54.el7ost,
> "nova_api_ensure_cell0_database_url" container starts at "step: 5",
> whereas in openstack-tripleo-heat-templates-8.4.1-42.el7ost starting
> of "nova_api_ensure_cell0_database_url" container moved to "step:
> 3".

This is change id I7b5f6e0a2c8ba77fd575cf1a1003a1553f96efff and it's
only in z7, z6 doesn't have it and z8 has already the next change
(switch to step3)

> Due to which during the controller update,
> "nova_api_ensure_cell0_database_url" from 13.0.7 was still running
> when step is at 3 and when latest openstack-tripleo-heat-temaplates
> are trying to start "nova_api_ensure_cell0_database_url" at step 3,
> this is causing the issue: Conflict. The container name
> \"/nova_api_ensure_cell0_database_url\" is already in use by
> container
> 580df6343e9af347fdf157e1f00fe37e0155c02ed368263ce8fc08466fcf7824. You
> have to remove (or rename) that container to be able to reuse that
> name.."

Those containers should be ephemeral.  My guess here is that it was
not properly removed by docker and that causes that issue.
As I said I've seen that before, and the problem is that issue during delete doesn't
cause an error, so it stay unnoticed until we play the update.

It's only a guess because:
 - we need more logs:
   - complete output of the udpate run 
   - sos-report of the overcloud node where the error happened
   - sos-report of the undercloud.
 - the way it was tested is not supported as far as I can tell.

In all cases, I've triggered a job that does the update from z7 to
z11[1] in order to have a reference point.  If it's really caused by
I7b5f6e0a2c8ba77fd575cf1a1003a1553f96efff we will know right away as
this will reproduce systematically.  If it's a removal issue then only
your log will tell.

I'll have the result on Monday, in the meantime if you can provide the
above log that would be helpful.

Thanks,

Comment 5 Sai Ram Peesapati 2020-03-20 23:48:28 UTC
Created attachment 1672132 [details]
update run output

Comment 6 Sai Ram Peesapati 2020-03-20 23:48:45 UTC
Hi,

Thanks for the response, 1813642-ansible.log (attached to this bug) has the compete output of update run.

Coming to sos-report of overcloud and undercloud after the failure, we uploaded them to ftp "dropbox.redhat.com" with names sosreport-overcloud-controller-0-1813642-2020-03-20-orbwngp.tar.xz (overcloud controller) and sosreport-undercloud13-1813642-2020-03-20-zdbbguh.tar.xz (undercloud) in path /incoming.

Thanks,
Sai

Comment 7 Sunny Verma 2020-03-23 20:58:52 UTC
Hey Sofer Athlan-Guyot ,
Any update on this ?

Comment 8 Sofer Athlan-Guyot 2020-03-25 18:42:08 UTC
Hi,

so I was able to reproduce this in a lab, I will be able to analyse further tomorrow, and hopefully come up with an definitive answer.

Regards,

Comment 9 Sofer Athlan-Guyot 2020-03-26 18:29:21 UTC
Hi,

So here is the chain of even:

 - update tripleo heat template to openstack-tripleo-heat-templates-11.3.2-0.20200324120625.c3a8eb4.el8ost
 - which switches nova_api_ensure_cell0_database_url to step5 to step3

In the logs:

 - in paunch logs we have a serie of deletion:

2020-03-24 23:07:07.996 55867 DEBUG paunch [  ] $ docker rm nova_api_ensure_default_cell
2020-03-24 23:07:08.051 55867 DEBUG paunch [  ] nova_api_ensure_default_cell

2020-03-24 23:07:08.051 55867 DEBUG paunch [  ]
2020-03-24 23:07:08.052 55867 DEBUG paunch [  ] $ docker inspect --type container --format {{index .Config.Labels "config_data"}} nova_api_map_cell0
2020-03-24 23:07:08.116 55867 DEBUG paunch [  ] {"start_order": 1, "command": "/usr/bin/bootstrap_host_exec nova_api su nova -s /bin/bash -c '/usr/bin/nova-manage cell_v2 map_cell0'", "user": "root", "volumes": ["/etc/hosts:/etc/hosts:ro", "/etc/localtime:/etc/localtime:ro", "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro", "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro", "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro", "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro", "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", "/dev/log:/dev/log", "/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro", "/etc/puppet:/etc/puppet:ro", "/var/log/containers/nova:/var/log/nova", "/var/log/containers/httpd/nova-api:/var/log/httpd", "/var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro", "/var/lib/config-data/nova/etc/nova/:/etc/nova/:ro"], "image": "192.168.24.1:8787/rhosp13/openstack-nova-api:2019-07-29-grades", "detach": false, "net": "host"}

2020-03-24 23:07:08.117 55867 DEBUG paunch [  ]
2020-03-24 23:07:08.125 55867 DEBUG paunch [  ] Deleting container (changed config_data): nova_api_map_cell0
2020-03-24 23:07:08.125 55867 DEBUG paunch [  ] $ docker stop nova_api_map_cell0
2020-03-24 23:07:08.180 55867 DEBUG paunch [  ] nova_api_map_cell0

2020-03-24 23:07:08.181 55867 DEBUG paunch [  ]
2020-03-24 23:07:08.181 55867 DEBUG paunch [  ] $ docker rm nova_api_map_cell0
2020-03-24 23:07:08.242 55867 DEBUG paunch [  ] nova_api_map_cell0

but nothing about nova_api_ensure_cell0_database_url being deleted and
no error whatsoever.

Then when it tries to run it and fails because it hasn't been deleted
in the first place.

   2020-03-24 23:08:13.269 55867 DEBUG paunch [ ] $ docker run --name
   nova_api_ensure_cell0_database_url --label config_id=tripleo_step3
   --label container_name=nova_api_ensure_cell0_database_url --label
   managed_by=paunch --label config_data={"start_order": 3, "image":
   "192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-api:20200323.1",
   "environment":
   ["TRIPLEO_CONFIG_HASH=2bce43a73f636ff057f68b65bdd839cf"], "command":
   "/usr/bin/bootstrap_host_exec nova_api
   /nova_api_ensure_cell0_database_url.sh", "user": "root", "volumes":
   ["/etc/hosts:/etc/hosts:ro", "/etc/localtime:/etc/localtime:ro",
   "/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro",
   "/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro",
   "/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro",
   "/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro",
   "/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro", "/dev/log:/dev/log",
   "/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro",
   "/etc/puppet:/etc/puppet:ro",
   "/var/log/containers/nova:/var/log/nova",
   "/var/log/containers/httpd/nova-api:/var/log/httpd",
   "/var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro",
   "/var/lib/config-data/nova/etc/nova/:/etc/nova/:ro",
   "/var/log/containers/nova:/var/log/nova",
   "/var/lib/config-data/puppet-generated/nova/:/var/lib/kolla/config_files/src:ro",
   "/var/lib/docker-config-scripts/nova_api_ensure_cell0_database_url.sh:/nova_api_ensure_cell0_database_url.sh:ro"],
   "net": "host", "detach": false}
   --env=TRIPLEO_CONFIG_HASH=2bce43a73f636ff057f68b65bdd839cf --net=host
   --user=root --volume=/etc/hosts:/etc/hosts:ro
   --volume=/etc/localtime:/etc/localtime:ro
   --volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro
   --volume=/etc/pki/ca-trust/source/anchors:/etc/pki/ca-trust/source/anchors:ro
   --volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro
   --volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro
   --volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro
   --volume=/dev/log:/dev/log
   --volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro
   --volume=/etc/puppet:/etc/puppet:ro
   --volume=/var/log/containers/nova:/var/log/nova
   --volume=/var/log/containers/httpd/nova-api:/var/log/httpd
   --volume=/var/lib/config-data/nova/etc/my.cnf.d/tripleo.cnf:/etc/my.cnf.d/tripleo.cnf:ro
   --volume=/var/lib/config-data/nova/etc/nova/:/etc/nova/:ro
   --volume=/var/log/containers/nova:/var/log/nova
   --volume=/var/lib/config-data/puppet-generated/nova/:/var/lib/kolla/config_files/src:ro
   --volume=/var/lib/docker-config-scripts/nova_api_ensure_cell0_database_url.sh:/nova_api_ensure_cell0_database_url.sh:ro
   --cpuset-cpus=0,1,2,3,4,5,6,7
   192.168.24.1:8787/rh-osbs/rhosp13-openstack-nova-api:20200323.1
   /usr/bin/bootstrap_host_exec nova_api
   /nova_api_ensure_cell0_database_url.sh
   
   2020-03-24 23:08:13.315 55867 DEBUG paunch [ ]
   /usr/bin/docker-current: Error response from daemon: Conflict. The
   container name "/nova_api_ensure_cell0_database_url" is already in use
   by container
   d959dfe22826f52363ce4f1d9dcea454b4331ebaf9bf1f63b6b43f0a7f3d5427. You
   have to remove (or rename) that container to be able to reuse that
   name..

So the intersting bits in this is "config_id".  It is set righfully to
tripleo_step3 but ... this information is encoded in the container
label.  So on the live environment we have:

[root@controller-2 ~]# docker inspect nova_api_ensure_cell0_database_url|jq '.[]|.Config.Labels.config_id'
"tripleo_step5"

So I wonder if this is why paunch doesn't try to delete it in the
first place.

The workaround you've applied is correct (renaming the container). 
You could as well delete it without risk because in the end that what
we should do (detect that it has moved to step3 and ensure the delete
action happens there as well)

I'm discussing the right next course of action right now.

Thanks,

Comment 10 Emilien Macchi 2020-03-26 19:39:13 UTC
Sofer, I found the problem in the env you gave to me.

2 things:

- the version of Paunch was outdated, python-paunch-2.5.0-4.el7ost.noarch while the last one is python-paunch-2.5.3-3.el7ost.noarch and it includes the patches that fix https://bugzilla.redhat.com/show_bug.cgi?id=1790792 (same bug as you hit presently I think).
- A backport in Paunch was missing (not sure you need it in your case, but it's good to have it): https://code.engineering.redhat.com/gerrit/189510

So now I wonder why the paunch rpm wasn't updated on the overclouds?

Comment 11 Sofer Athlan-Guyot 2020-03-27 10:42:16 UTC
Hi Emilien,

(In reply to Emilien Macchi from comment #10)
> Sofer, I found the problem in the env you gave to me.
> 
> 2 things:
> 
> - the version of Paunch was outdated, python-paunch-2.5.0-4.el7ost.noarch
> while the last one is python-paunch-2.5.3-3.el7ost.noarch and it includes
> the patches that fix https://bugzilla.redhat.com/show_bug.cgi?id=1790792
> (same bug as you hit presently I think).

Oh, you've worked on controller-0, while the update started with controller-2

On ctl-2:

rpm -qa | grep paunch
python-paunch-2.5.3-3.el7ost.noarch

and it was updated during update:

/var/log/messages:Mar 24 22:43:41 controller-2 yum[935175]: Updated: python-paunch-2.5.3-3.el7ost.noarch

and before paunch was triggered (see timeline in #9).  So paunch was 2.5.5-3 at the time of the error.

> - A backport in Paunch was missing (not sure you need it in your case, but
> it's good to have it): https://code.engineering.redhat.com/gerrit/189510

Let's try to work on this this again today, not sure this patch would help.

Thanks,

Comment 12 Sai Ram Peesapati 2020-04-01 20:32:08 UTC
Hi Sofer,

Any update on this?

Comment 13 Sai Ram Peesapati 2020-04-07 19:18:05 UTC
Hi Sofer,

Any update on this?

Comment 14 Mike Orazi 2020-04-07 19:33:40 UTC
Sofer (or Emilien),

Can we get some updates to this BZ at your nearest convenience?

Comment 15 Emilien Macchi 2020-04-08 13:01:15 UTC
Hi folks,

The last time I looked at this BZ, I realized that a backport into Paunch was missing. I went ahead and did it, and today I built python-paunch-2.5.3-4.el7ost which should include all the needed backports in OSP13.

I would like us to retry this scenario and pull the latest paunch, to see if we can reproduce the issue.
Like I said in previous comments, a bunch of issues related to that BZ were fixed in a paunch version that is superior to what was on controller2 when the update failed.

Comment 16 Sunny Verma 2020-04-13 16:41:00 UTC
Thanks Emillien for update. 

1. I see latest python-paunch available today is python-paunch-2.5.3-3.el7ost on rhel-7-server-openstack-13-rpms repository. 

When do you think we will have python-paunch-2.5.3-4.el7ost RPM published in the repo?

2. Is there any ETA on whoever is trying to reproduce this issue to see if it fixes all the issues or not? As this is a blocker of us, a reasonable ETA so we can plan accordingly from our side would be helpful. 

Thanks.

Comment 17 Emilien Macchi 2020-04-13 17:43:59 UTC
(In reply to Sunny Verma from comment #16)
> Thanks Emillien for update. 
> 
> 1. I see latest python-paunch available today is
> python-paunch-2.5.3-3.el7ost on rhel-7-server-openstack-13-rpms repository. 
> 
> When do you think we will have python-paunch-2.5.3-4.el7ost RPM published in
> the repo?

AFIK OSP13z12 GA is scheduled for June 3rd.

> 2. Is there any ETA on whoever is trying to reproduce this issue to see if
> it fixes all the issues or not? As this is a blocker of us, a reasonable ETA
> so we can plan accordingly from our side would be helpful. 
> 
> Thanks.

I don't know on my side. Sofer, please let me know if DF needs to help on that one.

Comment 20 Sofer Athlan-Guyot 2020-04-14 22:17:29 UTC
Hi,

so we tested the paunch at the version mentioned, but the problem still happened.  Based on the analysis is c#9, we are going to:

 1. make a first queen only patch whose sole purpose is to make sure that nova_api_ensure_cell0_database_url is deleted before reaching the paunch stage;
 2. make a more long term solution where those container are indeed ephemeral and get destroyed after being run.

1. is implemented in the new review attached to that bz.

Thanks,

Comment 30 Alex Schultz 2020-05-20 18:15:34 UTC
*** Bug 1837872 has been marked as a duplicate of this bug. ***

Comment 36 errata-xmlrpc 2020-06-24 11:33:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2718


Note You need to log in before you can comment on or make changes to this bug.