1779165 – [OSP16] Minor update failed on undercloud update stage when creating container storage, 'name is in use'

Bug 1779165 - [OSP16] Minor update failed on undercloud update stage when creating container storage, 'name is in use'

Summary: [OSP16] Minor update failed on undercloud update stage when creating containe...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-tripleo-heat-templates
Sub Component:
Version:	16.0 (Train)
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ga
Target Release:	16.0 (Train on RHEL 8.1)
Assignee:	RHOS Maint
QA Contact:	Sasha Smolyak
Docs Contact:
URL:
Whiteboard:
Duplicates (3):	1779308 1788276 1791110 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-12-03 12:58 UTC by Roman Safronov
Modified:	2020-02-26 17:16 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openstack-tripleo-heat-templates-11.3.1-0.20200105035857.f739706.el8ost.noarch
Doc Type:	Known Issue
Doc Text:	Cause: Pod Man performance issues are causing updates to fail under load. Consequence: During the update the old containers are not deleted and errors out due to naming conflicts. Workaround (if any): Comment 23 identifies a workaround for the naming conflict Result: Update fails
Clone Of:
Environment:
Last Closed:	2020-02-06 14:43:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1855090	None	None	None	2019-12-04 11:19:09 UTC
Launchpad	1856086	None	None	None	2020-01-02 22:44:53 UTC
Launchpad	1856324	None	None	None	2019-12-23 14:56:39 UTC
OpenStack gerrit	695921	'None'	MERGED	Cast single value list as a string for config id	2021-01-31 00:08:41 UTC
OpenStack gerrit	696673	'None'	MERGED	Fix action Apply ignoring managed-by arg	2021-01-31 00:08:41 UTC
OpenStack gerrit	697280	'None'	ABANDONED	Fix container config IDs lookup by lists vs str	2021-01-31 00:09:25 UTC
OpenStack gerrit	697696	None	ABANDONED	Retry removing containers.	2021-01-31 00:08:41 UTC
OpenStack gerrit	698806	None	MERGED	Use paunch to handle container removal.	2021-01-31 00:09:25 UTC
OpenStack gerrit	698962	None	ABANDONED	podman: force rm --storage and unlink lingering overlayfs	2021-01-31 00:08:41 UTC
OpenStack gerrit	698999	None	ABANDONED	podman: force rm --storage and unlink lingering overlayfs	2021-01-31 00:08:41 UTC
Red Hat Product Errata	RHEA-2020:0283	None	None	None	2020-02-06 14:43:56 UTC

Description Roman Safronov 2019-12-03 12:58:05 UTC

Description of problem:
Minor update CI job to the latest passed_phase1 puddle  (RHOS_TRUNK-16.0-RHEL-8-20191202.n.1)  failed

Link to the failed job:
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/8/


Note: previous my attempt to test OSP16 minor update with this job (RHOS_TRUNK-16.0-RHEL-8-20191115.n.0 to RHOS_TRUNK-16.0-RHEL-8-20191126.n.2) passed this stage successfully



Version-Release number of selected component (if applicable):
update from RHOS_TRUNK-16.0-RHEL-8-20191115.n.0 to the latest passed_phase1 (RHOS_TRUNK-16.0-RHEL-8-20191202.n.1)

How reproducible:
Tried once and the issue occurred

Steps to Reproduce:
1. Run the CI job: https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-update-16_director-rhel-virthost-3cont_2comp_2net-ipv4-geneve-composable/

specify the following parameters for the job:
PRODUCT_BUILD:   RHOS_TRUNK-16.0-RHEL-8-20191115.n.0
UPDATE_TO:   passed_phase1
DIRECTOR_UPDATE_BUILD:   passed_phase1

Actual results:
Build failed on 'undercloud update' stage


Expected results:
Build succeeded, minor update performed properly

Additional info:


from undercloud_update.log

2019-12-03 12:25:19 | TASK [Debug output for task: Start containers for step 1] **********************
2019-12-03 12:25:19 | Tuesday 03 December 2019  12:25:19 +0000 (0:00:09.610)       0:07:13.261 ******
2019-12-03 12:25:19 | fatal: [undercloud-0]: FAILED! => {
2019-12-03 12:25:19 |     "failed_when_result": true,
2019-12-03 12:25:19 |     "outputs.stdout_lines | default([]) | union(outputs.stderr_lines | default([]))": [
2019-12-03 12:25:19 |         "70e83b809fcabc50f22e7536b4407fe424790193fed052ece54687aa71bd4db2",
2019-12-03 12:25:19 |         "",
2019-12-03 12:25:19 |         "197328514ab778e313faf99331f505c93fe00a9e9ebcacd3abefa14af4a17c0f",
2019-12-03 12:25:19 |         "20b417c537fe6d6a34a73404f5fbff492da53aac7b3971c8fa369d805a3612b1",
2019-12-03 12:25:19 |         "20811f8d666cd2a4898caf29cc8326ee50c5ea64503db2cd968c5fc9094e8605",
2019-12-03 12:25:19 |         "12329d46802a64053b2662b1c536b708e97d65f62b012bac13d7870a76c7899a",
2019-12-03 12:25:19 |         "Trying to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-haproxy:20191202.1...Getting image source signatures",
2019-12-03 12:25:19 |         "Copying blob sha256:54e50146173df64120e677675f80286fa9bfbf4686147534cbb203aa7e6925f6",
2019-12-03 12:25:19 |         "Copying blob sha256:641d7cc5cbc48a13c68806cf25d5bcf76ea2157c3181e1db4f5d0edae34954ac",
2019-12-03 12:25:19 |         "Copying blob sha256:93f0c8c37476696dd27c6a731860e252d452c0f512cf4cc694252f3d3ff862dc",
2019-12-03 12:25:19 |         "Copying blob sha256:c65691897a4d140d441e2024ce086de996b1c4620832b90c973db81329577274",
2019-12-03 12:25:19 |         "Copying config sha256:70e83b809fcabc50f22e7536b4407fe424790193fed052ece54687aa71bd4db2",
2019-12-03 12:25:19 |         "Writing manifest to image destination",
2019-12-03 12:25:19 |         "Storing signatures",
2019-12-03 12:25:19 |         "Trying to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-keepalived:20191202.1...Getting image source signatures",
2019-12-03 12:25:19 |         "Copying blob sha256:b1a5153b62cf35f43fc48264ec54dcc87c4404d2253df7a3d9985909e54751c6",
2019-12-03 12:25:19 |         "Copying config sha256:197328514ab778e313faf99331f505c93fe00a9e9ebcacd3abefa14af4a17c0f",
2019-12-03 12:25:19 |         "Trying to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:20191202.1...Getting image source signatures",
2019-12-03 12:25:19 |         "Copying blob sha256:7483ec0d2594f5fe6beb0d9647c23d201d1f4f3d79075138a834191018b8a92f",
2019-12-03 12:25:19 |         "Copying config sha256:20b417c537fe6d6a34a73404f5fbff492da53aac7b3971c8fa369d805a3612b1",
2019-12-03 12:25:19 |         "Trying to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-memcached:20191202.1...Getting image source signatures",
2019-12-03 12:25:19 |         "Copying blob sha256:5f0aec986338681d4d16eb7381444a1d65659727200d26adcae361352090253c",
2019-12-03 12:25:19 |         "Copying config sha256:20811f8d666cd2a4898caf29cc8326ee50c5ea64503db2cd968c5fc9094e8605",
2019-12-03 12:25:19 |         "Trying to pull undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-rabbitmq:20191202.1...Getting image source signatures",
2019-12-03 12:25:19 |         "Copying blob sha256:126b4d8036b67596339911734d94d14a9bf73b6053469eb1ae965f74a2f478ad",
2019-12-03 12:25:19 |         "Copying config sha256:12329d46802a64053b2662b1c536b708e97d65f62b012bac13d7870a76c7899a",
2019-12-03 12:25:19 |         "Error: error creating container storage: the container name \"keepalived\" is already in use by \"4462204d064c68eea99bb045f41ab8799babc1be6ac02d4de9537e2b46753c01\". You have to remove that container to be able to reuse that name.: that name is already in use",
2019-12-03 12:25:19 |         "Error: error creating container storage: the container name \"memcached\" is already in use by \"05202a6bb6f462144f6309a6f09139a81b68d7da70cf6bfdf37dcb47f99be7f1\". You have to remove that container to be able to reuse that name.: that name is already in use",
2019-12-03 12:25:19 |         "Error: error creating container storage: the container name \"mysql_init_logs\" is already in use by \"9d068617a1726d972a6e522d75a31867e830dce57936a440cef28a0b0f1d1bbe\". You have to remove that container to be able to reuse that name.: that name is already in use",
2019-12-03 12:25:19 |         "Error: error creating container storage: the container name \"rabbitmq_init_logs\" is already in use by \"26a748d9fe52bcc2cf8ddd708ba5949843a3abafef33255a2722067cbde214ef\". You have to remove that container to be able to reuse that name.: that name is already in use",
2019-12-03 12:25:19 |         "Error: error creating container storage: the container name \"haproxy\" is already in use by \"54031a0f83beb91d289d3a569ba441775ffe0d058f48d7086c10646dbc3aadfa\". You have to remove that container to be able to reuse that name.: that name is already in use",
2019-12-03 12:25:19 |         "Error: error creating container storage: the container name \"rabbitmq_bootstrap\" is already in use by \"39bae6226cba8c48a6964d1a1b2216f3fd1786e58bbb8215773f73b5a981a353\". You have to remove that container to be able to reuse that name.: that name is already in use",
2019-12-03 12:25:19 |         "Error: error creating container storage: the container name \"rabbitmq\" is already in use by \"bd159d1e85a715b7aa16f31295013d300db7316e0944b3ee92c36684823fc0dc\". You have to remove that container to be able to reuse that name.: that name is already in use"
2019-12-03 12:25:19 |     ]
2019-12-03 12:25:19 | }
2019-12-03 12:25:19 | 
2019-12-03 12:25:19 | NO MORE HOSTS LEFT *************************************************************

Comment 1 Cédric Jeanneret 2019-12-03 15:49:36 UTC

Hello!

I think the issue is related to paunch, and should be corrected already, at least upstream: https://review.opendev.org/#/c/696589/

Would you be able to apply it in your env and confirm it's correcting your issue? I'll check if it's already downstream...

Thank you!

Cheers,

C.

Comment 3 Sofer Athlan-Guyot 2019-12-03 18:01:10 UTC

*** Bug 1779308 has been marked as a duplicate of this bug. ***

Comment 4 Sofer Athlan-Guyot 2019-12-03 21:29:23 UTC

Hi,

So I'm investigating that error as well as I have it in the update CI[1].  I was about to set up the job to test when I notice that the puddle RHOS_TRUNK-16.0-RHEL-8-20191202.n.1[2] has already that patch in, as it has python3-paunch-5.3.1-0.20191202200345.8b47eb2.el8ost.noarch.rpm which has that review in.  See the changelog there[3].

So it looks like that patch didn't solve the issue.  Moving this back to assignee.


[1] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/16/
[2] http://rhos-qe-mirror-tlv.usersys.redhat.com/rcm-guest/puddles/OpenStack/16.0-RHEL-8/RHOS_TRUNK-16.0-RHEL-8-20191202.n.1/compose/OpenStack/x86_64/os/Packages/
[3] https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=1025918

Comment 5 Sofer Athlan-Guyot 2019-12-03 21:36:10 UTC

FYI, We also checked that https://review.opendev.org/#/c/695929/ was also is the puddle and it is as the downstream patch is 184ae87 and the package is 0.4.1-0.20191202200351.184ae87.el8ost

Comment 7 Bogdan Dobrelya 2019-12-04 09:57:42 UTC

I can confirm tripleo-ansible-0.4.1-0.20191202200351.184ae87.el8ost.noarch.rpm in that puddle also contains https://review.opendev.org/696673 . So we need to figure out which cause remains unaddressed.

Comment 8 Bogdan Dobrelya 2019-12-04 10:58:11 UTC

@Sofer, the job [1] https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/16/ paunch.log shows:

2019-12-03 14:38:43.580 47869 WARNING paunch [  ] Did not find container with "['podman', 'ps', '-a', '--filter', 'label=container_name=ironic_pxe_http', '--filter', "label=config_id=['tripleo_step4']", '--format', '{{.Names}}']" - retrying without config_id

That means that https://review.opendev.org/695921 was **not** applied (yet?) to tripleo-ansible. Then paunch executes the "bad" code (old one that seems is missing update, yet?) and still hits the config_id=['foo'] problem, which is expected to be closed by https://review.opendev.org/695921. I'm unsure why that happens. I'd expect the tripleo-ansible lib to before updated from RPM on hosts before doing any paunch executions while minor updates application is in progress.

Comment 9 Bogdan Dobrelya 2019-12-04 11:06:19 UTC

Maybe I'm wrong, and the events logged around 2019-12-03 14:38:43 are actually expected, as represent an installation of the unpatched environment. While update starts later, at 17:09:30.

Comment 10 Bogdan Dobrelya 2019-12-04 11:15:19 UTC

I discovered that f.e. the original keepelived container (deployed before the minor update had been started) has "config_id": "['tripleo_step1']"

That's why the updated from RPMs code cannot filter it by the fixed config_id=tripleo_step1 label!

So we need to provide a fallback here as well, as the "bad" code gets fixed by an rpm update, but not the bad state.

Comment 11 Carlos Camacho 2019-12-04 14:02:06 UTC

The fixes are being tested here: https://review.opendev.org/#/c/695242 All the patches are merged upstream.

Comment 12 Sofer Athlan-Guyot 2019-12-04 14:28:42 UTC

Hi,

@ccamacho as I said all the patches are in the puddle but the error still persist.  Now what Bogdan says in https://review.opendev.org/697280 is that it's failing because the container wasn't deployed with the fix of RHOS_TRUNK-16.0-RHEL-8-20191202.n.1. which is true as it was deployed with RHOS_TRUNK-16.0-RHEL-8-20191126.n.2, to be able to run an update.

If we are seeing this only because we're starting with a puddle that will never see the light of GA and that starting with 26.n.2 it's fine, then I think that https://review.opendev.org/#/c/695242 is not stricly required for downstream (but still required for upstream)  I'll try to deploy 26.n.2 with only paunch taken from 02.n.1 to test that hypothesis.

Comment 13 Bogdan Dobrelya 2019-12-04 14:44:59 UTC

I think we can close this as testing only issue then. The upstream bug will be kept in progress and backported for Train

Comment 14 Sofer Athlan-Guyot 2019-12-04 14:47:51 UTC

So before closing it, I'll test #12, and when it's working I'll give the workaround for all downstream ci until we get a new puddle.

Comment 15 Sofer Athlan-Guyot 2019-12-05 09:33:49 UTC

Hi, so first test failed.

In this one I deployed 02.n.1 with all the above patch but the last one.  Then update it to 02.n.1 This has the right version of paunch and should have prevented the issue altogether.  It should have been an easy test even not conclusive as nothing should have happened as we updating to the same version.

So something happened :)  It failed during "Run container-puppet tasks (generate config) during step 1" from the common_tasks with "You have to remove that container to be able to reuse that name.: that name is already in use"

It was on container-puppet-cinder at that time.

Full logs are there https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/17/

So, now we have two issues:
 - the name already in use still persist even with the latest paunch
 - Why run container-puppet isn't skipped when there is obviously nothing to change.

Comment 16 Sofer Athlan-Guyot 2019-12-05 16:04:20 UTC

So, I confirm that the job running update from RHOS_TRUNK-16.0-RHEL-8-20191202.n.1 to RHOS_TRUNK-16.0-RHEL-8-20191202.n.1 fails during controller update in the common_step during puppet configuration:


2019-12-05 01:31:40 |         "2019-12-05 01:30:41,272 WARNING: 13199 -- Retrying running container: cinder",
2019-12-05 01:31:40 |         "2019-12-05 01:30:44,395 ERROR: 13199 -- ['/usr/bin/podman', 'run', '--user', '0', '--name', 'container-puppet-cinder', '--env', 'PUPPET_TAGS=file,file_line,concat,augeas,cron,cinder_
config,cinder_type,file,concat,file_line,cinder_config,file,concat,file_line,cinder_config,file,concat,file_line,cinder_config,file,concat,file_line', '--env', 'NAME=cinder', '--env', 'HOSTNAME=controller-2', '--e
nv', 'NO_ARCHIVE=', '--env', 'STEP=6', '--env', 'NET_HOST=true', '--env', 'DEBUG=False', '--volume', '/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmprzcc2b6j:/etc/config.pp:ro', '--volume', '/etc/puppet/:
/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro', '--volume', '/var/lib/container-puppet/puppetlabs/:/opt/puppetlabs/:ro', '--volume', '/dev/log:/dev/log:rw', '--rm', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/container-puppet-cinder.log', '--security-opt', 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro', '--entrypoint', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/hosts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', 'undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-cinder-api:20191202.1'] run failed after Error: error creating container storage: the container name \"container-puppet-cinder\" is already in use by \"81838cad0d3b57e72d09ad6df8d4c494d1c5112a8984741e36f9e09890895e58\". You have to remove that container to be able to reuse that name.: that name is already in use",
2019-12-05 01:31:40 |         " attempt(s): 3",
2019-12-05 01:31:40 |         "2019-12-05 01:30:44,395 WARNING: 13199 -- Retrying running container: cinder",
2019-12-05 01:31:40 |         "2019-12-05 01:30:44,395 ERROR: 13199 -- Failed running container for cinder",
2019-12-05 01:31:40 |         "2019-12-05 01:30:44,395 INFO: 13199 -- Finished processing puppet configs for cinder",


2019-12-05 01:31:40 |         "2019-12-05 01:31:13,009 ERROR: 13205 -- ['/usr/bin/podman', 'run', '--user', '0', '--name', 'container-puppet-neutron', '--env', 'PUPPET_TAGS=file,file_line,concat,augeas,cron,neutro
n_config,neutron_api_config,neutron_plugin_ml2', '--env', 'NAME=neutron', '--env', 'HOSTNAME=controller-2', '--env', 'NO_ARCHIVE=', '--env', 'STEP=6', '--env', 'NET_HOST=true', '--env', 'DEBUG=False', '--volume', 
'/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmpycgmu4lr:/etc/config.pp:ro', '--volume', '/etc/puppet/:/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--vo
lume', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', '/etc/pki/tls/cert.pem:/et
c/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro', '--volume', '/var/lib/cont
ainer-puppet/puppetlabs/:/opt/puppetlabs/:ro', '--volume', '/dev/log:/dev/log:rw', '--rm', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/container-puppet-neutron.log', '--security-opt'
, 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro', '--entrypoint', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/ho
sts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', 'undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-neutron-server-o
vn:20191202.1'] run failed after Error: error creating container storage: the container name \"container-puppet-neutron\" is already in use by \"34afb03989a595129c10986722bbc62650cfce2d26fbb5250adcb5eb4bccf6a0\". 
You have to remove that container to be able to reuse that name.: that name is already in use",
2019-12-05 01:31:40 |         "2019-12-05 01:31:13,009 WARNING: 13205 -- Retrying running container: neutron",
2019-12-05 01:31:40 |         "2019-12-05 01:31:13,009 ERROR: 13205 -- Failed running container for neutron",


which end up in:

2019-12-05 01:31:40 |         "2019-12-05 01:31:38,648 ERROR: 13198 -- ERROR configuring cinder",
2019-12-05 01:31:40 |         "2019-12-05 01:31:38,648 ERROR: 13198 -- ERROR configuring neutron"

This is in the artifact: undercloud.tar.gz in undercloud-0/home/stack/overcloud_update_run_Controller.log

Comment 17 Sofer Athlan-Guyot 2019-12-05 16:52:53 UTC

Oki, I think I've found the real cause:

 "time=\"2019-12-04T21:55:51Z\" level=error msg=\"Error removing container 81838cad0d3b57e72d09ad6df8d4c494d1c5112a8984741e36f9e09890895e58: error removing container 81838cad0d3b57e72d09ad6df8d4c494d1c5112a8984741e36f9e09890895e58 root filesystem: unlinkat /var/lib/containers/storage/overlay/c5b17e5ef5e37441917b557a5359997ee70c043f943c14209a77c8d59a5c6e06/merged: device or resource busy\" "

in undercloud-0/var/lib/mistral/qe-Cloud-0/ansible.log in the undercloud.  So it seems we fails to remove the container, hence paunch fails later on.

Comment 18 Sofer Athlan-Guyot 2019-12-09 07:53:49 UTC

Hi Bogdan, did you have a chance to look at the logs of the update from 02.n.1 to 02.n.1 puddle.  Those have errors while removing container (puppet config related container) and thus cannot be re-created during update.  You have a example log in https://bugzilla.redhat.com/show_bug.cgi?id=1779165#c15 

Thanks,

Comment 19 Sofer Athlan-Guyot 2019-12-09 16:43:47 UTC

Hi,

some news, so either the latest puddle has some fixes that help with that issue or that issue doesn't happen all the time:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/upgrades/view/update/job/DFG-upgrades-updates-16-from-passed_phase1-HA-ipv4/ job 20 (RHOS_TRUNK-16.0-RHEL-8-20191206.n.1-> to itself), 19 (RHOS_TRUNK-16.0-RHEL-8-20191202.n.1 -> RHOS_TRUNK-16.0-RHEL-8-20191206.n.1)

So currently we can lower the priority of this one, and wait for a few more run to call it closed.

Comment 21 Sofer Athlan-Guyot 2019-12-10 20:57:26 UTC

I had another instance of that issue during undercloud deployment at task: Run container-puppet tasks (generate config) during step 1

(while the podman run says we run STEP=6, it's a little confusing)

First:

2019-12-10 13:20:34.177 16627 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]         "2019-12-10 18:20:04,631 ERROR: 28609 -- ['/usr/bin/podman', 'run', '--user', '0', '--name', 'container-puppet-mysql', '--env', 'PUPPET_TAGS=file,file_line,concat,augeas,cron,file', '--env', 'NAME=mysql', '--env', 'HOSTNAME=undercloud-0', '--env', 'NO_ARCHIVE=', '--env', 'STEP=6', '--env', 'NET_HOST=true', '--env', 'DEBUG=False', '--volume', '/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmpwwj184iz:/etc/config.pp:ro', '--volume', '/etc/puppet/:/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro', '--volume', '/var/lib/container-puppet/puppetlabs/:/opt/puppetlabs/:ro', '--volume', '/dev/log:/dev/log:rw', '--rm', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/container-puppet-mysql.log', '--security-opt', 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro', '--entrypoint', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/hosts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', 'undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:20191202.1'] run failed after standard_init_linux.go:211: exec user process caused \"no such file or directory\"",

2019-12-10 13:20:34.177 16627 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]         "time=\"2019-12-10T18:20:01Z\" level=error msg=\"Error removing container bf0cdbdc0bc76ad5753962fa351574f6f5d1bfa443d677f2b
2839e4ccc2a0986: error removing container bf0cdbdc0bc76ad5753962fa351574f6f5d1bfa443d677f2b2839e4ccc2a0986 root filesystem: unlinkat /var/lib/containers/storage/overlay/7831a0135e378847e876b09c64769a6cbec5d5dc7aee
6d2f39accbdcdddd6df6/merged: device or resource busy\" ",
2019-12-10 13:20:34.177 16627 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]         " attempt(s): 1",
2019-12-10 13:20:34.177 16627 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]         "2019-12-10 18:20:04,631 WARNING: 28609 -- Retrying running container: mysql",
2019-12-10 13:20:34.177 16627 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]         "2019-12-10 18:20:13,083 WARNING: 28610 -- + mkdir -p /etc/puppet",
2019-12-10 13:20:34.177 16627 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]         "+ '[' -n file,file_line,concat,augeas,cron,mistral_config,mistral_config,mistral_config,mistral_config,user,group ']'",
2019-12-10 13:20:34.178 16627 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]         "+ TAGS='--tags \"file,file_line,concat,augeas,cron,mistral_config,mistral_config,mistral_config,mistral_config,user,group\
"'",

then:

2019-12-10 13:20:34.188 16627 WARNING tripleoclient.v1.tripleo_deploy.Deploy [  ]         "2019-12-10 18:20:16,118 ERROR: 28609 -- ['/usr/bin/podman', 'run', '--user', '0', '--name', 'container-puppet-mysql', '--env', 'PUPPET_TAGS=file,file_line,concat,augeas,cron,file', '--env', 'NAME=mysql', '--env', 'HOSTNAME=undercloud-0', '--env', 'NO_ARCHIVE=', '--env', 'STEP=6', '--env', 'NET_HOST=true', '--env', 'DEBUG=False', '--volume', '/etc/localtime:/etc/localtime:ro', '--volume', '/tmp/tmpwwj184iz:/etc/config.pp:ro', '--volume', '/etc/puppet/:/tmp/puppet-etc/:ro', '--volume', '/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume', '/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume', '/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume', '/var/lib/config-data:/var/lib/config-data/:rw', '--volume', '/var/lib/container-puppet/puppetlabs/facter.conf:/etc/puppetlabs/facter/facter.conf:ro', '--volume', '/var/lib/container-puppet/puppetlabs/:/opt/puppetlabs/:ro', '--volume', '/dev/log:/dev/log:rw', '--rm', '--log-driver', 'k8s-file', '--log-opt', 'path=/var/log/containers/stdouts/container-puppet-mysql.log', '--security-opt', 'label=disable', '--volume', '/usr/share/openstack-puppet/modules/:/usr/share/openstack-puppet/modules/:ro', '--entrypoint', '/var/lib/container-puppet/container-puppet.sh', '--net', 'host', '--volume', '/etc/hosts:/etc/hosts:ro', '--volume', '/var/lib/container-puppet/container-puppet.sh:/var/lib/container-puppet/container-puppet.sh:ro', 'undercloud-0.ctlplane.redhat.local:8787/rh-osbs/rhosp16-openstack-mariadb:20191202.1'] run failed after Error: error creating container storage: the container name \"container-puppet-mysql\" is already in use by \"bf0cdbdc0bc76ad5753962fa351574f6f5d1bfa443d677f2b2839e4ccc2a0986\". You have to remove that container to be able to reuse that name.: that name is already in use",

So it seems we are retrying but as the container wasn't properly removed we are failing

Full log there https://rhos-ci-staging-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-upgrades-updates-16-from-passed_phase1-composable-ipv6-scale-up/11/

Comment 22 Sofer Athlan-Guyot 2019-12-10 21:00:09 UTC

Note the first error "run failed after standard_init_linux.go:211: exec user process caused \"no such file or directory\"" is symptomatic of a call to a binary that's not there (search internet for it, it appears all the time)

Comment 23 Luke Short 2019-12-10 21:11:18 UTC

We were able to find a temporary workaround.

1. Log into the Overcloud node(s) where the container failed to be created.
2. Modify the "/var/lib/containers/storage/overlay-containers/containers.json" file and remove the dictionary with information about the related "container-puppet-*" container. In my case, I had to do this for container-puppet-neutron and container-puppet-iscsid.
3. Delete any existing containers that relate to the service: `sudo podman ps -a | grep neutron; sudo podman rm $CONTAINER_NAME`
4. Re-run the `openstack overcloud deploy` command.

Regarding 2, here is an example dictionary that we removed:

{"id":"ec32b6dc3c8bc65dc5d39efc140e95efc8a49c3d09b46110b9721aca6303540c","names":["container-puppet-iscsid"],"image":"add324c23e9e73a0e0c8cef414a3285f9b3fcc93c6bdbe69bb0d5364f7d5c2d1","layer":"c3555bca24533c899509e4caf880c5a714faef220b241411e7e5c7fa474eddc9","metadata":"{\"image-name\":\"lshort-1.ctlplane.localdomain:8787/rh-osbs/rhosp16-openstack-iscsid:20191126.1\",\"image-id\":\"add324c23e9e73a0e0c8cef414a3285f9b3fcc93c6bdbe69bb0d5364f7d5c2d1\",\"name\":\"container-puppet-iscsid\",\"created-at\":1575419988}","created":"2019-12-04T00:39:48.21833944Z","flags":{"MountLabel":"system_u:object_r:container_file_t:s0:c105,c634","ProcessLabel":""}}

Comment 32 Sofer Athlan-Guyot 2020-01-07 12:20:50 UTC

*** Bug 1788276 has been marked as a duplicate of this bug. ***

Comment 33 pweeks 2020-01-08 13:26:30 UTC

GA blocker since this BZ is dependent on podman 1.6

Comment 37 shreshtha joshi 2020-01-17 12:42:25 UTC

The fix is in the compose, Can we get blocker ack to add it to errata.

Comment 41 David Rosenfeld 2020-01-29 14:40:41 UTC

The job DFG-df-splitstack-16-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-update failed due to this BZ. It is now passing:

https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/splitstack/job/DFG-df-splitstack-16-virsh-3cont_3comp_3ceph-blacklist-1compute-1control-update/12/

Comment 43 errata-xmlrpc 2020-02-06 14:43:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2020:0283

Comment 44 Ollie Walsh 2020-02-26 17:16:44 UTC

*** Bug 1791110 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.