Bug 1876717

Summary:	RHOSP16.1 - podman "cannot remove container <container ID> as it is running - running or paused containers cannot be removed without force: container state improper"
Product:	Red Hat OpenStack	Reporter:	XinhuaLi <xili>
Component:	ceph-ansible	Assignee:	Guillaume Abrioux <gabrioux>
Status:	CLOSED DUPLICATE	QA Contact:	Yogev Rabl <yrabl>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	16.1 (Train)	CC:	gfidente, johfulto, m.andre
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-09-21 15:46:40 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description XinhuaLi 2020-09-08 03:28:26 UTC

Description of problem:
We can see container keep restarting and complaining "cannot remove container ". At the same time, these logs are keep flushing.
Actually, there is only one ceph-mon container is running at that time.
As a workaround, we cannot use "podman restart <container ID>" but only "podman stop <container ID>" to restore the state.

-------------------------------------------------------------------------------------
Sep 4 03:15:47 overcloud-controller-0 systemd[1]: Stopped Ceph Monitor.
Sep 4 03:15:47 overcloud-controller-0 systemd[1]: Starting Ceph Monitor...
Sep 4 03:15:47 overcloud-controller-0 podman[709253]: Error: cannot remove container 638a2692f6d041eaeb9f66a1d8b85a53c15721c96af74a6eeafb1c319f6d6725 as it is running - running or paused containers cannot be removed without force: container state improper
Sep 4 03:15:47 overcloud-controller-0 podman[709276]: Error: error creating container storage: the container name "ceph-mon-overcloud-controller-0" is already in use by "638a2692f6d041eaeb9f66a1d8b85a53c15721c96af74a6eeafb1c319f6d6725". You have to remove that container to be able to reuse that name.: that name is already in use
Sep 4 03:15:47 overcloud-controller-0 systemd[1]: ceph-mon: Control process exited, code=exited status=125
Sep 4 03:15:47 overcloud-controller-0 systemd[1]: ceph-mon: Failed with result 'exit-code'.
Sep 4 03:15:47 overcloud-controller-0 systemd[1]: Failed to start Ceph Monitor.
-------------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):
-------------------------------------------------------------------------------------
RHOSP 16.1
rhceph-4-rhel8:4-32
podman-1.6.4-15.module+el8.2.0+7290+954fb593.x86_64
podman-docker-1.6.4-15.module+el8.2.0+7290+954fb593.noarch
-------------------------------------------------------------------------------------

How reproducible:
Currently, there is no exact reproduce procedure yet. It happens some times.

Steps to Reproduce:
1.
2.
3.

Actual results:
Container cannot restart correctly and keep flushing the logs.

Expected results:
Container can restart/start without error.

Additional info:
It seems that there could be something related the state detection inside podman.
Could you please help to check ?

Regards.
Sam

Comment 1 John Fulton 2020-09-21 15:46:40 UTC

This message:

the container name "ceph-mon-overcloud-controller-0" is already in use by "638a2692f6d041eaeb9f66a1d8b85a53c15721c96af74a6eeafb1c319f6d6725". You have to remove that container to be able to reuse that name.: that name is already in use

Is from the ceph mon systemd unit file failing to start the ceph mon container because that container is already in use. The unit file needs to be updated so that it is able to remove the older container if it is already in use. When it's removed, then the new container will be able to start. The old container, 638a..., might not be running correctly but parts of it are left over and need to be cleaned up. 

The unit file shouldn't be handed edited. Instead it is managed by ceph-ansible. ceph-ansible has had updates in how it manages the unit file to avoid this problem and the bug was fixed in bz 1858865. It is also documented in this bug that it can result in the cinder-volume being down.

Ensure you have the errata from bug 1858865 (ceph-ansible-4.0.25.1-1.el8cp) on your UNDERCLOUD and then run a stack update. This will result in ceph-ansible configuring your unit files so that you don't have the problem.

*** This bug has been marked as a duplicate of bug 1858865 ***