Bug 1850479

Summary: Docker container stuck in restarting state after stack update
Product: Red Hat OpenStack Reporter: vivek koul <vkoul>
Component: openstack-tripleo-heat-templatesAssignee: Alex Schultz <aschultz>
Status: CLOSED ERRATA QA Contact: David Rosenfeld <drosenfe>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: aschultz, bshephar, emacchi, mburns, ykulkarn
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.4.1-64.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-28 18:23:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description vivek koul 2020-06-24 11:45:28 UTC
Summary: Docker container stuck in restarting state after stack update

Description of problem:
After doing a stack update, below containers got stuck in restarting state:

~~~
swift_proxy
heat_api_cfn
swift_container_auditor
swift_object_expirer
swift_object_updater
swift_container_replicator
swift_account_auditor
swift_account_server
swift_object_replicator
swift_container_server
swift_rsync
swift_account_reaper
swift_account_replicator
swift_object_auditor
swift_object_server
swift_container_updater
~~~

here are the docker logs:

~~~
INFO:__main__:Copying /var/lib/kolla/config_files/src/etc/mtab to /etc/mtab
INFO:__main__:Deleting /etc/hostname
ERROR:__main__:Unexpected error:
Traceback (most recent call last):
  File "/usr/local/bin/kolla_set_configs", line 411, in main
    execute_config_strategy(config)
  File "/usr/local/bin/kolla_set_configs", line 377, in execute_config_strategy
    copy_config(config)
  File "/usr/local/bin/kolla_set_configs", line 306, in copy_config
    config_file.copy()
  File "/usr/local/bin/kolla_set_configs", line 150, in copy
    self._merge_directories(source, dest)
  File "/usr/local/bin/kolla_set_configs", line 97, in _merge_directories
    os.path.join(dest, to_copy))
  File "/usr/local/bin/kolla_set_configs", line 99, in _merge_directories
    self._copy_file(source, dest)
  File "/usr/local/bin/kolla_set_configs", line 75, in _copy_file
    self._delete_path(dest)
  File "/usr/local/bin/kolla_set_configs", line 108, in _delete_path
    os.remove(path)
OSError: [Errno 16] Device or resource busy: '/etc/hostname'
~~~

Comment 3 Michele Baldessari 2020-06-24 12:36:51 UTC
*** Bug 1850475 has been marked as a duplicate of this bug. ***

Comment 5 Alex Schultz 2020-06-24 16:24:55 UTC
*** Bug 1850472 has been marked as a duplicate of this bug. ***

Comment 6 Alex Schultz 2020-06-24 16:33:01 UTC
This is very likely a docker bug or overlayfs issue. You shouldn't have any <containername>-<randomchars> containers. If you get those, it's points to the containers not being able to be removed correctly. Please try manually removing the original containers and and the incorrectly named ones and rerunning the paunch items.  Additionally is there a reason the customer is still on z7? There might be a fix in a newer version of docker.

Comment 7 Brendan Shephard 2020-06-25 05:20:16 UTC
(In reply to Alex Schultz from comment #6)
> This is very likely a docker bug or overlayfs issue. You shouldn't have any
> <containername>-<randomchars> containers. If you get those, it's points to
> the containers not being able to be removed correctly. Please try manually
> removing the original containers and and the incorrectly named ones and
> rerunning the paunch items.  Additionally is there a reason the customer is
> still on z7? There might be a fix in a newer version of docker.

Hey Alex,

For this one. I asked them to try removing heat_api_cfn and recreate it using paunch:

paunch apply --debug --file /var/lib/tripleo-config/hashed-docker-container-startup-config-step_4.json --config-id=tripleo-step_4 --managed-by=tripleo-Controller

That seems to have created the containername-randomchars containers. I'm not 100% sure why it did that, I guess the original containers might have been missing a label there somewhere? (Or possibly just an issue with old z7 paunch / kolla configurations?


Main issue is the one with "OSError: [Errno 16] Device or resource busy: '/etc/hostname'".   It does seem very similar to that issue we had with /etc/hosts that was related to NTP. In that case, it was possible to simply "touch" /etc/hosts to resolve the issue. But in those cases, we were mounting /etc/hosts inside the container, which isn't happening with /etc/hostname.

But we did give that a shot as well:

find /var/lib/docker -name hostname -type f -exec touch {} +


This didn't help either. The container seems to still be having that same issue. 


As far as I know, /etc/hostname will be referring to a file on the overlayfs, which should be recreated when they delete the heat_api_cfn container and re-launch it with paunch. I'm not sure if it's happening on all three Controllers or just this one. Maybe we should just try stopping the container, deleting the image and then having paunch recreate it again

[root@overcloud-controller-0 ~]# docker inspect heat_api_cfn -f '{{.Config.Image}}'
192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111
[root@overcloud-controller-0 ~]# docker stop heat_api_cfn && docker rm heat_api_cfn
[root@overcloud-controller-0 ~]# docker rmi 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111
[root@overcloud-controller-0 ~]# paunch apply --file /var/lib/tripleo-config/hashed-docker-container-startup-config-step_4.json --config-id=tripleo-step_4 --managed-by=tripleo-Controller
[root@overcloud-controller-0 ~]# docker ps | grep heat_api_cfn
52d66ebdbf11        192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111                "dumb-init --singl..."   About a minute ago   Up About a minute (healthy)                             heat_api_cfn

If the other controllers aren't seeing the issue, then maybe completely removing it and re-downloading it from Satellite might help?

Comment 8 Emilien Macchi 2020-06-25 15:48:56 UTC
(In reply to Brendan Shephard from comment #7)

> That seems to have created the containername-randomchars containers. I'm not
> 100% sure why it did that, I guess the original containers might have been
> missing a label there somewhere? (Or possibly just an issue with old z7
> paunch / kolla configurations?

I think it was a bug in Paunch that we fixed in recent zstreams:
https://bugzilla.redhat.com/show_bug.cgi?id=1835828
https://bugzilla.redhat.com/show_bug.cgi?id=1813642
and also: https://bugzilla.redhat.com/show_bug.cgi?id=1790792

> 
> 
> Main issue is the one with "OSError: [Errno 16] Device or resource busy:
> '/etc/hostname'".   It does seem very similar to that issue we had with
> /etc/hosts that was related to NTP. In that case, it was possible to simply
> "touch" /etc/hosts to resolve the issue. But in those cases, we were
> mounting /etc/hosts inside the container, which isn't happening with
> /etc/hostname.
> 
> But we did give that a shot as well:
> 
> find /var/lib/docker -name hostname -type f -exec touch {} +
> 
> 
> This didn't help either. The container seems to still be having that same
> issue. 
> 
> 
> As far as I know, /etc/hostname will be referring to a file on the
> overlayfs, which should be recreated when they delete the heat_api_cfn
> container and re-launch it with paunch. I'm not sure if it's happening on
> all three Controllers or just this one. Maybe we should just try stopping
> the container, deleting the image and then having paunch recreate it again

Yes, it sounds like a plan.

> [root@overcloud-controller-0 ~]# docker inspect heat_api_cfn -f
> '{{.Config.Image}}'
> 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111
> [root@overcloud-controller-0 ~]# docker stop heat_api_cfn && docker rm
> heat_api_cfn
> [root@overcloud-controller-0 ~]# docker rmi
> 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111
> [root@overcloud-controller-0 ~]# paunch apply --file
> /var/lib/tripleo-config/hashed-docker-container-startup-config-step_4.json
> --config-id=tripleo-step_4 --managed-by=tripleo-Controller
> [root@overcloud-controller-0 ~]# docker ps | grep heat_api_cfn
> 52d66ebdbf11       
> 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111               
> "dumb-init --singl..."   About a minute ago   Up About a minute (healthy)   
> heat_api_cfn
> 
> If the other controllers aren't seeing the issue, then maybe completely
> removing it and re-downloading it from Satellite might help?

Yes, I would do that.

Comment 11 Alex Schultz 2020-06-29 16:35:56 UTC
So I believe this a limitation in docker. Per https://github.com/moby/moby/issues/9295#issuecomment-140350434 it seems that /etc/hostname is a special file in docker and cannot be updated in this way. We already exclude /etc/hosts but it seems that we also need to exclude /etc/hostname when we generate the configuration file for the containers. The workaround would be to remove the hostname file in the puppet-generated folder for this container. Once you do that you should be able to start the container. I'll look into patching this out. It's a THT patch similar to what we did to exclude /etc/hosts https://review.opendev.org/#/c/706532/

Comment 13 Alex Schultz 2020-06-29 18:26:22 UTC
Confirmed you cannot replace /etc/hostname when running containers under docker.

[cloud-user@aschultz-docker-test rhbz1850479]$ cat Dockerfile 
FROM centos:8
COPY run.sh /
COPY hostname.new /
CMD bash /run.sh
[cloud-user@aschultz-docker-test rhbz1850479]$ cat run.sh
#!/bin/bash
set -ex
rm -f /etc/hostname
cp /hostname.new /etc/hostname
[cloud-user@aschultz-docker-test rhbz1850479]$ sudo docker logs testing
+ rm -f /etc/hostname
rm: cannot remove '/etc/hostname': Device or resource busy


The workaround is to remove /var/lib/config-data/puppet-generated/*/etc/hostname

Comment 27 David Rosenfeld 2020-09-28 12:53:24 UTC
Deployed overcloud and performed a stack update. On controller nodes check status of containers from the description. They were all up e.g.: 

sudo docker ps | grep swift_object_replicator
80686d440f1d        192.168.24.1:8787/rh-osbs/rhosp13-openstack-swift-object:20200916.1                "dumb-init --singl..."   13 hours ago        Up 13 hours                                 swift_object_replicator

sudo docker ps | grep swift_container_server
48d377f2da84        192.168.24.1:8787/rh-osbs/rhosp13-openstack-swift-container:20200916.1             "dumb-init --singl..."   13 hours ago        Up 13 hours (healthy)                       swift_container_server

sudo docker ps | grep swift_rsync
5ac29efe40e1        192.168.24.1:8787/rh-osbs/rhosp13-openstack-swift-object:20200916.1                "dumb-init --singl..."   13 hours ago        Up 13 hours                                 swift_rsync

Comment 34 errata-xmlrpc 2020-10-28 18:23:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4388