Bug 1850479
Summary: | Docker container stuck in restarting state after stack update | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | vivek koul <vkoul> |
Component: | openstack-tripleo-heat-templates | Assignee: | Alex Schultz <aschultz> |
Status: | CLOSED ERRATA | QA Contact: | David Rosenfeld <drosenfe> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 13.0 (Queens) | CC: | aschultz, bshephar, emacchi, mburns, ykulkarn |
Target Milestone: | --- | Keywords: | Triaged, ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | openstack-tripleo-heat-templates-8.4.1-64.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-28 18:23:50 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
vivek koul
2020-06-24 11:45:28 UTC
*** Bug 1850475 has been marked as a duplicate of this bug. *** *** Bug 1850472 has been marked as a duplicate of this bug. *** This is very likely a docker bug or overlayfs issue. You shouldn't have any <containername>-<randomchars> containers. If you get those, it's points to the containers not being able to be removed correctly. Please try manually removing the original containers and and the incorrectly named ones and rerunning the paunch items. Additionally is there a reason the customer is still on z7? There might be a fix in a newer version of docker. (In reply to Alex Schultz from comment #6) > This is very likely a docker bug or overlayfs issue. You shouldn't have any > <containername>-<randomchars> containers. If you get those, it's points to > the containers not being able to be removed correctly. Please try manually > removing the original containers and and the incorrectly named ones and > rerunning the paunch items. Additionally is there a reason the customer is > still on z7? There might be a fix in a newer version of docker. Hey Alex, For this one. I asked them to try removing heat_api_cfn and recreate it using paunch: paunch apply --debug --file /var/lib/tripleo-config/hashed-docker-container-startup-config-step_4.json --config-id=tripleo-step_4 --managed-by=tripleo-Controller That seems to have created the containername-randomchars containers. I'm not 100% sure why it did that, I guess the original containers might have been missing a label there somewhere? (Or possibly just an issue with old z7 paunch / kolla configurations? Main issue is the one with "OSError: [Errno 16] Device or resource busy: '/etc/hostname'". It does seem very similar to that issue we had with /etc/hosts that was related to NTP. In that case, it was possible to simply "touch" /etc/hosts to resolve the issue. But in those cases, we were mounting /etc/hosts inside the container, which isn't happening with /etc/hostname. But we did give that a shot as well: find /var/lib/docker -name hostname -type f -exec touch {} + This didn't help either. The container seems to still be having that same issue. As far as I know, /etc/hostname will be referring to a file on the overlayfs, which should be recreated when they delete the heat_api_cfn container and re-launch it with paunch. I'm not sure if it's happening on all three Controllers or just this one. Maybe we should just try stopping the container, deleting the image and then having paunch recreate it again [root@overcloud-controller-0 ~]# docker inspect heat_api_cfn -f '{{.Config.Image}}' 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111 [root@overcloud-controller-0 ~]# docker stop heat_api_cfn && docker rm heat_api_cfn [root@overcloud-controller-0 ~]# docker rmi 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111 [root@overcloud-controller-0 ~]# paunch apply --file /var/lib/tripleo-config/hashed-docker-container-startup-config-step_4.json --config-id=tripleo-step_4 --managed-by=tripleo-Controller [root@overcloud-controller-0 ~]# docker ps | grep heat_api_cfn 52d66ebdbf11 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111 "dumb-init --singl..." About a minute ago Up About a minute (healthy) heat_api_cfn If the other controllers aren't seeing the issue, then maybe completely removing it and re-downloading it from Satellite might help? (In reply to Brendan Shephard from comment #7) > That seems to have created the containername-randomchars containers. I'm not > 100% sure why it did that, I guess the original containers might have been > missing a label there somewhere? (Or possibly just an issue with old z7 > paunch / kolla configurations? I think it was a bug in Paunch that we fixed in recent zstreams: https://bugzilla.redhat.com/show_bug.cgi?id=1835828 https://bugzilla.redhat.com/show_bug.cgi?id=1813642 and also: https://bugzilla.redhat.com/show_bug.cgi?id=1790792 > > > Main issue is the one with "OSError: [Errno 16] Device or resource busy: > '/etc/hostname'". It does seem very similar to that issue we had with > /etc/hosts that was related to NTP. In that case, it was possible to simply > "touch" /etc/hosts to resolve the issue. But in those cases, we were > mounting /etc/hosts inside the container, which isn't happening with > /etc/hostname. > > But we did give that a shot as well: > > find /var/lib/docker -name hostname -type f -exec touch {} + > > > This didn't help either. The container seems to still be having that same > issue. > > > As far as I know, /etc/hostname will be referring to a file on the > overlayfs, which should be recreated when they delete the heat_api_cfn > container and re-launch it with paunch. I'm not sure if it's happening on > all three Controllers or just this one. Maybe we should just try stopping > the container, deleting the image and then having paunch recreate it again Yes, it sounds like a plan. > [root@overcloud-controller-0 ~]# docker inspect heat_api_cfn -f > '{{.Config.Image}}' > 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111 > [root@overcloud-controller-0 ~]# docker stop heat_api_cfn && docker rm > heat_api_cfn > [root@overcloud-controller-0 ~]# docker rmi > 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111 > [root@overcloud-controller-0 ~]# paunch apply --file > /var/lib/tripleo-config/hashed-docker-container-startup-config-step_4.json > --config-id=tripleo-step_4 --managed-by=tripleo-Controller > [root@overcloud-controller-0 ~]# docker ps | grep heat_api_cfn > 52d66ebdbf11 > 192.168.24.1:8787/rhosp13/openstack-heat-api-cfn:13.0-111 > "dumb-init --singl..." About a minute ago Up About a minute (healthy) > heat_api_cfn > > If the other controllers aren't seeing the issue, then maybe completely > removing it and re-downloading it from Satellite might help? Yes, I would do that. So I believe this a limitation in docker. Per https://github.com/moby/moby/issues/9295#issuecomment-140350434 it seems that /etc/hostname is a special file in docker and cannot be updated in this way. We already exclude /etc/hosts but it seems that we also need to exclude /etc/hostname when we generate the configuration file for the containers. The workaround would be to remove the hostname file in the puppet-generated folder for this container. Once you do that you should be able to start the container. I'll look into patching this out. It's a THT patch similar to what we did to exclude /etc/hosts https://review.opendev.org/#/c/706532/ Confirmed you cannot replace /etc/hostname when running containers under docker. [cloud-user@aschultz-docker-test rhbz1850479]$ cat Dockerfile FROM centos:8 COPY run.sh / COPY hostname.new / CMD bash /run.sh [cloud-user@aschultz-docker-test rhbz1850479]$ cat run.sh #!/bin/bash set -ex rm -f /etc/hostname cp /hostname.new /etc/hostname [cloud-user@aschultz-docker-test rhbz1850479]$ sudo docker logs testing + rm -f /etc/hostname rm: cannot remove '/etc/hostname': Device or resource busy The workaround is to remove /var/lib/config-data/puppet-generated/*/etc/hostname Deployed overcloud and performed a stack update. On controller nodes check status of containers from the description. They were all up e.g.: sudo docker ps | grep swift_object_replicator 80686d440f1d 192.168.24.1:8787/rh-osbs/rhosp13-openstack-swift-object:20200916.1 "dumb-init --singl..." 13 hours ago Up 13 hours swift_object_replicator sudo docker ps | grep swift_container_server 48d377f2da84 192.168.24.1:8787/rh-osbs/rhosp13-openstack-swift-container:20200916.1 "dumb-init --singl..." 13 hours ago Up 13 hours (healthy) swift_container_server sudo docker ps | grep swift_rsync 5ac29efe40e1 192.168.24.1:8787/rh-osbs/rhosp13-openstack-swift-object:20200916.1 "dumb-init --singl..." 13 hours ago Up 13 hours swift_rsync Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat OpenStack Platform 13.0 director bug fix advisory), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4388 |