Bug 1652406
| Summary: | Director deployed OCP 3.11: docker service gets restarted during scale outs or stack updates causing an outage | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Marius Cornea <mcornea> | ||||
| Component: | openstack-tripleo-heat-templates | Assignee: | Martin André <m.andre> | ||||
| Status: | CLOSED ERRATA | QA Contact: | Marius Cornea <mcornea> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | urgent | ||||||
| Version: | 14.0 (Rocky) | CC: | athomas, dbecker, gchamoul, m.andre, mburns, mfedosin, morazi, sclewis | ||||
| Target Milestone: | rc | Keywords: | Triaged | ||||
| Target Release: | 14.0 (Rocky) | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | ansible-role-container-registry-1.0.1-0.20181003162447.ddf8d09.el7ost openstack-tripleo-heat-templates-9.0.1-0.20181013060904.el7ost | Doc Type: | If docs needed, set a value | ||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2019-01-11 11:54:47 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Marius Cornea
2018-11-22 01:42:43 UTC
Created attachment 1508152 [details] logs.tar.gz I monitored the pods status and app availability during the scale out. Attaching the results. Here are my observations so far: pods status(oc_pods.log): ~ 2018-11-22 21:06:19 haproxy becomes unreacheable so the oc command doesn't provide any status until 2018-11-22 21:06:44 when we start seeing pods getting into Error or CrashLoopBackOff state. app availability(http_response.log): at 2018-11-22 21:07:23 the server starts responding with 503 and it recovers at 2018-11-22 21:09:34 /var/lib/mistral/openshift/ansible.log: at 2018-11-22 21:06:18 docker gets restarted on the master nodes. From initial looks it appears it is because /etc/sysconfig/docker-network changed: https://github.com/openstack/ansible-role-container-registry/blob/master/tasks/docker.yml#L100-L107 This is how tripleo sets docker-network file: [root@openshift-master-0 heat-admin]# cat /etc/sysconfig/docker-network # /etc/sysconfig/docker-network DOCKER_NETWORK_OPTIONS=' --bip=172.31.0.1/24' This is how the file looks after the file looks after the initial deployment: # /etc/sysconfig/docker-network DOCKER_NETWORK_OPTIONS=' --mtu=1450' So to summarize I believe the issue here is that tripleo restarts docker on an already existing deployment while it shouldn't. Is there any way we could avoid this? Changing the title to reflect the new findings and requesting back the blocker flag. I believe this has to do with the prerequisites.yml playbook being included when it should not. I'll submit a patch soon. Created Launchpad issue https://bugs.launchpad.net/tripleo/+bug/1804790 The patch at https://review.openstack.org/#/c/619713/ should fix the generated openshift-ansible playbook for updates. @Marius, sorry I didn't read your comment until the end. I think your analysis is correct and should find a way to prevent tripleo from restarting docker. That being said, re-running the prerequisites playbook on the existing nodes was also a problem, my patch is still valid :) One more difference is in /etc/sysconfig/docker:
after tripleo configuration:
# /etc/sysconfig/docker
# Modify these options if you want to change the way the docker daemon runs
OPTIONS='-H unix:///run/docker.sock -H unix:///var/lib/openstack/docker.sock --log-driver=journald --signature-verification=false --iptables=false --live-restore'
if [ -z "${DOCKER_CERT_PATH}" ]; then
DOCKER_CERT_PATH=/etc/docker
fi
# Do not add registries in this file anymore. Use /etc/containers/registries.conf
# instead. For more information reference the registries.conf(5) man page.
# Location used for temporary files, such as those created by
# docker load and build operations. Default is /var/lib/docker/tmp
# Can be overriden by setting the following environment variable.
# DOCKER_TMPDIR=/var/tmp
# Controls the /etc/cron.daily/docker-logrotate cron job status.
# To disable, uncomment the line below.
# LOGROTATE=false
# docker-latest daemon can be used by starting the docker-latest unitfile.
# To use docker-latest client, uncomment below lines
#DOCKERBINARY=/usr/bin/docker-latest
#DOCKERDBINARY=/usr/bin/dockerd-latest
#DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest
#DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest
INSECURE_REGISTRY='--insecure-registry 192.168.24.1:8787'
==========================================================================
after openshift-ansible configuration:
# /etc/sysconfig/docker
# Modify these options if you want to change the way the docker daemon runs
OPTIONS='-H unix:///run/docker.sock -H unix:///var/lib/openstack/docker.sock --log-driver=journald --signature-verification=false --iptables=false --live-restore'
if [ -z "${DOCKER_CERT_PATH}" ]; then
DOCKER_CERT_PATH=/etc/docker
fi
# Do not add registries in this file anymore. Use /etc/containers/registries.conf
# instead. For more information reference the registries.conf(5) man page.
# Location used for temporary files, such as those created by
# docker load and build operations. Default is /var/lib/docker/tmp
# Can be overriden by setting the following environment variable.
# DOCKER_TMPDIR=/var/tmp
# Controls the /etc/cron.daily/docker-logrotate cron job status.
# To disable, uncomment the line below.
# LOGROTATE=false
# docker-latest daemon can be used by starting the docker-latest unitfile.
# To use docker-latest client, uncomment below lines
#DOCKERBINARY=/usr/bin/docker-latest
#DOCKERDBINARY=/usr/bin/dockerd-latest
#DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest
#DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest
INSECURE_REGISTRY='--insecure-registry 192.168.24.1:8787'
ADD_REGISTRY='--add-registry registry.redhat.io'
Proposed fix on review https://review.openstack.org/#/c/620621/ The inappropriate openshift-ansible docker restarts should be fixed with https://review.openstack.org/#/c/619713/. While the tripleo ones should be fixed with https://review.openstack.org/#/c/621241/ and https://review.openstack.org/#/c/620621/. No doc text required. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045 |