Bug 1652406

Summary: Director deployed OCP 3.11: docker service gets restarted during scale outs or stack updates causing an outage
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Martin André <m.andre>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 14.0 (Rocky)CC: athomas, dbecker, gchamoul, m.andre, mburns, mfedosin, morazi, sclewis
Target Milestone: rcKeywords: Triaged
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: ansible-role-container-registry-1.0.1-0.20181003162447.ddf8d09.el7ost openstack-tripleo-heat-templates-9.0.1-0.20181013060904.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-11 11:54:47 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs.tar.gz none

Description Marius Cornea 2018-11-22 01:42:43 UTC
Description of problem:
Director deployed OCP 3.11: openshift-monitoring pods end up in CrashLoopBackOff after scale out:

[root@openshift-master-0 heat-admin]# oc get pods --all-namespaces | grep -v Running | grep -v Complete
NAMESPACE               NAME                                           READY     STATUS             RESTARTS   AGE
openshift-monitoring    prometheus-operator-5677fb6f87-xzdw5           0/1       CrashLoopBackOff   17         1h


Checking the infra node where the pod was running we can see:

[root@openshift-infra-0 heat-admin]# docker logs -f k8s_prometheus-operator_prometheus-operator-5677fb6f87-xzdw5_openshift-monitoring_cfed5b0c-ede6-11e8-8571-525400112488_19
ts=2018-11-22T01:34:30.683149725Z caller=main.go:130 msg="Starting Prometheus Operator version '0.23.1'."
ts=2018-11-22T01:34:30.687595956Z caller=main.go:193 msg="Unhandled error received. Exiting..." err="communicating with server failed: Get https://172.30.0.1:443/version?timeout=32s: dial tcp 172.30.0.1:443: connect: network is unreachable"

Checking openvswitch logs:

[root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovsdb-server.log 
2018-11-21T22:57:24.935Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2018-11-21T22:57:24.946Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.10.0
2018-11-21T22:57:34.961Z|00003|memory|INFO|4248 kB peak resident set size after 10.0 seconds
2018-11-21T22:57:34.961Z|00004|memory|INFO|cells:38 json-caches:1 monitors:2 sessions:1
2018-11-21T23:43:13.575Z|00005|jsonrpc|WARN|unix#78: receive error: Connection reset by peer
2018-11-21T23:43:13.575Z|00006|reconnect|WARN|unix#78: connection dropped (Connection reset by peer)
2018-11-21T23:43:39.723Z|00007|jsonrpc|WARN|unix#87: receive error: Connection reset by peer
2018-11-21T23:43:39.724Z|00008|reconnect|WARN|unix#87: connection dropped (Connection reset by peer)
2018-11-21T23:44:05.943Z|00009|jsonrpc|WARN|unix#94: receive error: Connection reset by peer
2018-11-21T23:44:05.943Z|00010|reconnect|WARN|unix#94: connection dropped (Connection reset by peer)
[root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovs-vswitchd.log 
2018-11-22T00:21:52.727Z|00181|connmgr|INFO|br0<->unix#362: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T00:22:46.366Z|00182|connmgr|INFO|br0<->unix#368: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T00:40:39.588Z|00183|connmgr|INFO|br0<->unix#449: 3 flow_mods in the last 0 s (3 adds)
2018-11-22T00:40:39.595Z|00184|connmgr|INFO|br0<->unix#451: 1 flow_mods in the last 0 s (1 adds)
2018-11-22T01:01:12.115Z|00185|bridge|INFO|bridge br0: added interface vethe6d048e0 on port 14
2018-11-22T01:01:12.127Z|00186|connmgr|INFO|br0<->unix#547: 4 flow_mods in the last 0 s (4 adds)
2018-11-22T01:01:12.150Z|00187|connmgr|INFO|br0<->unix#549: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T01:01:33.027Z|00188|connmgr|INFO|br0<->unix#551: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T01:01:33.051Z|00189|connmgr|INFO|br0<->unix#553: 4 flow_mods in the last 0 s (4 deletes)
2018-11-22T01:01:33.086Z|00190|bridge|INFO|bridge br0: deleted interface vethe6d048e0 on port 14

After running 'systemctl restart openvswitch' on the infra node the pod was able to start successfully.


Version-Release number of selected component (if applicable):
2018-11-21.2 puddle

How reproducible:
Not always.

Steps to Reproduce:
1. Deploy OCP with 3master + 2infra + 2worker nodes
2. Add one master node

Actual results:
Scale out operation completes fine but there are infra pods in CrashLoopBackOff state.

Expected results:
All pods remain in Running state.

Additional info:

Comment 3 Marius Cornea 2018-11-23 02:51:21 UTC
Created attachment 1508152 [details]
logs.tar.gz

I monitored the pods status and app availability during the scale out. Attaching the results.

Here are my observations so far:

pods status(oc_pods.log): ~ 2018-11-22 21:06:19 haproxy becomes unreacheable so the oc command doesn't provide any status until 2018-11-22 21:06:44 when we start seeing pods getting into Error or CrashLoopBackOff state.

app availability(http_response.log): at 2018-11-22 21:07:23 the server starts responding with 503 and it recovers at 2018-11-22 21:09:34

/var/lib/mistral/openshift/ansible.log: at 2018-11-22 21:06:18 docker gets restarted on the master nodes. From initial looks it appears it is because /etc/sysconfig/docker-network changed:

https://github.com/openstack/ansible-role-container-registry/blob/master/tasks/docker.yml#L100-L107

This is how tripleo sets docker-network file:
[root@openshift-master-0 heat-admin]# cat /etc/sysconfig/docker-network 
# /etc/sysconfig/docker-network
DOCKER_NETWORK_OPTIONS=' --bip=172.31.0.1/24'

This is how the file looks after the file looks after the initial deployment: 
# /etc/sysconfig/docker-network
DOCKER_NETWORK_OPTIONS=' --mtu=1450'

So to summarize I believe the issue here is that tripleo restarts docker on an already existing deployment while it shouldn't. Is there any way we could avoid this?

Comment 4 Marius Cornea 2018-11-23 02:53:24 UTC
Changing the title to reflect the new findings and requesting back the blocker flag.

Comment 5 Martin André 2018-11-23 10:05:54 UTC
I believe this has to do with the prerequisites.yml playbook being included when it should not. I'll submit a patch soon.

Comment 6 Martin André 2018-11-23 10:13:05 UTC
Created Launchpad issue https://bugs.launchpad.net/tripleo/+bug/1804790

Comment 7 Martin André 2018-11-23 10:28:03 UTC
The patch at https://review.openstack.org/#/c/619713/ should fix the generated openshift-ansible playbook for updates.

Comment 8 Martin André 2018-11-23 13:50:33 UTC
@Marius, sorry I didn't read your comment until the end. I think your analysis is correct and should find a way to prevent tripleo from restarting docker. That being said, re-running the prerequisites playbook on the existing nodes was also a problem, my patch is still valid :)

Comment 9 Marius Cornea 2018-11-23 15:38:31 UTC
One more difference is in /etc/sysconfig/docker:

after tripleo configuration:

# /etc/sysconfig/docker

# Modify these options if you want to change the way the docker daemon runs
OPTIONS='-H unix:///run/docker.sock -H unix:///var/lib/openstack/docker.sock --log-driver=journald --signature-verification=false --iptables=false --live-restore'
if [ -z "${DOCKER_CERT_PATH}" ]; then
    DOCKER_CERT_PATH=/etc/docker
fi

# Do not add registries in this file anymore. Use /etc/containers/registries.conf
# instead. For more information reference the registries.conf(5) man page.

# Location used for temporary files, such as those created by
# docker load and build operations. Default is /var/lib/docker/tmp
# Can be overriden by setting the following environment variable.
# DOCKER_TMPDIR=/var/tmp

# Controls the /etc/cron.daily/docker-logrotate cron job status.
# To disable, uncomment the line below.
# LOGROTATE=false

# docker-latest daemon can be used by starting the docker-latest unitfile.
# To use docker-latest client, uncomment below lines
#DOCKERBINARY=/usr/bin/docker-latest
#DOCKERDBINARY=/usr/bin/dockerd-latest
#DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest
#DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest
INSECURE_REGISTRY='--insecure-registry 192.168.24.1:8787'

==========================================================================
after openshift-ansible configuration:

# /etc/sysconfig/docker

# Modify these options if you want to change the way the docker daemon runs
OPTIONS='-H unix:///run/docker.sock -H unix:///var/lib/openstack/docker.sock --log-driver=journald --signature-verification=false --iptables=false --live-restore'
if [ -z "${DOCKER_CERT_PATH}" ]; then
    DOCKER_CERT_PATH=/etc/docker
fi

# Do not add registries in this file anymore. Use /etc/containers/registries.conf
# instead. For more information reference the registries.conf(5) man page.

# Location used for temporary files, such as those created by
# docker load and build operations. Default is /var/lib/docker/tmp
# Can be overriden by setting the following environment variable.
# DOCKER_TMPDIR=/var/tmp

# Controls the /etc/cron.daily/docker-logrotate cron job status.
# To disable, uncomment the line below.
# LOGROTATE=false

# docker-latest daemon can be used by starting the docker-latest unitfile.
# To use docker-latest client, uncomment below lines
#DOCKERBINARY=/usr/bin/docker-latest
#DOCKERDBINARY=/usr/bin/dockerd-latest
#DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest
#DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest
INSECURE_REGISTRY='--insecure-registry 192.168.24.1:8787'
ADD_REGISTRY='--add-registry registry.redhat.io'

Comment 10 Mike Fedosin 2018-11-28 15:06:36 UTC
Proposed fix on review https://review.openstack.org/#/c/620621/

Comment 11 Martin André 2018-12-03 10:34:29 UTC
The inappropriate openshift-ansible docker restarts should be fixed with https://review.openstack.org/#/c/619713/.

While the tripleo ones should be fixed with https://review.openstack.org/#/c/621241/ and https://review.openstack.org/#/c/620621/.

Comment 29 Martin André 2019-01-10 10:16:51 UTC
No doc text required.

Comment 30 errata-xmlrpc 2019-01-11 11:54:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045