Bug 1652406 - Director deployed OCP 3.11: docker service gets restarted during scale outs or stack updates causing an outage
Summary: Director deployed OCP 3.11: docker service gets restarted during scale outs o...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: rc
: 14.0 (Rocky)
Assignee: Martin André
QA Contact: Marius Cornea
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-11-22 01:42 UTC by Marius Cornea
Modified: 2019-01-11 11:54 UTC (History)
8 users (show)

Fixed In Version: ansible-role-container-registry-1.0.1-0.20181003162447.ddf8d09.el7ost openstack-tripleo-heat-templates-9.0.1-0.20181013060904.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-11 11:54:47 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
logs.tar.gz (2.20 MB, application/x-gzip)
2018-11-23 02:51 UTC, Marius Cornea
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1804790 0 None None None 2018-11-23 10:13:05 UTC
OpenStack gerrit 619713 0 'None' MERGED Rework the generated openshift-ansible playbook 2020-10-21 14:32:15 UTC
OpenStack gerrit 620621 0 'None' MERGED Allow to skip docker reconfiguration during stack update 2020-10-21 14:32:15 UTC
OpenStack gerrit 621241 0 'None' MERGED Allow to skip docker reconfiguration 2020-10-21 14:32:15 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:54:58 UTC

Description Marius Cornea 2018-11-22 01:42:43 UTC
Description of problem:
Director deployed OCP 3.11: openshift-monitoring pods end up in CrashLoopBackOff after scale out:

[root@openshift-master-0 heat-admin]# oc get pods --all-namespaces | grep -v Running | grep -v Complete
NAMESPACE               NAME                                           READY     STATUS             RESTARTS   AGE
openshift-monitoring    prometheus-operator-5677fb6f87-xzdw5           0/1       CrashLoopBackOff   17         1h


Checking the infra node where the pod was running we can see:

[root@openshift-infra-0 heat-admin]# docker logs -f k8s_prometheus-operator_prometheus-operator-5677fb6f87-xzdw5_openshift-monitoring_cfed5b0c-ede6-11e8-8571-525400112488_19
ts=2018-11-22T01:34:30.683149725Z caller=main.go:130 msg="Starting Prometheus Operator version '0.23.1'."
ts=2018-11-22T01:34:30.687595956Z caller=main.go:193 msg="Unhandled error received. Exiting..." err="communicating with server failed: Get https://172.30.0.1:443/version?timeout=32s: dial tcp 172.30.0.1:443: connect: network is unreachable"

Checking openvswitch logs:

[root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovsdb-server.log 
2018-11-21T22:57:24.935Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log
2018-11-21T22:57:24.946Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.10.0
2018-11-21T22:57:34.961Z|00003|memory|INFO|4248 kB peak resident set size after 10.0 seconds
2018-11-21T22:57:34.961Z|00004|memory|INFO|cells:38 json-caches:1 monitors:2 sessions:1
2018-11-21T23:43:13.575Z|00005|jsonrpc|WARN|unix#78: receive error: Connection reset by peer
2018-11-21T23:43:13.575Z|00006|reconnect|WARN|unix#78: connection dropped (Connection reset by peer)
2018-11-21T23:43:39.723Z|00007|jsonrpc|WARN|unix#87: receive error: Connection reset by peer
2018-11-21T23:43:39.724Z|00008|reconnect|WARN|unix#87: connection dropped (Connection reset by peer)
2018-11-21T23:44:05.943Z|00009|jsonrpc|WARN|unix#94: receive error: Connection reset by peer
2018-11-21T23:44:05.943Z|00010|reconnect|WARN|unix#94: connection dropped (Connection reset by peer)
[root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovs-vswitchd.log 
2018-11-22T00:21:52.727Z|00181|connmgr|INFO|br0<->unix#362: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T00:22:46.366Z|00182|connmgr|INFO|br0<->unix#368: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T00:40:39.588Z|00183|connmgr|INFO|br0<->unix#449: 3 flow_mods in the last 0 s (3 adds)
2018-11-22T00:40:39.595Z|00184|connmgr|INFO|br0<->unix#451: 1 flow_mods in the last 0 s (1 adds)
2018-11-22T01:01:12.115Z|00185|bridge|INFO|bridge br0: added interface vethe6d048e0 on port 14
2018-11-22T01:01:12.127Z|00186|connmgr|INFO|br0<->unix#547: 4 flow_mods in the last 0 s (4 adds)
2018-11-22T01:01:12.150Z|00187|connmgr|INFO|br0<->unix#549: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T01:01:33.027Z|00188|connmgr|INFO|br0<->unix#551: 2 flow_mods in the last 0 s (2 deletes)
2018-11-22T01:01:33.051Z|00189|connmgr|INFO|br0<->unix#553: 4 flow_mods in the last 0 s (4 deletes)
2018-11-22T01:01:33.086Z|00190|bridge|INFO|bridge br0: deleted interface vethe6d048e0 on port 14

After running 'systemctl restart openvswitch' on the infra node the pod was able to start successfully.


Version-Release number of selected component (if applicable):
2018-11-21.2 puddle

How reproducible:
Not always.

Steps to Reproduce:
1. Deploy OCP with 3master + 2infra + 2worker nodes
2. Add one master node

Actual results:
Scale out operation completes fine but there are infra pods in CrashLoopBackOff state.

Expected results:
All pods remain in Running state.

Additional info:

Comment 3 Marius Cornea 2018-11-23 02:51:21 UTC
Created attachment 1508152 [details]
logs.tar.gz

I monitored the pods status and app availability during the scale out. Attaching the results.

Here are my observations so far:

pods status(oc_pods.log): ~ 2018-11-22 21:06:19 haproxy becomes unreacheable so the oc command doesn't provide any status until 2018-11-22 21:06:44 when we start seeing pods getting into Error or CrashLoopBackOff state.

app availability(http_response.log): at 2018-11-22 21:07:23 the server starts responding with 503 and it recovers at 2018-11-22 21:09:34

/var/lib/mistral/openshift/ansible.log: at 2018-11-22 21:06:18 docker gets restarted on the master nodes. From initial looks it appears it is because /etc/sysconfig/docker-network changed:

https://github.com/openstack/ansible-role-container-registry/blob/master/tasks/docker.yml#L100-L107

This is how tripleo sets docker-network file:
[root@openshift-master-0 heat-admin]# cat /etc/sysconfig/docker-network 
# /etc/sysconfig/docker-network
DOCKER_NETWORK_OPTIONS=' --bip=172.31.0.1/24'

This is how the file looks after the file looks after the initial deployment: 
# /etc/sysconfig/docker-network
DOCKER_NETWORK_OPTIONS=' --mtu=1450'

So to summarize I believe the issue here is that tripleo restarts docker on an already existing deployment while it shouldn't. Is there any way we could avoid this?

Comment 4 Marius Cornea 2018-11-23 02:53:24 UTC
Changing the title to reflect the new findings and requesting back the blocker flag.

Comment 5 Martin André 2018-11-23 10:05:54 UTC
I believe this has to do with the prerequisites.yml playbook being included when it should not. I'll submit a patch soon.

Comment 6 Martin André 2018-11-23 10:13:05 UTC
Created Launchpad issue https://bugs.launchpad.net/tripleo/+bug/1804790

Comment 7 Martin André 2018-11-23 10:28:03 UTC
The patch at https://review.openstack.org/#/c/619713/ should fix the generated openshift-ansible playbook for updates.

Comment 8 Martin André 2018-11-23 13:50:33 UTC
@Marius, sorry I didn't read your comment until the end. I think your analysis is correct and should find a way to prevent tripleo from restarting docker. That being said, re-running the prerequisites playbook on the existing nodes was also a problem, my patch is still valid :)

Comment 9 Marius Cornea 2018-11-23 15:38:31 UTC
One more difference is in /etc/sysconfig/docker:

after tripleo configuration:

# /etc/sysconfig/docker

# Modify these options if you want to change the way the docker daemon runs
OPTIONS='-H unix:///run/docker.sock -H unix:///var/lib/openstack/docker.sock --log-driver=journald --signature-verification=false --iptables=false --live-restore'
if [ -z "${DOCKER_CERT_PATH}" ]; then
    DOCKER_CERT_PATH=/etc/docker
fi

# Do not add registries in this file anymore. Use /etc/containers/registries.conf
# instead. For more information reference the registries.conf(5) man page.

# Location used for temporary files, such as those created by
# docker load and build operations. Default is /var/lib/docker/tmp
# Can be overriden by setting the following environment variable.
# DOCKER_TMPDIR=/var/tmp

# Controls the /etc/cron.daily/docker-logrotate cron job status.
# To disable, uncomment the line below.
# LOGROTATE=false

# docker-latest daemon can be used by starting the docker-latest unitfile.
# To use docker-latest client, uncomment below lines
#DOCKERBINARY=/usr/bin/docker-latest
#DOCKERDBINARY=/usr/bin/dockerd-latest
#DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest
#DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest
INSECURE_REGISTRY='--insecure-registry 192.168.24.1:8787'

==========================================================================
after openshift-ansible configuration:

# /etc/sysconfig/docker

# Modify these options if you want to change the way the docker daemon runs
OPTIONS='-H unix:///run/docker.sock -H unix:///var/lib/openstack/docker.sock --log-driver=journald --signature-verification=false --iptables=false --live-restore'
if [ -z "${DOCKER_CERT_PATH}" ]; then
    DOCKER_CERT_PATH=/etc/docker
fi

# Do not add registries in this file anymore. Use /etc/containers/registries.conf
# instead. For more information reference the registries.conf(5) man page.

# Location used for temporary files, such as those created by
# docker load and build operations. Default is /var/lib/docker/tmp
# Can be overriden by setting the following environment variable.
# DOCKER_TMPDIR=/var/tmp

# Controls the /etc/cron.daily/docker-logrotate cron job status.
# To disable, uncomment the line below.
# LOGROTATE=false

# docker-latest daemon can be used by starting the docker-latest unitfile.
# To use docker-latest client, uncomment below lines
#DOCKERBINARY=/usr/bin/docker-latest
#DOCKERDBINARY=/usr/bin/dockerd-latest
#DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest
#DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest
INSECURE_REGISTRY='--insecure-registry 192.168.24.1:8787'
ADD_REGISTRY='--add-registry registry.redhat.io'

Comment 10 Mike Fedosin 2018-11-28 15:06:36 UTC
Proposed fix on review https://review.openstack.org/#/c/620621/

Comment 11 Martin André 2018-12-03 10:34:29 UTC
The inappropriate openshift-ansible docker restarts should be fixed with https://review.openstack.org/#/c/619713/.

While the tripleo ones should be fixed with https://review.openstack.org/#/c/621241/ and https://review.openstack.org/#/c/620621/.

Comment 29 Martin André 2019-01-10 10:16:51 UTC
No doc text required.

Comment 30 errata-xmlrpc 2019-01-11 11:54:47 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045


Note You need to log in before you can comment on or make changes to this bug.