Description of problem: Director deployed OCP 3.11: openshift-monitoring pods end up in CrashLoopBackOff after scale out: [root@openshift-master-0 heat-admin]# oc get pods --all-namespaces | grep -v Running | grep -v Complete NAMESPACE NAME READY STATUS RESTARTS AGE openshift-monitoring prometheus-operator-5677fb6f87-xzdw5 0/1 CrashLoopBackOff 17 1h Checking the infra node where the pod was running we can see: [root@openshift-infra-0 heat-admin]# docker logs -f k8s_prometheus-operator_prometheus-operator-5677fb6f87-xzdw5_openshift-monitoring_cfed5b0c-ede6-11e8-8571-525400112488_19 ts=2018-11-22T01:34:30.683149725Z caller=main.go:130 msg="Starting Prometheus Operator version '0.23.1'." ts=2018-11-22T01:34:30.687595956Z caller=main.go:193 msg="Unhandled error received. Exiting..." err="communicating with server failed: Get https://172.30.0.1:443/version?timeout=32s: dial tcp 172.30.0.1:443: connect: network is unreachable" Checking openvswitch logs: [root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovsdb-server.log 2018-11-21T22:57:24.935Z|00001|vlog|INFO|opened log file /var/log/openvswitch/ovsdb-server.log 2018-11-21T22:57:24.946Z|00002|ovsdb_server|INFO|ovsdb-server (Open vSwitch) 2.10.0 2018-11-21T22:57:34.961Z|00003|memory|INFO|4248 kB peak resident set size after 10.0 seconds 2018-11-21T22:57:34.961Z|00004|memory|INFO|cells:38 json-caches:1 monitors:2 sessions:1 2018-11-21T23:43:13.575Z|00005|jsonrpc|WARN|unix#78: receive error: Connection reset by peer 2018-11-21T23:43:13.575Z|00006|reconnect|WARN|unix#78: connection dropped (Connection reset by peer) 2018-11-21T23:43:39.723Z|00007|jsonrpc|WARN|unix#87: receive error: Connection reset by peer 2018-11-21T23:43:39.724Z|00008|reconnect|WARN|unix#87: connection dropped (Connection reset by peer) 2018-11-21T23:44:05.943Z|00009|jsonrpc|WARN|unix#94: receive error: Connection reset by peer 2018-11-21T23:44:05.943Z|00010|reconnect|WARN|unix#94: connection dropped (Connection reset by peer) [root@openshift-infra-0 heat-admin]# tail -10 /var/log/openvswitch/ovs-vswitchd.log 2018-11-22T00:21:52.727Z|00181|connmgr|INFO|br0<->unix#362: 2 flow_mods in the last 0 s (2 deletes) 2018-11-22T00:22:46.366Z|00182|connmgr|INFO|br0<->unix#368: 2 flow_mods in the last 0 s (2 deletes) 2018-11-22T00:40:39.588Z|00183|connmgr|INFO|br0<->unix#449: 3 flow_mods in the last 0 s (3 adds) 2018-11-22T00:40:39.595Z|00184|connmgr|INFO|br0<->unix#451: 1 flow_mods in the last 0 s (1 adds) 2018-11-22T01:01:12.115Z|00185|bridge|INFO|bridge br0: added interface vethe6d048e0 on port 14 2018-11-22T01:01:12.127Z|00186|connmgr|INFO|br0<->unix#547: 4 flow_mods in the last 0 s (4 adds) 2018-11-22T01:01:12.150Z|00187|connmgr|INFO|br0<->unix#549: 2 flow_mods in the last 0 s (2 deletes) 2018-11-22T01:01:33.027Z|00188|connmgr|INFO|br0<->unix#551: 2 flow_mods in the last 0 s (2 deletes) 2018-11-22T01:01:33.051Z|00189|connmgr|INFO|br0<->unix#553: 4 flow_mods in the last 0 s (4 deletes) 2018-11-22T01:01:33.086Z|00190|bridge|INFO|bridge br0: deleted interface vethe6d048e0 on port 14 After running 'systemctl restart openvswitch' on the infra node the pod was able to start successfully. Version-Release number of selected component (if applicable): 2018-11-21.2 puddle How reproducible: Not always. Steps to Reproduce: 1. Deploy OCP with 3master + 2infra + 2worker nodes 2. Add one master node Actual results: Scale out operation completes fine but there are infra pods in CrashLoopBackOff state. Expected results: All pods remain in Running state. Additional info:
Created attachment 1508152 [details] logs.tar.gz I monitored the pods status and app availability during the scale out. Attaching the results. Here are my observations so far: pods status(oc_pods.log): ~ 2018-11-22 21:06:19 haproxy becomes unreacheable so the oc command doesn't provide any status until 2018-11-22 21:06:44 when we start seeing pods getting into Error or CrashLoopBackOff state. app availability(http_response.log): at 2018-11-22 21:07:23 the server starts responding with 503 and it recovers at 2018-11-22 21:09:34 /var/lib/mistral/openshift/ansible.log: at 2018-11-22 21:06:18 docker gets restarted on the master nodes. From initial looks it appears it is because /etc/sysconfig/docker-network changed: https://github.com/openstack/ansible-role-container-registry/blob/master/tasks/docker.yml#L100-L107 This is how tripleo sets docker-network file: [root@openshift-master-0 heat-admin]# cat /etc/sysconfig/docker-network # /etc/sysconfig/docker-network DOCKER_NETWORK_OPTIONS=' --bip=172.31.0.1/24' This is how the file looks after the file looks after the initial deployment: # /etc/sysconfig/docker-network DOCKER_NETWORK_OPTIONS=' --mtu=1450' So to summarize I believe the issue here is that tripleo restarts docker on an already existing deployment while it shouldn't. Is there any way we could avoid this?
Changing the title to reflect the new findings and requesting back the blocker flag.
I believe this has to do with the prerequisites.yml playbook being included when it should not. I'll submit a patch soon.
Created Launchpad issue https://bugs.launchpad.net/tripleo/+bug/1804790
The patch at https://review.openstack.org/#/c/619713/ should fix the generated openshift-ansible playbook for updates.
@Marius, sorry I didn't read your comment until the end. I think your analysis is correct and should find a way to prevent tripleo from restarting docker. That being said, re-running the prerequisites playbook on the existing nodes was also a problem, my patch is still valid :)
One more difference is in /etc/sysconfig/docker: after tripleo configuration: # /etc/sysconfig/docker # Modify these options if you want to change the way the docker daemon runs OPTIONS='-H unix:///run/docker.sock -H unix:///var/lib/openstack/docker.sock --log-driver=journald --signature-verification=false --iptables=false --live-restore' if [ -z "${DOCKER_CERT_PATH}" ]; then DOCKER_CERT_PATH=/etc/docker fi # Do not add registries in this file anymore. Use /etc/containers/registries.conf # instead. For more information reference the registries.conf(5) man page. # Location used for temporary files, such as those created by # docker load and build operations. Default is /var/lib/docker/tmp # Can be overriden by setting the following environment variable. # DOCKER_TMPDIR=/var/tmp # Controls the /etc/cron.daily/docker-logrotate cron job status. # To disable, uncomment the line below. # LOGROTATE=false # docker-latest daemon can be used by starting the docker-latest unitfile. # To use docker-latest client, uncomment below lines #DOCKERBINARY=/usr/bin/docker-latest #DOCKERDBINARY=/usr/bin/dockerd-latest #DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest #DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest INSECURE_REGISTRY='--insecure-registry 192.168.24.1:8787' ========================================================================== after openshift-ansible configuration: # /etc/sysconfig/docker # Modify these options if you want to change the way the docker daemon runs OPTIONS='-H unix:///run/docker.sock -H unix:///var/lib/openstack/docker.sock --log-driver=journald --signature-verification=false --iptables=false --live-restore' if [ -z "${DOCKER_CERT_PATH}" ]; then DOCKER_CERT_PATH=/etc/docker fi # Do not add registries in this file anymore. Use /etc/containers/registries.conf # instead. For more information reference the registries.conf(5) man page. # Location used for temporary files, such as those created by # docker load and build operations. Default is /var/lib/docker/tmp # Can be overriden by setting the following environment variable. # DOCKER_TMPDIR=/var/tmp # Controls the /etc/cron.daily/docker-logrotate cron job status. # To disable, uncomment the line below. # LOGROTATE=false # docker-latest daemon can be used by starting the docker-latest unitfile. # To use docker-latest client, uncomment below lines #DOCKERBINARY=/usr/bin/docker-latest #DOCKERDBINARY=/usr/bin/dockerd-latest #DOCKER_CONTAINERD_BINARY=/usr/bin/docker-containerd-latest #DOCKER_CONTAINERD_SHIM_BINARY=/usr/bin/docker-containerd-shim-latest INSECURE_REGISTRY='--insecure-registry 192.168.24.1:8787' ADD_REGISTRY='--add-registry registry.redhat.io'
Proposed fix on review https://review.openstack.org/#/c/620621/
The inappropriate openshift-ansible docker restarts should be fixed with https://review.openstack.org/#/c/619713/. While the tripleo ones should be fixed with https://review.openstack.org/#/c/621241/ and https://review.openstack.org/#/c/620621/.
No doc text required.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0045