Created attachment 1419413 [details] Annotated console interaction trying to debug the issue - eventually shows console route starting to work Description of problem: During a standard openshift-ansible upgrade from 3.7->3.9 (GA), the playbooks failed because of an issue gathering the health of the console over the course of about 40 minutes. After this excessive period of time past, the console route suddenly started working again without any action from us. The route issue: "curl", "-k", "https://webconsole.openshift-web-console.svc/healthz" resulted in "Failed connect to webconsole.openshift-web-console.svc:443; No route to host" Version-Release number of selected component (if applicable): OCP v3.9.14 How reproducible: Unknown Additional info: - Attachment shows annotated interactions on the cluster console to determine what was happening. It was during this interaction that the route became available again.
Weibin, can you try to reproduce this? Unfortunately I can't tell what time things started working from the associated information, so I can't correlate things to the logs. Justin: If you see this happen again, can you grab the output from: iptables-save ovs-ofctl dump-flows br0 -O OpenFlow13 oc get ep <service-name>
This did happen again on east-2a's upgrade today. Unfortunately, it occurred before your request. I worked around the issue this time by restarting atomic-openshift-node on all the masters. Almost immediately after restarting the node process on one of the masters, the webconsole check by openshift-ansible was satisfied and moved on.
Reproduced above issue in ec2 setup when upgrade v3.7 to v3.9 The reason for curl failed is because web-console pod use wrong repository( registry.access.redhat.com/openshift3/ose-web-console) after update v3.9. After updating web-console pod to use new repository(registry.reg-aws.openshift.com:443/openshift3/ose-web-console), pod running correctly and curl passed. #### Test log: After upgrade to v3.9 [root@ip-172-18-2-112 ~]# oc version oc v3.9.24 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://ip-172-18-2-112.ec2.internal:8443 openshift v3.9.24 kubernetes v1.9.1+a0ce1bc657 [root@ip-172-18-2-112 ~]# [root@ip-172-18-2-112 ~]# curl -k https://webconsole.openshift-web-console.svc/healthz curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused [root@ip-172-18-2-112 ~]# oc project openshift-web-console Already on project "openshift-web-console" on server "https://ip-172-18-2-112.ec2.internal:8443". [root@ip-172-18-2-112 ~]# oc get all NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE deploy/webconsole 1 1 1 0 8m NAME DESIRED CURRENT READY AGE rs/webconsole-894b9c44 1 1 0 8m NAME READY STATUS RESTARTS AGE po/webconsole-894b9c44-qbqm7 0/1 ImagePullBackOff 0 8m NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE svc/webconsole ClusterIP 172.30.30.34 <none> 443/TCP 8m [root@ip-172-18-2-112 ~]# oc describe po/webconsole-894b9c44-qbqm7 Name: webconsole-894b9c44-qbqm7 Namespace: openshift-web-console Node: ip-172-18-2-112.ec2.internal/172.18.2.112 Start Time: Fri, 20 Apr 2018 15:27:23 -0400 Labels: pod-template-hash=45065700 webconsole=true Annotations: openshift.io/scc=restricted Status: Pending IP: 10.129.0.11 Controlled By: ReplicaSet/webconsole-894b9c44 Containers: webconsole: Container ID: Image: registry.access.redhat.com/openshift3/ose-web-console:v3.9.24 Image ID: Port: 8443/TCP Command: /usr/bin/origin-web-console --audit-log-path=- -v=0 --config=/var/webconsole-config/webconsole-config.yaml State: Waiting Reason: ImagePullBackOff Ready: False Restart Count: 0 Requests: cpu: 100m memory: 100Mi Liveness: exec [/bin/sh -i -c if [[ ! -f /tmp/webconsole-config.hash ]]; then \ md5sum /var/webconsole-config/webconsole-config.yaml > /tmp/webconsole-config.hash; \ elif [[ $(md5sum /var/webconsole-config/webconsole-config.yaml) != $(cat /tmp/webconsole-config.hash) ]]; then \ exit 1; \ fi && curl -k -f https://0.0.0.0:8443/console/] delay=0s timeout=1s period=10s #success=1 #failure=3 Readiness: http-get https://:8443/healthz delay=0s timeout=1s period=10s #success=1 #failure=3 Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from webconsole-token-6m7gz (ro) /var/serving-cert from serving-cert (rw) /var/webconsole-config from webconsole-config (rw) Conditions: Type Status Initialized True Ready False PodScheduled True Volumes: serving-cert: Type: Secret (a volume populated by a Secret) SecretName: webconsole-serving-cert Optional: false webconsole-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: webconsole-config Optional: false webconsole-token-6m7gz: Type: Secret (a volume populated by a Secret) SecretName: webconsole-token-6m7gz Optional: false QoS Class: Burstable Node-Selectors: node-role.kubernetes.io/master=true Tolerations: node.kubernetes.io/memory-pressure:NoSchedule Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 8m default-scheduler Successfully assigned webconsole-894b9c44-qbqm7 to ip-172-18-2-112.ec2.internal Normal SuccessfulMountVolume 8m kubelet, ip-172-18-2-112.ec2.internal MountVolume.SetUp succeeded for volume "webconsole-config" Normal SuccessfulMountVolume 8m kubelet, ip-172-18-2-112.ec2.internal MountVolume.SetUp succeeded for volume "webconsole-token-6m7gz" Warning FailedMount 8m (x2 over 8m) kubelet, ip-172-18-2-112.ec2.internal MountVolume.SetUp failed for volume "serving-cert" : secrets "webconsole-serving-cert" not found Normal SuccessfulMountVolume 8m kubelet, ip-172-18-2-112.ec2.internal MountVolume.SetUp succeeded for volume "serving-cert" Normal Pulling 8m (x2 over 8m) kubelet, ip-172-18-2-112.ec2.internal pulling image "registry.access.redhat.com/openshift3/ose-web-console:v3.9.24" Warning Failed 8m (x2 over 8m) kubelet, ip-172-18-2-112.ec2.internal Failed to pull image "registry.access.redhat.com/openshift3/ose-web-console:v3.9.24": rpc error: code = Unknown desc = error parsing HTTP 404 response body: invalid character 'F' looking for beginning of value: "File not found.\"" Warning Failed 8m (x2 over 8m) kubelet, ip-172-18-2-112.ec2.internal Error: ErrImagePull Normal SandboxChanged 7m (x6 over 8m) kubelet, ip-172-18-2-112.ec2.internal Pod sandbox changed, it will be killed and re-created. Normal BackOff 7m (x4 over 8m) kubelet, ip-172-18-2-112.ec2.internal Back-off pulling image "registry.access.redhat.com/openshift3/ose-web-console:v3.9.24" Warning Failed 3m (x22 over 8m) kubelet, ip-172-18-2-112.ec2.internal Error: ImagePullBackOff [root@ip-172-18-2-112 ~]# ## Workaround: [root@ip-172-18-2-112 ~]#oc describe pods router-1-w6t24 -n default|grep Image: Image: registry.reg-aws.openshift.com:443/openshift3/ose-haproxy-router:v3.7.44 # update registry.reg-aws.openshift.com:443/openshift3/ose-web-console in pod [root@ip-172-18-2-112 ~]# oc edit po/webconsole-894b9c44-qbqm7 deployment "webconsole" edited [root@ip-172-18-2-112 ~]# oc get pod NAME READY STATUS RESTARTS AGE webconsole-894b9c44-28t4m 1/1 Running 0 5m [root@ip-172-18-2-112 ~]# curl -k https://webconsole.openshift-web-console.svc/healthz ok [root@ip-172-18-2-112 ~]#
In the starter environments we've seen the problem resolve itself with no further action on our part. We also see it immediately resolved by restarting the node service. In these environments the images were always set properly so I don't think comment 7 is the same problem. If I had to guess I imagine this is iptables randomly burning CPU while propagating services. According to Justin this has happened at least twice.
If @sdodson's hunch is right then this is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1569128
I believe that this did not happen with the 3.10 upgrade, and it has not been reported by any customers going to 3.9.