Bug 1565222

Summary:

[starter-us-east-2] webconsole inaccessible for 40 minutes after upgrade from 3.7->3.9

Product:

OpenShift Container Platform

Reporter:

Justin Pierce <jupierce>

Component:

Networking

Assignee:

Ben Bennett <bbennett>

Status:

CLOSED INSUFFICIENT_DATA

QA Contact:

Meng Bo <bmeng>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.9.0

CC:

aos-bugs, bbennett, jokerman, jupierce, mmccomas

Target Milestone:

---

Flags:

jupierce: needinfo-

Target Release:

3.11.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-06-06 14:17:33 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Annotated console interaction trying to debug the issue - eventually shows console route starting to work	none

Description Justin Pierce 2018-04-09 16:22:30 UTC

Created attachment 1419413 [details]
Annotated console interaction trying to debug the issue - eventually shows console route starting to work

Description of problem:
During a standard openshift-ansible upgrade from 3.7->3.9 (GA), the playbooks failed because of an issue gathering the health of the console over the course of about 40 minutes. After this excessive period of time past, the console route suddenly started working again without any action from us.

The route issue:
"curl", "-k", "https://webconsole.openshift-web-console.svc/healthz"
resulted in "Failed connect to webconsole.openshift-web-console.svc:443; No route to host"

Version-Release number of selected component (if applicable):
OCP v3.9.14

How reproducible:
Unknown

Additional info:
- Attachment shows annotated interactions on the cluster console to determine what was happening. It was during this interaction that the route became available again.

Comment 2 Ben Bennett 2018-04-10 15:58:32 UTC

Weibin, can you try to reproduce this?  Unfortunately I can't tell what time things started working from the associated information, so I can't correlate things to the logs.

Justin: If you see this happen again, can you grab the output from:
  iptables-save
  ovs-ofctl dump-flows br0 -O OpenFlow13
  oc get ep <service-name>

Comment 4 Justin Pierce 2018-04-10 17:35:55 UTC

This did happen again on east-2a's upgrade today. Unfortunately, it occurred before your request. I worked around the issue this time by restarting atomic-openshift-node on all the masters. 
Almost immediately after restarting the node process on one of the masters, the webconsole check by openshift-ansible was satisfied and moved on.

Comment 7 Weibin Liang 2018-04-20 20:13:01 UTC

Reproduced above issue in ec2 setup when upgrade v3.7 to v3.9

The reason for curl failed is because web-console pod use wrong repository( registry.access.redhat.com/openshift3/ose-web-console) after update v3.9.

After updating web-console pod to use new repository(registry.reg-aws.openshift.com:443/openshift3/ose-web-console), pod running correctly and curl passed.

#### Test log:

After upgrade to v3.9

[root@ip-172-18-2-112 ~]# oc version
oc v3.9.24
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://ip-172-18-2-112.ec2.internal:8443
openshift v3.9.24
kubernetes v1.9.1+a0ce1bc657
[root@ip-172-18-2-112 ~]# 

[root@ip-172-18-2-112 ~]# curl -k https://webconsole.openshift-web-console.svc/healthz
curl: (7) Failed connect to webconsole.openshift-web-console.svc:443; Connection refused
[root@ip-172-18-2-112 ~]# oc project openshift-web-console
Already on project "openshift-web-console" on server "https://ip-172-18-2-112.ec2.internal:8443".
[root@ip-172-18-2-112 ~]# oc get all
NAME                DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/webconsole   1         1         1            0           8m

NAME                     DESIRED   CURRENT   READY     AGE
rs/webconsole-894b9c44   1         1         0         8m

NAME                           READY     STATUS             RESTARTS   AGE
po/webconsole-894b9c44-qbqm7   0/1       ImagePullBackOff   0          8m

NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
svc/webconsole   ClusterIP   172.30.30.34   <none>        443/TCP   8m
[root@ip-172-18-2-112 ~]# oc describe po/webconsole-894b9c44-qbqm7
Name:           webconsole-894b9c44-qbqm7
Namespace:      openshift-web-console
Node:           ip-172-18-2-112.ec2.internal/172.18.2.112
Start Time:     Fri, 20 Apr 2018 15:27:23 -0400
Labels:         pod-template-hash=45065700
                webconsole=true
Annotations:    openshift.io/scc=restricted
Status:         Pending
IP:             10.129.0.11
Controlled By:  ReplicaSet/webconsole-894b9c44
Containers:
  webconsole:
    Container ID:  
    Image:         registry.access.redhat.com/openshift3/ose-web-console:v3.9.24
    Image ID:      
    Port:          8443/TCP
    Command:
      /usr/bin/origin-web-console
      --audit-log-path=-
      -v=0
      --config=/var/webconsole-config/webconsole-config.yaml
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Requests:
      cpu:     100m
      memory:  100Mi
    Liveness:  exec [/bin/sh -i -c if [[ ! -f /tmp/webconsole-config.hash ]]; then \
  md5sum /var/webconsole-config/webconsole-config.yaml > /tmp/webconsole-config.hash; \
elif [[ $(md5sum /var/webconsole-config/webconsole-config.yaml) != $(cat /tmp/webconsole-config.hash) ]]; then \
  exit 1; \
fi && curl -k -f https://0.0.0.0:8443/console/] delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:8443/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from webconsole-token-6m7gz (ro)
      /var/serving-cert from serving-cert (rw)
      /var/webconsole-config from webconsole-config (rw)
Conditions:
  Type           Status
  Initialized    True 
  Ready          False 
  PodScheduled   True 
Volumes:
  serving-cert:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webconsole-serving-cert
    Optional:    false
  webconsole-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      webconsole-config
    Optional:  false
  webconsole-token-6m7gz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  webconsole-token-6m7gz
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  node-role.kubernetes.io/master=true
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
Events:
  Type     Reason                 Age               From                                   Message
  ----     ------                 ----              ----                                   -------
  Normal   Scheduled              8m                default-scheduler                      Successfully assigned webconsole-894b9c44-qbqm7 to ip-172-18-2-112.ec2.internal
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-2-112.ec2.internal  MountVolume.SetUp succeeded for volume "webconsole-config"
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-2-112.ec2.internal  MountVolume.SetUp succeeded for volume "webconsole-token-6m7gz"
  Warning  FailedMount            8m (x2 over 8m)   kubelet, ip-172-18-2-112.ec2.internal  MountVolume.SetUp failed for volume "serving-cert" : secrets "webconsole-serving-cert" not found
  Normal   SuccessfulMountVolume  8m                kubelet, ip-172-18-2-112.ec2.internal  MountVolume.SetUp succeeded for volume "serving-cert"
  Normal   Pulling                8m (x2 over 8m)   kubelet, ip-172-18-2-112.ec2.internal  pulling image "registry.access.redhat.com/openshift3/ose-web-console:v3.9.24"
  Warning  Failed                 8m (x2 over 8m)   kubelet, ip-172-18-2-112.ec2.internal  Failed to pull image "registry.access.redhat.com/openshift3/ose-web-console:v3.9.24": rpc error: code = Unknown desc = error parsing HTTP 404 response body: invalid character 'F' looking for beginning of value: "File not found.\""
  Warning  Failed                 8m (x2 over 8m)   kubelet, ip-172-18-2-112.ec2.internal  Error: ErrImagePull
  Normal   SandboxChanged         7m (x6 over 8m)   kubelet, ip-172-18-2-112.ec2.internal  Pod sandbox changed, it will be killed and re-created.
  Normal   BackOff                7m (x4 over 8m)   kubelet, ip-172-18-2-112.ec2.internal  Back-off pulling image "registry.access.redhat.com/openshift3/ose-web-console:v3.9.24"
  Warning  Failed                 3m (x22 over 8m)  kubelet, ip-172-18-2-112.ec2.internal  Error: ImagePullBackOff
[root@ip-172-18-2-112 ~]# 


## Workaround:
[root@ip-172-18-2-112 ~]#oc describe pods router-1-w6t24 -n default|grep Image:
Image:          registry.reg-aws.openshift.com:443/openshift3/ose-haproxy-router:v3.7.44
# update registry.reg-aws.openshift.com:443/openshift3/ose-web-console in pod
[root@ip-172-18-2-112 ~]# oc edit po/webconsole-894b9c44-qbqm7
deployment "webconsole" edited

[root@ip-172-18-2-112 ~]# oc get pod
NAME                        READY     STATUS    RESTARTS   AGE
webconsole-894b9c44-28t4m   1/1       Running   0          5m
[root@ip-172-18-2-112 ~]# curl -k https://webconsole.openshift-web-console.svc/healthz
ok
[root@ip-172-18-2-112 ~]#

Comment 8 Scott Dodson 2018-05-01 20:00:55 UTC

In the starter environments we've seen the problem resolve itself with no further action on our part. We also see it immediately resolved by restarting the node service. In these environments the images were always set properly so I don't think comment 7 is the same problem. If I had to guess I imagine this is iptables randomly burning CPU while propagating services.

According to Justin this has happened at least twice.

Comment 9 Ben Bennett 2018-05-02 15:34:51 UTC

If @sdodson's hunch is right then this is caused by https://bugzilla.redhat.com/show_bug.cgi?id=1569128

Comment 10 Ben Bennett 2018-06-06 14:17:33 UTC

I believe that this did not happen with the 3.10 upgrade, and it has not been reported by any customers going to 3.9.