Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1766851

Summary:	upgrade fail due to ingress operator go to degraded
Product:	OpenShift Container Platform	Reporter:	liujia <jiajliu>
Component:	Networking	Assignee:	Miciah Dashiel Butler Masters <mmasters>
Networking sub component:	router	QA Contact:	liujia <jiajliu>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, dmace, hongli
Version:	4.3.0
Target Milestone:	---
Target Release:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-23 11:09:47 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description liujia 2019-10-30 05:41:36 UTC

Description of problem:
Upgrade upi/gcp cluster from 4.2.2 to 4.3.0-0.nightly-2019-10-29-073252 fail.

# ./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.2.2     True        True          70m     Unable to apply 4.3.0-0.nightly-2019-10-29-073252: the cluster operator ingress is degraded 

# ./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
...
dns                                        4.3.0-0.nightly-2019-10-29-073252   True        False         False      99m
image-registry                             4.3.0-0.nightly-2019-10-29-073252   True        False         False      91m
ingress                                    4.3.0-0.nightly-2019-10-29-073252   True        False         True       91m
...
machine-api                                4.3.0-0.nightly-2019-10-29-073252   True        False         False      100m
machine-config                             4.2.2                               True        False         False      99m
...

===========================================================================
# ./oc describe co ingress
Name:         ingress
Namespace:    
...
Status:
  Conditions:
    Last Transition Time:  2019-10-30T02:55:27Z
    Message:               Some ingresscontrollers are degraded: default
    Reason:                IngressControllersDegraded
    Status:                True
    Type:                  Degraded
    Last Transition Time:  2019-10-30T02:45:26Z
    Message:               desired and current number of IngressControllers are equal
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2019-10-30T02:14:24Z
    Message:               desired and current number of IngressControllers are equal
    Status:                True
    Type:                  Available
...

Checked ingresscontroller that the new deployment timeout.
# oc get IngressController default -o yaml
...
  - lastTransitionTime: "2019-10-30T02:55:27Z"
    message: 'The deployment failed (reason: ProgressDeadlineExceeded) with message:
      ReplicaSet "router-default-5c94bd7d94" has timed out progressing.'
    reason: DeploymentFailed
    status: "True"
    type: Degraded
...

Checked that the new deployed router pod in pending status.
# oc get all -n openshift-ingress
NAME                                  READY   STATUS    RESTARTS   AGE
pod/router-default-5c94bd7d94-xl4pt   0/1     Pending   0          68m
pod/router-default-78f49dfd9-2p4pk    1/1     Running   0          104m
pod/router-default-78f49dfd9-hx5vm    1/1     Running   0          104m

NAME                              TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                      AGE
service/router-default            LoadBalancer   172.30.12.124   35.239.159.216   80:30778/TCP,443:30771/TCP   104m
service/router-internal-default   ClusterIP      172.30.31.142   <none>           80/TCP,443/TCP,1936/TCP      104m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/router-default   2/2     1            2           104m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/router-default-5c94bd7d94   1         1         0       68m
replicaset.apps/router-default-78f49dfd9    2         2         2       104m

# oc describe pod/router-default-5c94bd7d94-xl4pt -n openshift-ingress
Name:                 router-default-5c94bd7d94-xl4pt
Namespace:            openshift-ingress
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
                      ingresscontroller.operator.openshift.io/hash=576796cc5d
                      pod-template-hash=5c94bd7d94
Annotations:          openshift.io/scc: restricted
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/router-default-5c94bd7d94
Containers:
  router:
    Image:       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:83defbeee71841e5d3f2d9d5a971f3fb89605fdc4503c5a7d60af9609bf1a5bb
    Ports:       80/TCP, 443/TCP, 1936/TCP
    Host Ports:  0/TCP, 0/TCP, 0/TCP
    Requests:
      cpu:      100m
      memory:   256Mi
    Liveness:   http-get http://:1936/healthz delay=10s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://:1936/healthz/ready delay=10s timeout=1s period=10s #success=1 #failure=3
    Environment:
      DEFAULT_CERTIFICATE_DIR:       /etc/pki/tls/private
      ROUTER_CANONICAL_HOSTNAME:     apps.jliu-bug.qe.gcp.devcluster.openshift.com
      ROUTER_CIPHERS:                TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256:ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
      ROUTER_METRICS_TLS_CERT_FILE:  /etc/pki/tls/metrics-certs/tls.crt
      ROUTER_METRICS_TLS_KEY_FILE:   /etc/pki/tls/metrics-certs/tls.key
      ROUTER_METRICS_TYPE:           haproxy
      ROUTER_SERVICE_NAME:           default
      ROUTER_SERVICE_NAMESPACE:      openshift-ingress
      ROUTER_THREADS:                4
      SSL_MIN_VERSION:               TLSv1.2
      STATS_PASSWORD:                <set to the key 'statsPassword' in secret 'router-stats-default'>  Optional: false
      STATS_PORT:                    1936
      STATS_USERNAME:                <set to the key 'statsUsername' in secret 'router-stats-default'>  Optional: false
    Mounts:
      /etc/pki/tls/metrics-certs from metrics-certs (ro)
      /etc/pki/tls/private from default-certificate (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from router-token-jpcxg (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-certificate:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  router-certs-default
    Optional:    false
  metrics-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  router-metrics-certs-default
    Optional:    false
  router-token-jpcxg:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  router-token-jpcxg
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  kubernetes.io/os=linux
                 node-role.kubernetes.io/worker=
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule
                 node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason            Age        From               Message
  ----     ------            ----       ----               -------
  Warning  FailedScheduling  <unknown>  default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/5 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taints that the pod didn't tolerate, 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/5 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taints that the pod didn't tolerate, 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/5 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity, 1 node(s) didn't satisfy existing pods anti-affinity rules, 1 node(s) had taints that the pod didn't tolerate, 3 node(s) didn't match node selector.
  Warning  FailedScheduling  <unknown>  default-scheduler  0/5 nodes are available: 2 node(s) didn't match pod affinity/anti-affinity, 2 node(s) didn't satisfy existing pods anti-affinity rules, 3 node(s) didn't match node selector.


Version-Release number of selected component (if applicable):
4.3.0-0.nightly-2019-10-29-073252

How reproducible:
always

Steps to Reproduce:
1. upgrade upi/gcp cluster from 4.2.2 to 4.3 nightly build.
2.
3.

Actual results:
upgrade failed at ingress operator.

Expected results:
upgrade succeed.

Additional info:
Refer to master gather logs.

Comment 7 liujia 2019-11-12 06:26:40 UTC

Version:4.3.0-0.nightly-2019-11-12-000306

Upgrade from v4.2.4 to latest 4.3.0-0.nightly-2019-11-12-000306 succeed.

# ./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.3.0-0.nightly-2019-11-12-000306   True        False         False      112m
cloud-credential                           4.3.0-0.nightly-2019-11-12-000306   True        False         False      127m
cluster-autoscaler                         4.3.0-0.nightly-2019-11-12-000306   True        False         False      122m
console                                    4.3.0-0.nightly-2019-11-12-000306   True        False         False      73m
dns                                        4.3.0-0.nightly-2019-11-12-000306   True        False         False      126m
image-registry                             4.3.0-0.nightly-2019-11-12-000306   True        False         False      86m
ingress                                    4.3.0-0.nightly-2019-11-12-000306   True        False         False      81m
insights                                   4.3.0-0.nightly-2019-11-12-000306   True        False         False      127m
kube-apiserver                             4.3.0-0.nightly-2019-11-12-000306   True        False         False      126m
kube-controller-manager                    4.3.0-0.nightly-2019-11-12-000306   True        False         False      124m
kube-scheduler                             4.3.0-0.nightly-2019-11-12-000306   True        False         False      125m
machine-api                                4.3.0-0.nightly-2019-11-12-000306   True        False         False      127m
machine-config                             4.3.0-0.nightly-2019-11-12-000306   True        False         False      126m
marketplace                                4.3.0-0.nightly-2019-11-12-000306   True        False         False      79m
monitoring                                 4.3.0-0.nightly-2019-11-12-000306   True        False         False      75m
network                                    4.3.0-0.nightly-2019-11-12-000306   True        False         False      126m
node-tuning                                4.3.0-0.nightly-2019-11-12-000306   True        False         False      85m
openshift-apiserver                        4.3.0-0.nightly-2019-11-12-000306   True        False         False      83m
openshift-controller-manager               4.3.0-0.nightly-2019-11-12-000306   True        False         False      124m
openshift-samples                          4.3.0-0.nightly-2019-11-12-000306   True        False         False      98m
operator-lifecycle-manager                 4.3.0-0.nightly-2019-11-12-000306   True        False         False      126m
operator-lifecycle-manager-catalog         4.3.0-0.nightly-2019-11-12-000306   True        False         False      126m
operator-lifecycle-manager-packageserver   4.3.0-0.nightly-2019-11-12-000306   True        False         False      73m
service-ca                                 4.3.0-0.nightly-2019-11-12-000306   True        False         False      127m
service-catalog-apiserver                  4.3.0-0.nightly-2019-11-12-000306   True        False         False      123m
service-catalog-controller-manager         4.3.0-0.nightly-2019-11-12-000306   True        False         False      123m
storage                                    4.3.0-0.nightly-2019-11-12-000306   True        False         False      99m

# ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.3.0-0.nightly-2019-11-12-000306   True        False         70m     Cluster version is 4.3.0-0.nightly-2019-11-12-000306

# ./oc get all -n openshift-ingress
NAME                                  READY   STATUS    RESTARTS   AGE
pod/router-default-54c44cb495-flhrk   1/1     Running   0          128m
pod/router-default-54c44cb495-zldq9   1/1     Running   0          132m

NAME                              TYPE           CLUSTER-IP      EXTERNAL-IP    PORT(S)                      AGE
service/router-default            LoadBalancer   172.30.141.82   35.222.217.2   80:32378/TCP,443:31958/TCP   164m
service/router-internal-default   ClusterIP      172.30.53.70    <none>         80/TCP,443/TCP,1936/TCP      164m

NAME                             READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/router-default   2/2     2            2           164m

NAME                                        DESIRED   CURRENT   READY   AGE
replicaset.apps/router-default-54c44cb495   2         2         2       133m
replicaset.apps/router-default-5b4cc4c6f6   0         0         0       164m

Comment 9 errata-xmlrpc 2020-01-23 11:09:47 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0062