Bug 1687940

Summary: Creating an IngressController Never Achieves Desired Deployment AvailableReplicas
Product: OpenShift Container Platform Reporter: Daneyon Hansen <dhansen>
Component: NetworkingAssignee: Dan Mace <dmace>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, dmace
Version: 4.1.0   
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:45:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Daneyon Hansen 2019-03-12 17:23:43 UTC
Description of problem:
When creating a non-default IngressController, the dependent deployment never achieves the default (2) available replicas.

Version-Release number of selected component (if applicable):
4.0.0-0.alpha-2019-03-12-024440

How reproducible:
Always

Steps to Reproduce:
1. Install OpenShift
2. Create a clusteringress:

kind: IngressController
apiVersion: operator.openshift.io/v1
metadata:
  name: test0
  namespace: openshift-ingress-operator
spec:
  domain: tests0.<YOUR_INGRESS_DOMAIN>

3. Check the ingresscontroller:

$ oc get ingresscontroller/test0 -n openshift-ingress-operator -o yaml | grep 

Actual results:

availableReplicas
  availableReplicas: 1

Expected results:

Note: The default number of replicas for an ingresscontroller is 2.

availableReplicas
  availableReplicas: 2

Additional info:

$ oc logs deploy/router-test0 -n openshift-ingress
Found 2 pods, using pod/router-test0-566cfb6db8-gvl4z
I0312 15:07:03.471696       1 template.go:299] Starting template router (4.0.0-20-g80b8c3d)
I0312 15:07:03.475628       1 metrics.go:147] Router health and metrics port listening at 0.0.0.0:1936 on HTTP and HTTPS
E0312 15:07:03.491998       1 haproxy.go:392] can't scrape HAProxy: dial unix /var/lib/haproxy/run/haproxy.sock: connect: no such file or directory
I0312 15:07:03.515420       1 router.go:482] Router reloaded:
 - Proxy protocol on, checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
I0312 15:07:03.515451       1 router.go:255] Router is including routes in all namespaces
E0312 15:07:03.519758       1 reflector.go:205] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to list *v1.Route: Unauthorized
E0312 15:07:04.532099       1 reflector.go:322] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to watch *v1.Route: the server has asked for the client to provide credentials (get routes.route.openshift.io)
E0312 15:07:04.720898       1 status.go:171] Unable to write router status for openshift-monitoring/prometheus-k8s: Unauthorized
I0312 15:07:04.750700       1 router.go:482] Router reloaded:
 - Proxy protocol on, checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).
E0312 15:07:05.544733       1 reflector.go:322] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to watch *v1.Route: the server has asked for the client to provide credentials (get routes.route.openshift.io)
E0312 15:07:06.548378       1 reflector.go:205] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to list *v1.Route: Unauthorized
E0312 15:07:07.551659       1 reflector.go:205] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to list *v1.Route: Unauthorized
E0312 15:07:07.723293       1 status.go:171] Unable to write router status for openshift-monitoring/prometheus-k8s: Unauthorized
E0312 15:07:08.553918       1 reflector.go:205] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to list *v1.Route: Unauthorized
E0312 15:07:09.556289       1 reflector.go:205] github.com/openshift/router/pkg/router/controller/factory/factory.go:112: Failed to list *v1.Route: Unauthorized
I0312 15:07:09.756840       1 router.go:482] Router reloaded:
 - Proxy protocol on, checking http://localhost:80 ...
 - Health check ok : 0 retry attempt(s).

Comment 1 Daneyon Hansen 2019-03-12 19:37:14 UTC
After looking at a 'describe' for the pod in question, it does not get scheduled due to anti-affinity rules:

$ oc describe po/router-test0-566cfb6db8-zjfsf -n openshift-ingress
<SNIP>
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  2h (x751 over 3h)    default-scheduler  0/6 nodes are available: 3 node(s) didn't match node selector, 3 node(s) didn't match pod affinity/anti-affinity, 3 node(s) didn't satisfy existing pods anti-affinity rules.

Is it required that router pods from different ingress controllers NOT be scheduled to the same nodes?

Comment 2 Dan Mace 2019-03-13 21:32:10 UTC
Yeah, the anti-affinity rule is incomplete. It needs an additional selector to ensure anti-affinity is scoped to a particular ingresscontroller.

Comment 3 Dan Mace 2019-03-21 12:52:14 UTC
We change the anti-affinity rule to be preferred rather than required, which should enable horizontal scaling but also allow for surge pods to be scheduled on nodes during a deployment.

Comment 5 Hongan Li 2019-03-22 06:06:21 UTC
will verify with next nightly build which contains the fix.

Comment 6 Hongan Li 2019-03-25 01:43:24 UTC
verified with 4.0.0-0.nightly-2019-03-23-222829 the issue has been fixed.

$ oc get ingresscontrollers.operator.openshift.io test0 -n openshift-ingress-operator -o yaml
---
status:
  availableReplicas: 2
---

$ oc get pod -n openshift-ingress -o wide
NAME                              READY   STATUS    RESTARTS   AGE   IP            NODE                             NOMINATED NODE
router-default-65dc774d97-6hw5z   1/1     Running   0          13m   10.129.2.12   ip-172-31-134-125.ec2.internal   <none>
router-default-65dc774d97-b8wh2   1/1     Running   0          13m   10.131.0.12   ip-172-31-151-75.ec2.internal    <none>
router-default-65dc774d97-mkvmm   1/1     Running   0          12m   10.128.2.10   ip-172-31-162-21.ec2.internal    <none>
router-test0-649fd8d759-rtgj8     1/1     Running   0          98s   10.131.0.13   ip-172-31-151-75.ec2.internal    <none>
router-test0-649fd8d759-zcpqs     1/1     Running   0          98s   10.128.2.11   ip-172-31-162-21.ec2.internal    <none>

Comment 8 errata-xmlrpc 2019-06-04 10:45:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758