Bug 1740258

Summary: After adding second ingresscontroller produces TLS handshake error coming from prometheus.
Product: OpenShift Container Platform Reporter: Ryan Howe <rhowe>
Component: NetworkingAssignee: Dan Mace <dmace>
Networking sub component: router QA Contact: Hongan Li <hongli>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, jdesousa, nagrawal, rcarrata, talessio
Version: 4.1.zKeywords: NeedsTestCase
Target Milestone: ---   
Target Release: 4.1.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-09-25 07:27:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1724498    
Bug Blocks:    

Description Ryan Howe 2019-08-12 14:27:26 UTC
Description of problem:

When a 2nd ingresscontroller object is added to cluster router logs for all routers shows the following error which is coming from the prometheus pod ips. 

```
http: TLS handshake error from 10.131.0.218:59612: remote error: tls: bad certificate
```

Version-Release number of selected component (if applicable):
4.1.9 

How reproducible:
100%

Steps to Reproduce:
1. Add second ingresscontroller to cluster. 

# oc create -n  openshift-ingress-operator - -f - <<EOF
apiVersion: operator.openshift.io/v1
kind: IngressController
metadata:
  finalizers:
  - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
  generation: 1
  name: test
  selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/test
spec:
  domain: apps2.example.ocp.com
  replicas: 1


2. Review logs 

# oc logs router-test-xxx
# oc logs router-default-xxxx


Actual results:

LOGS:

I0812 14:19:41.952951       1 logs.go:49] http: TLS handshake error from 10.128.2.204:54680: remote error: tls: bad certificate
I0812 14:19:41.953043       1 logs.go:49] http: TLS handshake error from 10.131.0.225:36654: remote error: tls: bad certificate


prometheus is unable to get router metrics.  


Expected results:

Able to add 2nd ingress controller with out breaking prometheus metrics.

Comment 1 Hongan Li 2019-08-13 09:14:19 UTC
This issue has been fixed in 4.2 by PR: https://github.com/openshift/cluster-ingress-operator/pull/242

and same root cause to https://bugzilla.redhat.com/show_bug.cgi?id=1724498

Comment 5 Dan Mace 2019-08-15 15:35:42 UTC
(In reply to Hongan Li from comment #3)
> workaround is updating the selector in servicemonitor resource for each
> ingresscontroller, for example:
> 
> ### update servicemonitor for default ingresscontroller
> $ oc get servicemonitor router-default -o yaml -n openshift-ingress
> <---snip--->
> spec:
> <---snip--->
>   selector: {}
> 
> $ oc edit servicemonitor router-default -n openshift-ingress
>   selector:
>     matchLabels:
>       ingresscontroller.operator.openshift.io/owning-ingresscontroller:
> default
> 
> 
> ### update servicemonitor for test ingresscontroller
> $ oc edit servicemonitor router-test -n openshift-ingress
>   selector:
>     matchLabels:
>       ingresscontroller.operator.openshift.io/owning-ingresscontroller: test

Just to be clear, while this is a possible solution in the context of a formal support exception, we don't have an exception yet, and manually editing this resource IS NOT SUPPORTED. Doing so could make the cluster unsupported or unable to be upgraded.

Please DO NOT execute this patch in a production cluster for which support is expected.

Comment 7 Hongan Li 2019-09-20 05:26:04 UTC
Verified with 4.1.17 and issue has been fixed. 

$ oc -n openshift-ingress-operator get ingresscontroller
NAME      AGE
default   74m
test      3m50s

$ oc -n openshift-ingress logs router-test-6b4ddc8b47-bnxcx | grep -i error

Comment 9 errata-xmlrpc 2019-09-25 07:27:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2820