Bug 1699324

Summary:	improper certificates, can crush a cluster
Product:	OpenShift Container Platform	Reporter:	Eric Rich <erich>
Component:	Networking	Assignee:	Dan Mace <dmace>
Networking sub component:	router	QA Contact:	Hongan Li <hongli>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	low
Priority:	low	CC:	aos-bugs, bbennett, mkhan
Version:	4.1.0	Keywords:	NeedsTestCase
Target Milestone:	---
Target Release:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-08-07 19:22:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1664187

Description Eric Rich 2019-04-12 12:26:19 UTC

Description of problem: Providing a non-wild card certificate to a wildcard route, causes the cluster to crash. 


Version-Release number of selected component (if applicable): 4.1 Beta 3


How reproducible: 100% 


Steps to Reproduce:
1: Create some certificates (mimics a customer getting their own certificates) 

DOMAIN=$(oc get ingresscontroller.operator default -n openshift-ingress-operator -o jsonpath='{.status.domain}{"\n"}' | sed 's/apps/gears/'); openssl req -new -newkey rsa:2048 -days 5 -nodes -x509 -subj "/C=US/ST=NC/L=Raleigh/O=Red Hat Inc./OU=CEE/CN=$DOMAIN" -keyout server.key -out server.crt

2: Put the certificate into a secret 

oc create secret tls broken-cert --cert=server.crt --key=server.key -n openshift-ingress

3: Tell the cluster to use this certificate as the default certificate for all wildcard domains, note my certificate is not a wild card certificate. 

oc patch ingresscontroller.operator default --type=merge -p "{\"spec\":{\"defaultCertificate\":{\"name\": \"broken-cert\"}}}" -n openshift-ingress-operator


Actual results: This breaks a cluster


Expected results: The ingress operator validates that the cert you pointed to is a wildcard before using it, and log an error if not.  


Additional info:

Comment 1 Mo 2019-04-12 13:25:30 UTC

Moving to routing as they maintain ingress operator.

Comment 2 Dan Mace 2019-07-24 17:57:34 UTC

I'm not sure I agree with characterization of this bug. The claims I see are that providing an "invalid" certificate:

1. "causes the cluster to crash."
2. "This breaks a cluster"

Can you more clearly describe what you mean by "crash" and "break"? I've executed your steps and the ingresscontroller happily uses the provided certificate.

The ingresscontroller API doesn't describe any validation rules applied to the contents of the certificate secret[1]. Whether or not some validation _should_ be applied seems like a useful discussion, but can we agree that such a discussion should take place in the context of an RFE and not a bug report?

[1] https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L89

Comment 3 Eric Rich 2019-07-24 19:00:48 UTC

(In reply to Dan Mace from comment #2)
> I'm not sure I agree with characterization of this bug. The claims I see are
> that providing an "invalid" certificate:
> 
> 1. "causes the cluster to crash."
> 2. "This breaks a cluster"
> 
> Can you more clearly describe what you mean by "crash" and "break"? I've
> executed your steps and the ingresscontroller happily uses the provided
> certificate.
> 
> The ingresscontroller API doesn't describe any validation rules applied to
> the contents of the certificate secret[1]. Whether or not some validation
> _should_ be applied seems like a useful discussion, but can we agree that
> such a discussion should take place in the context of an RFE and not a bug
> report?
> 
> [1]
> https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L89

Router no longer runs. 
Without a router authentication no longer works.

Comment 4 Dan Mace 2019-07-24 19:49:58 UTC

(In reply to Eric Rich from comment #3)
> (In reply to Dan Mace from comment #2)
> > I'm not sure I agree with characterization of this bug. The claims I see are
> > that providing an "invalid" certificate:
> > 
> > 1. "causes the cluster to crash."
> > 2. "This breaks a cluster"
> > 
> > Can you more clearly describe what you mean by "crash" and "break"? I've
> > executed your steps and the ingresscontroller happily uses the provided
> > certificate.
> > 
> > The ingresscontroller API doesn't describe any validation rules applied to
> > the contents of the certificate secret[1]. Whether or not some validation
> > _should_ be applied seems like a useful discussion, but can we agree that
> > such a discussion should take place in the context of an RFE and not a bug
> > report?
> > 
> > [1]
> > https://github.com/openshift/api/blob/master/operator/v1/types_ingress.go#L89
> 
> Router no longer runs. 
> Without a router authentication no longer works.

I didn't see crash loops when running the test. I am seeing the invalid cert being served. If there's actually a router crash occurring, I would expect status of the ingresscontroller/operator to reflect that. Can you provide the output of must-gather following a crash if that's what you're seeing? Given a reference to a well-formed certificate data in a valid secret reference I would expect the certificate to be used and served as-is, and so far what I see matches those expectations.

I do agree that changing the certificate has potentially serious downstream implications, and I think our API docs should more clearly articulate the risks. We can certainly improve the API docs right away.

That said, my claim is still that we're conforming to the API we publish. I'll let the rest of the team weigh in here, but my current position is that we need to spec out the behavior in an RFE or story.

Comment 5 Dan Mace 2019-08-07 19:22:10 UTC

Can't reproduce a crash, and some sort of validation of the certificate would be an RFE. I'm going to close this one. If I've misunderstood and you have some steps to reproduce a crash, please let me know and re-open.

Comment 6 Eric Rich 2019-08-07 22:29:06 UTC

https://jira.coreos.com/browse/RFE-298 filed for this.