Bug 1309192

Summary:	The latest cassandra image encounter fatal exception during initialization
Product:	OpenShift Container Platform	Reporter:	Xia Zhao <xiazhao>
Component:	Hawkular	Assignee:	Matt Wringe <mwringe>
Status:	CLOSED ERRATA	QA Contact:	chunchen <chunchen>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.2.0	CC:	aos-bugs, bleanhar, ccoleman, jliggitt, mwringe, sdodson, tdawson, wsun
Target Milestone:	---	Keywords:	Regression, TestBlocker
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-05-12 16:29:08 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Jeff Cantrill 2016-02-17 13:13:10 UTC

@Brenton could this be a problem with the build of the image?

Comment 2 Matt Wringe 2016-02-17 14:24:13 UTC

What is the output of 'oc get service hawkular-cassandra-nodes'

Comment 3 Matt Wringe 2016-02-17 15:29:17 UTC

This also looks to be the same issue opened here: https://bugzilla.redhat.com/show_bug.cgi?id=1307170

The logs the first time Cassandra fails would be very useful here. The logs attached are the failures which occur when the Cassandra instance restarts with a left over files in the pod storage.

The issue usually occurs when Cassandra tries to connect to an invalid endpoint and ends up in an error state.

This can be caused by the 'hawkular-cassandra-nodes' service not being a headless service (it tries to connect to the service endpoint instead of the individual components behind the service).

Or if the 'hawkular-cassandra-nodes' hostname is resolving to something other than the Cassandra instances.

Comment 8 Xia Zhao 2016-02-19 02:46:09 UTC

OK, understand. Thanks for the confirmation.

Comment 9 chunchen 2016-02-19 09:11:41 UTC

Met the similar errors in logging-deployer pod like "Invalid value: 9300: must be equal to targetPort when clusterIP = None".

Tested with below EFK images:

openshift3/logging-deployment   74de3e4b37f8
openshift3/logging-auth-proxy   a28d3494ea25
openshift3/logging-fluentd      581e80e4e569
openshift3/logging-kibana       1d7701631584
openshift3/logging-elasticsearch 338955b2e0fd

[chunchen@F17-CCY daily]$ oc get svc
NAME                 CLUSTER_IP      EXTERNAL_IP   PORT(S)    SELECTOR                                  AGE
logging-es           172.31.13.232   <none>        9200/TCP   component=es,provider=openshift           11m
logging-es-ops       172.31.187.78   <none>        9200/TCP   component=es-ops,provider=openshift       11m
logging-kibana       172.31.59.142   <none>        443/TCP    component=kibana,provider=openshift       11m
logging-kibana-ops   172.31.83.99    <none>        443/TCP    component=kibana-ops,provider=openshift   11m

Comment 10 Matt Wringe 2016-02-19 14:04:22 UTC

Yes, this issue is also expected when running the 3.1 images on 3.2 and is due to a backwards compatibility issue.

Images meant for 3.1 will not necessarily work with 3.2 since OpenShift is not fully backwards compatible between releases.

We should hopefully have the images meant for 3.2 built soon

Comment 12 Brenton Leanhardt 2016-02-20 17:28:18 UTC

Matt,

Can you explain the backwards incompatibility you are referring to?  It's a pretty serious problem at upgrade time if 3.1 logging images don't work with 3.2.  We have to be able to upgrade in a rolling manner.  I'm CC'ing Jordan on this bug.

Comment 13 Jordan Liggitt 2016-02-20 18:10:57 UTC

Do you have the service yaml that is considered invalid?

Comment 14 Jordan Liggitt 2016-02-20 18:54:56 UTC

Nevermind, I see it

Validation change in https://github.com/openshift/origin/blob/49f17578b3ab07b9975eed0d8ed0de2122ae1d63/Godeps/_workspace/src/k8s.io/kubernetes/pkg/api/validation/validation.go#L1739-L1743

Change between (newly) invalid service definition and current valid definition:
https://github.com/openshift/origin-metrics/commit/9779da1d9e164d5481ba8cc1674f014ae2b32f82#diff-425bae80a2c9ecd3bb04943580836e3f

Upstream change that tightened the validation:
https://github.com/kubernetes/kubernetes/pull/17862
https://github.com/kubernetes/kubernetes/issues/17634

Comment 15 Jordan Liggitt 2016-02-20 19:18:26 UTC

Possible fix in https://github.com/openshift/origin/pull/7495

Our options are:

1. Disable the validation, continuing to allow invalid targetPort values for headless services. This will allow bad values in the system, and if anything dealing with headless services starts using the targetPort (not sure why they would, but still...) it would be invalid.

2. Override targetPort values for headless services to match the port (ignoring what is specified). This means users won't be told they're setting an invalid targetPort, and will have no feedback as to why their specified value is getting thrown out.

3. Break compatibility for invalid values that were tolerated prior to 3.2


Preferences?

Comment 16 Jordan Liggitt 2016-02-20 20:45:35 UTC

Option 4: push for either 1 or 2 upstream. It's a compatibility issue for them too

Comment 17 Clayton Coleman 2016-02-22 00:51:34 UTC

#1 upstream sounds correct to me, because this breaks backwards compatibility for any Kube deployment.

Comment 20 Jordan Liggitt 2016-02-23 04:40:55 UTC

The breaking validation change for headless service targetPort fields is being reverted in https://github.com/openshift/origin/pull/7495

The case-sensitivity change will remain, since that is part of the parser now in use for performance reasons, and because documented API fields still work correctly.

Comment 22 Xia Zhao 2016-02-24 07:04:16 UTC

Tested on OSE 3.2 with latest metrics images, metrics pods can be running well, and this bug is fixed:
openshift3/metrics-hawkular-metrics               latest              0939fae5e762      
openshift3/metrics-deployer                       latest              5b12fd896d9d      
openshift3/metrics-heapster                       latest              91e9f7156877       
openshift3/metrics-cassandra                      latest              6798b0f4381a     

Please change bug status to ON_QA, I will then close it. Thanks!

Comment 23 Xia Zhao 2016-02-25 03:03:26 UTC

Set to verified based on my comment #22

Comment 25 errata-xmlrpc 2016-05-12 16:29:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064