Bug 1309192
Summary: | The latest cassandra image encounter fatal exception during initialization | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Xia Zhao <xiazhao> |
Component: | Hawkular | Assignee: | Matt Wringe <mwringe> |
Status: | CLOSED ERRATA | QA Contact: | chunchen <chunchen> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 3.2.0 | CC: | aos-bugs, bleanhar, ccoleman, jliggitt, mwringe, sdodson, tdawson, wsun |
Target Milestone: | --- | Keywords: | Regression, TestBlocker |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-05-12 16:29:08 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 1
Jeff Cantrill
2016-02-17 13:13:10 UTC
What is the output of 'oc get service hawkular-cassandra-nodes' This also looks to be the same issue opened here: https://bugzilla.redhat.com/show_bug.cgi?id=1307170 The logs the first time Cassandra fails would be very useful here. The logs attached are the failures which occur when the Cassandra instance restarts with a left over files in the pod storage. The issue usually occurs when Cassandra tries to connect to an invalid endpoint and ends up in an error state. This can be caused by the 'hawkular-cassandra-nodes' service not being a headless service (it tries to connect to the service endpoint instead of the individual components behind the service). Or if the 'hawkular-cassandra-nodes' hostname is resolving to something other than the Cassandra instances. OK, understand. Thanks for the confirmation. Met the similar errors in logging-deployer pod like "Invalid value: 9300: must be equal to targetPort when clusterIP = None". Tested with below EFK images: openshift3/logging-deployment 74de3e4b37f8 openshift3/logging-auth-proxy a28d3494ea25 openshift3/logging-fluentd 581e80e4e569 openshift3/logging-kibana 1d7701631584 openshift3/logging-elasticsearch 338955b2e0fd [chunchen@F17-CCY daily]$ oc get svc NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE logging-es 172.31.13.232 <none> 9200/TCP component=es,provider=openshift 11m logging-es-ops 172.31.187.78 <none> 9200/TCP component=es-ops,provider=openshift 11m logging-kibana 172.31.59.142 <none> 443/TCP component=kibana,provider=openshift 11m logging-kibana-ops 172.31.83.99 <none> 443/TCP component=kibana-ops,provider=openshift 11m Yes, this issue is also expected when running the 3.1 images on 3.2 and is due to a backwards compatibility issue. Images meant for 3.1 will not necessarily work with 3.2 since OpenShift is not fully backwards compatible between releases. We should hopefully have the images meant for 3.2 built soon Matt, Can you explain the backwards incompatibility you are referring to? It's a pretty serious problem at upgrade time if 3.1 logging images don't work with 3.2. We have to be able to upgrade in a rolling manner. I'm CC'ing Jordan on this bug. Do you have the service yaml that is considered invalid? Nevermind, I see it Validation change in https://github.com/openshift/origin/blob/49f17578b3ab07b9975eed0d8ed0de2122ae1d63/Godeps/_workspace/src/k8s.io/kubernetes/pkg/api/validation/validation.go#L1739-L1743 Change between (newly) invalid service definition and current valid definition: https://github.com/openshift/origin-metrics/commit/9779da1d9e164d5481ba8cc1674f014ae2b32f82#diff-425bae80a2c9ecd3bb04943580836e3f Upstream change that tightened the validation: https://github.com/kubernetes/kubernetes/pull/17862 https://github.com/kubernetes/kubernetes/issues/17634 Possible fix in https://github.com/openshift/origin/pull/7495 Our options are: 1. Disable the validation, continuing to allow invalid targetPort values for headless services. This will allow bad values in the system, and if anything dealing with headless services starts using the targetPort (not sure why they would, but still...) it would be invalid. 2. Override targetPort values for headless services to match the port (ignoring what is specified). This means users won't be told they're setting an invalid targetPort, and will have no feedback as to why their specified value is getting thrown out. 3. Break compatibility for invalid values that were tolerated prior to 3.2 Preferences? Option 4: push for either 1 or 2 upstream. It's a compatibility issue for them too #1 upstream sounds correct to me, because this breaks backwards compatibility for any Kube deployment. The breaking validation change for headless service targetPort fields is being reverted in https://github.com/openshift/origin/pull/7495 The case-sensitivity change will remain, since that is part of the parser now in use for performance reasons, and because documented API fields still work correctly. Tested on OSE 3.2 with latest metrics images, metrics pods can be running well, and this bug is fixed: openshift3/metrics-hawkular-metrics latest 0939fae5e762 openshift3/metrics-deployer latest 5b12fd896d9d openshift3/metrics-heapster latest 91e9f7156877 openshift3/metrics-cassandra latest 6798b0f4381a Please change bug status to ON_QA, I will then close it. Thanks! Set to verified based on my comment #22 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064 |