Bug 1307170 - hawkular-cassandra deployment issues
hawkular-cassandra deployment issues
Status: CLOSED ERRATA
Product: OpenShift Container Platform
Classification: Red Hat
Component: Metrics (Show other bugs)
3.1.0
x86_64 Linux
unspecified Severity medium
: ---
: ---
Assigned To: Matt Wringe
chunchen
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2016-02-12 16:55 EST by Kent Hua
Modified: 2016-09-29 22:16 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-05-12 12:28:57 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
metrics-deployer pod log (15.03 KB, text/plain)
2016-02-12 16:55 EST, Kent Hua
no flags Details
hawkular-cassandra-1-eredm - initial error (12.85 KB, text/plain)
2016-02-12 16:56 EST, Kent Hua
no flags Details
hawkular-cassandra-1-eredm - subsequent errors (10.74 KB, text/plain)
2016-02-12 16:56 EST, Kent Hua
no flags Details
Cassandra logs, OSE 3.1.1.6 (11.69 KB, text/plain)
2016-02-17 16:15 EST, Matt Wringe
no flags Details
origin container images (107.00 KB, text/plain)
2016-03-01 19:04 EST, Kent Hua
no flags Details

  None (edit)
Description Kent Hua 2016-02-12 16:55:35 EST
Created attachment 1123647 [details]
metrics-deployer pod log

Description of problem:
hawkular-cassandra pod does not deploy as part of executing metrics-deployer.  hawkular-cassandra initially fails with one error:
java.lang.RuntimeException: Unable to gossip with any seeds  

subsequent pod retries result in: 
org.apache.cassandra.exceptions.ConfigurationException: Found system keyspace files, but they couldn't be loaded!

Issue occurs with USE_PERSISTENT_STORAGE true and false.

Scenario below is USE_PERSISTENT_STORAGE=false


Version-Release number of selected component (if applicable):
OSE v3.1.1.6
metrics-deployer and related pods, v3.1.1

How reproducible:
Every time

Steps to Reproduce:
Fresh install of OSE 3.1.1.6 via advanced ansible installer
  
Setup steps:
HAWKULAR_METRICS_HOSTNAME=ose-master.kenthua.com

oc project openshift-infra

oc create -f - <<API
apiVersion: v1
kind: ServiceAccount
metadata:
  name: metrics-deployer
secrets:
- name: metrics-deployer
API

oadm policy add-role-to-user edit system:serviceaccount:openshift-infra:metrics-deployer
oadm policy add-cluster-role-to-user cluster-reader system:serviceaccount:openshift-infra:heapster
oc secrets new metrics-deployer nothing=/dev/null

oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/infrastructure-templates/enterprise/metrics-deployer.yaml -v \
IMAGE_PREFIX=registry.access.redhat.com/openshift3/,\
IMAGE_VERSION=3.1.1,\
HAWKULAR_METRICS_HOSTNAME=$HAWKULAR_METRICS_HOSTNAME,\
USE_PERSISTENT_STORAGE=false \
| oc create -f -


# DNS inside the metrics-deployer pod looks good.  
# timing was tricky for this one because it goes by fast once running

root@ose-master ~]# oc exec -it metrics-deployer-do0lt /bin/bash
id: cannot find name for user ID 1000
<ployer-do0lt deploy]$ curl https://kubernetes.default.svc:443 -k
{
  "paths": [
    "/api",
    "/api/v1",
    "/controllers",
    "/healthz",
    "/healthz/ping",
    "/healthz/ready",
    "/logs/",
    "/metrics",
    "/oapi",
    "/oapi/v1",
    "/swaggerapi/"
  ]
}[I have no name!@metrics-deployer-do0lt deploy]$

# Initial hawkular-cassandra-1-eredm failure
# timing was tricky for this because it seems to happen only when the pod first tries to run

INFO  18:09:03 Starting Encrypted Messaging Service on SSL port 7001
java.lang.RuntimeException: Unable to gossip with any seeds
	at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1328)
	at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:543)
	at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:754)
	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:688)
	at org.apache.cassandra.service.StorageService.initServer(StorageService.java:580)
	at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292)
	at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:488)
	at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:595)
Exception (java.lang.RuntimeException) encountered during startup: Unable to gossip with any seeds
ERROR 18:09:34 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
	at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1328) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]

# Subsequent hawkular-cassandra-1-eredm failures

INFO  18:09:56 Initializing system.range_xfers
ERROR 18:09:57 Fatal exception during initialization
org.apache.cassandra.exceptions.ConfigurationException: Found system keyspace files, but they couldn't be loaded!
	at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:744) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2]

# Heapster fails because it's waiting for hawkular-metrics to start

F0212 13:10:59.329583       1 heapster.go:67] Get https://hawkular-metrics:443/hawkular/metrics/metrics?type=gauge: dial tcp 172.30.215.241:443: no route to hos

# Other info

[root@ose-master ~]# oc get svc
NAME                       CLUSTER_IP       EXTERNAL_IP   PORT(S)                               SELECTOR                  AGE
hawkular-cassandra         172.30.37.33     <none>        9042/TCP,9160/TCP,7000/TCP,7001/TCP   type=hawkular-cassandra   2m
hawkular-cassandra-nodes   None             <none>        9042/TCP,9160/TCP,7000/TCP,7001/TCP   type=hawkular-cassandra   2m
hawkular-metrics           172.30.215.241   <none>        443/TCP                               name=hawkular-metrics     2m
heapster                   172.30.65.108    <none>        80/TCP                                name=heapster             2m





Actual results:
[root@ose-master ~]# oc get pods
NAME                         READY     STATUS             RESTARTS   AGE
hawkular-cassandra-1-eredm   0/1       CrashLoopBackOff   47         3h
hawkular-metrics-5b067       0/1       Pending            0          3h
heapster-4gwen               0/1       CrashLoopBackOff   46         3h
metrics-deployer-do0lt       0/1       Completed          0          3h

Expected results:
cluster metrics successfully running with all 3 pods (cassandra, metrics and heapster running)

Additional info:
Comment 1 Kent Hua 2016-02-12 16:56 EST
Created attachment 1123648 [details]
hawkular-cassandra-1-eredm - initial error
Comment 2 Kent Hua 2016-02-12 16:56 EST
Created attachment 1123649 [details]
hawkular-cassandra-1-eredm - subsequent errors
Comment 3 Matt Wringe 2016-02-12 17:08:56 EST
Can reproduce locally, looking into this
Comment 4 Matt Wringe 2016-02-12 17:13:23 EST
Correction, I cannot reproduce this locally with the OSE images, they work for me. I was able to reproduce the same error message with an older origin-metrics image, but that is a different issue.
Comment 5 Matt Wringe 2016-02-12 17:49:11 EST
The issue appears to be that Cassandra is resolving 'hawkular-cassandra-nodes' when this shouldn't be resolvable yet. The 'hawkular-cassandra-nodes' hostname should point to a service which is created as part of the metrics install, this should not be resolvable when first deploying metrics and when Cassandra is starting up.

Is there anything else running in your project, such as existing Cassandra instances, when you deploy this? Have you made any other modifications to the metrics deployment?

What happens when you add the 'REDEPLOY=true' when you process the template? This should redeploy metrics (but be warned it is a full redeploy, so any existing components will be deleted)

oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/infrastructure-templates/enterprise/metrics-deployer.yaml -v \
IMAGE_PREFIX=registry.access.redhat.com/openshift3/,\
IMAGE_VERSION=3.1.1,\
HAWKULAR_METRICS_HOSTNAME=$HAWKULAR_METRICS_HOSTNAME,\
USE_PERSISTENT_STORAGE=false,REDEPLOY=true \
| oc create -f -
Comment 6 Kent Hua 2016-02-12 17:58:11 EST
(In reply to Matt Wringe from comment #5)
> 
> Is there anything else running in your project, such as existing Cassandra
> instances, when you deploy this? Have you made any other modifications to
> the metrics deployment?

No existing cassandra instances.  No modifications to the metrics-deployer.yaml.  Using only the parameters for changes.

> 
> What happens when you add the 'REDEPLOY=true' when you process the template?
> This should redeploy metrics (but be warned it is a full redeploy, so any
> existing components will be deleted)

Same issue.  gossip error on first try, then finding system keyspace on subsequent retries.


[root@ose-master ~]# oc describe pod hawkular-cassandra-1-8pszg
Name:				hawkular-cassandra-1-8pszg
Namespace:			openshift-infra
Image(s):			registry.access.redhat.com/openshift3/metrics-cassandra:3.1.1
Node:				ose-node2.kenthua.com/172.16.118.131
Start Time:			Fri, 12 Feb 2016 14:53:53 -0800
Labels:				metrics-infra=hawkular-cassandra,name=hawkular-cassandra-1,type=hawkular-cassandra
Status:				Running
Reason:
Message:
IP:				10.1.1.6
Replication Controllers:	hawkular-cassandra-1 (1/1 replicas created)
Containers:
  hawkular-cassandra-1:
    Container ID:	docker://753789df0a55d80fb664daed8ea9578cc7db7a54ee40ec9b93eaec726b8be55d
    Image:		registry.access.redhat.com/openshift3/metrics-cassandra:3.1.1
    Image ID:		docker://227b99b94d054dd99b148ae5024f646e40775463b6a961f90f5f85b309b17a6e
    QoS Tier:
      cpu:			BestEffort
      memory:			BestEffort
    State:			Waiting
      Reason:			CrashLoopBackOff
    Last Termination State:	Terminated
      Reason:			Error
      Exit Code:		100
      Started:			Fri, 12 Feb 2016 14:56:16 -0800
      Finished:			Fri, 12 Feb 2016 14:56:20 -0800
    Ready:			False
    Restart Count:		5
    Environment Variables:
      CASSANDRA_MASTER:	true
      POD_NAMESPACE:	openshift-infra (v1:metadata.namespace)
Conditions:
  Type		Status
  Ready 	False
Volumes:
  cassandra-data:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
  hawkular-cassandra-secrets:
    Type:	Secret (a secret that should populate this volume)
    SecretName:	hawkular-cassandra-secrets
  cassandra-token-5rmta:
    Type:	Secret (a secret that should populate this volume)
    SecretName:	cassandra-token-5rmta
Events:
  FirstSeen	LastSeen	Count	From				SubobjectPath				Reason		Message
  ─────────	────────	─────	────				─────────────				──────		───────
  3m		3m		1	{kubelet ose-node2.kenthua.com}	implicitly required container POD	Created		Created with docker id 3eec5325b357
  3m		3m		1	{kubelet ose-node2.kenthua.com}	implicitly required container POD	Started		Started with docker id 3eec5325b357
  3m		3m		1	{kubelet ose-node2.kenthua.com}	implicitly required container POD	Pulled		Container image "openshift3/ose-pod:v3.1.1.6" already present on machine
  3m		3m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Created		Created with docker id b79873255f83
  3m		3m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Started		Started with docker id b79873255f83
  3m		3m		1	{scheduler }								Scheduled	Successfully assigned hawkular-cassandra-1-8pszg to ose-node2.kenthua.com
  2m		2m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Started		Started with docker id 1f3a12eb1f51
  2m		2m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Created		Created with docker id 1f3a12eb1f51
  2m		2m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Created		Created with docker id 5e8b1f4feaf6
  2m		2m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Started		Started with docker id 5e8b1f4feaf6
  2m		2m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Created		Created with docker id 5cd4a53bceae
  2m		2m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Started		Started with docker id 5cd4a53bceae
  3m		1m		5	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Pulled		Container image "registry.access.redhat.com/openshift3/metrics-cassandra:3.1.1" already present on machine
  1m		1m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Started		Started with docker id 753789df0a55
  1m		1m		1	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Created		Created with docker id 753789df0a55
  2m		8s		15	{kubelet ose-node2.kenthua.com}	spec.containers{hawkular-cassandra-1}	Backoff		Back-off restarting failed docker container
Comment 7 Matt Wringe 2016-02-16 09:12:20 EST
The errors you are seeing appear to be the result of something else on your system resolving the 'hawkular-cassandra-nodes' hostname. Are you sure there isn't something which is already resolving this hostname?
Comment 8 Kent Hua 2016-02-16 15:18:34 EST
Nothing appears to be resolving 'hawkular-cassandra-nodes'

I tried to do it with my current setup.

[root@ose-master ~]# oc exec -it docker-registry-1-u8zje /bin/bash
bash-4.2$ curl hawkular-cassandra-nodes
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><meta http-equiv="refresh" content="0;url=http://finder.cox.net/main?ParticipantID=96e687opkbv4scrood8k84drs6gw5duf&FailedURI=http%3A%2F%2Fhawkular-cassandra-nodes%2F&FailureMode=1&Implementation=&AddInType=4&Version=pywr1.0&ClientLocation=us"/><script type="text/javascript">url="http://finder.cox.net/main?ParticipantID=96e687opkbv4scrood8k84drs6gw5duf&FailedURI=http%3A%2F%2Fhawkular-cassandra-nodes%2F&FailureMode=1&Implementation=&AddInType=4&Version=pywr1.0&ClientLocation=us";if(top.location!=location){var w=window,d=document,e=d.documentElement,b=d.body,x=w.innerWidth||e.clientWidth||b.clientWidth,y=w.innerHeight||e.clientHeight||b.clientHeight;url+="&w="+x+"&h="+y;}window.location.replace(url);</script></head><body></body></html>

bash-4.2$
bash-4.2$ curl redhatabc.com
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><meta http-equiv="refresh" content="0;url=http://finder.cox.net/main?ParticipantID=96e687opkbv4scrood8k84drs6gw5duf&FailedURI=http%3A%2F%2Fredhatabc.com%2F&FailureMode=1&Implementation=&AddInType=4&Version=pywr1.0&ClientLocation=us"/><script type="text/javascript">url="http://finder.cox.net/main?ParticipantID=96e687opkbv4scrood8k84drs6gw5duf&FailedURI=http%3A%2F%2Fredhatabc.com%2F&FailureMode=1&Implementation=&AddInType=4&Version=pywr1.0&ClientLocation=us";if(top.location!=location){var w=window,d=document,e=d.documentElement,b=d.body,x=w.innerWidth||e.clientWidth||b.clientWidth,y=w.innerHeight||e.clientHeight||b.clientHeight;url+="&w="+x+"&h="+y;}window.location.replace(url);</script></head><body></body></html>bash-4.2$




I also setup a DNS for my ose instances to point to.  These occur before the metrics-deploy is run.  It is run on the ose host as well as in a container.  It also occurs after as expected.  

[root@ose-master ~]# curl hawkular-cassandra
curl: (6) Could not resolve host: hawkular-cassandra; Name or service not known
[root@ose-master ~]# curl hawkular-cassandra-nodes
curl: (6) Could not resolve host: hawkular-cassandra-nodes; Name or service not known

[root@ose-master ~]# oc exec -it docker-registry-1-u8zje /bin/bash
cbash-4.2$ curl hawkular-cassandra
curl: (6) Could not resolve host: hawkular-cassandra; Name or service not known
bash-4.2$ curl hawkular-cassandra-nodes
curl: (6) Could not resolve host: hawkular-cassandra-nodes; Name or service not known
bash-4.2$
Comment 9 Matt Wringe 2016-02-16 15:51:26 EST
A Cassandra instance always checks if there are other Cassandra instances running to see if it can connect to them. It uses the 'hawkular-cassandra-nodes' hostname to determine this.

If the 'hawkular-cassandra-nodes' hostname could not be resolved, then there should have been an entry in the logs you posted here about it: https://bugzilla.redhat.com/attachment.cgi?id=1123648

But there is no entry in those logs corresponding to the hostname not being resolvable. Which leads me to believe the system is resolving the hostname to something else.

The root cause here appears to be that the system is resolving the 'hawkular-cassandra-nodes' hostname to something which is not a running Cassandra instance. This causes the Cassandra instance to write its configuration files in such a manner that requires manual intervention to get working again on subsequent re-starts (which is the error you are seeing).

Something else which can cause this exact problem would be if the cassandra-nodes-service was not a headless service. But from your initial comment it appears that its not the case.

We can double check this with 'oc get -o json service hawkular-cassandra-nodes' and making sure the portalIP value is 'None'

Can you please confirm if your docker registry is running in the same project as the metric components? If not, then you will need to check the full hostname which is 'hawkular-cassandra-nodes.openshift-infra.svc.cluster.local' assuming the metrics components were deployed to the 'openshift-infra' project.
Comment 10 Kent Hua 2016-02-16 18:12:00 EST
You are correct in that my metrics are in openshift-infra, so I didn't properly qualify, but even if I do it still gives the same errors.

[root@ose-master ~]# oc project openshift-infra
Now using project "openshift-infra" on server "https://ose-master.kenthua.com:8443".

[root@ose-master ~]# oc get pods

[root@ose-master ~]# oc get svc

[root@ose-master ~]# oc project default
Now using project "default" on server "https://ose-master.kenthua.com:8443".

[root@ose-master ~]# oc exec -it docker-registry-1-u8zje /bin/bash
bash-4.2$ curl hawkular-cassandra-nodes.openshift-infra.svc.cluster.local
curl: (6) Could not resolve host: hawkular-cassandra-nodes.openshift-infra.svc.cluster.local; Name or service not known


[root@ose-master ~]# oc get -o json service hawkular-cassandra-nodes | grep portalIP
        "portalIP": "None",
Comment 11 Matt Wringe 2016-02-17 16:15 EST
Created attachment 1128012 [details]
Cassandra logs, OSE 3.1.1.6

Attaching a log running on OSE v3.1.1.6-10-g15b47fc with the 3.1.1 metric components.

From the log is the expected messages about trying to connect to the 'hawkular-cassandra-nodes'

"WARN  20:42:57 UnknownHostException for service 'hawkular-cassandra-nodes'. It may not be up yet. Trying again"
Comment 12 Matt Wringe 2016-02-17 16:27:07 EST
I have attached the logs that I am seeing and the expected message about the hawkular-cassandra-nodes being an unknown host.

OSE version: v3.1.1.6-10-g15b47fc [fresh install, RHEL7 base]
Metrics version tag: 3.1.1

This message is missing in the logs pointed to here https://bugzilla.redhat.com/attachment.cgi?id=1123648 and is very suspicious

This type of error would be expected on OSE 3.2 or a newer version of OpenShift Origin, but not on OSE v3.1.1.6. And this error would have been caused by an issue with the 'hawkular-cassandra-nodes' service not being headless.
Comment 14 Kent Hua 2016-02-17 18:13:04 EST
You are running off a different build of OSE.  

My installed version from this repo: rhel-7-server-ose-3.1-rpms

[root@ose-master ~]# rpm -qa | grep atomic-openshift
atomic-openshift-clients-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
atomic-openshift-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
atomic-openshift-node-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
atomic-openshift-utils-3.0.13-1.git.0.5e8c5c7.el7aos.noarch
atomic-openshift-master-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
tuned-profiles-atomic-openshift-node-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64

A yum update only returns: atomic-openshift-utils-3.0.35-1.git.0.6a386dd.el7aos


I deleted the "hawkular-cassandra-nodes" service while the hawkular-cassandra pod was still in pending.  I still got the same error without the UnknownHostException.
Comment 15 Matt Wringe 2016-02-18 09:21:16 EST
> You are running off a different build of OSE.  

I should be running from the latest released OSE rpms. Have you tried updating?

atomic-openshift-utils-3.0.35-1.git.0.6a386dd.el7aos.noarch
atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
atomic-openshift-clients-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
tuned-profiles-atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
atomic-openshift-master-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64


> I deleted the "hawkular-cassandra-nodes" service while the hawkular-cassandra pod was still in pending.  I still got the same error without the UnknownHostException.

Yes, I would expect this to happen as all evidence seems to point to something else resolving the 'hawkular-cassandra-nodes' hostname instead of the 'hawkular-cassandra-nodes' service.
Comment 16 Kent Hua 2016-02-18 15:09:28 EST
(In reply to Matt Wringe from comment #15)
> > You are running off a different build of OSE.  
> 
> I should be running from the latest released OSE rpms. Have you tried
> updating?
> 
> atomic-openshift-utils-3.0.35-1.git.0.6a386dd.el7aos.noarch
> atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
> atomic-openshift-clients-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
> tuned-profiles-atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
> atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
> atomic-openshift-master-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
> atomic-openshift-sdn-ovs-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64
> 
> 

Not seeing this version as available in my repo.  

[root@ose-master ~]# yum list atomic-openshift-master
Loaded plugins: product-id, search-disabled-repos, subscription-manager
Installed Packages
atomic-openshift-master.x86_64                                    3.1.1.6-1.git.0.b57e8bd.el7aos                                     @rhel-7-server-ose-3.1-rpms

[root@ose-master ~]# yum upgrade atomic-openshift-master
Loaded plugins: product-id, search-disabled-repos, subscription-manager
No packages marked for update
Comment 17 Matt Wringe 2016-02-19 09:39:32 EST
Hmm, I wonder why the versions appear to be different. I followed the prerequisites which should have all the same repos enabled.

Would it be possible to install the the hello-pod example in the openshift-infra project (https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json)

This should be in the same project where you have deployed the metrics components and they should be installed (even if they are failing) while running the following  checks.

And then run 'docker exec -u root -it $CONTAINER_ID bash'

Once inside the container run: 'yum install bind-utils' and then 'dig +short +search hawkular-cassandra-nodes' and post the output of the 'dig' command.

I can't reproduce this with a clean 3.1.1.6 install with the 3.1.1 origin metrics components. Is there anything special with your setup or network configuration?
Comment 18 Kent Hua 2016-02-19 13:04:09 EST
I couldn't get a shell into hello-openshift.  I tried to get rhel-tools, but that wouldn't startup up.  So I just pulled an image off docker hub with bind, enabled RunAsAny in openshift and got into the container.

root@bind-1-r9xyq:/# dig +short +search hawkular-cassandra-nodes
root@bind-1-r9xyq:/# dig +search hawkular-cassandra-nodes

; <<>> DiG 9.9.5-3ubuntu0.7-Ubuntu <<>> +search hawkular-cassandra-nodes
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 19709
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;hawkular-cassandra-nodes.localdomain. IN A

root@bind-1-r9xyq:/# dig +short +search hawkular-cassandra-nodes.openshift-inf>
root@bind-1-r9xyq:/# dig +search hawkular-cassandra-nodes.openshift-infra.svc.>

; <<>> DiG 9.9.5-3ubuntu0.7-Ubuntu <<>> +search hawkular-cassandra-nodes.openshift-infra.svc.cluster.local
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 47366
;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0
;; WARNING: recursion requested but not available

;; QUESTION SECTION:
;hawkular-cassandra-nodes.openshift-infra.svc.cluster.local.localdomain.	IN A
Comment 19 Matt Wringe 2016-02-19 13:54:23 EST
Would it be possible for you to verify what the 'CASSANDRA_NODES_SERVICE_NAME' environment variable is for the Cassandra container?
Comment 20 Kent Hua 2016-02-19 14:33:18 EST
[root@ose-master ~]# oc exec -it hawkular-cassandra-1-qgyzt env
...
CASSANDRA_NODES_SERVICE_NAME=hawkular-cassandra-nodes
...
Comment 23 Kent Hua 2016-03-01 16:41:57 EST
It does have something to do with name resolution / DNS.  Here is my deployment layout.

Working  
kenthua.com - Locally managed DNS via bind with aliases pointing to the appropriate IPs.
ose-master.example.com 172.xx
ose-node1.example.com 172.xx
ose-node2.example.com 172.xx
*.cloudapps.example.com 172.xx

My OSE is pointing to my local bind DNS first, then vmware fusion dns second.


NOT Working
kenthua.com - Externally managed domain and DNS, with aliases pointing to the appropriate IPs.
ose-master.kenthua.com 172.xx
ose-node1.kenthua.com 172.xx
ose-node2.kenthua.com 172.xx
*.cloudapps.kenthua.com 172.xx

My OSE is pointing to my vmware fusion instance as the default DNS, which eventually will end up resolving to my ISP dns to resolve "*.kenthua.com".  Everything else OSE doesn't seem to have a problem resolving hostnames.



Versions didn't matter, I've tried the latest now: "3.1.1.6-4.git.21.cd70c35.el7aos.x86_64"  What is the hawkular-cassandra pod doing differently or unique that makes it only work with the local DNS.
Comment 24 Matt Wringe 2016-03-01 17:07:50 EST
The hawkular-cassandra pod is trying to connect to the 'hawkular-cassandra-nodes' hostname which should correspond to a headless service. Headless services are meant not to return a single address for a service but multiple addresses corresponding to all the running instances behind the service.

The other thing I can think of is that the hawkular-cassandra pod does different things based on if it can resolve the hostname or not. So if your external DNS is resolving 'hawkular-cassandra-nodes' to an ip address, then it will try and connect to that which will cause exactly cause the issue you outlined. But it doesn't appear from your debugging that this is the case.

Have you noticed anything different between those two dns setups with respect to resolving hostnames which should not resolve to anything?
Comment 25 Kent Hua 2016-03-01 17:49:56 EST
I did some basic testing within the environments, outside of metrics-deployer.  Both had the same curl and dig responses, when trying to resolve a non-existant hostname.
Comment 26 Matt Wringe 2016-03-01 18:05:04 EST
Would it be possible to test with the origin-metrics containers? The Cassandra one running there does things in a slightly different manner and may provide some better debugging information. At the very least it will output what IP address it is trying to connect to.

Other than that, I would probably have to create a test container for Cassandra with more debugging enabled.
Comment 27 Kent Hua 2016-03-01 19:04 EST
Created attachment 1132085 [details]
origin container images

This is the external DNS use case.

I deployed origin containers and all of the pods are in a "Running" state without any issue.

IMAGE_PREFIX=openshift/origin-,\
IMAGE_VERSION=latest,\

I was also able to hit: https://ose-master.kenthua.com/hawkular/metrics without an issue.

0.13.0-SNAPSHOT
(Git SHA1 - 7dee24acfcfb3beac356e2c4d83b7b1704ebf82f)
Metrics Service :STARTED
Comment 28 Matt Wringe 2016-03-18 13:07:55 EDT
This should be fixed in out 3.2 images since they have been updated to use the same mechanism that origin-metrics uses (and which is verified to be fixed)
Comment 29 Xia Zhao 2016-03-21 04:37:46 EDT
Installed OSE 3.2 with latest puddle (2016-03-18.4), and tested with latest metrics images pulled from brew, all the metrics pods can be running and metrics UI looks good. Closing the issue as fixed.

Here are the images I used:
docker images|grep metrics| awk '{print $1"    "$3}' |awk -F'/' '{print $2"/"$3}'
openshift3/metrics-deployer    d3b5bd02c6ad
openshift3/metrics-hawkular-metrics    0d825e62d05a
openshift3/metrics-heapster    9a6aa3a55a44
openshift3/metrics-cassandra    2f9af4d01e97
Comment 31 errata-xmlrpc 2016-05-12 12:28:57 EDT
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064

Note You need to log in before you can comment on or make changes to this bug.