Created attachment 1123647 [details] metrics-deployer pod log Description of problem: hawkular-cassandra pod does not deploy as part of executing metrics-deployer. hawkular-cassandra initially fails with one error: java.lang.RuntimeException: Unable to gossip with any seeds subsequent pod retries result in: org.apache.cassandra.exceptions.ConfigurationException: Found system keyspace files, but they couldn't be loaded! Issue occurs with USE_PERSISTENT_STORAGE true and false. Scenario below is USE_PERSISTENT_STORAGE=false Version-Release number of selected component (if applicable): OSE v3.1.1.6 metrics-deployer and related pods, v3.1.1 How reproducible: Every time Steps to Reproduce: Fresh install of OSE 3.1.1.6 via advanced ansible installer Setup steps: HAWKULAR_METRICS_HOSTNAME=ose-master.kenthua.com oc project openshift-infra oc create -f - <<API apiVersion: v1 kind: ServiceAccount metadata: name: metrics-deployer secrets: - name: metrics-deployer API oadm policy add-role-to-user edit system:serviceaccount:openshift-infra:metrics-deployer oadm policy add-cluster-role-to-user cluster-reader system:serviceaccount:openshift-infra:heapster oc secrets new metrics-deployer nothing=/dev/null oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/infrastructure-templates/enterprise/metrics-deployer.yaml -v \ IMAGE_PREFIX=registry.access.redhat.com/openshift3/,\ IMAGE_VERSION=3.1.1,\ HAWKULAR_METRICS_HOSTNAME=$HAWKULAR_METRICS_HOSTNAME,\ USE_PERSISTENT_STORAGE=false \ | oc create -f - # DNS inside the metrics-deployer pod looks good. # timing was tricky for this one because it goes by fast once running root@ose-master ~]# oc exec -it metrics-deployer-do0lt /bin/bash id: cannot find name for user ID 1000 <ployer-do0lt deploy]$ curl https://kubernetes.default.svc:443 -k { "paths": [ "/api", "/api/v1", "/controllers", "/healthz", "/healthz/ping", "/healthz/ready", "/logs/", "/metrics", "/oapi", "/oapi/v1", "/swaggerapi/" ] }[I have no name!@metrics-deployer-do0lt deploy]$ # Initial hawkular-cassandra-1-eredm failure # timing was tricky for this because it seems to happen only when the pod first tries to run INFO 18:09:03 Starting Encrypted Messaging Service on SSL port 7001 java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1328) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:543) at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:754) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:688) at org.apache.cassandra.service.StorageService.initServer(StorageService.java:580) at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:292) at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:488) at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:595) Exception (java.lang.RuntimeException) encountered during startup: Unable to gossip with any seeds ERROR 18:09:34 Exception encountered during startup java.lang.RuntimeException: Unable to gossip with any seeds at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1328) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] # Subsequent hawkular-cassandra-1-eredm failures INFO 18:09:56 Initializing system.range_xfers ERROR 18:09:57 Fatal exception during initialization org.apache.cassandra.exceptions.ConfigurationException: Found system keyspace files, but they couldn't be loaded! at org.apache.cassandra.db.SystemKeyspace.checkHealth(SystemKeyspace.java:744) ~[apache-cassandra-2.2.1.redhat-2.jar:2.2.1.redhat-2] # Heapster fails because it's waiting for hawkular-metrics to start F0212 13:10:59.329583 1 heapster.go:67] Get https://hawkular-metrics:443/hawkular/metrics/metrics?type=gauge: dial tcp 172.30.215.241:443: no route to hos # Other info [root@ose-master ~]# oc get svc NAME CLUSTER_IP EXTERNAL_IP PORT(S) SELECTOR AGE hawkular-cassandra 172.30.37.33 <none> 9042/TCP,9160/TCP,7000/TCP,7001/TCP type=hawkular-cassandra 2m hawkular-cassandra-nodes None <none> 9042/TCP,9160/TCP,7000/TCP,7001/TCP type=hawkular-cassandra 2m hawkular-metrics 172.30.215.241 <none> 443/TCP name=hawkular-metrics 2m heapster 172.30.65.108 <none> 80/TCP name=heapster 2m Actual results: [root@ose-master ~]# oc get pods NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-eredm 0/1 CrashLoopBackOff 47 3h hawkular-metrics-5b067 0/1 Pending 0 3h heapster-4gwen 0/1 CrashLoopBackOff 46 3h metrics-deployer-do0lt 0/1 Completed 0 3h Expected results: cluster metrics successfully running with all 3 pods (cassandra, metrics and heapster running) Additional info:
Created attachment 1123648 [details] hawkular-cassandra-1-eredm - initial error
Created attachment 1123649 [details] hawkular-cassandra-1-eredm - subsequent errors
Can reproduce locally, looking into this
Correction, I cannot reproduce this locally with the OSE images, they work for me. I was able to reproduce the same error message with an older origin-metrics image, but that is a different issue.
The issue appears to be that Cassandra is resolving 'hawkular-cassandra-nodes' when this shouldn't be resolvable yet. The 'hawkular-cassandra-nodes' hostname should point to a service which is created as part of the metrics install, this should not be resolvable when first deploying metrics and when Cassandra is starting up. Is there anything else running in your project, such as existing Cassandra instances, when you deploy this? Have you made any other modifications to the metrics deployment? What happens when you add the 'REDEPLOY=true' when you process the template? This should redeploy metrics (but be warned it is a full redeploy, so any existing components will be deleted) oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/infrastructure-templates/enterprise/metrics-deployer.yaml -v \ IMAGE_PREFIX=registry.access.redhat.com/openshift3/,\ IMAGE_VERSION=3.1.1,\ HAWKULAR_METRICS_HOSTNAME=$HAWKULAR_METRICS_HOSTNAME,\ USE_PERSISTENT_STORAGE=false,REDEPLOY=true \ | oc create -f -
(In reply to Matt Wringe from comment #5) > > Is there anything else running in your project, such as existing Cassandra > instances, when you deploy this? Have you made any other modifications to > the metrics deployment? No existing cassandra instances. No modifications to the metrics-deployer.yaml. Using only the parameters for changes. > > What happens when you add the 'REDEPLOY=true' when you process the template? > This should redeploy metrics (but be warned it is a full redeploy, so any > existing components will be deleted) Same issue. gossip error on first try, then finding system keyspace on subsequent retries. [root@ose-master ~]# oc describe pod hawkular-cassandra-1-8pszg Name: hawkular-cassandra-1-8pszg Namespace: openshift-infra Image(s): registry.access.redhat.com/openshift3/metrics-cassandra:3.1.1 Node: ose-node2.kenthua.com/172.16.118.131 Start Time: Fri, 12 Feb 2016 14:53:53 -0800 Labels: metrics-infra=hawkular-cassandra,name=hawkular-cassandra-1,type=hawkular-cassandra Status: Running Reason: Message: IP: 10.1.1.6 Replication Controllers: hawkular-cassandra-1 (1/1 replicas created) Containers: hawkular-cassandra-1: Container ID: docker://753789df0a55d80fb664daed8ea9578cc7db7a54ee40ec9b93eaec726b8be55d Image: registry.access.redhat.com/openshift3/metrics-cassandra:3.1.1 Image ID: docker://227b99b94d054dd99b148ae5024f646e40775463b6a961f90f5f85b309b17a6e QoS Tier: cpu: BestEffort memory: BestEffort State: Waiting Reason: CrashLoopBackOff Last Termination State: Terminated Reason: Error Exit Code: 100 Started: Fri, 12 Feb 2016 14:56:16 -0800 Finished: Fri, 12 Feb 2016 14:56:20 -0800 Ready: False Restart Count: 5 Environment Variables: CASSANDRA_MASTER: true POD_NAMESPACE: openshift-infra (v1:metadata.namespace) Conditions: Type Status Ready False Volumes: cassandra-data: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: hawkular-cassandra-secrets: Type: Secret (a secret that should populate this volume) SecretName: hawkular-cassandra-secrets cassandra-token-5rmta: Type: Secret (a secret that should populate this volume) SecretName: cassandra-token-5rmta Events: FirstSeen LastSeen Count From SubobjectPath Reason Message ───────── ──────── ───── ──── ───────────── ────── ─────── 3m 3m 1 {kubelet ose-node2.kenthua.com} implicitly required container POD Created Created with docker id 3eec5325b357 3m 3m 1 {kubelet ose-node2.kenthua.com} implicitly required container POD Started Started with docker id 3eec5325b357 3m 3m 1 {kubelet ose-node2.kenthua.com} implicitly required container POD Pulled Container image "openshift3/ose-pod:v3.1.1.6" already present on machine 3m 3m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Created Created with docker id b79873255f83 3m 3m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Started Started with docker id b79873255f83 3m 3m 1 {scheduler } Scheduled Successfully assigned hawkular-cassandra-1-8pszg to ose-node2.kenthua.com 2m 2m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Started Started with docker id 1f3a12eb1f51 2m 2m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Created Created with docker id 1f3a12eb1f51 2m 2m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Created Created with docker id 5e8b1f4feaf6 2m 2m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Started Started with docker id 5e8b1f4feaf6 2m 2m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Created Created with docker id 5cd4a53bceae 2m 2m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Started Started with docker id 5cd4a53bceae 3m 1m 5 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Pulled Container image "registry.access.redhat.com/openshift3/metrics-cassandra:3.1.1" already present on machine 1m 1m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Started Started with docker id 753789df0a55 1m 1m 1 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Created Created with docker id 753789df0a55 2m 8s 15 {kubelet ose-node2.kenthua.com} spec.containers{hawkular-cassandra-1} Backoff Back-off restarting failed docker container
The errors you are seeing appear to be the result of something else on your system resolving the 'hawkular-cassandra-nodes' hostname. Are you sure there isn't something which is already resolving this hostname?
Nothing appears to be resolving 'hawkular-cassandra-nodes' I tried to do it with my current setup. [root@ose-master ~]# oc exec -it docker-registry-1-u8zje /bin/bash bash-4.2$ curl hawkular-cassandra-nodes <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><meta http-equiv="refresh" content="0;url=http://finder.cox.net/main?ParticipantID=96e687opkbv4scrood8k84drs6gw5duf&FailedURI=http%3A%2F%2Fhawkular-cassandra-nodes%2F&FailureMode=1&Implementation=&AddInType=4&Version=pywr1.0&ClientLocation=us"/><script type="text/javascript">url="http://finder.cox.net/main?ParticipantID=96e687opkbv4scrood8k84drs6gw5duf&FailedURI=http%3A%2F%2Fhawkular-cassandra-nodes%2F&FailureMode=1&Implementation=&AddInType=4&Version=pywr1.0&ClientLocation=us";if(top.location!=location){var w=window,d=document,e=d.documentElement,b=d.body,x=w.innerWidth||e.clientWidth||b.clientWidth,y=w.innerHeight||e.clientHeight||b.clientHeight;url+="&w="+x+"&h="+y;}window.location.replace(url);</script></head><body></body></html> bash-4.2$ bash-4.2$ curl redhatabc.com <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"><html><head><meta http-equiv="refresh" content="0;url=http://finder.cox.net/main?ParticipantID=96e687opkbv4scrood8k84drs6gw5duf&FailedURI=http%3A%2F%2Fredhatabc.com%2F&FailureMode=1&Implementation=&AddInType=4&Version=pywr1.0&ClientLocation=us"/><script type="text/javascript">url="http://finder.cox.net/main?ParticipantID=96e687opkbv4scrood8k84drs6gw5duf&FailedURI=http%3A%2F%2Fredhatabc.com%2F&FailureMode=1&Implementation=&AddInType=4&Version=pywr1.0&ClientLocation=us";if(top.location!=location){var w=window,d=document,e=d.documentElement,b=d.body,x=w.innerWidth||e.clientWidth||b.clientWidth,y=w.innerHeight||e.clientHeight||b.clientHeight;url+="&w="+x+"&h="+y;}window.location.replace(url);</script></head><body></body></html>bash-4.2$ I also setup a DNS for my ose instances to point to. These occur before the metrics-deploy is run. It is run on the ose host as well as in a container. It also occurs after as expected. [root@ose-master ~]# curl hawkular-cassandra curl: (6) Could not resolve host: hawkular-cassandra; Name or service not known [root@ose-master ~]# curl hawkular-cassandra-nodes curl: (6) Could not resolve host: hawkular-cassandra-nodes; Name or service not known [root@ose-master ~]# oc exec -it docker-registry-1-u8zje /bin/bash cbash-4.2$ curl hawkular-cassandra curl: (6) Could not resolve host: hawkular-cassandra; Name or service not known bash-4.2$ curl hawkular-cassandra-nodes curl: (6) Could not resolve host: hawkular-cassandra-nodes; Name or service not known bash-4.2$
A Cassandra instance always checks if there are other Cassandra instances running to see if it can connect to them. It uses the 'hawkular-cassandra-nodes' hostname to determine this. If the 'hawkular-cassandra-nodes' hostname could not be resolved, then there should have been an entry in the logs you posted here about it: https://bugzilla.redhat.com/attachment.cgi?id=1123648 But there is no entry in those logs corresponding to the hostname not being resolvable. Which leads me to believe the system is resolving the hostname to something else. The root cause here appears to be that the system is resolving the 'hawkular-cassandra-nodes' hostname to something which is not a running Cassandra instance. This causes the Cassandra instance to write its configuration files in such a manner that requires manual intervention to get working again on subsequent re-starts (which is the error you are seeing). Something else which can cause this exact problem would be if the cassandra-nodes-service was not a headless service. But from your initial comment it appears that its not the case. We can double check this with 'oc get -o json service hawkular-cassandra-nodes' and making sure the portalIP value is 'None' Can you please confirm if your docker registry is running in the same project as the metric components? If not, then you will need to check the full hostname which is 'hawkular-cassandra-nodes.openshift-infra.svc.cluster.local' assuming the metrics components were deployed to the 'openshift-infra' project.
You are correct in that my metrics are in openshift-infra, so I didn't properly qualify, but even if I do it still gives the same errors. [root@ose-master ~]# oc project openshift-infra Now using project "openshift-infra" on server "https://ose-master.kenthua.com:8443". [root@ose-master ~]# oc get pods [root@ose-master ~]# oc get svc [root@ose-master ~]# oc project default Now using project "default" on server "https://ose-master.kenthua.com:8443". [root@ose-master ~]# oc exec -it docker-registry-1-u8zje /bin/bash bash-4.2$ curl hawkular-cassandra-nodes.openshift-infra.svc.cluster.local curl: (6) Could not resolve host: hawkular-cassandra-nodes.openshift-infra.svc.cluster.local; Name or service not known [root@ose-master ~]# oc get -o json service hawkular-cassandra-nodes | grep portalIP "portalIP": "None",
Created attachment 1128012 [details] Cassandra logs, OSE 3.1.1.6 Attaching a log running on OSE v3.1.1.6-10-g15b47fc with the 3.1.1 metric components. From the log is the expected messages about trying to connect to the 'hawkular-cassandra-nodes' "WARN 20:42:57 UnknownHostException for service 'hawkular-cassandra-nodes'. It may not be up yet. Trying again"
I have attached the logs that I am seeing and the expected message about the hawkular-cassandra-nodes being an unknown host. OSE version: v3.1.1.6-10-g15b47fc [fresh install, RHEL7 base] Metrics version tag: 3.1.1 This message is missing in the logs pointed to here https://bugzilla.redhat.com/attachment.cgi?id=1123648 and is very suspicious This type of error would be expected on OSE 3.2 or a newer version of OpenShift Origin, but not on OSE v3.1.1.6. And this error would have been caused by an issue with the 'hawkular-cassandra-nodes' service not being headless.
You are running off a different build of OSE. My installed version from this repo: rhel-7-server-ose-3.1-rpms [root@ose-master ~]# rpm -qa | grep atomic-openshift atomic-openshift-clients-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64 atomic-openshift-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64 atomic-openshift-node-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64 atomic-openshift-utils-3.0.13-1.git.0.5e8c5c7.el7aos.noarch atomic-openshift-master-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64 tuned-profiles-atomic-openshift-node-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64 atomic-openshift-sdn-ovs-3.1.1.6-1.git.0.b57e8bd.el7aos.x86_64 A yum update only returns: atomic-openshift-utils-3.0.35-1.git.0.6a386dd.el7aos I deleted the "hawkular-cassandra-nodes" service while the hawkular-cassandra pod was still in pending. I still got the same error without the UnknownHostException.
> You are running off a different build of OSE. I should be running from the latest released OSE rpms. Have you tried updating? atomic-openshift-utils-3.0.35-1.git.0.6a386dd.el7aos.noarch atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 atomic-openshift-clients-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 tuned-profiles-atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 atomic-openshift-master-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 atomic-openshift-sdn-ovs-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 > I deleted the "hawkular-cassandra-nodes" service while the hawkular-cassandra pod was still in pending. I still got the same error without the UnknownHostException. Yes, I would expect this to happen as all evidence seems to point to something else resolving the 'hawkular-cassandra-nodes' hostname instead of the 'hawkular-cassandra-nodes' service.
(In reply to Matt Wringe from comment #15) > > You are running off a different build of OSE. > > I should be running from the latest released OSE rpms. Have you tried > updating? > > atomic-openshift-utils-3.0.35-1.git.0.6a386dd.el7aos.noarch > atomic-openshift-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 > atomic-openshift-clients-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 > tuned-profiles-atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 > atomic-openshift-node-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 > atomic-openshift-master-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 > atomic-openshift-sdn-ovs-3.1.1.6-2.git.10.15b47fc.el7aos.x86_64 > > Not seeing this version as available in my repo. [root@ose-master ~]# yum list atomic-openshift-master Loaded plugins: product-id, search-disabled-repos, subscription-manager Installed Packages atomic-openshift-master.x86_64 3.1.1.6-1.git.0.b57e8bd.el7aos @rhel-7-server-ose-3.1-rpms [root@ose-master ~]# yum upgrade atomic-openshift-master Loaded plugins: product-id, search-disabled-repos, subscription-manager No packages marked for update
Hmm, I wonder why the versions appear to be different. I followed the prerequisites which should have all the same repos enabled. Would it be possible to install the the hello-pod example in the openshift-infra project (https://raw.githubusercontent.com/openshift/origin/master/examples/hello-openshift/hello-pod.json) This should be in the same project where you have deployed the metrics components and they should be installed (even if they are failing) while running the following checks. And then run 'docker exec -u root -it $CONTAINER_ID bash' Once inside the container run: 'yum install bind-utils' and then 'dig +short +search hawkular-cassandra-nodes' and post the output of the 'dig' command. I can't reproduce this with a clean 3.1.1.6 install with the 3.1.1 origin metrics components. Is there anything special with your setup or network configuration?
I couldn't get a shell into hello-openshift. I tried to get rhel-tools, but that wouldn't startup up. So I just pulled an image off docker hub with bind, enabled RunAsAny in openshift and got into the container. root@bind-1-r9xyq:/# dig +short +search hawkular-cassandra-nodes root@bind-1-r9xyq:/# dig +search hawkular-cassandra-nodes ; <<>> DiG 9.9.5-3ubuntu0.7-Ubuntu <<>> +search hawkular-cassandra-nodes ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 19709 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;hawkular-cassandra-nodes.localdomain. IN A root@bind-1-r9xyq:/# dig +short +search hawkular-cassandra-nodes.openshift-inf> root@bind-1-r9xyq:/# dig +search hawkular-cassandra-nodes.openshift-infra.svc.> ; <<>> DiG 9.9.5-3ubuntu0.7-Ubuntu <<>> +search hawkular-cassandra-nodes.openshift-infra.svc.cluster.local ;; global options: +cmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 47366 ;; flags: qr rd; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 0 ;; WARNING: recursion requested but not available ;; QUESTION SECTION: ;hawkular-cassandra-nodes.openshift-infra.svc.cluster.local.localdomain. IN A
Would it be possible for you to verify what the 'CASSANDRA_NODES_SERVICE_NAME' environment variable is for the Cassandra container?
[root@ose-master ~]# oc exec -it hawkular-cassandra-1-qgyzt env ... CASSANDRA_NODES_SERVICE_NAME=hawkular-cassandra-nodes ...
It does have something to do with name resolution / DNS. Here is my deployment layout. Working kenthua.com - Locally managed DNS via bind with aliases pointing to the appropriate IPs. ose-master.example.com 172.xx ose-node1.example.com 172.xx ose-node2.example.com 172.xx *.cloudapps.example.com 172.xx My OSE is pointing to my local bind DNS first, then vmware fusion dns second. NOT Working kenthua.com - Externally managed domain and DNS, with aliases pointing to the appropriate IPs. ose-master.kenthua.com 172.xx ose-node1.kenthua.com 172.xx ose-node2.kenthua.com 172.xx *.cloudapps.kenthua.com 172.xx My OSE is pointing to my vmware fusion instance as the default DNS, which eventually will end up resolving to my ISP dns to resolve "*.kenthua.com". Everything else OSE doesn't seem to have a problem resolving hostnames. Versions didn't matter, I've tried the latest now: "3.1.1.6-4.git.21.cd70c35.el7aos.x86_64" What is the hawkular-cassandra pod doing differently or unique that makes it only work with the local DNS.
The hawkular-cassandra pod is trying to connect to the 'hawkular-cassandra-nodes' hostname which should correspond to a headless service. Headless services are meant not to return a single address for a service but multiple addresses corresponding to all the running instances behind the service. The other thing I can think of is that the hawkular-cassandra pod does different things based on if it can resolve the hostname or not. So if your external DNS is resolving 'hawkular-cassandra-nodes' to an ip address, then it will try and connect to that which will cause exactly cause the issue you outlined. But it doesn't appear from your debugging that this is the case. Have you noticed anything different between those two dns setups with respect to resolving hostnames which should not resolve to anything?
I did some basic testing within the environments, outside of metrics-deployer. Both had the same curl and dig responses, when trying to resolve a non-existant hostname.
Would it be possible to test with the origin-metrics containers? The Cassandra one running there does things in a slightly different manner and may provide some better debugging information. At the very least it will output what IP address it is trying to connect to. Other than that, I would probably have to create a test container for Cassandra with more debugging enabled.
Created attachment 1132085 [details] origin container images This is the external DNS use case. I deployed origin containers and all of the pods are in a "Running" state without any issue. IMAGE_PREFIX=openshift/origin-,\ IMAGE_VERSION=latest,\ I was also able to hit: https://ose-master.kenthua.com/hawkular/metrics without an issue. 0.13.0-SNAPSHOT (Git SHA1 - 7dee24acfcfb3beac356e2c4d83b7b1704ebf82f) Metrics Service :STARTED
This should be fixed in out 3.2 images since they have been updated to use the same mechanism that origin-metrics uses (and which is verified to be fixed)
Installed OSE 3.2 with latest puddle (2016-03-18.4), and tested with latest metrics images pulled from brew, all the metrics pods can be running and metrics UI looks good. Closing the issue as fixed. Here are the images I used: docker images|grep metrics| awk '{print $1" "$3}' |awk -F'/' '{print $2"/"$3}' openshift3/metrics-deployer d3b5bd02c6ad openshift3/metrics-hawkular-metrics 0d825e62d05a openshift3/metrics-heapster 9a6aa3a55a44 openshift3/metrics-cassandra 2f9af4d01e97
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064