Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1285468

Summary:	OSE 3.1.0.4 metrics do not show for non Resource Limited projects
Product:	OpenShift Container Platform	Reporter:	Boris Kurktchiev <kurktchiev>
Component:	Hawkular	Assignee:	Matt Wringe <mwringe>
Status:	CLOSED NOTABUG	QA Contact:	chunchen <chunchen>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.1.0	CC:	aos-bugs, mwringe, nicholas_schuetz, spadgett, wsun
Target Milestone:	---
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-01-11 18:33:26 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Boris Kurktchiev 2015-11-25 16:32:41 UTC

Description of problem:

I am unable to get metrics to show up for non Resource/Quota projects and even in those I seem to only be getting Mem metrics and no CPU metrics. In unrestricted projects I just have empty graphs.

Version-Release number of selected component (if applicable):
----> oc get all
CONTROLLER                   CONTAINER(S)                      IMAGE(S)                                           SELECTOR                              REPLICAS                  AGE
hawkular-cassandra-1         hawkular-cassandra-1              openshift/origin-metrics-cassandra:latest          name=hawkular-cassandra-1             1                         1m
hawkular-metrics             hawkular-metrics                  openshift/origin-metrics-hawkular-metrics:latest   name=hawkular-metrics                 1                         1m
heapster                     heapster                          openshift/origin-metrics-heapster:latest           name=heapster                         1                         1m
NAME                         HOST/PORT                         PATH                                               SERVICE                               LABELS                    INSECURE POLICY   TLS TERMINATION
hawkular-metrics             ose-metrics.ose.devapps.unc.edu                                                      hawkular-metrics                      metrics-infra=support                       passthrough
NAME                         CLUSTER_IP                        EXTERNAL_IP                                        PORT(S)                               SELECTOR                  AGE
hawkular-cassandra           172.30.99.158                     <none>                                             9042/TCP,9160/TCP,7000/TCP,7001/TCP   type=hawkular-cassandra   1m
hawkular-cassandra-nodes     None                              <none>                                             9042/TCP,9160/TCP,7000/TCP,7001/TCP   type=hawkular-cassandra   1m
hawkular-metrics             172.30.28.56                      <none>                                             443/TCP                               name=hawkular-metrics     1m
heapster                     172.30.200.193                    <none>                                             80/TCP                                name=heapster             1m
NAME                         READY                             STATUS                                             RESTARTS                              AGE
hawkular-cassandra-1-hs7b5   1/1                               Running                                            0                                     1m
hawkular-metrics-ba6id       1/1                               Running                                            0                                     1m
heapster-qqs1i               1/1                               Running                                            2                                     1m

----> rpm -qa | grep atomic
atomic-openshift-utils-3.0.13-1.git.0.5e8c5c7.el7aos.noarch
tuned-profiles-atomic-openshift-node-3.1.0.4-1.git.4.b6c7cd2.el7aos.x86_64
atomic-openshift-master-3.1.0.4-1.git.4.b6c7cd2.el7aos.x86_64
atomic-openshift-clients-3.1.0.4-1.git.4.b6c7cd2.el7aos.x86_64
atomic-openshift-3.1.0.4-1.git.4.b6c7cd2.el7aos.x86_64
atomic-openshift-node-3.1.0.4-1.git.4.b6c7cd2.el7aos.x86_64
atomic-openshift-sdn-ovs-3.1.0.4-1.git.4.b6c7cd2.el7aos.x86_64

How reproducible:
Deploy with oc process -f /usr/share/ansible/openshift-ansible/roles/openshift_examples/files/examples/infrastructure-templates/enterprise/metrics-deployer.yaml -v CASSANDRA_PV_SIZE=2Gi,CASSANDRA_NODES=2,HAWKULAR_METRICS_HOSTNAME=ose-metrics.ose.devapps.unc.edu,USE_PERSISTENT_STORAGE=false | oc create -f -

Steps to Reproduce:
1.
2.
3.

Actual results:
No metrics but in some Resrouce limited projects

Expected results:
 Metrics everywhere

Additional info:
I have tried redeploying using the REDEPLOY=true
I have tried deploying with Persistent Storage

Comment 1 Matt Wringe 2015-11-25 19:18:25 UTC

Can you post a screenshot of what you are seeing for the limited containers? Eg the ones where you are seeing a graph?

Comment 2 Boris Kurktchiev 2015-11-26 01:13:31 UTC

(In reply to Matt Wringe from comment #1)
> Can you post a screenshot of what you are seeing for the limited containers?
> Eg the ones where you are seeing a graph?

https://www.dropbox.com/s/7c9fk0xz8p9tsh9/NoMetrics.png?dl=0 basically empty graphs

In a quota project i get this:

https://www.dropbox.com/s/m77nwxb3l92hg5f/Metrics.png?dl=0

You can see that even in that I get no CPU information, only memory

Comment 3 Matt Wringe 2015-12-02 14:04:42 UTC

I can't reproduce this issue.

Are you seeing anything in the Heapster logs? Or the Hawkular Metrics logs?

Comment 4 Samuel Padgett 2015-12-02 15:36:51 UTC

Boris, can you attach the output of the following command?

oc get -o yaml pod mwmattermost-18-gofky -n mwmattermost

Comment 5 Boris Kurktchiev 2015-12-02 15:40:54 UTC

(In reply to Matt Wringe from comment #3)
> I can't reproduce this issue.
> 
> Are you seeing anything in the Heapster logs? Or the Hawkular Metrics logs?

As soon as you tell me what to look for I am happy to look, on a quick glance, I am not seeing much of anything, but I am not familiar with either product to where I can be 100% sure :/

Comment 6 Boris Kurktchiev 2015-12-02 15:42:23 UTC

root@osmaster0s:~:
----> oc get pods mwmattermost-18-gofky
NAME                    READY     STATUS    RESTARTS   AGE
mwmattermost-18-gofky   1/1       Running   0          6d
root@osmaster0s:~:
----> oc get pods mwmattermost-18-gofky -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubernetes.io/created-by: |
      {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"mwservices","name":"mwmattermost-18","uid":"505d3aeb-91fb-11e5-866b-005056a6874f","apiVersion":"v1","resourceVersion":"3224694"}}
    openshift.io/deployment-config.latest-version: "18"
    openshift.io/deployment-config.name: mwmattermost
    openshift.io/deployment.name: mwmattermost-18
    openshift.io/scc: restricted
  creationTimestamp: 2015-11-25T16:47:26Z
  generateName: mwmattermost-18-
  labels:
    app: mwmattermost
    deployment: mwmattermost-18
    deploymentconfig: mwmattermost
  name: mwmattermost-18-gofky
  namespace: mwservices
  resourceVersion: "3224709"
  selfLink: /api/v1/namespaces/mwservices/pods/mwmattermost-18-gofky
  uid: 3584cb46-9394-11e5-ac32-005056a6874f
spec:
  containers:
  - env:
    - name: DB_HOST
      value: 172.30.134.98
    - name: DB_NAME
      value: mattermost
    - name: DB_PASS
      value: matterm0st
    - name: DB_TYPE
      value: mysql
    - name: DB_USER
      value: mattermost
    image: 172.30.16.236:5000/mwservices/mwmattermost@sha256:5d3a1cc959ce23609ab316420f0533ac22685d0711f2fc997e6ed3ceae25043a
    imagePullPolicy: Always
    name: mwmattermost
    ports:
    - containerPort: 8080
      protocol: TCP
    resources: {}
    securityContext:
      privileged: false
    terminationMessagePath: /dev/termination-log
    volumeMounts:
    - mountPath: /opt/local/mattermost/data
      name: mwmattermost-volume-1
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: default-token-2xueq
      readOnly: true
  dnsPolicy: ClusterFirst
  host: osnode1s.devapps.unc.edu
  imagePullSecrets:
  - name: default-dockercfg-igm5g
  nodeName: osnode1s.devapps.unc.edu
  nodeSelector:
    region: primary
    zone: vipapps
  restartPolicy: Always
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  terminationGracePeriodSeconds: 30
  volumes:
  - name: mwmattermost-volume-1
    persistentVolumeClaim:
      claimName: mwmattermost
  - name: default-token-2xueq
    secret:
      secretName: default-token-2xueq
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: 2015-11-25T16:47:28Z
    status: "True"
    type: Ready
  containerStatuses:
  - containerID: docker://f1ac5e8eecc0aa5ba13fb56899760bf48c4fffd0cdf644ef97c15997933b75f3
    image: 172.30.16.236:5000/mwservices/mwmattermost@sha256:5d3a1cc959ce23609ab316420f0533ac22685d0711f2fc997e6ed3ceae25043a
    imageID: docker://876b3723ee777b25a5bc22289c27075cf6587239ec3f6761a62d8e8469875db8
    lastState: {}
    name: mwmattermost
    ready: true
    restartCount: 0
    state:
      running:
        startedAt: 2015-11-25T16:47:28Z
  hostIP: 152.19.229.208
  phase: Running
  podIP: 10.1.1.17
  startTime: 2015-11-25T16:47:26Z

Comment 7 Boris Kurktchiev 2015-12-02 15:45:13 UTC

(In reply to Matt Wringe from comment #3)
> I can't reproduce this issue.
> 
> Are you seeing anything in the Heapster logs? Or the Hawkular Metrics logs?

biggest thing I am seeing in the heapster logs is:
2015/12/01 15:40:49 http: TLS handshake error from 10.1.0.1:57300: tls: no cipher suite supported by both client and server
971
2015/12/01 15:40:49 http: TLS handshake error from 10.1.0.1:57301: tls: no cipher suite supported by both client and server
972
2015/12/01 15:40:49 http: TLS handshake error from 10.1.0.1:57302: tls: no cipher suite supported by both client and server
973
2015/12/01 15:40:49 http: TLS handshake error from 10.1.0.1:57303: tls: no cipher suite supported by both client and server
974
2015/12/01 15:44:00 http: TLS handshake error from 10.1.0.1:35213: tls: unsupported SSLv2 handshake received

As far as I can tell Hawkular and Cassandra are not throwing any errors and only the heapster log is the one tossing the above.

Comment 8 Boris Kurktchiev 2015-12-02 15:50:59 UTC

(In reply to Boris Kurktchiev from comment #7)
> (In reply to Matt Wringe from comment #3)
> > I can't reproduce this issue.
> > 
> > Are you seeing anything in the Heapster logs? Or the Hawkular Metrics logs?
> 
> biggest thing I am seeing in the heapster logs is:
> 2015/12/01 15:40:49 http: TLS handshake error from 10.1.0.1:57300: tls: no
> cipher suite supported by both client and server
> 971
> 2015/12/01 15:40:49 http: TLS handshake error from 10.1.0.1:57301: tls: no
> cipher suite supported by both client and server
> 972
> 2015/12/01 15:40:49 http: TLS handshake error from 10.1.0.1:57302: tls: no
> cipher suite supported by both client and server
> 973
> 2015/12/01 15:40:49 http: TLS handshake error from 10.1.0.1:57303: tls: no
> cipher suite supported by both client and server
> 974
> 2015/12/01 15:44:00 http: TLS handshake error from 10.1.0.1:35213: tls:
> unsupported SSLv2 handshake received
> 
> As far as I can tell Hawkular and Cassandra are not throwing any errors and
> only the heapster log is the one tossing the above.

And recently these have started to pop up in the log:
W1202 14:18:49.380194       1 reflector.go:224] /tmp/gopath/src/k8s.io/heapster/sources/pods.go:173: watch of *api.Pod ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [3468623/3468145]) [3469622]
1000
W1202 15:18:50.433236       1 reflector.go:224] /tmp/gopath/src/k8s.io/heapster/sources/pods.go:173: watch of *api.Pod ended with: 401: The event in requested index is outdated and cleared (the requested history has been cleared [3470101/3469623]) [3471100]

Comment 9 Matt Wringe 2015-12-02 16:25:18 UTC

Hmm, this may be part of the root cause:

"http: TLS handshake error from 10.1.0.1:57300: tls: no cipher suite supported by both client and server"

The problem is that its really strange that it can only receive a very specific type of metric. I would have expected all metrics or nothing.

Its also really strange that I have not heard anything similar to this from anyone else, which is why I am suspecting something slightly different with the setup or install. But I don't see anything special in what you are doing.

Comment 10 Boris Kurktchiev 2015-12-02 16:32:31 UTC

(In reply to Matt Wringe from comment #9)
> Hmm, this may be part of the root cause:
> 
> "http: TLS handshake error from 10.1.0.1:57300: tls: no cipher suite
> supported by both client and server"
> 
> The problem is that its really strange that it can only receive a very
> specific type of metric. I would have expected all metrics or nothing.
> 
> Its also really strange that I have not heard anything similar to this from
> anyone else, which is why I am suspecting something slightly different with
> the setup or install. But I don't see anything special in what you are doing.

Not sure, original install was 3.0.2 from whatever state the ansible-playbooks github repo was. Then the upgrade to 3.1 was done with the included playbooks. Also, only RAM shows up for the resource limited projecets, no CPU, and obviously still nothing on all other projects.

Comment 11 Matt Wringe 2015-12-02 18:57:25 UTC

Can you please post the full logs from the heapster container somewhere?

Comment 12 Boris Kurktchiev 2015-12-02 19:09:25 UTC

(In reply to Matt Wringe from comment #11)
> Can you please post the full logs from the heapster container somewhere?

https://gist.github.com/ebalsumgo/83f826ef2d9bfff00e8f

Comment 13 Matt Wringe 2015-12-02 21:56:02 UTC

From the log:

"2015/11/30 08:08:08 http: TLS handshake error from 10.1.2.1:39503: tls: first record does not look like a TLS handshake
2015/11/30 08:08:08 http: TLS handshake error from 10.1.2.1:42048: tls: unsupported SSLv2 handshake received"

Can you verify is 10.1.2.1 is the ip address for one of your nodes?

Did you do anything special to setup your nodes certificate? Or are you using the default generated ones?

Comment 14 Boris Kurktchiev 2015-12-03 14:25:53 UTC

(In reply to Matt Wringe from comment #13)
> From the log:
> 
> "2015/11/30 08:08:08 http: TLS handshake error from 10.1.2.1:39503: tls:
> first record does not look like a TLS handshake
> 2015/11/30 08:08:08 http: TLS handshake error from 10.1.2.1:42048: tls:
> unsupported SSLv2 handshake received"
> 
> Can you verify is 10.1.2.1 is the ip address for one of your nodes?
> 
> Did you do anything special to setup your nodes certificate? Or are you
> using the default generated ones?

root@osmaster0s:~:
----> oc get hostsubnets
NAME                         HOST                         HOST IP          SUBNET
osmaster0s.devapps.unc.edu   osmaster0s.devapps.unc.edu   152.19.229.206   10.1.0.0/24
osnode0s.devapps.unc.edu     osnode0s.devapps.unc.edu     152.19.229.207   10.1.2.0/24
osnode1s.devapps.unc.edu     osnode1s.devapps.unc.edu     152.19.229.208   10.1.1.0/24
osnode2s.devapps.unc.edu     osnode2s.devapps.unc.edu     152.19.229.209   10.1.3.0/24

Looks like that IP should live on one of my nodes, yes.

Comment 15 Boris Kurktchiev 2015-12-03 15:11:44 UTC

For the record, here is the debug output from heapster: https://gist.github.com/ebalsumgo/d7e199abfdf2947b148d

Comment 16 Boris Kurktchiev 2015-12-03 16:43:18 UTC

(In reply to Boris Kurktchiev from comment #15)
> For the record, here is the debug output from heapster:
> https://gist.github.com/ebalsumgo/d7e199abfdf2947b148d

Got it working by following these instructions to replace my 3.0.2 certs:

On each node with cert IP errors
================================
1. Determine what subject alt names are already in place for the node's serving certificate:
   openssl x509 -in /etc/origin/node/server.crt -text -noout | grep -A "Subject Alternative Name"

	If the output shows:
		X509v3 Subject Alternative Name: 
    		DNS:mynode, DNS:mynode.mydomain.com, IP: 1.2.3.4
	then your subject alt names are:
		mynode
		mynode.mydomain.com
		1.2.3.4

2. Determine the IP address the node will register, listed in /etc/origin/node/node-config.yaml as the "nodeIP" key. For example:
	nodeIP: "10.10.10.1"
  
  This should match the IP in the log error about the node certificate. If the IP address is not listed as a subject alt name in the node certificate, it needs to be added.


On the master
=============
1. Make a tmp dir and run this:
	signing_opts="--signer-cert=/etc/origin/master/ca.crt --signer-key=/etc/origin/master/ca.key --signer-serial=/etc/origin/master/ca.serial.txt"

2. For each node, run:
	oadm ca create-server-cert --cert=$nodename/server.crt --key=$nodename/server.key --hostnames=<existing subject alt names>,<new node IP> $signing_opts

	For example:
	oadm ca create-server-cert --cert=mynode/server.crt --key=mynode/server.key --hostnames=mynode,mynode.mydomain.com,1.2.3.4,10.10.10.1 $signing_opts


Replace node serving certs
==========================
1. back up the existing /etc/origin/node/server.{crt,key} files on each node
2. copy the generated $nodename/server.{crt,key} files to each node under /etc/origin/node/
3. restart the node service

Comment 17 Matt Wringe 2016-01-11 18:33:26 UTC

Since it appears that this was just an issue with certificates, I will be closing this issue. If you run into a similar issue again, please let us know.