1448462 – The hawkular metrics don't work after installing the logging via advanced install as it fails with "Error: the service account for Hawkular Metrics does not have permission to view resources in this namespace. View permissions are required for Hawkular "

Bug 1448462 - The hawkular metrics don't work after installing the logging via advanced install as it fails with "Error: the service account for Hawkular Metrics does not have permission to view resources in this namespace. View permissions are required for Hawkular "

Summary: The hawkular metrics don't work after installing the logging via advanced ins...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Hawkular
Sub Component:
Version:	3.4.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	urgent
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Matt Wringe
QA Contact:	Liming Zhou
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-05 13:09 UTC by Miheer Salunke
Modified:	2021-06-10 12:16 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-06-29 00:24:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Miheer Salunke 2017-05-05 13:09:31 UTC

Description of problem:

The hawkular metrics don't work after installing the logging via advanced install as it fails with "Error: the service account for Hawkular Metrics does not have permission to view resources in this namespace. View permissions are required for Hawkular " 
We also ran the command mentioned but it didn't work.
Also under view role you can see the service account for hawkular under openshift-infra project.

Version-Release number of selected component (if applicable):
OCP 3.5

How reproducible:
On customer side

Steps to Reproduce:
1.Install logging via advance install
2.
3.

Actual results:
Hawkular pod fails with "2017-05-05 06:59:22 Starting Hawkular Metrics
Error: the service account for Hawkular Metrics does not have permission to view resources in this namespace. View permissions are required for Hawkular Metrics to function properly.
Usually this can be resolved by running: oc adm policy add-role-to-user view system:serviceaccount:openshift-infra:hawkular -n openshift-infra"
even when the view role is assigned to hawkular service account in the openshift-infra project.

Expected results:
It shall not fail with "2017-05-05 06:59:22 Starting Hawkular Metrics
Error: the service account for Hawkular Metrics does not have permission to view resources in this namespace. View permissions are required for Hawkular Metrics to function properly.
Usually this can be resolved by running: oc adm policy add-role-to-user view system:serviceaccount:openshift-infra:hawkular -n openshift-infra"
even when the view role is assigned to hawkular service account in the openshift-infra project.

Additional info:

Comment 6 Miheer Salunke 2017-05-19 10:12:55 UTC

Hi,

We tried the following ->

Adding '-DKUBERNETES_MASTER_URL=https://kubernetes.default.svc:443' than https://kubernetes.default.svc.cluster.local in the rc of hawkular metrics from the web console helps the pod to run without that issue.

(Not related to this issue but adding it from the web console adds more '' quotes on the url)



Failure -> check haw-debug-pod-logs.txt
Things we tried ->

1) tried with https://kubernetes.default.svc.cluster.local in the -DKUBERNETES_MASTER_URL of rc of hawkular which fails

2)then we oc debug <hawkular pod> and 

3)from the debug pod ran "sh-4.2$ curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt --max-time 10 --connect-timeout 10 -H "Authorization: Bearer `cat /var/run/secrets/kubernetes.io/serviceaccount/token`" ${MASTER_URL:-https://kubernetes.default.svc.cluster.local}/api/${KUBERNETES_API_VERSION:-v1}/namespaces/${POD_NAMESPACE}/replicationcontrollers/hawkular-metrics -v"

which fails with "* Peer's certificate issuer has been marked as not trusted by the user."

then ->

sh-4.2$ echo $MASTER_URL
https://kubernetes.default.svc.cluster.local

sh-4.2$ MASTER_URL=https://kubernetes.default.svc:443

and the curl worked->

sh-4.2$ curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt --max-time 10 --connect-timeout 10 -H "Authorization: Bearer `cat /var/run/secrets/kubernetes.io/serviceaccount/token`" ${MASTER_URL:-https://kubernetes.default.svc:443}/api/${KUBERNETES_API_VERSION:-v1}/namespaces/${POD_NAMESPACE}/replicationcontrollers/hawkular-metrics -v


I have added some more details in the attachment.

So is it that the advanced install while generating certs for hawkular failed in considering to take the hostname https://kubernetes.default.svc.cluster.local i.e ".cluster.local" part and just generated the cert using the  https://kubernetes.default.svc If yes that is the case then is there a way to re-generate the certs and put them in secrets again for metrics?

Comment 8 Matt Wringe 2017-05-23 21:23:21 UTC

(In reply to Miheer Salunke from comment #6)
> Hi,
> 
> We tried the following ->
> 
> Adding '-DKUBERNETES_MASTER_URL=https://kubernetes.default.svc:443' than
> https://kubernetes.default.svc.cluster.local in the rc of hawkular metrics
> from the web console helps the pod to run without that issue.

So when you set this, does everything start to work then or are you still running into issues?

Normally, "kubernetes.default.svc.cluster.local" is the hostname we use and we don't usually run into issue with it. It looks like in this environment only "https://bugzilla.redhat.com/show_bug.cgi?id=1411427".

> 
> (Not related to this issue but adding it from the web console adds more ''
> quotes on the url)
> 
> 
> 
> Failure -> check haw-debug-pod-logs.txt
> Things we tried ->
> 
> 1) tried with https://kubernetes.default.svc.cluster.local in the
> -DKUBERNETES_MASTER_URL of rc of hawkular which fails
> 
> 2)then we oc debug <hawkular pod> and 
> 
> 3)from the debug pod ran "sh-4.2$ curl --cacert
> /var/run/secrets/kubernetes.io/serviceaccount/ca.crt --max-time 10
> --connect-timeout 10 -H "Authorization: Bearer `cat
> /var/run/secrets/kubernetes.io/serviceaccount/token`"
> ${MASTER_URL:-https://kubernetes.default.svc.cluster.local}/api/
> ${KUBERNETES_API_VERSION:-v1}/namespaces/${POD_NAMESPACE}/
> replicationcontrollers/hawkular-metrics -v"
> 
> which fails with "* Peer's certificate issuer has been marked as not trusted
> by the user."
> 
> then ->
> 
> sh-4.2$ echo $MASTER_URL
> https://kubernetes.default.svc.cluster.local
> 
> sh-4.2$ MASTER_URL=https://kubernetes.default.svc:443
> 
> and the curl worked->
> 
> sh-4.2$ curl --cacert /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
> --max-time 10 --connect-timeout 10 -H "Authorization: Bearer `cat
> /var/run/secrets/kubernetes.io/serviceaccount/token`"
> ${MASTER_URL:-https://kubernetes.default.svc:443}/api/
> ${KUBERNETES_API_VERSION:-v1}/namespaces/${POD_NAMESPACE}/
> replicationcontrollers/hawkular-metrics -v
> 
> 
> I have added some more details in the attachment.
> 
> So is it that the advanced install while generating certs for hawkular
> failed in considering to take the hostname
> https://kubernetes.default.svc.cluster.local i.e ".cluster.local" part and
> just generated the cert using the  https://kubernetes.default.svc If yes
> that is the case then is there a way to re-generate the certs and put them
> in secrets again for metrics?

It has nothing to do with the certs that Hawkular Metrics creates or utilises, it has to do with the certs that the master api endpoint uses.

If the cluster is not using the default 'kubernetes.default.svc.cluster.local' hostname, they can specify the alternative one to use when installing metrics.

Is this being installed using ansible or the deployer pod?

Comment 9 John Sanda 2017-06-28 15:01:14 UTC

Lowering priority and bumping target release since the NEEDINFO flag has been set for a while now.

Comment 10 Takayoshi Kimura 2017-06-29 00:24:12 UTC

This is caused by customer DNS setup, they set router wildcard domain on their regular host domain like *.example.com and master.example.com. With "search example.com" in resolv.conf, many DNS queries wrongly resolved to router addresses. We don't support this conflicting wildcard DNS domain / actual host DNS domain setup. The wildcard DNS should be set to unique domain like *.apps.example.com.

Comment 11 Gleidson Nascimento 2017-08-06 21:49:13 UTC

I'm facing a similar issue as described by reporter and I followed the troubleshooting steps and/or recommendations described here. What I would like to add to this issue is that the recommendations here works, however, it takes more than 10 seconds for master to answer. The startup script 'hawkular-metrics-wrapper.sh' has a hardcoded timeout of 10 seconds, so I am facing the same service account error, even though my environment seems to be correctly setup. 

On a Debug container, increasing the timeout to 30 seconds seems to fix the issue.

Comment 12 Matt Wringe 2017-08-08 16:06:27 UTC

(In reply to Gleidson Nascimento from comment #11)
> I'm facing a similar issue as described by reporter and I followed the
> troubleshooting steps and/or recommendations described here. What I would
> like to add to this issue is that the recommendations here works, however,
> it takes more than 10 seconds for master to answer. The startup script
> 'hawkular-metrics-wrapper.sh' has a hardcoded timeout of 10 seconds, so I am
> facing the same service account error, even though my environment seems to
> be correctly setup. 
> 
> On a Debug container, increasing the timeout to 30 seconds seems to fix the
> issue.

Taking more than 10s to get a response back from the master seems like the master is probably not performing enough for metrics to be functioning.

Are you are sure your master is properly functioning? Why is it taking so long to get a response back from a simple query that should be almost instantaneous?

The original issue here was closed as it was a problem with a DNS setup, if you are experiencing an issue due to a timeout being too short, can you please open a new issue?

Comment 13 Gleidson Nascimento 2017-08-08 21:26:28 UTC

The master is properly functioning since other items were successfully deployed - router, registry, cockpit, kibana, etc - using the ansible installer.

Indeed there's some sort of DNS issue in my deployment, as kubernetes.default.svc is not resolving from inside the pods. To keep troubleshooting this particular issue I've changed the master address to the actual address of the master host. Then, I attempted the curl tests as described in the issue and, despite the fact it works, takes too long for the response from master to arrive. 

I can see that the script handling of responses has improved from 1.5 to 3.6. Still, the error just mention an issue reaching the master. If at this point we know it's either because of DNS issues or because of master taking too long to answer, would be good to improve the level of details given in the error message inside the script.

BTW, to whoever arrived here from google, here's what I did to fix:

1- Deployed from scratch using ansible installer from github on 3.5 tag (I used openshift-ansible-3.5.110-1)
2 - On my hosts file, I've added openshift_metrics_master_url, openshift_metrics_startup_timeout and openshift_hosted_metrics_deployer_version, to match my actual master host, to 900 seconds of startup and to match OCP 3.5
3- Deployed just OCP, then later run the byo/openshift-cluster/openshift-metrics.yml to add metrics.

Doing like this I've got no DNS issues and metrics are just fine.

Comment 14 Matt Wringe 2017-08-09 18:59:51 UTC

(In reply to Gleidson Nascimento from comment #13)
> The master is properly functioning since other items were successfully
> deployed - router, registry, cockpit, kibana, etc - using the ansible
> installer.
> 
> Indeed there's some sort of DNS issue in my deployment, as
> kubernetes.default.svc is not resolving from inside the pods. To keep
> troubleshooting this particular issue I've changed the master address to the
> actual address of the master host. Then, I attempted the curl tests as
> described in the issue and, despite the fact it works, takes too long for
> the response from master to arrive. 
> 
> I can see that the script handling of responses has improved from 1.5 to
> 3.6. Still, the error just mention an issue reaching the master. If at this
> point we know it's either because of DNS issues or because of master taking
> too long to answer, would be good to improve the level of details given in
> the error message inside the script.
> 
> BTW, to whoever arrived here from google, here's what I did to fix:
> 
> 1- Deployed from scratch using ansible installer from github on 3.5 tag (I
> used openshift-ansible-3.5.110-1)
> 2 - On my hosts file, I've added openshift_metrics_master_url,
> openshift_metrics_startup_timeout and
> openshift_hosted_metrics_deployer_version, to match my actual master host,
> to 900 seconds of startup and to match OCP 3.5
> 3- Deployed just OCP, then later run the
> byo/openshift-cluster/openshift-metrics.yml to add metrics.
> 
> Doing like this I've got no DNS issues and metrics are just fine.

Can you please open another bugzilla? this one is already closed and was due to another problem. We need to make sure we are only tracking a single problem in each bugzilla.

Note You need to log in before you can comment on or make changes to this bug.