Bug 1805044 - No mem/filesystem/Network Utilization in VM overview
Summary: No mem/filesystem/Network Utilization in VM overview
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Container Native Virtualization (CNV)
Classification: Red Hat
Component: Virtualization
Version: 2.3.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 2.4.0
Assignee: Daniel Belenky
QA Contact: zhe peng
URL:
Whiteboard:
: 1829252 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-02-20 07:26 UTC by Guohua Ouyang
Modified: 2020-07-28 19:09 UTC (History)
19 users (show)

Fixed In Version: hco-bundle-registry-container-v2.3.0-299
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-28 19:09:44 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
no data for mem even guest agent is installed (74.43 KB, image/png)
2020-02-20 10:44 UTC, Guohua Ouyang
no flags Details
Patch to ensure that Prometheus honors our labels (46 bytes, patch)
2020-03-04 12:17 UTC, Daniel Belenky
no flags Details | Diff
Patch to openshift-console to ensure correct labels are used in the query (47 bytes, patch)
2020-03-04 12:18 UTC, Daniel Belenky
no flags Details | Diff
VM metrics on openshift-console with fixes applied (80.73 KB, image/png)
2020-03-04 12:19 UTC, Daniel Belenky
no flags Details
Fix kubevirt service monitor relabel configs (46 bytes, patch)
2020-03-15 17:47 UTC, Daniel Belenky
no flags Details | Diff
vm_status (33.45 KB, image/png)
2020-06-25 19:08 UTC, Israel Pinto
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2020:3194 0 None None None 2020-07-28 19:09:55 UTC

Description Guohua Ouyang 2020-02-20 07:26:26 UTC
Description of problem:
Create a fedora VM and get it running about one hour, check mem/filesystem/Network Utilization in VM overview, there are not available. However, CPU Utilization is displaying there.

Version-Release number of selected component (if applicable):
4.4.0-0.nightly-2020-02-17-022408

How reproducible:
100%

Steps to Reproduce:
1. Create a VM and running one hour
2. Check memory Utilization in VM overview
3.

Actual results:


Expected results:


Additional info:

Comment 1 Tomas Jelinek 2020-02-20 07:38:06 UTC
Please make sure the guest agent is installed in the guest. This is where this statistics are being read from.

Comment 2 Guohua Ouyang 2020-02-20 08:23:43 UTC
What's the guest agent for Linux(Fedora)?

Comment 3 Tomas Jelinek 2020-02-20 09:54:33 UTC
the qemu guest agent - should be in fedora repos

Comment 4 Guohua Ouyang 2020-02-20 10:44:21 UTC
Created attachment 1664310 [details]
no data for mem even guest agent is installed

Comment 5 Rastislav Wagner 2020-02-21 13:04:06 UTC
the response from prometheus-tenancy returns no data for memory/filesystem/network, however if I use just prometheus endpoint the data are there.
Queries we run look like 
sum(kubevirt_vmi_storage_traffic_bytes_total{exported_namespace='<%= namespace %>',name='<%= vmName %>'})
sum(kubevirt_vmi_network_traffic_bytes_total{type='rx',exported_namespace='<%= namespace %>',name='<%= vmName %>'})
sum(kubevirt_vmi_network_traffic_bytes_total{type='tx',exported_namespace='<%= namespace %>',name='<%= vmName %>'})

CPU metric works fine for prometheus and prometheus-tenancy
pod:container_cpu_usage:sum{namespace='<%= namespace %>',pod='<%= launcherPodName %>'}

Can somebody from monitoring team take a look please ?

Comment 6 Rastislav Wagner 2020-02-21 13:18:24 UTC
Changing to CNV SSP team.
All kubevirt_* metric results are stored under openshift-cnv namespace which probably causes issues when using prometheus-tenancy.

We have a VM called Fedora in default namespace.
Running query such as sum(kubevirt_vmi_storage_traffic_bytes_total{exported_namespace='<%= namespace %>',name='<%= vmName %>'}) against prometheus-tenancy default namespace doesnt work since all metrics are stored in openshift-cnv namespace. The user however doesnt have to have access to openshift-cnv namespace so he cannot see the results. The metrics for particular VM should be stored under the VM's namespace.

Comment 7 Daniel Belenky 2020-02-25 13:50:17 UTC
Hi, can you switch to the admin check if you see the metrics please?

TL;DR: we can't easily solve this right now because OpenShift does not fully support yet monitoring of user workloads.

Long version:
Thanks for Rastislav's comment, I reached to the monitoring team to better understand
how prometheus-tenancy works in OCP. The thing is, the UI component (openshift-console)
is not querying the "standard" Prometheus endpoint. Instead, it goes through the thanos-querier
component that actually checks if the user have permissions to view the namespace in which the
metric was originated from (openshift-cnv in our case) and rejects the query otherwise. So this
is actually working as expected.

Proposal for solution:
We can deploy a ServiceMonitor object in any arbitrary namespace where our user is creating VMs
*but* this is not yet supported by OpenShift (tech preview in 4.3) https://docs.openshift.com/container-platform/4.3/monitoring/monitoring-your-own-services.html
and it requires the user to explicitly enable this feature in his cluster config. We need to decide
internally if we want to start implementing this feature and being an early adopter...

Comment 8 Daniel Belenky 2020-02-25 13:56:15 UTC
Just to add: the fact that the metrics exists on the "standard" Prometheus endpoint is the proof that the metrics are being exported and collected properly.

Comment 9 Rastislav Wagner 2020-02-25 14:39:04 UTC
We always use prometheus-tenancy on VM dashboard, even with admin users.

What we could do on the UI is to check if user has access to prometheus and use non-tenancy in those cases, which would show the metrics.

Comment 10 Daniel Belenky 2020-02-25 17:37:34 UTC
I don't think that we are supposed to bypass the tenancy. Users with insufficient permissions should not be able to see VMs that they are not supposed to.

What we will do (for now) is to deploy a custom cluster-role, a group and a cluster-role-binding to that group. Cluster admins will have the option to
assign their users to this group and it will allow them to monitor the VMs. It should be cluster admin's responsibility to decide if a user can or can't
create and monitor VMs.

Comment 11 Fabian Deutsch 2020-02-25 21:08:12 UTC
> I don't think that we are supposed to bypass the tenancy. Users with insufficient permissions should not be able to see VMs that they are not supposed to.

+1

Daniel, wrt comment #7: Please share the relevant OCP jira issue which is tracking this feature. OCP should tell us how to solve this situation.
A solution can then be discussed on https://issues.redhat.com/browse/CNV-4112

Comment 12 Guohua Ouyang 2020-02-26 02:17:27 UTC
It should not be a permission issue, the user I logged in is kubeadmin which should be a cluster admin.
It works for the VM's pod, but not the VM:

- On the VM's pod detail page, all metrics are showing.
- On Home -> Overview -> Cluster Utilization, all metrics are showing.
- On VM overview, Memory/Filesystem/Network Utilization are missing.

Comment 13 Guohua Ouyang 2020-02-26 03:46:16 UTC
I'm not so sure whether it's a permission now, I run master console locally(which does not need to login) against the cluster, the Memory/Filesystem/Network Utilization shows well.

Comment 14 Daniel Belenky 2020-02-26 08:43:40 UTC
Can I somehow get access to the cluster that you are running? I'd like to inspect this setup and see how it's configured with regards to permissions because by default, you should not be able to view any resources w/o performing a login so I assume that you are getting some default user that has enough permissions to view it.

Comment 15 Radim Hrazdil 2020-02-26 11:06:30 UTC
I believe what Guohua refers to is using the bridge, which is a dev utility that hosts the UI on local machine, but uses oc client to comunicate with the remote cluster.
So your effectively signed in with the user that you logged in with oc. Presumably kubeadmin.

Comment 16 Radim Hrazdil 2020-02-26 11:10:10 UTC
Correction - it uses the bearer token generated by oc login for API access.

Comment 17 Fabian Deutsch 2020-02-27 09:43:08 UTC
The important bit is: Take a regular, production cluster, login as a user, and check if the metrics are available to a regular user. (Oren Cohen is running a production cluster).

If the metrics are available for pods to a regular user, then the same should be possible for VMs.

Comment 19 Guohua Ouyang 2020-02-28 04:19:49 UTC
@Daniel, The issue exists on newer OCP 4.4.0-0.nightly-2020-02-26-063555, fyi.

Comment 20 Daniel Belenky 2020-03-01 11:23:05 UTC
(In reply to Guohua Ouyang from comment #19)
> @Daniel, The issue exists on newer OCP 4.4.0-0.nightly-2020-02-26-063555,
> fyi.

What version of thanos-proxy this cluster runs? May I have access to this environment?

I can't reproduce this issue on CNV's production cluster which is running 4.4.0-0.nightly-2020-02-28-200834 (cnv.engineering.redhat.com)
@Fabian, this is the cluster that Oren runs.

I also wasn't able to reproduce it locally using the OCP kubevirt-ci provider which runs 4.4.0-0.nightly-2020-02-22-102956

Comment 21 Daniel Belenky 2020-03-01 16:56:23 UTC
Finally, the metrics disappeared and I was able to reproduce. Though, I still don't have a satisfying answer on why the metrics showed at first.
Maybe the UI team will be able to shed more light on how they query Prometheus? When this issue does not reproduce, the console does not even send the namespace query param... maybe it has several authentication flows? 

The namespace query parameter that openshift-console computes is wrong - it computes is from the VMI object. Instead, it should render the namespace in which kubevirt was deployed in.
Here is the console code that apparently is responsible to form the the query template: https://github.com/openshift/console/blob/master/frontend/packages/kubevirt-plugin/src/components/dashboards-page/vm-dashboard/queries.ts#L44
and here is the code that actually renders it: https://github.com/openshift/console/blob/ba3408ba2e9b28d8c4151f5d1908bae946324bd9/frontend/packages/kubevirt-plugin/src/components/dashboards-page/vm-dashboard/vm-utilization.tsx#L67
where the namespace is being computed here: https://github.com/openshift/console/blob/ba3408ba2e9b28d8c4151f5d1908bae946324bd9/frontend/packages/kubevirt-plugin/src/components/dashboards-page/vm-dashboard/vm-utilization.tsx#L53

What happens next is that the console is sending the query to the thanos-proxy endpoint that authenticates and **passes the query to prometheus-label-proxy** that enforces the namespace from console's query argument to be explicitly
set on the query to prometheus (and the metric does not exist the specified namespace at all... so it returns nothing) 

If we take the same query, but replace the namespace to the correct value, the data is returned as expected.


To summarize, there are several issues here:

1. There is a bug in openshift-console - the namespace query argument should be calculated differently.
2. In KubeVirt, we should deploy a group that has the requested permissions to view metrics, document it properly and let cluster-admins to decide if they want to let users to monitor VMs or not.
* this group is a workaround until https://issues.redhat.com/browse/MON-845 is production.

Do we want to split this bug into separate bugs? UI and Virt?

Comment 22 Rastislav Wagner 2020-03-02 08:33:36 UTC
@Daniel how can we tell from UI in which namespace the CNV/Kubevirt is installed ? Is it safe to assume that its 'openshift-cnv' if we have branded UI and 'openshift' namespace for okd?

Comment 23 Rastislav Wagner 2020-03-02 08:36:39 UTC
I cant search for cnv pods as a regular user who can access only some namespaces.

Comment 24 Daniel Belenky 2020-03-02 10:05:39 UTC
(In reply to Rastislav Wagner from comment #22)
> @Daniel how can we tell from UI in which namespace the CNV/Kubevirt is
> installed ? Is it safe to assume that its 'openshift-cnv' if we have branded
> UI and 'openshift' namespace for okd?

I don't think that it's safe to assume that because this is configurable by users.
What we could do, in theory, is to label VMI objects with the handler's namespace... I'm not sure about the security implications though.
It will allow users who doesn't have access to view the namespace in which CNV was deployed in to know it exists. I don't think that it's
that bad, but it's not that good either... On the other hand, the user will know that this namespace exists anyway when the query will be
issued by the console so I think that this change is safe enough.

Let me bring this up in the next virt team meeting and I'll update here...

Comment 25 Daniel Belenky 2020-03-04 12:17:26 UTC
Created attachment 1667473 [details]
Patch to ensure that Prometheus honors our labels

Comment 26 Daniel Belenky 2020-03-04 12:18:31 UTC
Created attachment 1667474 [details]
Patch to openshift-console to ensure correct labels are used in the query

Comment 27 Daniel Belenky 2020-03-04 12:19:22 UTC
Created attachment 1667475 [details]
VM metrics on openshift-console with fixes applied

Comment 30 Daniel Belenky 2020-03-15 17:47:58 UTC
Created attachment 1670361 [details]
Fix kubevirt service monitor relabel configs

Comment 31 Daniel Belenky 2020-03-15 17:48:57 UTC
https://github.com/kubevirt/kubevirt/pull/3146 the last PR is awaiting code-review.
It also features a new test case to ensure that this functionality will not break.

Comment 32 Tomas Jelinek 2020-04-30 12:05:03 UTC
*** Bug 1829252 has been marked as a duplicate of this bug. ***

Comment 33 Yaacov Zamir 2020-05-13 15:59:40 UTC
@Daniel hi, https://github.com/kubevirt/kubevirt/pull/3146 is merged, should this move to modified ?

Comment 34 Daniel Belenky 2020-05-13 16:04:26 UTC
Forgot to change the status. Thank you!

Comment 35 Radim Hrazdil 2020-06-02 08:06:37 UTC
@Daniel Hello, since BZ 1810002 depends on this, and is alredy ON_QA, shouldn't this be moved to ON_QA as well?

Comment 36 Daniel Belenky 2020-06-02 08:31:44 UTC
@Radim usually QE takes the BZ when they start testing it AFAIK... I moved it to MODIFIED when the code was merged.

Comment 37 Israel Pinto 2020-06-25 19:07:30 UTC
Verify with: 
CNV 2.4
Kubevirt: v0.30.1
Attached VM status

Comment 38 Israel Pinto 2020-06-25 19:08:11 UTC
Created attachment 1698838 [details]
vm_status

Comment 41 errata-xmlrpc 2020-07-28 19:09:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3194


Note You need to log in before you can comment on or make changes to this bug.