Description of problem: Create a fedora VM and get it running about one hour, check mem/filesystem/Network Utilization in VM overview, there are not available. However, CPU Utilization is displaying there. Version-Release number of selected component (if applicable): 4.4.0-0.nightly-2020-02-17-022408 How reproducible: 100% Steps to Reproduce: 1. Create a VM and running one hour 2. Check memory Utilization in VM overview 3. Actual results: Expected results: Additional info:
Please make sure the guest agent is installed in the guest. This is where this statistics are being read from.
What's the guest agent for Linux(Fedora)?
the qemu guest agent - should be in fedora repos
Created attachment 1664310 [details] no data for mem even guest agent is installed
the response from prometheus-tenancy returns no data for memory/filesystem/network, however if I use just prometheus endpoint the data are there. Queries we run look like sum(kubevirt_vmi_storage_traffic_bytes_total{exported_namespace='<%= namespace %>',name='<%= vmName %>'}) sum(kubevirt_vmi_network_traffic_bytes_total{type='rx',exported_namespace='<%= namespace %>',name='<%= vmName %>'}) sum(kubevirt_vmi_network_traffic_bytes_total{type='tx',exported_namespace='<%= namespace %>',name='<%= vmName %>'}) CPU metric works fine for prometheus and prometheus-tenancy pod:container_cpu_usage:sum{namespace='<%= namespace %>',pod='<%= launcherPodName %>'} Can somebody from monitoring team take a look please ?
Changing to CNV SSP team. All kubevirt_* metric results are stored under openshift-cnv namespace which probably causes issues when using prometheus-tenancy. We have a VM called Fedora in default namespace. Running query such as sum(kubevirt_vmi_storage_traffic_bytes_total{exported_namespace='<%= namespace %>',name='<%= vmName %>'}) against prometheus-tenancy default namespace doesnt work since all metrics are stored in openshift-cnv namespace. The user however doesnt have to have access to openshift-cnv namespace so he cannot see the results. The metrics for particular VM should be stored under the VM's namespace.
Hi, can you switch to the admin check if you see the metrics please? TL;DR: we can't easily solve this right now because OpenShift does not fully support yet monitoring of user workloads. Long version: Thanks for Rastislav's comment, I reached to the monitoring team to better understand how prometheus-tenancy works in OCP. The thing is, the UI component (openshift-console) is not querying the "standard" Prometheus endpoint. Instead, it goes through the thanos-querier component that actually checks if the user have permissions to view the namespace in which the metric was originated from (openshift-cnv in our case) and rejects the query otherwise. So this is actually working as expected. Proposal for solution: We can deploy a ServiceMonitor object in any arbitrary namespace where our user is creating VMs *but* this is not yet supported by OpenShift (tech preview in 4.3) https://docs.openshift.com/container-platform/4.3/monitoring/monitoring-your-own-services.html and it requires the user to explicitly enable this feature in his cluster config. We need to decide internally if we want to start implementing this feature and being an early adopter...
Just to add: the fact that the metrics exists on the "standard" Prometheus endpoint is the proof that the metrics are being exported and collected properly.
We always use prometheus-tenancy on VM dashboard, even with admin users. What we could do on the UI is to check if user has access to prometheus and use non-tenancy in those cases, which would show the metrics.
I don't think that we are supposed to bypass the tenancy. Users with insufficient permissions should not be able to see VMs that they are not supposed to. What we will do (for now) is to deploy a custom cluster-role, a group and a cluster-role-binding to that group. Cluster admins will have the option to assign their users to this group and it will allow them to monitor the VMs. It should be cluster admin's responsibility to decide if a user can or can't create and monitor VMs.
> I don't think that we are supposed to bypass the tenancy. Users with insufficient permissions should not be able to see VMs that they are not supposed to. +1 Daniel, wrt comment #7: Please share the relevant OCP jira issue which is tracking this feature. OCP should tell us how to solve this situation. A solution can then be discussed on https://issues.redhat.com/browse/CNV-4112
It should not be a permission issue, the user I logged in is kubeadmin which should be a cluster admin. It works for the VM's pod, but not the VM: - On the VM's pod detail page, all metrics are showing. - On Home -> Overview -> Cluster Utilization, all metrics are showing. - On VM overview, Memory/Filesystem/Network Utilization are missing.
I'm not so sure whether it's a permission now, I run master console locally(which does not need to login) against the cluster, the Memory/Filesystem/Network Utilization shows well.
Can I somehow get access to the cluster that you are running? I'd like to inspect this setup and see how it's configured with regards to permissions because by default, you should not be able to view any resources w/o performing a login so I assume that you are getting some default user that has enough permissions to view it.
I believe what Guohua refers to is using the bridge, which is a dev utility that hosts the UI on local machine, but uses oc client to comunicate with the remote cluster. So your effectively signed in with the user that you logged in with oc. Presumably kubeadmin.
Correction - it uses the bearer token generated by oc login for API access.
The important bit is: Take a regular, production cluster, login as a user, and check if the metrics are available to a regular user. (Oren Cohen is running a production cluster). If the metrics are available for pods to a regular user, then the same should be possible for VMs.
@Daniel, The issue exists on newer OCP 4.4.0-0.nightly-2020-02-26-063555, fyi.
(In reply to Guohua Ouyang from comment #19) > @Daniel, The issue exists on newer OCP 4.4.0-0.nightly-2020-02-26-063555, > fyi. What version of thanos-proxy this cluster runs? May I have access to this environment? I can't reproduce this issue on CNV's production cluster which is running 4.4.0-0.nightly-2020-02-28-200834 (cnv.engineering.redhat.com) @Fabian, this is the cluster that Oren runs. I also wasn't able to reproduce it locally using the OCP kubevirt-ci provider which runs 4.4.0-0.nightly-2020-02-22-102956
Finally, the metrics disappeared and I was able to reproduce. Though, I still don't have a satisfying answer on why the metrics showed at first. Maybe the UI team will be able to shed more light on how they query Prometheus? When this issue does not reproduce, the console does not even send the namespace query param... maybe it has several authentication flows? The namespace query parameter that openshift-console computes is wrong - it computes is from the VMI object. Instead, it should render the namespace in which kubevirt was deployed in. Here is the console code that apparently is responsible to form the the query template: https://github.com/openshift/console/blob/master/frontend/packages/kubevirt-plugin/src/components/dashboards-page/vm-dashboard/queries.ts#L44 and here is the code that actually renders it: https://github.com/openshift/console/blob/ba3408ba2e9b28d8c4151f5d1908bae946324bd9/frontend/packages/kubevirt-plugin/src/components/dashboards-page/vm-dashboard/vm-utilization.tsx#L67 where the namespace is being computed here: https://github.com/openshift/console/blob/ba3408ba2e9b28d8c4151f5d1908bae946324bd9/frontend/packages/kubevirt-plugin/src/components/dashboards-page/vm-dashboard/vm-utilization.tsx#L53 What happens next is that the console is sending the query to the thanos-proxy endpoint that authenticates and **passes the query to prometheus-label-proxy** that enforces the namespace from console's query argument to be explicitly set on the query to prometheus (and the metric does not exist the specified namespace at all... so it returns nothing) If we take the same query, but replace the namespace to the correct value, the data is returned as expected. To summarize, there are several issues here: 1. There is a bug in openshift-console - the namespace query argument should be calculated differently. 2. In KubeVirt, we should deploy a group that has the requested permissions to view metrics, document it properly and let cluster-admins to decide if they want to let users to monitor VMs or not. * this group is a workaround until https://issues.redhat.com/browse/MON-845 is production. Do we want to split this bug into separate bugs? UI and Virt?
@Daniel how can we tell from UI in which namespace the CNV/Kubevirt is installed ? Is it safe to assume that its 'openshift-cnv' if we have branded UI and 'openshift' namespace for okd?
I cant search for cnv pods as a regular user who can access only some namespaces.
(In reply to Rastislav Wagner from comment #22) > @Daniel how can we tell from UI in which namespace the CNV/Kubevirt is > installed ? Is it safe to assume that its 'openshift-cnv' if we have branded > UI and 'openshift' namespace for okd? I don't think that it's safe to assume that because this is configurable by users. What we could do, in theory, is to label VMI objects with the handler's namespace... I'm not sure about the security implications though. It will allow users who doesn't have access to view the namespace in which CNV was deployed in to know it exists. I don't think that it's that bad, but it's not that good either... On the other hand, the user will know that this namespace exists anyway when the query will be issued by the console so I think that this change is safe enough. Let me bring this up in the next virt team meeting and I'll update here...
Created attachment 1667473 [details] Patch to ensure that Prometheus honors our labels
Created attachment 1667474 [details] Patch to openshift-console to ensure correct labels are used in the query
Created attachment 1667475 [details] VM metrics on openshift-console with fixes applied
Created attachment 1670361 [details] Fix kubevirt service monitor relabel configs
https://github.com/kubevirt/kubevirt/pull/3146 the last PR is awaiting code-review. It also features a new test case to ensure that this functionality will not break.
*** Bug 1829252 has been marked as a duplicate of this bug. ***
@Daniel hi, https://github.com/kubevirt/kubevirt/pull/3146 is merged, should this move to modified ?
Forgot to change the status. Thank you!
@Daniel Hello, since BZ 1810002 depends on this, and is alredy ON_QA, shouldn't this be moved to ON_QA as well?
@Radim usually QE takes the BZ when they start testing it AFAIK... I moved it to MODIFIED when the code was merged.
Verify with: CNV 2.4 Kubevirt: v0.30.1 Attached VM status
Created attachment 1698838 [details] vm_status
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3194