Bug 2271135

Summary: [cee/sd][ceph-dashboard] Grafana portal doesn't show Graphs in the RGW sync overview page
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: avan <athakkar>
Component: Ceph-DashboardAssignee: avan <athakkar>
Status: CLOSED ERRATA QA Contact: Chaithra <ckulal>
Severity: medium Docs Contact: Akash Raj <akraj>
Priority: unspecified    
Version: 6.1CC: aakobi, akraj, asriram, athakkar, ceph-eng-bugs, cephqe-warriors, ckulal, jolmomar, nia, trchakra, tserlin, vpapnoi
Target Milestone: ---   
Target Release: 7.1   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-18.2.1-148.el9cp Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 2270223 Environment:
Last Closed: 2024-06-13 14:30:20 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 2270223, 2304284    
Bug Blocks: 2267614, 2271133, 2298578, 2298579    

Description avan 2024-03-22 22:06:43 UTC
+++ This bug was initially created as a clone of Bug #2270223 +++

Description of problem:

On RHCS 6.1z4 and lower versions, Ceph Grafana dashboard doesn't show the Graphs in RGW sync overview page as it complains about "No Data". If I set `mgr/prometheus/exclude_perf_counters` as false, then the Graphs are getting populated. Also OSD overview page doesn't show most of the graphs like OSD read latency, write latency, highest read latency, highest write latency etc. This doesn't change with setting `mgr/prometheus/exclude_perf_counters` as false. 

So looks like ceph-exporter is not working properly which is introduced in RHCS 6.1. 

Version-Release number of selected component (if applicable):
17.2.6-196.el9cp

How reproducible:

It is happening all the time in both customer environment and in my lab setup. 


Steps to Reproduce:

- Install 2 RHCS 6.1z2 clusters
- Setup RGW multisite between them
- Verify the RGW sync overview page in Grafana portal for both the sites. The Graphs will be showing as "No Data".
- Modify the parameter `mgr/prometheus/exclude_perf_counters` as false. 
- Verify the Graphs under RGW sync overview page. They will be showing data. 
- Reset the parameter
- Upgrade the cluster to the latest RHCS 6.1z4 and verify the Graphs. They will be showing no data again. 
- Change the parameter to false again. Now the Graphs will be populated properly.


Actual results:

Graphs are showing "No data"

Expected results:

Graphs should be populated with actual data

Additional info:

Similar issue was observed on the BZ#2224967 and Engineering provided workaround to set the parameter mgr/prometheus/exclude_perf_counters as false and the fix was supposed to be introduced on RHCS 6.1z2. But looks like it is still not fixed or partially fixed.

--- Additional comment from Tridibesh Chakraborty on 2024-03-19 04:09:39 UTC ---

Impacted customer: ENEDIS
Account No: 1571872
Strategic: Yes

SFDC case#: 03766106

--- Additional comment from Tridibesh Chakraborty on 2024-03-19 04:22:25 UTC ---



--- Additional comment from Tridibesh Chakraborty on 2024-03-19 04:26:06 UTC ---



--- Additional comment from Tridibesh Chakraborty on 2024-03-19 04:27:02 UTC ---



--- Additional comment from Tridibesh Chakraborty on 2024-03-19 04:29:01 UTC ---



--- Additional comment from Tridibesh Chakraborty on 2024-03-19 04:39:44 UTC ---

Hi,

If I set `exclude_perf_counters` as true which is by default on RHCS 6.1z2, I can see the Graphs under OSD overview section. But when I checked the RGW sync overview, I can see the Graphs are missing. You can refer attachment# 2022430 [details] for the same. But if I set the exclude_perf_counters as false, then the RGW sync overview section is showing data, but OSD overview section is not showing data. You can refer attachment# 2022428 [details] & 2022429. So please let me know if we are hitting here any bug and ceph-exporter is not able to send the data properly.

Thanks,
Tridibesh

--- Additional comment from avan on 2024-03-19 06:45:55 UTC ---

Hi @Tridibesh ,

Firstly, I recommend avoiding setting exclude_perf_counters to false, as the default setting is already true. This adjustment shouldn't be considered a solution unless there are clear indications that metrics aren't being sourced from the ceph-exporter. Upon inspecting the "Distribution of PGs per OSD" panel, it's evident that the data is sourced from the ceph_osd_numpg metrics provided by the ceph-exporter. Hence, there doesn't seem to be any missing metric issue with the exporter; rather, it might be related to queries or Grafana panels.

Upon investigation, I suggest maintaining exclude_perf_counters set to true. Setting it to false implies fetching daemon performance metrics from the Prometheus module as well. However, since the ceph-exporter already exposes these metrics, enabling this option may result in duplicate metric entries from different sources. This duplication could lead to issues with the OSD Grafana panels.

Regarding the RGW sync overview Dashboard issue, could you please provide the values of all metrics starting with ceph_data_sync_from_zone_ from the Prometheus UI? This information will help in further troubleshooting the problem

--- Additional comment from Tridibesh Chakraborty on 2024-03-19 08:31:30 UTC ---

Hi Avan,

I asked customer to hold on from making any changes on the exclude_perf_counters. It is still set as true in customer environment. 

From the Prometheus portal I can see there are graphs for the metrics starting with `ceph_data_sync_from_zone_` and I have attached their screenshot. Please let me know if you are looking for this same information. 

Thanks,
Tridibesh

--- Additional comment from Tridibesh Chakraborty on 2024-03-19 08:32:36 UTC ---



--- Additional comment from Juan Miguel Olmo on 2024-03-19 10:25:18 UTC ---

Hi @trchakra

By the description you provided us:
"""
On RHCS 6.1z4 and lower versions, Ceph Grafana dashboard doesn't show the Graphs in RGW sync overview page as it complains about "No Data". If I set `mgr/prometheus/exclude_perf_counters` as false, then the Graphs are getting populated. Also OSD overview page doesn't show most of the graphs like OSD read latency, write latency, highest read latency, highest write latency etc. This doesn't change with setting `mgr/prometheus/exclude_perf_counters` as false
"""

It could be related to some issue in ceph exporter. Could you verify the following points:

1. Verify Ceph exporter is running in all the cluster hosts:

It must be the same number of ceph exporters than nodes, in this example 3/3)
# ceph orch ls 
...
node-exporter              ?:9100           3/3  14s ago    4w   *      
...

Another command to determine in which nodes ceph exporter is running:
# ceph orch ps
...
node-exporter.chatest-node-00  chatest-node-00.cephlab.com  *:9100            running (2m)     2m ago   4w    15.7M        -  1.4.0            f543b9528048  4cf47e8d40ea  
node-exporter.chatest-node-01  chatest-node-01.cephlab.com  *:9100            running (2m)     2m ago   4w    20.5M        -  1.4.0            f543b9528048  1047d5d914f9  
node-exporter.chatest-node-02  chatest-node-02.cephlab.com  *:9100            running (2m)     2m ago   4w    20.7M        -  1.4.0            f543b9528048  c565146f59cf  
...


2. Verify no errors in Ceph exporter:
Connect to the node where ceph exporter is not running or the node where we want to verify ceph exporter behavior:

locate the ceph exporter service:
# systemctl list-units | grep @ceph-exporter
  ceph-6a765cea-ca9a-11ee-8d6a-5254009cc347.service                                  loaded active running   Ceph ceph-exporter.chatest-node-00 for 6a765cea-ca9a-11ee-8d6a-5254009cc347

Verify the service state: (any error strating or preventing the service running will appear here)
# systemctl status ceph-6a765cea-ca9a-11ee-8d6a-5254009cc347.service


Verify the logs: (any error getting metrics from ceph node daemons or providing metrics to propmetheus will appear here)
# journalctl -f -u ceph-6a765cea-ca9a-11ee-8d6a-5254009cc347.service


3. If no errors in ceph exporter, any other trouble with metrics clients (like grafana, Ceph dashboard, etc), must be investigated using Prometheus.

- Try to obtain the metrics using the prometheus interface directly
Use cephadm to locate the prometheus server:
# ceph orch ps | grep prometheus  
prometheus.chatest-node-00     chatest-node-00.cephlab.com  *:9095            running (15m)     4m ago   4w     109M        -  2.39.1           a9984c8b5d29  a36c909d3832 

Connect to prometheus in: http://chatest-node-00.cephlab.com:9095 
Note: replace the node name for the ip if you do not have dns for your nodes

- Verify that all ceph exporters appears as "scrape targets" in the prometheus configuration
You must have the same name of scrape targets as nodes
Note, you can see also this information using the prometheus UI (status/targets)
# curl -s  http://chatest-node-00.cephlab.com:9095/api/v1/targets |  jq '.data.activeTargets[] | "\(.labels.job): \(.scrapeUrl)"' | grep ceph-exporter
"ceph-exporter: http://192.168.122.85:9926/metrics"
"ceph-exporter: http://192.168.122.113:9926/metrics"
"ceph-exporter: http://192.168.122.99:9926/metrics"

- Verify errors in scraping using the UI interface or the CLI.
# curl -s  http://chatest-node-00.cephlab.com:9095/api/v1/targets

- Using the Prometheus ui, verify you can get the ceph metrics.

Please verify. If everything well, we can discard ceph exporter and we should start to search the error in the ceph dashboard.

--- Additional comment from Juan Miguel Olmo on 2024-03-19 10:47:05 UTC ---

@trchakra

In the my previous command examples, the output i copied is not the right one: here the correct:
1. Verify Ceph exporter is running in all the cluster hosts:

It must be the same number of ceph exporters than nodes, in this example 3/3)
# ceph orch ls 
...
ceph-exporter                               3/3  101s ago   4w   *         
...

Another command to determine in whoch nodes ceph exporter is running:
# ceph orch ps
...
ceph-exporter.chatest-node-00  chatest-node-00.cephlab.com                    running (74m)     2m ago   4w    23.6M        -  18.2.1-40.el9cp  4530dc529292  c3a0d2efebeb  
ceph-exporter.chatest-node-01  chatest-node-01.cephlab.com                    running (74m)     2m ago   4w    20.4M        -  18.2.1-40.el9cp  4530dc529292  ca4592f01e1b  
ceph-exporter.chatest-node-02  chatest-node-02.cephlab.com                    running (74m)     2m ago   4w    11.1M        -  18.2.1-40.el9cp  4530dc529292  4c24d5545d62  
...

--- Additional comment from avan on 2024-03-19 11:16:39 UTC ---

(In reply to Juan Miguel Olmo from comment #10)
> Hi @trchakra
> 
> By the description you provided us:
> """
> On RHCS 6.1z4 and lower versions, Ceph Grafana dashboard doesn't show the
> Graphs in RGW sync overview page as it complains about "No Data". If I set
> `mgr/prometheus/exclude_perf_counters` as false, then the Graphs are getting
> populated. Also OSD overview page doesn't show most of the graphs like OSD
> read latency, write latency, highest read latency, highest write latency
> etc. This doesn't change with setting `mgr/prometheus/exclude_perf_counters`
> as false
> """
> 
> It could be related to some issue in ceph exporter. Could you verify the
> following points:
> 
> 1. Verify Ceph exporter is running in all the cluster hosts:
> 
> It must be the same number of ceph exporters than nodes, in this example 3/3)
> # ceph orch ls 
> ...
> node-exporter              ?:9100           3/3  14s ago    4w   *      
> ...
> 
> Another command to determine in which nodes ceph exporter is running:
> # ceph orch ps
> ...
> node-exporter.chatest-node-00  chatest-node-00.cephlab.com  *:9100          
> running (2m)     2m ago   4w    15.7M        -  1.4.0           
> f543b9528048  4cf47e8d40ea  
> node-exporter.chatest-node-01  chatest-node-01.cephlab.com  *:9100          
> running (2m)     2m ago   4w    20.5M        -  1.4.0           
> f543b9528048  1047d5d914f9  
> node-exporter.chatest-node-02  chatest-node-02.cephlab.com  *:9100          
> running (2m)     2m ago   4w    20.7M        -  1.4.0           
> f543b9528048  c565146f59cf  
> ...
> 
> 
> 2. Verify no errors in Ceph exporter:
> Connect to the node where ceph exporter is not running or the node where we
> want to verify ceph exporter behavior:
> 
> locate the ceph exporter service:
> # systemctl list-units | grep @ceph-exporter
>  
> ceph-6a765cea-ca9a-11ee-8d6a-5254009cc347.
> service                                  loaded active running   Ceph
> ceph-exporter.chatest-node-00 for 6a765cea-ca9a-11ee-8d6a-5254009cc347
> 
> Verify the service state: (any error strating or preventing the service
> running will appear here)
> # systemctl status
> ceph-6a765cea-ca9a-11ee-8d6a-5254009cc347.
> service
> 
> 
> Verify the logs: (any error getting metrics from ceph node daemons or
> providing metrics to propmetheus will appear here)
> # journalctl -f -u
> ceph-6a765cea-ca9a-11ee-8d6a-5254009cc347.
> service
> 
> 
> 3. If no errors in ceph exporter, any other trouble with metrics clients
> (like grafana, Ceph dashboard, etc), must be investigated using Prometheus.
> 
> - Try to obtain the metrics using the prometheus interface directly
> Use cephadm to locate the prometheus server:
> # ceph orch ps | grep prometheus  
> prometheus.chatest-node-00     chatest-node-00.cephlab.com  *:9095          
> running (15m)     4m ago   4w     109M        -  2.39.1          
> a9984c8b5d29  a36c909d3832 
> 
> Connect to prometheus in: http://chatest-node-00.cephlab.com:9095 
> Note: replace the node name for the ip if you do not have dns for your nodes
> 
> - Verify that all ceph exporters appears as "scrape targets" in the
> prometheus configuration
> You must have the same name of scrape targets as nodes
> Note, you can see also this information using the prometheus UI
> (status/targets)
> # curl -s  http://chatest-node-00.cephlab.com:9095/api/v1/targets |  jq
> '.data.activeTargets[] | "\(.labels.job): \(.scrapeUrl)"' | grep
> ceph-exporter
> "ceph-exporter: http://192.168.122.85:9926/metrics"
> "ceph-exporter: http://192.168.122.113:9926/metrics"
> "ceph-exporter: http://192.168.122.99:9926/metrics"
> 
> - Verify errors in scraping using the UI interface or the CLI.
> # curl -s  http://chatest-node-00.cephlab.com:9095/api/v1/targets
> 
> - Using the Prometheus ui, verify you can get the ceph metrics.
> 
> Please verify. If everything well, we can discard ceph exporter and we
> should start to search the error in the ceph dashboard.

Let's refrain from making assumptions here, especially since the user has already confirmed the behavior of Grafana panels, indicating that the exporter is indeed running and delivering metrics correctly. It seems the issue is more aligned with filtering metrics in the RGW sync overview rather than the exporter itself. To prevent any further confusion for the user, it's best to focus our attention on resolving the filtering issue. Thank you.

--- Additional comment from avan on 2024-03-19 11:20:32 UTC ---

(In reply to Tridibesh Chakraborty from comment #8)
> Hi Avan,
> 
> I asked customer to hold on from making any changes on the
> exclude_perf_counters. It is still set as true in customer environment. 
> 
> From the Prometheus portal I can see there are graphs for the metrics
> starting with `ceph_data_sync_from_zone_` and I have attached their
> screenshot. Please let me know if you are looking for this same information. 
> 
> Thanks,
> Tridibesh

Thank you for providing the information. I noticed that all metrics have the job label set to "ceph". Have you set exclude_perf_counters to false? If so, I would recommend setting it to true and then verifying if these metrics are still being exposed. Additionally, could you please share a screenshot of the table view instead of the graph view in the Prometheus UI?

--- Additional comment from Tridibesh Chakraborty on 2024-03-20 02:09:32 UTC ---

Hi Avan,

Yes, I had the exclude_perf_counters set as false. Now I have set it to true and checked the prometheus portal for the metrics `ceph_data_sync_from_zone_`. Noticed it is showing "Empty query result". 

Thanks,
Tridibesh

--- Additional comment from Tridibesh Chakraborty on 2024-03-21 04:13:59 UTC ---

Hi Avan,

I have noticed there are 2 type of queries in the Promethues, ceph_data_sync_from_zone_* and ceph_data_sync_from_<source_zone_name>_*. For the `ceph_data_sync_from_zone_` queries, the table and graphs are showing "Empty query result", where as for the queries ceph_data_sync_from_<source_zone_name>_* , I can see the table is getting populated via job ceph-exporter. I have attached a screenshot FYR.

Thanks, 
Tridibesh

--- Additional comment from Tridibesh Chakraborty on 2024-03-21 04:15:38 UTC ---



--- Additional comment from avan on 2024-03-21 11:28:15 UTC ---

(In reply to Tridibesh Chakraborty from comment #15)
> Hi Avan,
> 
> I have noticed there are 2 type of queries in the Promethues,
> ceph_data_sync_from_zone_* and ceph_data_sync_from_<source_zone_name>_*. For
> the `ceph_data_sync_from_zone_` queries, the table and graphs are showing
> "Empty query result", where as for the queries
> ceph_data_sync_from_<source_zone_name>_* , I can see the table is getting
> populated via job ceph-exporter. I have attached a screenshot FYR.
> 
> Thanks, 
> Tridibesh

Hi Tridibesh,
Thanks for the update, this is helpful.I think this explains the issue with how the metrics are exposed from exporter which doesn't match with what's used in queries of grafana panels. We'd need to fix this in exporter so that we see these metrics in format ceph_data_sync_from_zone_* which grafana panels are expecting. That think would probably be for next 6.1 release, 6.1z6?

--- Additional comment from Tridibesh Chakraborty on 2024-03-22 04:55:08 UTC ---

Hi Avan,

I have a doubt regarding making the metrics in format ceph_data_sync_from_zone_* which grafana panels are expecting. If customer have multiple zones configured with multisite, will it still work or it will be better to have metrics named as how it is now in ceph-exporter and modify Grafana queries. 

Also, do you have any date for RHCS 6.1z6? I can see RHCS 6.1z5 is scheduled on 27th March, but there is no mention about z6. 

Thanks,
Tridibesh

--- Additional comment from avan on 2024-03-22 07:47:48 UTC ---

(In reply to Tridibesh Chakraborty from comment #18)
> Hi Avan,
> 
> I have a doubt regarding making the metrics in format
> ceph_data_sync_from_zone_* which grafana panels are expecting. If customer
> have multiple zones configured with multisite, will it still work or it will
> be better to have metrics named as how it is now in ceph-exporter and modify
> Grafana queries. 
> 
> Also, do you have any date for RHCS 6.1z6? I can see RHCS 6.1z5 is scheduled
> on 27th March, but there is no mention about z6. 
> 
> Thanks,
> Tridibesh

Indeed, upon reviewing the screenshots you shared earlier regarding the `job=ceph` metrics, it's evident that there's a label labeled "source_zone," which corresponds to the zone name for those metrics. This label is crucial for ceph-exporter to adopt as well. Therefore, even when dealing with multiple zones, the same queries would suffice, provided we update the way metrics are exposed by the ceph-exporter.

Regarding the timeline for 6.1z6, I'm uncertain of its planned release schedule. @nizam, would you happen to have any insights into when the next 6.1 z stream is scheduled for release?

Comment 1 avan 2024-04-10 09:04:28 UTC
We're considering including this bug into the blocker stage, as users will rely on the sync status and overview for RGW multisite functionality. So marking it as a blocker

Comment 7 errata-xmlrpc 2024-06-13 14:30:20 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Critical: Red Hat Ceph Storage 7.1 security, enhancements, and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2024:3925