Bug 1907198 - ceph and grafana dashboards are inaccessible post deployment with code 503
Summary: ceph and grafana dashboards are inaccessible post deployment with code 503
Keywords:
Status: CLOSED DUPLICATE of bug 1856999
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 16.1 (Train)
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: ---
Assignee: RHOS Maint
QA Contact: Joe H. Rahme
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-12-13 15:19 UTC by Punit Kundal
Modified: 2020-12-18 16:57 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-12-18 16:57:29 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ceph_dashboard_default (54.24 KB, image/png)
2020-12-13 15:19 UTC, Punit Kundal
no flags Details
ceph_dashboard_stack_upda (106.39 KB, image/png)
2020-12-13 16:59 UTC, Punit Kundal
no flags Details

Description Punit Kundal 2020-12-13 15:19:25 UTC
Created attachment 1738748 [details]
ceph_dashboard_default

Description of problem:

Post a fresh deployment of RHOSP 16.1 with ceph; grafana and ceph dashboards are inaccessible

We notice that the haproxy binds to ctlplane vip for the ceph and grafana dashboards 

it is also seen that backend server ip(s) are on the ctlplane network

ceph-ansible playbooks however config the dashboard to run on storage network instead of ctlplane

Here are the observations:

haproxy binds on the ctplane ip for ceph and grafana dashboard

+++
[root@overcloud-controller-0 ~]# netstat -tunpl | grep -i 8444
tcp        0      0 192.168.24.215:8444     0.0.0.0:*               LISTEN      752542/haproxy      
[root@overcloud-controller-0 ~]# netstat -tunpl | grep -i 3100
tcp        0      0 192.168.24.215:3100     0.0.0.0:*               LISTEN      752542/haproxy  
+++

the backend servers are running on the storage network:

+++
[root@overcloud-controller-0 ~]# podman exec -it d1ea7a60368a ceph mgr services
{
    "dashboard": "http://overcloud-controller-0.storage.labrh2251.com:8444/",
    "prometheus": "http://overcloud-controller-0.labrh2251.com:9283/"
}
[root@overcloud-controller-0 ~]# 
+++

if we try to curl the ctlplane vip on port 8444 and 3100 where haproxy is listening:

ceph dashboard:

+++
[root@overcloud-controller-0 ~]# curl -vL http://192.168.24.215:8444
* Rebuilt URL to: http://192.168.24.215:8444/
*   Trying 192.168.24.215...
* TCP_NODELAY set
* Connected to 192.168.24.215 (192.168.24.215) port 8444 (#0)
> GET / HTTP/1.1
> Host: 192.168.24.215:8444
> User-Agent: curl/7.61.1
> Accept: */*
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 503 Service Unavailable
< Cache-Control: no-cache
< Connection: close
< Content-Type: text/html
< 
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
* Closing connection 0
[root@overcloud-controller-0 ~]# 
+++

same for port 3100:

+++
[root@overcloud-controller-0 ~]# curl -vL http://192.168.24.215:3100
* Rebuilt URL to: http://192.168.24.215:3100/
*   Trying 192.168.24.215...
* TCP_NODELAY set
* Connected to 192.168.24.215 (192.168.24.215) port 3100 (#0)
> GET / HTTP/1.1
> Host: 192.168.24.215:3100
> User-Agent: curl/7.61.1
> Accept: */*
> 
* HTTP 1.0, assume close after body
< HTTP/1.0 503 Service Unavailable
< Cache-Control: no-cache
< Connection: close
< Content-Type: text/html
< 
<html><body><h1>503 Service Unavailable</h1>
No server is available to handle this request.
</body></html>
* Closing connection 0
[root@overcloud-controller-0 ~]# 
+++

here's the haproxy config which is deployed by default:

+++
[root@overcloud-controller-0 ~]# cat /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg | grep -i ceph_dashboard -A10
listen ceph_dashboard
  bind 192.168.24.215:8444 transparent
  mode http
  balance source
  http-check expect rstatus 2[0-9][0-9]
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk HEAD /
  server overcloud-controller-0.ctlplane.labrh2251.com 192.168.24.213:8444 check fall 5 inter 2000 rise 2

[root@overcloud-controller-0 ~]# cat /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg | grep -i ceph_grafana -A10
listen ceph_grafana
  bind 192.168.24.215:3100 transparent
  mode http
  balance source
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk HEAD /
  server overcloud-controller-0.ctlplane.labrh2251.com 192.168.24.213:3100 check fall 5 inter 2000 rise 2
+++

the reason behind this issue is because in service_net_map.j2.yaml it is noticed that the network that is being used is 'storage_dashboard' and there is no network by that name so it defaults to ctlplane:

+++
[root@undercloud16 ~]# grep -e CephDashboar -e CephGra /usr/share/openstack-tripleo-heat-templates/network/service_net_map.j2.yaml 
      CephDashboardNetwork: {{ _service_nets.get('storage_dashboard', 'ctlplane') }}
      CephGrafanaNetwork: {{ _service_nets.get('storage_dashboard', 'ctlplane') }}
+++

here are the details on the networks in this lab:

+++
(undercloud) [stack@undercloud16 ~]$ neutron net-list
neutron CLI is deprecated and will be removed in the future. Use openstack CLI instead.
+--------------------------------------+--------------+----------------------------------+-------------------------------------------------------+
| id                                   | name         | tenant_id                        | subnets                                               |
+--------------------------------------+--------------+----------------------------------+-------------------------------------------------------+
| 1b2da26a-6ff5-4664-af65-d015bd178a2b | ctlplane     | d73920b84d9043fea60173e8c27d3bfd | dbef2f39-75ef-41f8-8dda-49a7308e16a7 192.168.24.0/24  |
| 4759c631-d429-4fa4-b952-f6c9c0ae5e23 | internal_api | d73920b84d9043fea60173e8c27d3bfd | 1d6a347d-ee60-423f-af03-5b9b28adfcbf 172.16.20.0/24   |
| 67dd666a-51a6-4967-a9e8-22235e5d0d9b | tenant       | d73920b84d9043fea60173e8c27d3bfd | e3fd7eaa-5b8a-4339-a95b-0d1208c286fa 172.16.50.0/24   |
| acdb84c6-aee6-4d50-b8d0-08bdf568fe58 | storage      | d73920b84d9043fea60173e8c27d3bfd | fdc0d41f-76dc-47cc-9979-42c38f200276 172.16.40.0/24   |
| e3ab2496-32f7-47ef-a943-b7103727110f | external     | d73920b84d9043fea60173e8c27d3bfd | faed5981-dccc-455b-91a1-7b8aace621b4 192.168.122.0/24 |
| edc047be-22e3-4317-b203-f35dd3d41a46 | storage_mgmt | d73920b84d9043fea60173e8c27d3bfd | 7dfe3a0f-863d-4b6c-995d-1e117da652a7 172.16.60.0/24   |
+--------------------------------------+--------------+----------------------------------+-------------------------------------------------------+
+++

if we try to directly reach the backend server ip where the ceph and grafana dashboards are listening:

+++
[root@overcloud-controller-0 ~]# curl -vL http://172.16.40.16:8444
* Rebuilt URL to: http://172.16.40.16:8444/
*   Trying 172.16.40.16...
* TCP_NODELAY set
* Connected to 172.16.40.16 (172.16.40.16) port 8444 (#0)
> GET / HTTP/1.1
> Host: 172.16.40.16:8444
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: text/html;charset=utf-8
< Server: CherryPy/8.9.1
< Date: Sun, 13 Dec 2020 14:22:06 GMT
< Content-Language: en-US
< Vary: Accept-Language, Accept-Encoding
< Last-Modified: Tue, 24 Nov 2020 19:19:12 GMT
< Accept-Ranges: bytes
< Content-Length: 1193
< 
<!doctype html>
<html lang="en">
<head>
  <meta charset="utf-8">
  <title>Red Hat Ceph Storage</title>

  <script>
    document.write('<base href="' + document.location+ '" />');
  </script>
+++

and then if we try for the below:

+++
[root@overcloud-controller-0 ~]# curl -vL http://172.16.40.16:3100
* Rebuilt URL to: http://172.16.40.16:3100/
*   Trying 172.16.40.16...
* TCP_NODELAY set
* Connected to 172.16.40.16 (172.16.40.16) port 3100 (#0)
> GET / HTTP/1.1
> Host: 172.16.40.16:3100
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Content-Type: text/html; charset=UTF-8
< Set-Cookie: grafana_sess=54e469e89ae5c194; Path=/; HttpOnly
< Date: Sun, 13 Dec 2020 15:13:58 GMT
< Transfer-Encoding: chunked
< 
<!DOCTYPE html>
<html lang="en">

<head>
  <meta charset="utf-8">
  <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
  <meta name="viewport" content="width=device-width">
  <meta name="theme-color" content="#000">
+++

we can see that when try to hit the correct ip they are reachable; so the default config that is set by haproxy needs to be fixed

Version-Release number of selected component (if applicable):

+++
[root@undercloud16 ~]# rpm -qa | grep -i tripleo-heat
openstack-tripleo-heat-templates-11.3.2-1.20200914170156.el8ost.noarch
+++

How reproducible:
Always

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
Screenshots while trying to access the dashboard from the haproxy vip ends up with 503

Comment 1 Punit Kundal 2020-12-13 16:58:54 UTC
Here are some more details; 

this issue is because of the ServiceNetMapDefaults parameter; if we set:

+++
(undercloud) [stack@undercloud16 ~]$ grep -i ServiceNetMap ~/templates/network-environment.yaml -A2
  ServiceNetMap:
    CephDashboardNetwork: storage
    CephGrafanaNetwork: storage
(undercloud) [stack@undercloud16 ~]$ 
+++

then run a stack update:

+++
Wait for puppet host configuration to finish --------------------------- 29.19s
Wait for puppet host configuration to finish --------------------------- 29.14s
Wait for puppet host configuration to finish --------------------------- 29.14s
Run tripleo-container-image-prepare logged to: /var/log/tripleo-container-image-prepare.log -- 27.63s
Wait for container-puppet tasks (bootstrap tasks) for step 3 to finish -- 25.31s
Wait for puppet host configuration to finish --------------------------- 22.11s
Wait for containers to start for step 4 using paunch ------------------- 21.81s
Pre-fetch all the containers ------------------------------------------- 20.52s
Run puppet on the host to apply IPtables rules ------------------------- 16.58s
tripleo-hieradata : Render hieradata from template --------------------- 14.94s
tripleo-kernel : Set extra sysctl options ------------------------------- 9.96s
tripleo-keystone-resources : Async creation of Keystone user ------------ 9.35s
tripleo-keystone-resources : Async creation of Keystone admin endpoint --- 8.97s

Ansible passed.
Overcloud configuration completed.
Overcloud Endpoint: http://192.168.122.202:5000
Overcloud Horizon Dashboard URL: http://192.168.122.202:80/dashboard
Overcloud rc file: /home/stack/overcloudrc
Overcloud Deployed without error
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=4, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.205', 45238)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=5, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.205', 60856)>
sys:1: ResourceWarning: unclosed <ssl.SSLSocket fd=7, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=6, laddr=('192.168.24.205', 38006), raddr=('192.168.24.205', 13989)>

real    62m50.039s
user    0m18.178s
sys     0m2.159s
+++

this sets the backend ip(s) correctly on the storage network; the vip still points to the ctlplane network

+++
[root@overcloud-controller-0 ~]# cat /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg | grep -i ceph_dashboard -A10
listen ceph_dashboard
  bind 192.168.24.215:8444 transparent
  mode http
  balance source
  http-check expect rstatus 2[0-9][0-9]
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk HEAD /
  server overcloud-controller-0.storage.labrh2251.com 172.16.40.16:8444 check fall 5 inter 2000 rise 2

[root@overcloud-controller-0 ~]# cat /var/lib/config-data/puppet-generated/haproxy/etc/haproxy/haproxy.cfg | grep -i ceph_grafana -A10
listen ceph_grafana
  bind 192.168.24.215:3100 transparent
  mode http
  balance source
  http-request set-header X-Forwarded-Proto https if { ssl_fc }
  http-request set-header X-Forwarded-Proto http if !{ ssl_fc }
  http-request set-header X-Forwarded-Port %[dst_port]
  option httpchk HEAD /
  server overcloud-controller-0.storage.labrh2251.com 172.16.40.16:3100 check fall 5 inter 2000 rise 2
+++

we are now able to access the dashboard; while the dashboard is still accessible; while browsing through we get code 500 errors on many different points; some examples are:

cluster -> hosts  code 500
cluster -> monitor code 500

the data still appears to be displayed but we still get alerts

I am attaching a screen shot of the same

Regards,
Punit

Comment 2 Punit Kundal 2020-12-13 16:59:44 UTC
Created attachment 1738772 [details]
ceph_dashboard_stack_upda


Note You need to log in before you can comment on or make changes to this bug.