Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
This project is now read‑only. Starting Monday, February 2, please use https://ibm-ceph.atlassian.net/ for all bug tracking management.

Bug 2414677

Summary: Prometheus module error causing cluster to go to HEALTH_ERR
Product: [Red Hat Storage] Red Hat Ceph Storage Reporter: Eric Smith <esmith>
Component: Ceph-Mgr PluginsAssignee: anmol babu <anbabu>
Ceph-Mgr Plugins sub component: prometheus QA Contact: Manisha Saini <msaini>
Status: CLOSED ERRATA Docs Contact: ceph-docs <ceph-docs>
Severity: medium    
Priority: medium CC: aasharma, anbabu, bkunal, eric, jcaratza, ngangadh, pdhange, rkachach, shbhosal, timnguye, vdas, ygayam
Version: 8.1Flags: esmith: needinfo-
Target Milestone: ---   
Target Release: 9.0   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: ceph-20.1.0-131 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 2421743 (view as bug list) Environment:
Last Closed: 2026-01-29 07:03:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2421743    

Description Eric Smith 2025-11-12 20:49:12 UTC
Description of problem: When we enable the prometheus module, it will sometimes produce the following error:

2025-11-12T19:36:47.357+0000 7f6f90a6b640  0 [prometheus ERROR cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Error in 'start' listener <bound method Server.start of <cherrypy._cpserver.Server object at 0x7f6fba4d9cd0>>
Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cherrypy/process/wspbus.py", line 230, in publish
    output.append(listener(*args, **kwargs))
  File "/lib/python3.9/site-packages/cherrypy/_cpserver.py", line 180, in start
    super(Server, self).start()
  File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 184, in start
    self.wait()
  File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 246, in wait
    raise self.interrupt
  File "/lib64/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/lib64/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread
    self.httpserver.start()
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1844, in start
    self.prepare()
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1806, in prepare
    self._connections = connections.ConnectionManager(self)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 131, in __init__
    server.socket.fileno(),
AttributeError: 'NoneType' object has no attribute 'fileno'

This is passed into Ceph:

[ceph: root@dal1-qz2-sr2-rk044-s18 /]# ceph -s
  cluster:
    id:     868971c8-bffe-11f0-b45d-7cc2554980d4
    health: HEALTH_ERR
            Module 'prometheus' has failed: AttributeError("'NoneType' object has no attribute 'fileno'")


Version-Release number of selected component (if applicable): 19.2.1 (RHCS 8.1)


How reproducible: 75%



Steps to Reproduce:
1. Deploy Ceph
2. Enable the prometheus module
3. Receive the error

Actual results: 

[ceph: root@dal1-qz2-sr2-rk044-s18 /]# ceph -s
  cluster:
    id:     868971c8-bffe-11f0-b45d-7cc2554980d4
    health: HEALTH_ERR
            Module 'prometheus' has failed: AttributeError("'NoneType' object has no attribute 'fileno'")


Expected results: Prometheus module is enabled successfully


Additional info: There are other stack traces preceding this that leads us to believe the port is in use possibly?

2025-11-12T19:36:47.256+0000 7f6f720ee640  0 [prometheus ERROR cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Error in HTTP server: shutting down
Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread
    self.httpserver.start()
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1844, in start
    self.prepare()
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1799, in prepare
    raise socket.error(msg)
OSError: No socket could be created -- (('::', 9283, 0, 0): [Errno 98] Address already in use)

2025-11-12T19:36:47.256+0000 7f6f720ee640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus STOPPING
2025-11-12T19:36:47.256+0000 7f6f720ee640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::',
9283)) already shut down
2025-11-12T19:36:47.257+0000 7f6f720ee640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus STOPPED
2025-11-12T19:36:47.257+0000 7f6f720ee640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus EXITING
2025-11-12T19:36:47.257+0000 7f6f720ee640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus EXITED
2025-11-12T19:36:47.269+0000 7f6f718ed640  0 [prometheus ERROR cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Error in HTTP server: shutting down
Traceback (most recent call last):
  File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread
    self.httpserver.start()
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1844, in start
    self.prepare()
  File "/lib/python3.9/site-packages/cheroot/server.py", line 1806, in prepare
    self._connections = connections.ConnectionManager(self)
  File "/lib/python3.9/site-packages/cheroot/connections.py", line 131, in __init__
    server.socket.fileno(),
AttributeError: 'NoneType' object has no attribute 'fileno'

Comment 1 Eric Smith 2025-11-12 20:54:03 UTC
SOS report: https://ibm.box.com/s/t3a3ymtcey5y4hei2cmyuejcotuu02cz

Comment 2 Eric Smith 2025-11-12 20:54:44 UTC
We have to fail the manager to clear the error.

Comment 3 Bipin Kunal 2025-11-27 06:20:54 UTC
I see that there is error `OSError: No socket could be created -- (('::', 9283, 0, 0): [Errno 98] Address already in use)`. Did you check what is consuming the 9283 port?

Comment 4 Eric Smith 2025-12-01 13:30:28 UTC
What information is needed? Why was the needsinfo flag added?

Comment 5 Timothy Nguyễn 2025-12-01 16:25:56 UTC
We want to know what is consuming your port.

Comment 6 Eric Smith 2025-12-01 16:29:34 UTC
9283 is only bound to by the manager - we don't have any software that binds to 9283.

Comment 7 Timothy Nguyễn 2025-12-01 18:39:49 UTC
Can I and Prashant be given access to the SOS report.

Comment 8 Timothy Nguyễn 2025-12-01 18:50:15 UTC
Also to clarify are you deploying the cluster and disabling Prometheus, then re-enabling? On cluster deployment the Prometheus module is automatically deployed.

Comment 9 Eric Smith 2025-12-01 20:08:49 UTC
Here's an updated link: https://ibm.box.com/s/t3a3ymtcey5y4hei2cmyuejcotuu02cz

Once you have it downloaded can you let me know so that I can shutoff access again?

Comment 10 Timothy Nguyễn 2025-12-01 22:55:15 UTC
I have it downloaded.

Comment 11 Eric Smith 2025-12-02 12:35:48 UTC
We enable the prometheus module during deployment if it is not already enabled.

Comment 12 Timothy Nguyễn 2025-12-02 16:12:23 UTC
Is the mgr bound to 9283 a healthy mgr daemon?

Comment 13 Eric Smith 2025-12-02 23:14:45 UTC
Yes I believe so - manager fail-over works just fine.

Comment 17 anmol babu 2025-12-10 14:25:13 UTC
can you enable my access to https://ibm.box.com/s/t3a3ymtcey5y4hei2cmyuejcotuu02cz

Comment 18 Eric Smith 2025-12-10 15:57:44 UTC
Hi @anbabu - I've reenabled the share link - you should be able to access it now.

Comment 20 anmol babu 2025-12-10 16:09:25 UTC
Thanks Eric, I have downloaded it.
While I look at the logs, given you confirmed the issue is not due to mgr fail over, I had a few additional questions I need your inputs on:
1. Was there only 1 mgr on the same node or more than 1
2. Was there back to back enable disable of prometheus in under lets say a minute?
3. Were there any config changes triggered ?
4. As a work-around, have we tried changing the value of mgr/prometheus/server_port as in https://docs.ceph.com/en/octopus/mgr/prometheus/#configuration ?
5. were the system resources running at full or near full utilizations prompting restarts?
6. Was this issue only with prometheus mgr module or other manager modules as well or manager in general had some issue which manifested as prometheus module crash?

Comment 21 Eric Smith 2025-12-10 16:26:15 UTC
Hi @anbabu - is there something else needed from me?

Comment 22 Timothy Nguyễn 2025-12-10 16:33:39 UTC
Eric are you not able to see confidential comments?

Comment 24 Eric Smith 2025-12-10 16:54:47 UTC
1. Was there only 1 mgr on the same node or more than 1

There is only 1 manager at the time of the error (Very soon after bootstrap)

2. Was there back to back enable disable of prometheus in under lets say a minute?

No, we only run the module enable once for each module we're attempting to enable.

3. Were there any config changes triggered ?

This is part of a greenfield installation - it's immediately post bootstrap of the cluster.

4. As a work-around, have we tried changing the value of mgr/prometheus/server_port as in https://docs.ceph.com/en/octopus/mgr/prometheus/#configuration ?

No we have not tried changing the port.

5. were the system resources running at full or near full utilizations prompting restarts?

No, these nodes have 1TB of memory and 128 CPUs - they are very under utilized at the time of the error.

6. Was this issue only with prometheus mgr module or other manager modules as well or manager in general had some issue which manifested as prometheus module crash?

Only with the prometheus mgr module.

Comment 25 Eric Smith 2025-12-10 16:55:25 UTC
Tim was correct - I was unable to see confidential comments.

Comment 26 Timothy Nguyễn 2025-12-10 17:49:23 UTC
3. Were there any config changes triggered ?

Not sure if it's relevant but Eric did provide their initial config file they pass to --config during bootstrap. Perhaps it could trigger a config change?

[global]
bluefs_buffered_io = false
bluestore_cache_autotune = true
bluestore_cache_size = 3221225472
bluestore_compression_max_blob_size = 65536
bluestore_compression_min_blob_size = 8192
bluestore_default_buffered_write = true
bluestore_default_buffered_read = true
bluestore_deferred_batch_ops = 16
bluestore_extent_map_shard_min_size = 50
bluestore_extent_map_shard_max_size = 200
bluestore_extent_map_shard_target_size = 100
bluestore_max_blob_size = 65536
bluestore_min_alloc_size = 4096
bluestore_min_alloc_size_ssd = 4096
bluestore_prefer_deferred_size = 0
log_to_file = true
log_to_stderr = false
mon_cluster_log_to_file = true
mon_cluster_log_to_stderr = false
mon_max_pg_per_osd = 1000
mon_pg_warn_max_object_skew = 100000.000000
mon_pg_warn_min_per_osd = 0
ms_bind_msgr1 = false
ms_client_mode = secure
ms_cluster_mode = secure
ms_service_mode = secure
ms_mon_client_mode = secure
ms_mon_cluster_mode = secure
ms_mon_service_mode = secure
osd_pool_default_pg_autoscale_mode = off
rbd_cache = false
rbd_disable_zero_copy_writes = false
log_to_journald = false
mon_cluster_log_to_journald = false
[mgr]
mgr/cephadm/yes_i_know = true
mgr/cephadm/no_five_one_rgw = true
mgr/prometheus/exclude_perf_counters = false
[mon]
auth_allow_insecure_global_id_reclaim = false
mon_allow_pool_size_one = true
mon_config_key_max_entry_size = 4194304
[osd]
osd_memory_target = 34359738368
osd_op_complaint_time = 2.000000
osd_op_num_threads_per_shard = 2
osd_scrub_load_threshold = 0.010000
mon_max_pg_per_osd = 1000
osd_memory_target_autotune = false

Comment 27 Bipin Kunal 2025-12-11 05:15:09 UTC
Copy comment #23 for Eric. Removed the hostname with <hostname>

Below are my observations from the logs:

StandbyModule Successfully Starts on Port 9283 (<hostname> lines lines 71-76)
2025-11-12T19:36:41.651+0000 7f6fb0beb640  0 [prometheus INFO root] server_addr: :: server_port: 9283
2025-11-12T19:36:41.651+0000 7f6fb0beb640  0 [prometheus INFO root] Starting engine...
2025-11-12T19:36:41.651+0000 7f6fb0beb640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:41] ENGINE Bus STARTING
2025-11-12T19:36:41.755+0000 7f6fb0beb640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:41] ENGINE Serving on http://:::9283
2025-11-12T19:36:41.755+0000 7f6fb0beb640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:41] ENGINE Bus STARTED
2025-11-12T19:36:41.755+0000 7f6fb0beb640  0 [prometheus INFO root] Engine started.


StandBy shutdown probably because standby is now becoming active(<hostname> lines 77-74 and as seen from logs below)
2025-11-12T19:36:46.234+0000 7f73d1450640  1 mgr handle_mgr_map Activating!
2025-11-12T19:36:46.234+0000 7f73d1450640  1 mgr handle_mgr_map I am now activating
2025-11-12T19:36:46.240+0000 7f6faa3de640  0 [prometheus INFO root] Stopping engine...
2025-11-12T19:36:46.240+0000 7f6faa3de640  0 [prometheus INFO root] Stopped engine
2025-11-12T19:36:46.240+0000 7f6fb0beb640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE Bus STOPPING
2025-11-12T19:36:46.261+0000 7f6fb0beb640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 9283)) shut down
2025-11-12T19:36:46.261+0000 7f6fb0beb640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE Bus STOPPED
2025-11-12T19:36:46.261+0000 7f6fb0beb640  0 [prometheus INFO root] Engine stopped.


And again the active module restart prometheus(<hostname> lines 119-124)
2025-11-12T19:36:46.592+0000 7f6faa3de640  0 [prometheus DEBUG root] setting log level based on debug_mgr: INFO (2/5)
2025-11-12T19:36:46.595+0000 7f6faa3de640  1 mgr load Constructed class from module: prometheus
2025-11-12T19:36:46.596+0000 7f6faa3de640  0 [rbd_support DEBUG root] setting log level based on debug_mgr: INFO (2/5)
2025-11-12T19:36:46.597+0000 7f6f9126c640  0 [prometheus INFO root] server_addr: :: server_port: 9283
2025-11-12T19:36:46.597+0000 7f6f9126c640  0 [prometheus INFO root] Cache enabled
2025-11-12T19:36:46.597+0000 7f6f9026a640  0 [prometheus INFO root] starting metric collection thread

Now, something triggers config_notify (as indicated by log line Restarting engine... <hostname> lines 148 - 151)
2025-11-12T19:36:46.682+0000 7f6f90a6b640  0 [prometheus INFO root] Restarting engine...
2025-11-12T19:36:46.684+0000 7f6f90a6b640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE Bus STOPPING
2025-11-12T19:36:46.684+0000 7f6f90a6b640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE HTTP Server None already shut down
2025-11-12T19:36:46.685+0000 7f6f90a6b640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE Bus STOPPED

and the above seems to be causing the RACE condition(<hostname> lines 154-156)
2025-11-12T19:36:47.254+0000 7f6f9126c640  0 [prometheus INFO root] Starting engine...
2025-11-12T19:36:47.254+0000 7f6f90a6b640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus STARTING
2025-11-12T19:36:47.254+0000 7f6f9126c640  0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus STARTING

engine is started in quick succession and almost immediately we see the error:
2025-11-12T19:36:47.256+0000 7f6f720ee640  0 [prometheus ERROR cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Error in HTTP server: shutting downTraceback (most recent call last):  File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread    self.httpserver.start()  File "/lib/python3.9/site-packages/cheroot/server.py", line 1844, in start    self.prepare()  File "/lib/python3.9/site-packages/cheroot/server.py", line 1799, in prepare    raise socket.error(msg)OSError: No socket could be created -- (('::', 9283, 0, 0): [Errno 98] Address already in use)

Comment 28 Timothy Nguyễn 2025-12-11 19:16:18 UTC
This is Anmol Babu's pr: https://github.com/ceph/ceph/pull/66605

I agree with the changes here, issue was engine start being called in both the serve loop and config_notify, leading to one binding to port and the other trying to while the first was already binded.

Comment 35 errata-xmlrpc 2026-01-29 07:03:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2026:1536