Bug 2414677
| Summary: | Prometheus module error causing cluster to go to HEALTH_ERR | |||
|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Ceph Storage | Reporter: | Eric Smith <esmith> | |
| Component: | Ceph-Mgr Plugins | Assignee: | anmol babu <anbabu> | |
| Ceph-Mgr Plugins sub component: | prometheus | QA Contact: | Manisha Saini <msaini> | |
| Status: | CLOSED ERRATA | Docs Contact: | ceph-docs <ceph-docs> | |
| Severity: | medium | |||
| Priority: | medium | CC: | aasharma, anbabu, bkunal, eric, jcaratza, ngangadh, pdhange, rkachach, shbhosal, timnguye, vdas, ygayam | |
| Version: | 8.1 | Flags: | esmith:
needinfo-
|
|
| Target Milestone: | --- | |||
| Target Release: | 9.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | ceph-20.1.0-131 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2421743 (view as bug list) | Environment: | ||
| Last Closed: | 2026-01-29 07:03:42 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2421743 | |||
We have to fail the manager to clear the error. I see that there is error `OSError: No socket could be created -- (('::', 9283, 0, 0): [Errno 98] Address already in use)`. Did you check what is consuming the 9283 port?
What information is needed? Why was the needsinfo flag added? We want to know what is consuming your port. 9283 is only bound to by the manager - we don't have any software that binds to 9283. Can I and Prashant be given access to the SOS report. Also to clarify are you deploying the cluster and disabling Prometheus, then re-enabling? On cluster deployment the Prometheus module is automatically deployed. Here's an updated link: https://ibm.box.com/s/t3a3ymtcey5y4hei2cmyuejcotuu02cz Once you have it downloaded can you let me know so that I can shutoff access again? I have it downloaded. We enable the prometheus module during deployment if it is not already enabled. Is the mgr bound to 9283 a healthy mgr daemon? Yes I believe so - manager fail-over works just fine. can you enable my access to https://ibm.box.com/s/t3a3ymtcey5y4hei2cmyuejcotuu02cz Hi @anbabu - I've reenabled the share link - you should be able to access it now. Thanks Eric, I have downloaded it. While I look at the logs, given you confirmed the issue is not due to mgr fail over, I had a few additional questions I need your inputs on: 1. Was there only 1 mgr on the same node or more than 1 2. Was there back to back enable disable of prometheus in under lets say a minute? 3. Were there any config changes triggered ? 4. As a work-around, have we tried changing the value of mgr/prometheus/server_port as in https://docs.ceph.com/en/octopus/mgr/prometheus/#configuration ? 5. were the system resources running at full or near full utilizations prompting restarts? 6. Was this issue only with prometheus mgr module or other manager modules as well or manager in general had some issue which manifested as prometheus module crash? Hi @anbabu - is there something else needed from me? Eric are you not able to see confidential comments? 1. Was there only 1 mgr on the same node or more than 1 There is only 1 manager at the time of the error (Very soon after bootstrap) 2. Was there back to back enable disable of prometheus in under lets say a minute? No, we only run the module enable once for each module we're attempting to enable. 3. Were there any config changes triggered ? This is part of a greenfield installation - it's immediately post bootstrap of the cluster. 4. As a work-around, have we tried changing the value of mgr/prometheus/server_port as in https://docs.ceph.com/en/octopus/mgr/prometheus/#configuration ? No we have not tried changing the port. 5. were the system resources running at full or near full utilizations prompting restarts? No, these nodes have 1TB of memory and 128 CPUs - they are very under utilized at the time of the error. 6. Was this issue only with prometheus mgr module or other manager modules as well or manager in general had some issue which manifested as prometheus module crash? Only with the prometheus mgr module. Tim was correct - I was unable to see confidential comments. 3. Were there any config changes triggered ? Not sure if it's relevant but Eric did provide their initial config file they pass to --config during bootstrap. Perhaps it could trigger a config change? [global] bluefs_buffered_io = false bluestore_cache_autotune = true bluestore_cache_size = 3221225472 bluestore_compression_max_blob_size = 65536 bluestore_compression_min_blob_size = 8192 bluestore_default_buffered_write = true bluestore_default_buffered_read = true bluestore_deferred_batch_ops = 16 bluestore_extent_map_shard_min_size = 50 bluestore_extent_map_shard_max_size = 200 bluestore_extent_map_shard_target_size = 100 bluestore_max_blob_size = 65536 bluestore_min_alloc_size = 4096 bluestore_min_alloc_size_ssd = 4096 bluestore_prefer_deferred_size = 0 log_to_file = true log_to_stderr = false mon_cluster_log_to_file = true mon_cluster_log_to_stderr = false mon_max_pg_per_osd = 1000 mon_pg_warn_max_object_skew = 100000.000000 mon_pg_warn_min_per_osd = 0 ms_bind_msgr1 = false ms_client_mode = secure ms_cluster_mode = secure ms_service_mode = secure ms_mon_client_mode = secure ms_mon_cluster_mode = secure ms_mon_service_mode = secure osd_pool_default_pg_autoscale_mode = off rbd_cache = false rbd_disable_zero_copy_writes = false log_to_journald = false mon_cluster_log_to_journald = false [mgr] mgr/cephadm/yes_i_know = true mgr/cephadm/no_five_one_rgw = true mgr/prometheus/exclude_perf_counters = false [mon] auth_allow_insecure_global_id_reclaim = false mon_allow_pool_size_one = true mon_config_key_max_entry_size = 4194304 [osd] osd_memory_target = 34359738368 osd_op_complaint_time = 2.000000 osd_op_num_threads_per_shard = 2 osd_scrub_load_threshold = 0.010000 mon_max_pg_per_osd = 1000 osd_memory_target_autotune = false Copy comment #23 for Eric. Removed the hostname with <hostname> Below are my observations from the logs: StandbyModule Successfully Starts on Port 9283 (<hostname> lines lines 71-76) 2025-11-12T19:36:41.651+0000 7f6fb0beb640 0 [prometheus INFO root] server_addr: :: server_port: 9283 2025-11-12T19:36:41.651+0000 7f6fb0beb640 0 [prometheus INFO root] Starting engine... 2025-11-12T19:36:41.651+0000 7f6fb0beb640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:41] ENGINE Bus STARTING 2025-11-12T19:36:41.755+0000 7f6fb0beb640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:41] ENGINE Serving on http://:::9283 2025-11-12T19:36:41.755+0000 7f6fb0beb640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:41] ENGINE Bus STARTED 2025-11-12T19:36:41.755+0000 7f6fb0beb640 0 [prometheus INFO root] Engine started. StandBy shutdown probably because standby is now becoming active(<hostname> lines 77-74 and as seen from logs below) 2025-11-12T19:36:46.234+0000 7f73d1450640 1 mgr handle_mgr_map Activating! 2025-11-12T19:36:46.234+0000 7f73d1450640 1 mgr handle_mgr_map I am now activating 2025-11-12T19:36:46.240+0000 7f6faa3de640 0 [prometheus INFO root] Stopping engine... 2025-11-12T19:36:46.240+0000 7f6faa3de640 0 [prometheus INFO root] Stopped engine 2025-11-12T19:36:46.240+0000 7f6fb0beb640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE Bus STOPPING 2025-11-12T19:36:46.261+0000 7f6fb0beb640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 9283)) shut down 2025-11-12T19:36:46.261+0000 7f6fb0beb640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE Bus STOPPED 2025-11-12T19:36:46.261+0000 7f6fb0beb640 0 [prometheus INFO root] Engine stopped. And again the active module restart prometheus(<hostname> lines 119-124) 2025-11-12T19:36:46.592+0000 7f6faa3de640 0 [prometheus DEBUG root] setting log level based on debug_mgr: INFO (2/5) 2025-11-12T19:36:46.595+0000 7f6faa3de640 1 mgr load Constructed class from module: prometheus 2025-11-12T19:36:46.596+0000 7f6faa3de640 0 [rbd_support DEBUG root] setting log level based on debug_mgr: INFO (2/5) 2025-11-12T19:36:46.597+0000 7f6f9126c640 0 [prometheus INFO root] server_addr: :: server_port: 9283 2025-11-12T19:36:46.597+0000 7f6f9126c640 0 [prometheus INFO root] Cache enabled 2025-11-12T19:36:46.597+0000 7f6f9026a640 0 [prometheus INFO root] starting metric collection thread Now, something triggers config_notify (as indicated by log line Restarting engine... <hostname> lines 148 - 151) 2025-11-12T19:36:46.682+0000 7f6f90a6b640 0 [prometheus INFO root] Restarting engine... 2025-11-12T19:36:46.684+0000 7f6f90a6b640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE Bus STOPPING 2025-11-12T19:36:46.684+0000 7f6f90a6b640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE HTTP Server None already shut down 2025-11-12T19:36:46.685+0000 7f6f90a6b640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:46] ENGINE Bus STOPPED and the above seems to be causing the RACE condition(<hostname> lines 154-156) 2025-11-12T19:36:47.254+0000 7f6f9126c640 0 [prometheus INFO root] Starting engine... 2025-11-12T19:36:47.254+0000 7f6f90a6b640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus STARTING 2025-11-12T19:36:47.254+0000 7f6f9126c640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus STARTING engine is started in quick succession and almost immediately we see the error: 2025-11-12T19:36:47.256+0000 7f6f720ee640 0 [prometheus ERROR cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Error in HTTP server: shutting downTraceback (most recent call last): File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread self.httpserver.start() File "/lib/python3.9/site-packages/cheroot/server.py", line 1844, in start self.prepare() File "/lib/python3.9/site-packages/cheroot/server.py", line 1799, in prepare raise socket.error(msg)OSError: No socket could be created -- (('::', 9283, 0, 0): [Errno 98] Address already in use) This is Anmol Babu's pr: https://github.com/ceph/ceph/pull/66605 I agree with the changes here, issue was engine start being called in both the serve loop and config_notify, leading to one binding to port and the other trying to while the first was already binded. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: Red Hat Ceph Storage 9.0 Security and Enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2026:1536 |
Description of problem: When we enable the prometheus module, it will sometimes produce the following error: 2025-11-12T19:36:47.357+0000 7f6f90a6b640 0 [prometheus ERROR cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Error in 'start' listener <bound method Server.start of <cherrypy._cpserver.Server object at 0x7f6fba4d9cd0>> Traceback (most recent call last): File "/lib/python3.9/site-packages/cherrypy/process/wspbus.py", line 230, in publish output.append(listener(*args, **kwargs)) File "/lib/python3.9/site-packages/cherrypy/_cpserver.py", line 180, in start super(Server, self).start() File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 184, in start self.wait() File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 246, in wait raise self.interrupt File "/lib64/python3.9/threading.py", line 980, in _bootstrap_inner self.run() File "/lib64/python3.9/threading.py", line 917, in run self._target(*self._args, **self._kwargs) File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread self.httpserver.start() File "/lib/python3.9/site-packages/cheroot/server.py", line 1844, in start self.prepare() File "/lib/python3.9/site-packages/cheroot/server.py", line 1806, in prepare self._connections = connections.ConnectionManager(self) File "/lib/python3.9/site-packages/cheroot/connections.py", line 131, in __init__ server.socket.fileno(), AttributeError: 'NoneType' object has no attribute 'fileno' This is passed into Ceph: [ceph: root@dal1-qz2-sr2-rk044-s18 /]# ceph -s cluster: id: 868971c8-bffe-11f0-b45d-7cc2554980d4 health: HEALTH_ERR Module 'prometheus' has failed: AttributeError("'NoneType' object has no attribute 'fileno'") Version-Release number of selected component (if applicable): 19.2.1 (RHCS 8.1) How reproducible: 75% Steps to Reproduce: 1. Deploy Ceph 2. Enable the prometheus module 3. Receive the error Actual results: [ceph: root@dal1-qz2-sr2-rk044-s18 /]# ceph -s cluster: id: 868971c8-bffe-11f0-b45d-7cc2554980d4 health: HEALTH_ERR Module 'prometheus' has failed: AttributeError("'NoneType' object has no attribute 'fileno'") Expected results: Prometheus module is enabled successfully Additional info: There are other stack traces preceding this that leads us to believe the port is in use possibly? 2025-11-12T19:36:47.256+0000 7f6f720ee640 0 [prometheus ERROR cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Error in HTTP server: shutting down Traceback (most recent call last): File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread self.httpserver.start() File "/lib/python3.9/site-packages/cheroot/server.py", line 1844, in start self.prepare() File "/lib/python3.9/site-packages/cheroot/server.py", line 1799, in prepare raise socket.error(msg) OSError: No socket could be created -- (('::', 9283, 0, 0): [Errno 98] Address already in use) 2025-11-12T19:36:47.256+0000 7f6f720ee640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus STOPPING 2025-11-12T19:36:47.256+0000 7f6f720ee640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE HTTP Server cherrypy._cpwsgi_server.CPWSGIServer(('::', 9283)) already shut down 2025-11-12T19:36:47.257+0000 7f6f720ee640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus STOPPED 2025-11-12T19:36:47.257+0000 7f6f720ee640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus EXITING 2025-11-12T19:36:47.257+0000 7f6f720ee640 0 [prometheus INFO cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Bus EXITED 2025-11-12T19:36:47.269+0000 7f6f718ed640 0 [prometheus ERROR cherrypy.error] [12/Nov/2025:19:36:47] ENGINE Error in HTTP server: shutting down Traceback (most recent call last): File "/lib/python3.9/site-packages/cherrypy/process/servers.py", line 225, in _start_http_thread self.httpserver.start() File "/lib/python3.9/site-packages/cheroot/server.py", line 1844, in start self.prepare() File "/lib/python3.9/site-packages/cheroot/server.py", line 1806, in prepare self._connections = connections.ConnectionManager(self) File "/lib/python3.9/site-packages/cheroot/connections.py", line 131, in __init__ server.socket.fileno(), AttributeError: 'NoneType' object has no attribute 'fileno'