Description of problem: Integration of Ceph with existing enterprise monitoring tools would require to at least generate a SNMP trap to a SNMP trap destination server (or ideally a list of multiple). Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Integration of Ceph with existing enterprise monitoring systems is not possible due to missing SNMP trap generation upon status changes. Expected results: At least, upon change of health of a cluster from healthy to something else should generate a SNMP trap, sent to a list of configured SNMP trap destination servers. Additional info: Alternative implementation would be on a ceph-mon/mgr level, however this would require individual configuration for every Ceph cluster. Using dashboard as central point of monitoring could perhaps provide either one config for all (changes for all clusters reported to same destinations, initial solution) or a more sophisticated setup to be able to separate SNMP trap destinations for different clusters to allow deviation of traps depending on assignment within organisations.
An approach explored in the past consisted of: - ceph-mgr ==> Prometheus exporter ==> Prometheus ==> Prometheus AlertManager ==> HTTP Webhook API ==> Prometheus SNMPTrapper Webhook (https://github.com/chrusty/prometheus_webhook_snmptrapper) However, that latter project shows no activity since 2 years ago. On the other hand, this other webhook integration (https://github.com/maxwo/snmp_notifier) has been recently released. Both rely on Net-SNMP stack. That said, Ceph-Dashboard is not strictly required for this. However, the current upstream approach is to expose AlertManager in Dashboard, so technically we could book a place there for UI. Pros: - No code changes required in Ceph, as long as all metrics to send as 'traps' are already exported to Prometheus. - Prometheus and Alertmanager are already building blocks. - No big caveats in reliability, as long as SNMP traps shouldn't be used (alone) if reliability is a key concern. Cons: - Complexity moved to deployment/configuration stage. - No FOSS License assessment performed yet on those projects. - Both projects seem to have marginal community adoption/response (small or no track of issues/bugfixing activity). So a big question mark in terms of code/SNMP implementation quality.
Level setting the severity of this defect to "High" with a bulk update. Pls refine it to a more closure value, as defined by the severity definition in https://bugzilla.redhat.com/page.cgi?id=fields.html#bug_severity
backport pr: https://github.com/ceph/ceph/pull/44529
*** This bug has been marked as a duplicate of bug 1259160 ***