Bug 2155498 - test_ceph_osd_stopped is falling on IBM Z
Summary: test_ceph_osd_stopped is falling on IBM Z
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenShift Data Foundation
Classification: Red Hat Storage
Component: rook
Version: 4.12
Hardware: s390x
OS: Linux
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Travis Nielsen
QA Contact: Neha Berry
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-12-21 11:03 UTC by Sujoy Batabyal
Modified: 2023-08-09 17:03 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2023-01-17 15:11:23 UTC
Embargoed:


Attachments (Terms of Use)
Failed test case test_ceph_osd_stopped on ODF 4.12 (257.76 KB, text/plain)
2022-12-21 12:50 UTC, Sujoy Batabyal
no flags Details
Test case test_ceph_osd_stopped (246.98 KB, text/plain)
2023-01-13 14:57 UTC, Sujoy Batabyal
no flags Details

Description Sujoy Batabyal 2022-12-21 11:03:43 UTC
Description of problem (please be detailed as possible and provide log
snippests): tests/manage/monitoring/prometheus/test_deployment_status.py::test_ceph_osd_stopped is falling on IBM Z due to following error. 

Assertion error. Alerts are not counted as it should.

Version of all relevant components (if applicable): OCP 4.12, ODF 4.12.


Does this issue impact your ability to continue to work with the product
(please explain in detail what is the user impact)?
No

Is there any workaround available to the best of your knowledge?
No

Rate from 1 - 5 the complexity of the scenario you performed that caused this
bug (1 - very simple, 5 - very complex)?


Can this issue reproducible?
Yes

Can this issue reproduce from the UI?


If this is a regression, please provide more details to justify this:


Steps to Reproduce:
1.
2.
3.


Actual results: Incorrect number of CephOSDDiskNotResponding alerts.


Expected results: Test that there should be appropriate alert related to situation when ceph osd is down. Alert was cleared when osd disk is back online. 



Additional info:

Comment 2 Sujoy Batabyal 2022-12-21 11:08:35 UTC
Message: AssertionError: Incorrect number of CephOSDDiskNotResponding alerts (2 instead of 1 with states: ['pending']).
Alerts: [{'labels': {'alertname': 'CephOSDDiskNotResponding', 'ceph_daemon': 'osd.2', 'container': 'mgr', 'device': '/dev/dm-4', 'disk': '2', 'endpoint': 'http-metrics', 'host': 'worker-1.ocs-ci-large.test.ocs', 'instance': '10.129.4.30:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-77b6fb49bc-kkzqf', 'service': 'rook-ceph-mgr', 'severity': 'critical'}, 'annotations': {'description': 'Disk device /dev/dm-4 not responding, on host worker-1.ocs-ci-large.test.ocs.', 'message': 'Disk not responding', 'severity_level': 'error', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-12-10T14:35:12.677556721Z', 'value': '2e+00'}, {'labels': {'alertname': 'CephOSDDiskNotResponding', 'ceph_daemon': 'osd.2', 'container': 'mgr', 'device': '/dev/dm-4', 'disk': '2', 'endpoint': 'http-metrics', 'host': 'worker-1.ocs-ci-large.test.ocs', 'instance': '10.129.4.30:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-77b6fb49bc-kkzqf', 'service': 'rook-ceph-mgr', 'severity': 'critical'}, 'annotations': {'description': 'Disk device /dev/dm-4 not responding, on host worker-1.ocs-ci-large.test.ocs.', 'message': 'Disk not responding', 'severity_level': 'error', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-12-10T14:35:12.677556721Z', 'value': '2e+00'}]
Type: None

Text:
measure_stop_ceph_osd = {'first_run': True, 'metadata': None, 'prometheus_alerts': [{'activeAt': '2022-12-10T14:34:11.581193756Z', 'annotation...cf1ff25) pacific (stable)', 'container': 'mgr', ...}, 'state': 'pending', ...}, ...], 'result': 'rook-ceph-osd-2', ...}

    @tier4c
    @pytest.mark.polarion_id("OCS-900")
    @skipif_managed_service
    def test_ceph_osd_stopped(measure_stop_ceph_osd):
        """
        Test that there is appropriate alert related to situation when ceph osd
        is down. Alert is cleared when osd disk is back online.
        """
        api = prometheus.PrometheusAPI()
    
        # get alerts from time when manager deployment was scaled down
        alerts = measure_stop_ceph_osd.get("prometheus_alerts")
        for target_label, target_msg, target_states, target_severity, ignore in [
            (
                constants.ALERT_OSDDISKNOTRESPONDING,
                "Disk not responding",
                ["pending"],
                "error",
                False,
            ),
            (
                constants.ALERT_DATARECOVERYTAKINGTOOLONG,
                "Data recovery is slow",
                ["pending"],
                "warning",
                True,
            ),
            (
                constants.ALERT_CLUSTERWARNINGSTATE,
                "Storage cluster is in degraded state",
                ["pending"],
                "warning",
                False,
            ),
        ]:
>           prometheus.check_alert_list(
                label=target_label,
                msg=target_msg,
                alerts=alerts,
                states=target_states,
                severity=target_severity,
                ignore_more_occurences=ignore,
            )

tests/manage/monitoring/prometheus/test_deployment_status.py:146: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

label = 'CephOSDDiskNotResponding', msg = 'Disk not responding'
alerts = [{'activeAt': '2022-12-10T14:34:11.581193756Z', 'annotations': {'description': 'Storage cluster is in warning state fo... (68de1f204d3c34ec62bd59fae7a9814accf1ff25) pacific (stable)', 'container': 'mgr', ...}, 'state': 'pending', ...}, ...]
states = ['pending'], severity = 'error', ignore_more_occurences = False

    def check_alert_list(
        label, msg, alerts, states, severity="warning", ignore_more_occurences=True
    ):
        """
        Check list of alerts that there are alerts with requested label and
        message for each provided state. If some alert is missing then this check
        fails.
    
        Args:
            label (str): Alert label
            msg (str): Alert message
            alerts (list): List of alerts to check
            states (list): List of states to check, order is important
            ignore_more_occurences (bool): If true then there is checkced only
                occurence of alert with requested label, message and state but
                it is not checked if there is more of occurences than one.
        """
    
        target_alerts = [
            alert for alert in alerts if alert.get("labels").get("alertname") == label
        ]
    
        logger.info(f"Checking properties of found {label} alerts")
        if ignore_more_occurences:
            for state in states:
                delete = False
                for key, alert in reversed(list(enumerate(target_alerts))):
                    if alert.get("state") == state:
                        if delete:
                            d_msg = f"Ignoring {alert} as alert already appeared."
                            logger.debug(d_msg)
                            target_alerts.pop(key)
                        else:
                            delete = True
        assert_msg = (
            f"Incorrect number of {label} alerts ({len(target_alerts)} "
            f"instead of {len(states)} with states: {states})."
            f"\nAlerts: {target_alerts}"
        )
>       assert len(target_alerts) == len(states), assert_msg
E       AssertionError: Incorrect number of CephOSDDiskNotResponding alerts (2 instead of 1 with states: ['pending']).
E       Alerts: [{'labels': {'alertname': 'CephOSDDiskNotResponding', 'ceph_daemon': 'osd.2', 'container': 'mgr', 'device': '/dev/dm-4', 'disk': '2', 'endpoint': 'http-metrics', 'host': 'worker-1.ocs-ci-large.test.ocs', 'instance': '10.129.4.30:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-77b6fb49bc-kkzqf', 'service': 'rook-ceph-mgr', 'severity': 'critical'}, 'annotations': {'description': 'Disk device /dev/dm-4 not responding, on host worker-1.ocs-ci-large.test.ocs.', 'message': 'Disk not responding', 'severity_level': 'error', 'storage_type': 'ceph'}, 'state': 'pending', 'activeAt': '2022-12-10T14:35:12.677556721Z', 'value': '2e+00'}, {'labels': {'alertname': 'CephOSDDiskNotResponding', 'ceph_daemon': 'osd.2', 'container': 'mgr', 'device': '/dev/dm-4', 'disk': '2', 'endpoint': 'http-metrics', 'host': 'worker-1.ocs-ci-large.test.ocs', 'instance': '10.129.4.30:9283', 'job': 'rook-ceph-mgr', 'managedBy': 'ocs-storagecluster', 'namespace': 'openshift-storage', 'pod': 'rook-ceph-mgr-a-77b6fb49bc-kkzqf', 'service': 'rook-ceph-mgr', 'severity': 'critical'}, 'annotations': {'description': 'Disk device /dev/dm-4 not responding, on host worker-1.ocs-ci-large.test.ocs.', 'message': 'Disk not responding', 'severity_level': 'error', 'storage_type': 'ceph'}, 'state': 'firing', 'activeAt': '2022-12-10T14:35:12.677556721Z', 'value': '2e+00'}]

ocs_ci/utility/prometheus.py:61: AssertionError

Comment 3 Sujoy Batabyal 2022-12-21 12:50:25 UTC
Created attachment 1933907 [details]
Failed test case test_ceph_osd_stopped on ODF 4.12

Comment 8 Filip Balák 2023-01-12 08:16:00 UTC
This looks like an automation issue. According to logs, there are 2 log records for alert CephOSDDiskNotResponding: one in Pending state and one in Firing state. This is expected. The test is marked to look for both states in triggered alerts correctly: https://github.com/red-hat-storage/ocs-ci/blob/master/tests/manage/monitoring/prometheus/test_deployment_status.py#L133.
@Sujoy can you please share test logs to investigate?
In any case I think that this can be closed as NOTABUG.

Comment 9 Sujoy Batabyal 2023-01-13 14:57:28 UTC
Created attachment 1937823 [details]
Test case test_ceph_osd_stopped

@fbalak attached is the test logs.

Comment 10 Travis Nielsen 2023-01-17 15:11:23 UTC
Closing since it's not a bug as Filip mentioned


Note You need to log in before you can comment on or make changes to this bug.