Bug 1873353

Summary: The deleted PrometheusRule is still in the thanos-ruler UI
Product: OpenShift Container Platform Reporter: hongyan li <hongyli>
Component: MonitoringAssignee: Sergiusz Urbaniak <surbania>
Status: CLOSED ERRATA QA Contact: hongyan li <hongyli>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.6CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, spasquie, surbania
Target Milestone: ---Keywords: Regression
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:35:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
alert rule on Monitoring-Alerting Rules UI none

Description hongyan li 2020-08-28 02:19:23 UTC
Created attachment 1712904 [details]
alert rule on Monitoring-Alerting Rules UI

Created attachment 1712904 [details]
alert rule on Monitoring-Alerting Rules UI


still see 
The deleted PrometheusRule in monitoring-alertrules
The alerts in monitoring-alerts
the deleted PrometheusRule in thanos-ruler UI


Description of problem:
The deleted PrometheusRule is still in the thanos-ruler UI

Steps: enable UserWorkload and do the followings to create PrometheusRule
***********************************************
# oc new-project test3
# oc create -f - << EOF
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: test3.rules
spec:
  groups:
  - name: alerting rules
    rules:
    - alert: Watchdog
      expr: vector(1)
      labels:
        severity: none
      message:
        This is an alert meant to ensure that the entire alerting pipeline is functional.
EOF
***********************************************
The PrometheusRule is in rules-configmap-reloader container
# oc -n openshift-user-workload-monitoring exec -c rules-configmap-reloader thanos-ruler-user-workload-0 -- cat /etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/test3-test3.rules.yaml
groups:
- name: alerting rules
  rules:
  - alert: Watchdog
    expr: vector(1)
    labels:
      namespace: test3
      severity: none

# oc -n openshift-user-workload-monitoring exec -c rules-configmap-reloader thanos-ruler-user-workload-0 -- ls -al /etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/test3-test3.rules.yaml
lrwxrwxrwx. 1 root 1000420000 29 Apr  2 12:04 /etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/test3-test3.rules.yaml -> ..data/test3-test3.rules.yaml

delete PrometheusRule in test3
# oc delete project test3
project.project.openshift.io "test3" deleted

# oc -n test3 get PrometheusRule
No resources found in test3 namespace.

# oc -n openshift-user-workload-monitoring logs -c rules-configmap-reloader thanos-ruler-user-workload-0 
2020/08/28 01:33:11 Watching directory: "/etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0"
2020/08/28 01:33:13 config map updated
2020/08/28 01:33:13 successfully triggered reload
2020/08/28 01:34:35 config map updated
2020/08/28 01:34:35 successfully triggered reload
2020/08/28 01:45:50 config map updated
2020/08/28 01:45:50 successfully triggered reload
2020/08/28 01:53:53 config map updated
2020/08/28 01:53:53 successfully triggered reload


The PrometheusRule is also removed from rules-configmap-reloader container
# oc -n openshift-user-workload-monitoring exec -c rules-configmap-reloader thanos-ruler-user-workload-0 -- cat /etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/test3-test3.rules.yaml
cat: /etc/thanos/rules/thanos-ruler-user-workload-rulefiles-0/test3-test3.rules.yaml: No such file or directory
command terminated with exit code 1

Check after more than 20 minutes,the related alert and alert rules still can be found in UI, see the picture.

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-user-workload-monitoring exec -c thanos-ruler thanos-ruler-user-workload-1 -- curl -k -H "Authorization: Bearer $token" https://thanos-ruler.openshift-user-workload-monitoring.svc:9091/api/v1/rules | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   741  100   741    0     0  15052      0 --:--:-- --:--:-- --:--:-- 15122
{
  "status": "success",
  "data": {
    "groups": [
      {
        "name": "alerting rules",
        "file": "",
        "rules": [
          {
            "state": "firing",
            "name": "Watchdog",
            "query": "vector(1)",
            "duration": 0,
            "labels": {
              "namespace": "test3",
              "severity": "none",
              "thanos_ruler_replica": "thanos-ruler-user-workload-0"
            },
            "annotations": {},
            "alerts": [
              {
                "labels": {
                  "alertname": "Watchdog",
                  "namespace": "test3",
                  "severity": "none"
                },
                "annotations": {},
                "state": "firing",
                "activeAt": "2020-08-28T01:45:57.078951644Z",
                "value": "1e+00",
                "partialResponseStrategy": "ABORT"
              }
            ],
            "health": "ok",
            "evaluationTime": 0.0018852,
            "lastEvaluation": "2020-08-28T02:09:12.079090599Z",
            "type": "alerting"
          }
        ],
        "interval": 15,
        "evaluationTime": 0,
        "lastEvaluation": "0001-01-01T00:00:00Z",
        "partial_response_strategy": "WARN",
        "partialResponseStrategy": "ABORT"
      }
    ]
  }
}

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-08-27-005538

How reproducible:
Always

Steps to Reproduce:
1. See the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 hongyan li 2020-08-28 02:25:12 UTC
Similar bug for 4.5
https://bugzilla.redhat.com/show_bug.cgi?id=1820180

Comment 2 hongyan li 2020-08-28 02:28:09 UTC
Similar bug for 4.5
https://bugzilla.redhat.com/show_bug.cgi?id=1820180

Comment 3 hongyan li 2020-08-28 03:21:18 UTC
The alert rule and alerts always display on UI until a new Prometheus rule with different namespace is created

Comment 4 hongyan li 2020-08-28 03:22:07 UTC
The alert rule and alerts always display on UI after deleted until a new Prometheus rule with different namespace is created

Comment 5 Simon Pasquier 2020-08-28 09:44:09 UTC
I can reproduce with the upstream version of Thanos. I'll file an issue + PR there.

Comment 7 Simon Pasquier 2020-08-31 08:01:47 UTC
The upstream PR has been merged, we need to wait for the bump of Thanos in our downstream fork.

Comment 10 hongyan li 2020-09-10 04:16:39 UTC
Test with payload 4.6.0-0.nightly-2020-09-09-173545
The deleted PrometheusRule will disappear in less than a minute

Comment 12 errata-xmlrpc 2020-10-27 16:35:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196