Bug 2060091 - CMO produces invalid alertmanager statefulset if console cluster .status.consoleURL is unset
Summary: CMO produces invalid alertmanager statefulset if console cluster .status.cons...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.11.0
Assignee: aaleman
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 2060756
TreeView+ depends on / blocked
 
Reported: 2022-03-02 16:56 UTC by aaleman
Modified: 2022-08-10 10:52 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2060756 (view as bug list)
Environment:
Last Closed: 2022-08-10 10:51:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1576 0 None open Bug 2060091: Properly deal with an empty console URL 2022-03-02 17:02:43 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:52:26 UTC

Description aaleman 2022-03-02 16:56:23 UTC
Description of problem:

If the console cluster has an empty status/status.consoleURL unset, the CMO will set up the alertmanager with an argument of `-web.external-url=https:/monitoring` which is invalid and puts it into crashloop, logging "level=error ts=2022-03-02T16:53:47.773Z caller=main.go:369 msg="failed to determine external URL" err="\"monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported""


Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:
1. Disable the CVO and console clusteroperator
2. Set the console status to empty: `oc proxy & ; curl -v -XPATCH  -H "Accept: application/json" -H "Content-Type: application/merge-patch+json" -H "User-Agent: kubectl/v1.23.4 (linux/amd64) kubernetes/e6c093d" 'http://127.0.0.1:8001/apis/config.openshift.io/v1/consoles/cluster/status?fieldManager=kubectl-edit' --data '{"status":null}`
3. Delete the CMO pod, because it doesn't react to changes in clusteroperators, ref https://bugzilla.redhat.com/show_bug.cgi?id=2060083

Actual results:

The alertmanager is invalid and crashloops, logging

```
level=error ts=2022-03-02T16:53:47.773Z caller=main.go:369 msg="failed to determine external URL" err="\"monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported"
```

because it has a CLI argument of `--web.external-url=monitoring`



Expected results:

A working alertmanager is produced


Additional info:

Comment 4 Junqi Zhao 2022-03-08 04:55:45 UTC
the fix is in 4.11.0-0.nightly-2022-03-04-063157, tested with it, still the same issue
before seeting status.consoleURL to null for console/cluster
# oc get console/cluster -o jsonpath="{.status.consoleURL}"
https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com

# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
        - --web.external-url=https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
        - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/

scale down CVO/console-operator
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0
# oc -n openshift-console-operator scale deploy console-operator --replicas=0

# oc -n openshift-cluster-version get deploy
NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
cluster-version-operator   0/0     0            0           3h56m
# oc -n openshift-console-operator get deploy
NAME               READY   UP-TO-DATE   AVAILABLE   AGE
console-operator   0/0     0            0           3h43m


# oc proxy
open another teminal
# curl -v -XPATCH  -H "Accept: application/json" -H "Content-Type: application/merge-patch+json" -H "User-Agent: kubectl/v1.23.4 (linux/amd64) kubernetes/e6c093d" 'http://127.0.0.1:8001/apis/config.openshift.io/v1/consoles/cluster/status?fieldManager=kubectl-edit' --data '{"status":null}'

# oc get console/cluster -o jsonpath="{.status.consoleURL}"
no result

# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0                            6/6     Running            0             6m44s
alertmanager-main-1                            5/6     CrashLoopBackOff   4 (64s ago)   2m52s
prometheus-k8s-0                               6/6     Running            0             8m9s
prometheus-k8s-1                               6/6     Running            0             10m
cluster-monitoring-operator-66cb5487b9-8l7sl   2/2     Running            0             99m


# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
        - --web.external-url=/monitoring

# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
        - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/

# oc -n openshift-monitoring describe pod alertmanager-main-1
...
  alertmanager:
    Container ID:  cri-o://e502d6f2cde8ff678bc26bd074f662cf364441192bef214dd28e7fb1fdd61596
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58
    Ports:         9094/TCP, 9094/UDP
    Host Ports:    0/TCP, 0/UDP
    Args:
      --config.file=/etc/alertmanager/config/alertmanager.yaml
      --storage.path=/alertmanager
      --data.retention=120h
      --cluster.listen-address=[$(POD_IP)]:9094
      --web.listen-address=127.0.0.1:9093
      --web.external-url=/monitoring
      --web.route-prefix=/
      --cluster.peer=alertmanager-main-0.alertmanager-operated:9094
      --cluster.peer=alertmanager-main-1.alertmanager-operated:9094
      --cluster.reconnect-timeout=5m
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   level=info ts=2022-03-08T04:11:10.810Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=72e0ff6e1bacdb3e9ced559bc905bf4501eb8b61)"
level=info ts=2022-03-08T04:11:10.810Z caller=main.go:226 build_context="(go=go1.17.5, user=root@e377fc787659, date=20220304-03:55:17)"
level=info ts=2022-03-08T04:11:10.829Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=error ts=2022-03-08T04:11:10.857Z caller=main.go:369 msg="failed to determine external URL" err="\"/monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported"
level=info ts=2022-03-08T04:11:10.857Z caller=cluster.go:680 component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=28.253187ms

delete CMO pod
# oc -n openshift-monitoring delete pod cluster-monitoring-operator-66cb5487b9-8l7sl
pod "cluster-monitoring-operator-66cb5487b9-8l7sl" deleted

still the same issue
# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0                            6/6     Running   0             39m
alertmanager-main-1                            5/6     Error     3 (27s ago)   48s
cluster-monitoring-operator-66cb5487b9-8qwpb   2/2     Running   0             16m
prometheus-k8s-0                               6/6     Running   0             20m
prometheus-k8s-1                               6/6     Running   0             21m

# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
        - --web.external-url=/monitoring

# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
        - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/

# oc -n openshift-monitoring get po alertmanager-main-0 -oyaml | grep "web.external-url"
    - --web.external-url=https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get po alertmanager-main-1 -oyaml | grep "web.external-url"
    - --web.external-url=/monitoring

# oc -n openshift-monitoring get po prometheus-k8s-0 -oyaml | grep "web.external-url"
    - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/

# oc -n openshift-monitoring get po prometheus-k8s-1 -oyaml | grep "web.external-url"
    - --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/

# oc -n openshift-monitoring describe pod alertmanager-main-1
...
Containers:
  alertmanager:
    Container ID:  cri-o://8c733eba4cbc223a0598bafa6d95356de050da0151f3608f233402efc5c8342c
    Image:         quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58
    Image ID:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58
    Ports:         9094/TCP, 9094/UDP
    Host Ports:    0/TCP, 0/UDP
    Args:
      --config.file=/etc/alertmanager/config/alertmanager.yaml
      --storage.path=/alertmanager
      --data.retention=120h
      --cluster.listen-address=[$(POD_IP)]:9094
      --web.listen-address=127.0.0.1:9093
      --web.external-url=/monitoring
      --web.route-prefix=/
      --cluster.peer=alertmanager-main-0.alertmanager-operated:9094
      --cluster.peer=alertmanager-main-1.alertmanager-operated:9094
      --cluster.reconnect-timeout=5m
    State:       Waiting
      Reason:    CrashLoopBackOff
    Last State:  Terminated
      Reason:    Error
      Message:   level=info ts=2022-03-08T04:45:04.955Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=72e0ff6e1bacdb3e9ced559bc905bf4501eb8b61)"
level=info ts=2022-03-08T04:45:04.955Z caller=main.go:226 build_context="(go=go1.17.5, user=root@e377fc787659, date=20220304-03:55:17)"
level=info ts=2022-03-08T04:45:04.968Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=error ts=2022-03-08T04:45:05.007Z caller=main.go:369 msg="failed to determine external URL" err="\"/monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported"
level=info ts=2022-03-08T04:45:05.007Z caller=cluster.go:680 component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=38.734031ms

# oc -n openshift-monitoring rsh -c alertmanager alertmanager-main-0
sh-4.4$ /bin/alertmanager --help
...
      --web.external-url=WEB.EXTERNAL-URL  
                                 The URL under which Alertmanager is externally reachable (for example, if Alertmanager is served via a reverse proxy). Used for generating relative and absolute links back to Alertmanager itself. If the
                                 URL has a path portion, it will be used to prefix all HTTP endpoints served by Alertmanager. If omitted, relevant URL components will be derived automatically.

Comment 6 aaleman 2022-03-08 19:45:46 UTC
Junqi Zhao I can not reproduce this, for me it works in that version:

Cluster has same version as the one you mentioned:

$ k get clusterversion version -ojson|jq .status.desired
{
  "image": "registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-03-04-063157",
  "version": "4.11.0-0.nightly-2022-03-04-063157"
}


Console doesn't have the url in the status:

k get console cluster -ojson|jq .status
null

The statefulset doesn't have `web.external-url` in the alertmanager args:

$ k get statefulset -n openshift-monitoring alertmanager-main -ojson|jq '.spec.template.spec.containers[0].args'
[
  "--config.file=/etc/alertmanager/config/alertmanager.yaml",
  "--storage.path=/alertmanager",
  "--data.retention=120h",
  "--cluster.listen-address=[$(POD_IP)]:9094",
  "--web.listen-address=127.0.0.1:9093",
  "--web.route-prefix=/",
  "--cluster.peer=alertmanager-main-0.alertmanager-operated:9094",
  "--cluster.peer=alertmanager-main-1.alertmanager-operated:9094",
  "--cluster.reconnect-timeout=5m"
]

Alertmanager is running:

$ k get pod|rg alertmanager
alertmanager-main-0                           6/6     Running   0          9m8s
alertmanager-main-1                           6/6     Running   0          9m39s

Alertmanager pod doesn't have the `web.external-url` arg either:

$ k get pod alertmanager-main-0 -ojson|jq '.spec.containers[0].args'
[
  "--config.file=/etc/alertmanager/config/alertmanager.yaml",
  "--storage.path=/alertmanager",
  "--data.retention=120h",
  "--cluster.listen-address=[$(POD_IP)]:9094",
  "--web.listen-address=127.0.0.1:9093",
  "--web.route-prefix=/",
  "--cluster.peer=alertmanager-main-0.alertmanager-operated:9094",
  "--cluster.peer=alertmanager-main-1.alertmanager-operated:9094",
  "--cluster.reconnect-timeout=5m"
]

Are you sure your cluster is at 4.11.0-0.nightly-2022-03-04-063157 and not some other version that doesn't have the patch?

> I am kindly think the scenario in Comment 0 is invalid based on the --web.external-url help

It can happen during cluster creation and results in a crashlooping alertmanager pod, that is what the bug is about.

Comment 8 Junqi Zhao 2022-03-11 07:27:33 UTC
tested with 4.11.0-0.nightly-2022-03-09-235248, no issue now
# oc get console/cluster -o jsonpath="{.status.consoleURL}"
no result

# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0                            6/6     Running   0          76s
alertmanager-main-1                            6/6     Running   0          109s
cluster-monitoring-operator-5699fc45d8-q7m9r   2/2     Running   0          118s
prometheus-k8s-0                               6/6     Running   0          86s
prometheus-k8s-1                               6/6     Running   0          104s

# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
        - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
        - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url"
    - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url"
    - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url"
    - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url"
    - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring delete pod cluster-monitoring-operator-5699fc45d8-q7m9r
pod "cluster-monitoring-operator-5699fc45d8-q7m9r" deleted

# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0                            6/6     Running   0          41s
alertmanager-main-1                            6/6     Running   0          73s
cluster-monitoring-operator-5699fc45d8-2dvbg   2/2     Running   0          83s
prometheus-k8s-0                               6/6     Running   0          52s
prometheus-k8s-1                               6/6     Running   0          68s


# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
no result

# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
        - --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091

# oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url"
no result

# oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url"
no result

# oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url"
    - --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091

# oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url"
    - --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091

restore cluster
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=1
# oc -n openshift-console-operator scale deploy console-operator --replicas=1

# oc -n openshift-monitoring delete pod cluster-monitoring-operator-5699fc45d8-2dvbg 
pod "cluster-monitoring-operator-5699fc45d8-4786r" deleted

# oc get console/cluster -o jsonpath="{.status.consoleURL}"
https://console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com

# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0                            6/6     Running   0          52s
alertmanager-main-1                            6/6     Running   0          85s
cluster-monitoring-operator-5699fc45d8-cv4wv   2/2     Running   0          94s
prometheus-k8s-0                               6/6     Running   0          63s
prometheus-k8s-1                               6/6     Running   0          80s

# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
        - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
        - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url"
    - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url"
    - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url"
    - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

# oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url"
    - --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring

Comment 14 errata-xmlrpc 2022-08-10 10:51:53 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.