Bug 2060091
| Summary: | CMO produces invalid alertmanager statefulset if console cluster .status.consoleURL is unset | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | aaleman | |
| Component: | Monitoring | Assignee: | aaleman | |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | medium | |||
| Version: | 4.10 | CC: | amuller, anpicker, jfajersk, spasquie | |
| Target Milestone: | --- | |||
| Target Release: | 4.11.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2060756 (view as bug list) | Environment: | ||
| Last Closed: | 2022-08-10 10:51:53 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 2060756 | |||
|
Description
aaleman
2022-03-02 16:56:23 UTC
the fix is in 4.11.0-0.nightly-2022-03-04-063157, tested with it, still the same issue
before seeting status.consoleURL to null for console/cluster
# oc get console/cluster -o jsonpath="{.status.consoleURL}"
https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com
# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
- --web.external-url=https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
- --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/
scale down CVO/console-operator
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=0
# oc -n openshift-console-operator scale deploy console-operator --replicas=0
# oc -n openshift-cluster-version get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
cluster-version-operator 0/0 0 0 3h56m
# oc -n openshift-console-operator get deploy
NAME READY UP-TO-DATE AVAILABLE AGE
console-operator 0/0 0 0 3h43m
# oc proxy
open another teminal
# curl -v -XPATCH -H "Accept: application/json" -H "Content-Type: application/merge-patch+json" -H "User-Agent: kubectl/v1.23.4 (linux/amd64) kubernetes/e6c093d" 'http://127.0.0.1:8001/apis/config.openshift.io/v1/consoles/cluster/status?fieldManager=kubectl-edit' --data '{"status":null}'
# oc get console/cluster -o jsonpath="{.status.consoleURL}"
no result
# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0 6/6 Running 0 6m44s
alertmanager-main-1 5/6 CrashLoopBackOff 4 (64s ago) 2m52s
prometheus-k8s-0 6/6 Running 0 8m9s
prometheus-k8s-1 6/6 Running 0 10m
cluster-monitoring-operator-66cb5487b9-8l7sl 2/2 Running 0 99m
# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
- --web.external-url=/monitoring
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
- --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/
# oc -n openshift-monitoring describe pod alertmanager-main-1
...
alertmanager:
Container ID: cri-o://e502d6f2cde8ff678bc26bd074f662cf364441192bef214dd28e7fb1fdd61596
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58
Ports: 9094/TCP, 9094/UDP
Host Ports: 0/TCP, 0/UDP
Args:
--config.file=/etc/alertmanager/config/alertmanager.yaml
--storage.path=/alertmanager
--data.retention=120h
--cluster.listen-address=[$(POD_IP)]:9094
--web.listen-address=127.0.0.1:9093
--web.external-url=/monitoring
--web.route-prefix=/
--cluster.peer=alertmanager-main-0.alertmanager-operated:9094
--cluster.peer=alertmanager-main-1.alertmanager-operated:9094
--cluster.reconnect-timeout=5m
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: level=info ts=2022-03-08T04:11:10.810Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=72e0ff6e1bacdb3e9ced559bc905bf4501eb8b61)"
level=info ts=2022-03-08T04:11:10.810Z caller=main.go:226 build_context="(go=go1.17.5, user=root@e377fc787659, date=20220304-03:55:17)"
level=info ts=2022-03-08T04:11:10.829Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=error ts=2022-03-08T04:11:10.857Z caller=main.go:369 msg="failed to determine external URL" err="\"/monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported"
level=info ts=2022-03-08T04:11:10.857Z caller=cluster.go:680 component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=28.253187ms
delete CMO pod
# oc -n openshift-monitoring delete pod cluster-monitoring-operator-66cb5487b9-8l7sl
pod "cluster-monitoring-operator-66cb5487b9-8l7sl" deleted
still the same issue
# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0 6/6 Running 0 39m
alertmanager-main-1 5/6 Error 3 (27s ago) 48s
cluster-monitoring-operator-66cb5487b9-8qwpb 2/2 Running 0 16m
prometheus-k8s-0 6/6 Running 0 20m
prometheus-k8s-1 6/6 Running 0 21m
# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
- --web.external-url=/monitoring
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
- --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/
# oc -n openshift-monitoring get po alertmanager-main-0 -oyaml | grep "web.external-url"
- --web.external-url=https://console-openshift-console.apps.qe-daily-0308.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get po alertmanager-main-1 -oyaml | grep "web.external-url"
- --web.external-url=/monitoring
# oc -n openshift-monitoring get po prometheus-k8s-0 -oyaml | grep "web.external-url"
- --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/
# oc -n openshift-monitoring get po prometheus-k8s-1 -oyaml | grep "web.external-url"
- --web.external-url=https://prometheus-k8s-openshift-monitoring.apps.qe-daily-0308.qe.devcluster.openshift.com/
# oc -n openshift-monitoring describe pod alertmanager-main-1
...
Containers:
alertmanager:
Container ID: cri-o://8c733eba4cbc223a0598bafa6d95356de050da0151f3608f233402efc5c8342c
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58
Image ID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:5273885946234d1b5c0ad3a21d9359243d0f44cfcbaa7e19213fb26989710c58
Ports: 9094/TCP, 9094/UDP
Host Ports: 0/TCP, 0/UDP
Args:
--config.file=/etc/alertmanager/config/alertmanager.yaml
--storage.path=/alertmanager
--data.retention=120h
--cluster.listen-address=[$(POD_IP)]:9094
--web.listen-address=127.0.0.1:9093
--web.external-url=/monitoring
--web.route-prefix=/
--cluster.peer=alertmanager-main-0.alertmanager-operated:9094
--cluster.peer=alertmanager-main-1.alertmanager-operated:9094
--cluster.reconnect-timeout=5m
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Message: level=info ts=2022-03-08T04:45:04.955Z caller=main.go:225 msg="Starting Alertmanager" version="(version=0.23.0, branch=rhaos-4.10-rhel-8, revision=72e0ff6e1bacdb3e9ced559bc905bf4501eb8b61)"
level=info ts=2022-03-08T04:45:04.955Z caller=main.go:226 build_context="(go=go1.17.5, user=root@e377fc787659, date=20220304-03:55:17)"
level=info ts=2022-03-08T04:45:04.968Z caller=cluster.go:671 component=cluster msg="Waiting for gossip to settle..." interval=2s
level=error ts=2022-03-08T04:45:05.007Z caller=main.go:369 msg="failed to determine external URL" err="\"/monitoring\": invalid \"\" scheme, only 'http' and 'https' are supported"
level=info ts=2022-03-08T04:45:05.007Z caller=cluster.go:680 component=cluster msg="gossip not settled but continuing anyway" polls=0 elapsed=38.734031ms
# oc -n openshift-monitoring rsh -c alertmanager alertmanager-main-0
sh-4.4$ /bin/alertmanager --help
...
--web.external-url=WEB.EXTERNAL-URL
The URL under which Alertmanager is externally reachable (for example, if Alertmanager is served via a reverse proxy). Used for generating relative and absolute links back to Alertmanager itself. If the
URL has a path portion, it will be used to prefix all HTTP endpoints served by Alertmanager. If omitted, relevant URL components will be derived automatically.
Junqi Zhao I can not reproduce this, for me it works in that version:
Cluster has same version as the one you mentioned:
$ k get clusterversion version -ojson|jq .status.desired
{
"image": "registry.ci.openshift.org/ocp/release:4.11.0-0.nightly-2022-03-04-063157",
"version": "4.11.0-0.nightly-2022-03-04-063157"
}
Console doesn't have the url in the status:
k get console cluster -ojson|jq .status
null
The statefulset doesn't have `web.external-url` in the alertmanager args:
$ k get statefulset -n openshift-monitoring alertmanager-main -ojson|jq '.spec.template.spec.containers[0].args'
[
"--config.file=/etc/alertmanager/config/alertmanager.yaml",
"--storage.path=/alertmanager",
"--data.retention=120h",
"--cluster.listen-address=[$(POD_IP)]:9094",
"--web.listen-address=127.0.0.1:9093",
"--web.route-prefix=/",
"--cluster.peer=alertmanager-main-0.alertmanager-operated:9094",
"--cluster.peer=alertmanager-main-1.alertmanager-operated:9094",
"--cluster.reconnect-timeout=5m"
]
Alertmanager is running:
$ k get pod|rg alertmanager
alertmanager-main-0 6/6 Running 0 9m8s
alertmanager-main-1 6/6 Running 0 9m39s
Alertmanager pod doesn't have the `web.external-url` arg either:
$ k get pod alertmanager-main-0 -ojson|jq '.spec.containers[0].args'
[
"--config.file=/etc/alertmanager/config/alertmanager.yaml",
"--storage.path=/alertmanager",
"--data.retention=120h",
"--cluster.listen-address=[$(POD_IP)]:9094",
"--web.listen-address=127.0.0.1:9093",
"--web.route-prefix=/",
"--cluster.peer=alertmanager-main-0.alertmanager-operated:9094",
"--cluster.peer=alertmanager-main-1.alertmanager-operated:9094",
"--cluster.reconnect-timeout=5m"
]
Are you sure your cluster is at 4.11.0-0.nightly-2022-03-04-063157 and not some other version that doesn't have the patch?
> I am kindly think the scenario in Comment 0 is invalid based on the --web.external-url help
It can happen during cluster creation and results in a crashlooping alertmanager pod, that is what the bug is about.
tested with 4.11.0-0.nightly-2022-03-09-235248, no issue now
# oc get console/cluster -o jsonpath="{.status.consoleURL}"
no result
# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0 6/6 Running 0 76s
alertmanager-main-1 6/6 Running 0 109s
cluster-monitoring-operator-5699fc45d8-q7m9r 2/2 Running 0 118s
prometheus-k8s-0 6/6 Running 0 86s
prometheus-k8s-1 6/6 Running 0 104s
# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring delete pod cluster-monitoring-operator-5699fc45d8-q7m9r
pod "cluster-monitoring-operator-5699fc45d8-q7m9r" deleted
# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0 6/6 Running 0 41s
alertmanager-main-1 6/6 Running 0 73s
cluster-monitoring-operator-5699fc45d8-2dvbg 2/2 Running 0 83s
prometheus-k8s-0 6/6 Running 0 52s
prometheus-k8s-1 6/6 Running 0 68s
# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
no result
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
- --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091
# oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url"
no result
# oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url"
no result
# oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url"
- --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091
# oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url"
- --web.external-url=https://prometheus-k8s.openshift-monitoring.svc:9091
restore cluster
# oc -n openshift-cluster-version scale deploy cluster-version-operator --replicas=1
# oc -n openshift-console-operator scale deploy console-operator --replicas=1
# oc -n openshift-monitoring delete pod cluster-monitoring-operator-5699fc45d8-2dvbg
pod "cluster-monitoring-operator-5699fc45d8-4786r" deleted
# oc get console/cluster -o jsonpath="{.status.consoleURL}"
https://console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com
# oc -n openshift-monitoring get pod | grep -E "alertmanager-main|prometheus-k8s|cluster-monitoring-operator"
alertmanager-main-0 6/6 Running 0 52s
alertmanager-main-1 6/6 Running 0 85s
cluster-monitoring-operator-5699fc45d8-cv4wv 2/2 Running 0 94s
prometheus-k8s-0 6/6 Running 0 63s
prometheus-k8s-1 6/6 Running 0 80s
# oc -n openshift-monitoring get sts alertmanager-main -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get sts prometheus-k8s -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get pod alertmanager-main-0 -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get pod alertmanager-main-1 -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get pod prometheus-k8s-0 -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
# oc -n openshift-monitoring get pod prometheus-k8s-1 -oyaml | grep "web.external-url"
- --web.external-url=https:/console-openshift-console.apps.qe-ui411-0311.qe.devcluster.openshift.com/monitoring
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069 |