Created attachment 1485230 [details] pod output Description of problem: Upgrade from 3.10 - 3.11 fails (on origin). Version-Release number of selected component (if applicable): How reproducible: 100% Steps to Reproduce: 1. Deploy 3.10 cluster (1 master-infra, 2 compute) 2. Upgrade to 3.11 3. Upgrade fails. Actual results: Failure summary: 1. fedora1.mguginolocal.com Configure Cluster Monitoring Operator Wait for the ServiceMonitor CRD to be created Expected results: Monitoring and associated CRD is created. Additional info: Attached text files
Created attachment 1485231 [details] playbook failure
Could you share the state of all pods in the `openshift-monitoring` namespace?
(In reply to Frederic Branczyk from comment #2) > Could you share the state of all pods in the `openshift-monitoring` > namespace? Frederic, I have attached pod output in first attachment. Is there other information you are looking for specifically? Cluster is still online, let me know what commands to run and I'll get you the info.
For a basic understanding of what's going on, could you just share kubectl -n openshift-monitoring get pods And this one in a separate attachment kubectl -n openshift-monitoring get pods -oyaml
# oc -n openshift-monitoring logs prometheus-operator-6c9fddd47f-qdhz8 standard_init_linux.go:178: exec user process caused "operation not permitted"
Interesting. Could you share the deployment that makes up the Prometheus Operator? kubectl -n openshift-monitoring get deploy prometheus-operator -oyaml
(In reply to Frederic Branczyk from comment #8) > Interesting. Could you share the deployment that makes up the Prometheus > Operator? > > kubectl -n openshift-monitoring get deploy prometheus-operator -oyaml https://gist.github.com/michaelgugino/287ceae35291a8ef2c0a86fb8891a5d3
I see what the problem is. This is an origin cluster. There is a bump missing for origin. I'll work on it.
The last upgrade testing is upgrade cluster monitoring to v3.11.11 in OCP, cluster monitoring could be deployed successfully, this is not a test blocker for OCP
Opened a PR for the bump: https://github.com/openshift/openshift-ansible/pull/10220
The PR and cherry-pick to 3.11 got merged.
Not a 3.11.0 release blocker; moving to 3.11.z.
Issue is fixed with # rpm -qa | grep openshift-ansible openshift-ansible-3.11.20-1.git.0.734e601.el7.noarch openshift-ansible-docs-3.11.20-1.git.0.734e601.el7.noarch openshift-ansible-roles-3.11.20-1.git.0.734e601.el7.noarch openshift-ansible-playbooks-3.11.20-1.git.0.734e601.el7.noarch please change to ON_QA NOTE: Please set the following parameters if you want to use pv, 3.11 does not attach pv by default openshift_cluster_monitoring_operator_prometheus_storage_enabled=true openshift_cluster_monitoring_operator_prometheus_storage_capacity={xx}Gi openshift_cluster_monitoring_operator_alertmanager_storage_enabled=true openshift_cluster_monitoring_operator_alertmanager_storage_capacity={xx}Gi
Junqi let me know if you need anything else from our side.
(In reply to minden from comment #16) > Junqi let me know if you need anything else from our side. Thanks, I will set it to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0024