Bug 1631481 - Upgarde fails due to monitoring fails to deploy
Summary: Upgarde fails due to monitoring fails to deploy
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.z
Assignee: Frederic Branczyk
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-20 16:47 UTC by Michael Gugino
Modified: 2019-01-10 09:04 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-10 09:03:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
pod output (6.89 KB, text/plain)
2018-09-20 16:47 UTC, Michael Gugino
no flags Details
playbook failure (7.22 KB, text/plain)
2018-09-20 16:47 UTC, Michael Gugino
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0024 0 None None None 2019-01-10 09:04:05 UTC

Description Michael Gugino 2018-09-20 16:47:17 UTC
Created attachment 1485230 [details]
pod output

Description of problem:
Upgrade from 3.10 - 3.11 fails (on origin).

Version-Release number of selected component (if applicable):


How reproducible: 100%


Steps to Reproduce:
1.  Deploy 3.10 cluster (1 master-infra, 2 compute)
2.  Upgrade to 3.11
3.  Upgrade fails.

Actual results:
Failure summary:


  1. fedora1.mguginolocal.com
     Configure Cluster Monitoring Operator
     Wait for the ServiceMonitor CRD to be created


Expected results:
Monitoring and associated CRD is created.

Additional info:
Attached text files

Comment 1 Michael Gugino 2018-09-20 16:47:41 UTC
Created attachment 1485231 [details]
playbook failure

Comment 2 Frederic Branczyk 2018-09-21 10:13:51 UTC
Could you share the state of all pods in the `openshift-monitoring` namespace?

Comment 3 Michael Gugino 2018-09-21 15:28:28 UTC
(In reply to Frederic Branczyk from comment #2)
> Could you share the state of all pods in the `openshift-monitoring`
> namespace?

Frederic, I have attached pod output in first attachment.  Is there other information you are looking for specifically?

Cluster is still online, let me know what commands to run and I'll get you the info.

Comment 4 Frederic Branczyk 2018-09-21 16:18:07 UTC
For a basic understanding of what's going on, could you just share

kubectl -n openshift-monitoring get pods

And this one in a separate attachment

kubectl -n openshift-monitoring get pods -oyaml

Comment 7 Michael Gugino 2018-09-21 17:11:04 UTC
# oc -n openshift-monitoring logs prometheus-operator-6c9fddd47f-qdhz8
standard_init_linux.go:178: exec user process caused "operation not permitted"

Comment 8 Frederic Branczyk 2018-09-24 09:19:32 UTC
Interesting. Could you share the deployment that makes up the Prometheus Operator?

kubectl -n openshift-monitoring get deploy prometheus-operator -oyaml

Comment 9 Michael Gugino 2018-09-24 14:10:54 UTC
(In reply to Frederic Branczyk from comment #8)
> Interesting. Could you share the deployment that makes up the Prometheus
> Operator?
> 
> kubectl -n openshift-monitoring get deploy prometheus-operator -oyaml

https://gist.github.com/michaelgugino/287ceae35291a8ef2c0a86fb8891a5d3

Comment 10 Frederic Branczyk 2018-09-24 16:14:48 UTC
I see what the problem is. This is an origin cluster. There is a bump missing for origin. I'll work on it.

Comment 11 Junqi Zhao 2018-09-25 06:08:59 UTC
The last upgrade testing is upgrade cluster monitoring to v3.11.11 in OCP, cluster monitoring could be deployed successfully, this is not a test blocker for OCP

Comment 12 Frederic Branczyk 2018-09-25 16:10:07 UTC
Opened a PR for the bump: https://github.com/openshift/openshift-ansible/pull/10220

Comment 13 Frederic Branczyk 2018-09-28 20:30:27 UTC
The PR and cherry-pick to 3.11 got merged.

Comment 14 N. Harrison Ripps 2018-10-03 18:32:00 UTC
Not a 3.11.0 release blocker; moving to 3.11.z.

Comment 15 Junqi Zhao 2018-10-08 08:19:27 UTC
Issue is fixed with
# rpm -qa | grep openshift-ansible
openshift-ansible-3.11.20-1.git.0.734e601.el7.noarch
openshift-ansible-docs-3.11.20-1.git.0.734e601.el7.noarch
openshift-ansible-roles-3.11.20-1.git.0.734e601.el7.noarch
openshift-ansible-playbooks-3.11.20-1.git.0.734e601.el7.noarch

please change to ON_QA

NOTE:
Please set the following parameters if you want to use pv, 3.11 does not attach pv by default
openshift_cluster_monitoring_operator_prometheus_storage_enabled=true
openshift_cluster_monitoring_operator_prometheus_storage_capacity={xx}Gi
openshift_cluster_monitoring_operator_alertmanager_storage_enabled=true
openshift_cluster_monitoring_operator_alertmanager_storage_capacity={xx}Gi

Comment 16 minden 2018-10-08 09:39:36 UTC
Junqi let me know if you need anything else from our side.

Comment 17 Junqi Zhao 2018-10-08 09:40:45 UTC
(In reply to minden from comment #16)
> Junqi let me know if you need anything else from our side.

Thanks, I will set it to VERIFIED

Comment 19 errata-xmlrpc 2019-01-10 09:03:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0024


Note You need to log in before you can comment on or make changes to this bug.