Bug 1678475

Summary:

install/upgrade perf: cluster monitoring operator deploys operands serially rather than in parallel

Product:

OpenShift Container Platform

Reporter:

Seth Jennings <sjenning>

Component:

Monitoring

Assignee:

Sergiusz Urbaniak <surbania>

Status:

CLOSED ERRATA

QA Contact:

Junqi Zhao <juzhao>

Severity:

medium

Docs Contact:

Priority:

unspecified

Version:

4.1.0

CC:

ccoleman, fbranczy, mloibl, sponnaga, surbania

Target Milestone:

---

Target Release:

4.1.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-06-04 10:44:14 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
cmo-kubechart.png	none
kubechart-2.png	none
kubechart -3	none

Description Seth Jennings 2019-02-18 21:02:41 UTC

Created attachment 1536117 [details]
cmo-kubechart.png

Currently, the CMO takes about 6-8m to roll out all the monitoring components.

However, it does so serially rather than in parallel (see attached kubechart)

Is there a reason for this? If not, lets parallelize this as CMO is deployed late in install/upgrade and is on the critical path to completion of install/upgrade.  The telemeter-client is currently the last pod to start in the cluster.

Comment 4 Seth Jennings 2019-02-25 22:23:43 UTC

Created attachment 1538613 [details]
kubechart-2.png

I see the changes but there seems to be a lot of time where nothing is happening now (see new attachment)

Basically:
t-0 - CMO starts
+1m - prom operator starts (>1m image pull time)
+2m - prom operator running
+3m - everything except prom and prom-adapter starts
+6m - prom and prom-adapter start

By my observation, things are starting more in parallel, but due to things just sitting around doing nothing, still takes the same amount of time :-/

Comment 6 Junqi Zhao 2019-02-27 07:46:35 UTC

As per Comment 6, change back to MODIFIED

Comment 10 Junqi Zhao 2019-04-10 02:44:04 UTC

from the chart, monitoring components are deployed in parallel now, but it took about 12 minutes to roll out all the monitoring components. other products such as openshift-kube-scheduler-operator, openshift-marketplace also took about 12 minutes to roll out all components

cluster-monitoring-operator-775cccc768-b7sj7 "2019-04-09T21:22:59.633855813-04:00", "2019-04-09T21:34:03.680993632-04:00"
node-exporter-qtj7g                          "2019-04-09T21:22:59.632973119-04:00", "2019-04-09T21:34:03.680989644-04:00"
node-exporter-7r4r4			     "2019-04-09T21:22:59.634100905-04:00", "2019-04-09T21:34:03.680997416-04:00"
node-exporter-fmgxk		  	     "2019-04-09T21:23:14.330596029-04:00", "2019-04-09T21:34:03.680992082-04:00"
node-exporter-r6xxk			     "2019-04-09T21:25:21.89682946-04:00", "2019-04-09T21:34:03.680997923-04:00"
prometheus-operator-5ff75f95fc-k854z         "2019-04-09T21:25:22.795222761-04:00", "2019-04-09T21:34:03.680995725-04:00"
node-exporter-lvt8c			     "2019-04-09T21:25:32.744990996-04:00", "2019-04-09T21:34:03.680990378-04:00"
node-exporter-nnvpc			     "2019-04-09T21:26:09.737593613-04:00", "2019-04-09T21:34:03.680998403-04:00"
telemeter-client-8d885568b-9prt4 	     "2019-04-09T21:26:29.870418508-04:00", "2019-04-09T21:34:03.680993034-04:00"
kube-state-metrics-697cd6f695-wsvmf	     "2019-04-09T21:26:43.857177907-04:00", "2019-04-09T21:34:03.680997163-04:00"
grafana-56879d5757-bbxvg		     "2019-04-09T21:27:18.222775024-04:00", "2019-04-09T21:34:03.680994595-04:00"
alertmanager-main-0			     "2019-04-09T21:27:25.66836269-04:00", "2019-04-09T21:34:03.680994115-04:00"
alertmanager-main-1			     "2019-04-09T21:27:47.949750711-04:00", "2019-04-09T21:34:03.68099098-04:00"
alertmanager-main-2			     "2019-04-09T21:28:17.606583558-04:00", "2019-04-09T21:34:03.680991418-04:00"
prometheus-k8s-1			     "2019-04-09T21:28:51.757987889-04:00", "2019-04-09T21:34:03.681027516-04:00"
prometheus-k8s-0   			     "2019-04-09T21:29:53.943373903-04:00", "2019-04-09T21:34:03.681027175-04:00"
prometheus-adapter-7cc8fbcbd-4ldtm	     "2019-04-09T21:30:13.559570559-04:00", "2019-04-09T21:34:03.680995126-04:00"
prometheus-adapter-7cc8fbcbd-9ttsx	     "2019-04-09T21:30:13.559570559-04:00", "2019-04-09T21:34:03.680995126-04:00"

Comment 11 Junqi Zhao 2019-04-10 02:44:28 UTC

Created attachment 1554009 [details]
kubechart -3

Comment 12 Junqi Zhao 2019-04-10 02:45:18 UTC

payload: 4.0.0-0.nightly-2019-04-05-165550

Comment 14 errata-xmlrpc 2019-06-04 10:44:14 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758