1670005 – console-operator stuck in pending state if replica count of worker node is null

Bug 1670005 - console-operator stuck in pending state if replica count of worker node is null

Summary: console-operator stuck in pending state if replica count of worker node is null

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Alex Crawford
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1664187
TreeView+	depends on / blocked

Reported:	2019-01-28 11:01 UTC by Abhishek
Modified:	2020-05-11 13:55 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-02-18 18:40:11 UTC
Target Upstream Version:
Embargoed:
Flags:	aabhishe: needinfo-

Attachments	(Terms of Use)

Description Abhishek 2019-01-28 11:01:21 UTC

Description of problem:

Version-Release number of the following components:
rpm -q openshift-ansible
rpm -q ansible
ansible --version

How reproducible:

Steps to Reproduce:
1.
2.
3.

Actual results:
Please include the entire output from the last TASK line through the end of output if an error is generated

Expected results:

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 Abhishek 2019-01-28 11:29:38 UTC

Description of problem: console-operator stuck in a pending state without a worker node in Openshift 4.0. The master is taint with effect NoSchedule.
                        Installer breaks and trying to verify if the console is up or not. The deployment of console-operator should have toleration defined.


How reproducible:

Always

Steps to Reproduce:

1. The value for worker node replica count is kept empty

~~~
- name: master
  platform:
    aws:
      type: m5.xlarge
  replicas: 1
- name: worker
  platform: {}
  replicas: 
  creationTimestamp: null
~~~

Actual results:

time="2019-01-25T21:58:14+05:30" level=debug msg="Still waiting for the console route: the server could not find the requested resource (get routes.route.openshift.io)"
time="2019-01-25T21:58:51+05:30" level=debug msg="Still waiting for the console route: the server is currently unable to handle the request (get routes.route.openshift.io)"
time="2019-01-25T21:59:28+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:00:06+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:00:43+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:01:21+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:01:58+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:02:35+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:03:13+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:03:50+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:04:28+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:05:05+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:05:42+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:06:20+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:06:57+05:30" level=debug msg="Still waiting for the console route..."
time="2019-01-25T22:07:02+05:30" level=fatal msg="waiting for openshift-console URL: context deadline exceeded"


Expected results:
It should install successfully

Comment 2 W. Trevor King 2019-01-28 20:28:39 UTC

> The deployment of console-operator should have toleration defined.

Agreed, but this is a console-operator issue, and not an installer issue.  I *think* that the right component for that is Management Console, although that could also be some out-of-cluster management UI.  I'll optimistically redirect to Management Console, and we can redirect again if I'm guessing wrong ;).

Comment 3 Samuel Padgett 2019-01-28 20:35:55 UTC

(In reply to W. Trevor King from comment #2)
> I'll optimistically redirect to Management Console, and we can redirect again if I'm guessing wrong ;).

You found us!

Management Console is the right component

Comment 4 Samuel Padgett 2019-01-28 20:41:28 UTC

I'm unclear why you would see this as the console operator has a toleration (and node selector) for master nodes:

https://github.com/spadgett/console-operator/blob/37991a619ba244c2c9204f84a1b8262a24f19725/manifests/05-operator.yaml#L16-L21

```yaml
      nodeSelector:
        node-role.kubernetes.io/master: ""
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule
```

Can you check events in the openshift-console namespace? (Also openshift-console-operator if it exists, but that change just merged today.)

Comment 5 Samuel Padgett 2019-01-28 20:42:48 UTC

Also pod logs for the console-operator if it is in fact running.

Comment 8 Samuel Padgett 2019-01-29 14:59:08 UTC

Note that if I specify 0 workers, the installer gives me this error:

FATAL failed to fetch Terraform Variables: failed to load asset "Install Config": invalid "install-config.yaml" file: machines[1].replicas: Invalid value: 0: number of replicas must be positive

So it looks like leaving `replicas` empty is different than explicitly specifying 0. I'm not sure if it's defaulted to a non-zero value or if that's an installer bug. Trevor?

Comment 9 Samuel Padgett 2019-01-29 15:15:15 UTC

I was able to reproduce. The cluster monitoring operator is failing, which prevents the CVO from ever getting to the console operator. I'm not convinced this is a valid configuration, however. Sending this back to the install team for evaluation.

I0129 15:11:15.878366       1 operatorstatus.go:110] ClusterOperator /cluster-monitoring-operator is not done; it is available=false, progressing=true, failing=true
I0129 15:11:16.878261       1 operatorstatus.go:84] ClusterOperator /cluster-monitoring-operator is reporting (v1.ClusterOperatorStatus) {
 Conditions: ([]v1.ClusterOperatorStatusCondition) (len=3 cap=4) {
  (v1.ClusterOperatorStatusCondition) {
   Type: (v1.ClusterStatusConditionType) (len=9) "Available",
   Status: (v1.ConditionStatus) (len=5) "False",
   LastTransitionTime: (v1.Time) 2019-01-29 15:08:27 +0000 UTC,
   Reason: (string) "",
   Message: (string) ""
  },
  (v1.ClusterOperatorStatusCondition) {
   Type: (v1.ClusterStatusConditionType) (len=11) "Progressing",
   Status: (v1.ConditionStatus) (len=4) "True",
   LastTransitionTime: (v1.Time) 2019-01-29 15:08:56 +0000 UTC,
   Reason: (string) "",
   Message: (string) (len=22) "Rolling out the stack."
  },
  (v1.ClusterOperatorStatusCondition) {
   Type: (v1.ClusterStatusConditionType) (len=7) "Failing",
   Status: (v1.ConditionStatus) (len=4) "True",
   LastTransitionTime: (v1.Time) 2019-01-29 15:08:27 +0000 UTC,
   Reason: (string) "",
   Message: (string) (len=234) "Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Service failed: updating Service object failed: services \"prometheus-operator\" is forbidden: caches not synchronized"
  }
 },
 Versions: ([]v1.OperandVersion) <nil>,
 RelatedObjects: ([]v1.ObjectReference) <nil>,
 Extension: (runtime.RawExtension) &RawExtension{Raw:nil,}
}
I0129 15:11:16.878276       1 operatorstatus.go:110] ClusterOperator /cluster-monitoring-operator is not done; it is available=false, progressing=true, failing=true

Comment 15 W. Trevor King 2019-01-29 18:35:57 UTC

> 1. The value for worker node replica count is kept empty
> ...
> - name: worker
>   platform: {}
>   replicas: 

This is definitely not a supported approach to having no workers.  What it should be doing is giving you the platform default (three for AWS).  I've filed [1] to close this loophole.  We have medium-term plans to allow folks to configure zero workers (via 'replicas: 0') [2], but we're not there yet.

[1]: https://github.com/openshift/installer/pull/1146
[2]: https://github.com/openshift/installer/pull/958

Comment 16 Alex Crawford 2019-02-13 22:47:46 UTC

Installing with 0 workers does need to be supported. We need this for bring-your-own-host, for example. A minimal OpenShift installation shouldn't require any workers.

Sending this back over to you, Sam.

Comment 17 Samuel Padgett 2019-02-13 23:23:55 UTC

Alex, the console tolerates running on masters. In fact, it requires it. The CVO is not getting to console at all, however. I believe a different component is the problem. (See my comments above.)

Comment 18 Alex Crawford 2019-02-13 23:51:56 UTC

Sam, I didn't see the private comment. Sorry for the noise.

Abhishek, can you try this again with the latest installer? That should tell you which component failed to install properly (though, it might also prevent you from creating a cluster without any workers).

Comment 19 W. Trevor King 2019-02-14 00:28:25 UTC

> The cluster monitoring operator is failing...

This may be bug 1671137, although I'm not sure if that blocked the cluster-version operator or not.

Comment 20 Alex Crawford 2019-02-18 18:40:11 UTC

Closing due to inactivity.

Comment 21 Eric Rich 2019-02-18 19:24:30 UTC

This uses OpenShift Ansible and should not be considered and OCP 4 bug.

Note You need to log in before you can comment on or make changes to this bug.