Description of problem: Version-Release number of the following components: rpm -q openshift-ansible rpm -q ansible ansible --version How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Please include the entire output from the last TASK line through the end of output if an error is generated Expected results: Additional info: Please attach logs from ansible-playbook with the -vvv flag
Description of problem: console-operator stuck in a pending state without a worker node in Openshift 4.0. The master is taint with effect NoSchedule. Installer breaks and trying to verify if the console is up or not. The deployment of console-operator should have toleration defined. How reproducible: Always Steps to Reproduce: 1. The value for worker node replica count is kept empty ~~~ - name: master platform: aws: type: m5.xlarge replicas: 1 - name: worker platform: {} replicas: creationTimestamp: null ~~~ Actual results: time="2019-01-25T21:58:14+05:30" level=debug msg="Still waiting for the console route: the server could not find the requested resource (get routes.route.openshift.io)" time="2019-01-25T21:58:51+05:30" level=debug msg="Still waiting for the console route: the server is currently unable to handle the request (get routes.route.openshift.io)" time="2019-01-25T21:59:28+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:00:06+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:00:43+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:01:21+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:01:58+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:02:35+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:03:13+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:03:50+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:04:28+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:05:05+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:05:42+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:06:20+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:06:57+05:30" level=debug msg="Still waiting for the console route..." time="2019-01-25T22:07:02+05:30" level=fatal msg="waiting for openshift-console URL: context deadline exceeded" Expected results: It should install successfully
> The deployment of console-operator should have toleration defined. Agreed, but this is a console-operator issue, and not an installer issue. I *think* that the right component for that is Management Console, although that could also be some out-of-cluster management UI. I'll optimistically redirect to Management Console, and we can redirect again if I'm guessing wrong ;).
(In reply to W. Trevor King from comment #2) > I'll optimistically redirect to Management Console, and we can redirect again if I'm guessing wrong ;). You found us! Management Console is the right component
I'm unclear why you would see this as the console operator has a toleration (and node selector) for master nodes: https://github.com/spadgett/console-operator/blob/37991a619ba244c2c9204f84a1b8262a24f19725/manifests/05-operator.yaml#L16-L21 ```yaml nodeSelector: node-role.kubernetes.io/master: "" tolerations: - key: node-role.kubernetes.io/master operator: Exists effect: NoSchedule ``` Can you check events in the openshift-console namespace? (Also openshift-console-operator if it exists, but that change just merged today.)
Also pod logs for the console-operator if it is in fact running.
Note that if I specify 0 workers, the installer gives me this error: FATAL failed to fetch Terraform Variables: failed to load asset "Install Config": invalid "install-config.yaml" file: machines[1].replicas: Invalid value: 0: number of replicas must be positive So it looks like leaving `replicas` empty is different than explicitly specifying 0. I'm not sure if it's defaulted to a non-zero value or if that's an installer bug. Trevor?
I was able to reproduce. The cluster monitoring operator is failing, which prevents the CVO from ever getting to the console operator. I'm not convinced this is a valid configuration, however. Sending this back to the install team for evaluation. I0129 15:11:15.878366 1 operatorstatus.go:110] ClusterOperator /cluster-monitoring-operator is not done; it is available=false, progressing=true, failing=true I0129 15:11:16.878261 1 operatorstatus.go:84] ClusterOperator /cluster-monitoring-operator is reporting (v1.ClusterOperatorStatus) { Conditions: ([]v1.ClusterOperatorStatusCondition) (len=3 cap=4) { (v1.ClusterOperatorStatusCondition) { Type: (v1.ClusterStatusConditionType) (len=9) "Available", Status: (v1.ConditionStatus) (len=5) "False", LastTransitionTime: (v1.Time) 2019-01-29 15:08:27 +0000 UTC, Reason: (string) "", Message: (string) "" }, (v1.ClusterOperatorStatusCondition) { Type: (v1.ClusterStatusConditionType) (len=11) "Progressing", Status: (v1.ConditionStatus) (len=4) "True", LastTransitionTime: (v1.Time) 2019-01-29 15:08:56 +0000 UTC, Reason: (string) "", Message: (string) (len=22) "Rolling out the stack." }, (v1.ClusterOperatorStatusCondition) { Type: (v1.ClusterStatusConditionType) (len=7) "Failing", Status: (v1.ConditionStatus) (len=4) "True", LastTransitionTime: (v1.Time) 2019-01-29 15:08:27 +0000 UTC, Reason: (string) "", Message: (string) (len=234) "Failed to rollout the stack. Error: running task Updating Prometheus Operator failed: reconciling Prometheus Operator Service failed: updating Service object failed: services \"prometheus-operator\" is forbidden: caches not synchronized" } }, Versions: ([]v1.OperandVersion) <nil>, RelatedObjects: ([]v1.ObjectReference) <nil>, Extension: (runtime.RawExtension) &RawExtension{Raw:nil,} } I0129 15:11:16.878276 1 operatorstatus.go:110] ClusterOperator /cluster-monitoring-operator is not done; it is available=false, progressing=true, failing=true
> 1. The value for worker node replica count is kept empty > ... > - name: worker > platform: {} > replicas: This is definitely not a supported approach to having no workers. What it should be doing is giving you the platform default (three for AWS). I've filed [1] to close this loophole. We have medium-term plans to allow folks to configure zero workers (via 'replicas: 0') [2], but we're not there yet. [1]: https://github.com/openshift/installer/pull/1146 [2]: https://github.com/openshift/installer/pull/958
Installing with 0 workers does need to be supported. We need this for bring-your-own-host, for example. A minimal OpenShift installation shouldn't require any workers. Sending this back over to you, Sam.
Alex, the console tolerates running on masters. In fact, it requires it. The CVO is not getting to console at all, however. I believe a different component is the problem. (See my comments above.)
Sam, I didn't see the private comment. Sorry for the noise. Abhishek, can you try this again with the latest installer? That should tell you which component failed to install properly (though, it might also prevent you from creating a cluster without any workers).
> The cluster monitoring operator is failing... This may be bug 1671137, although I'm not sure if that blocked the cluster-version operator or not.
Closing due to inactivity.
This uses OpenShift Ansible and should not be considered and OCP 4 bug.