[Description of problem] ES is not working after upgrading from OCP 4.4 to 4.5 giving the EO operator continually the error: ~~~ {"level":"error","ts":1598251293.5835052,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps \"elasticsearch-cdm-6bm7w9o8-1\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string(nil): `selector` does not match template `labels`, spec.template.spec.containers: Required value]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"} ~~~ Also, in the CLO is possible to see the error below, before starting to see the previous error in the EO operator: ~~~ nodeConditions: elasticsearch-cdm-6bm7w9o8-1: - lastTransitionTime: '2020-08-21T16:38:51Z' message: >- 0/8 nodes are available: 2 Insufficient memory, 2 node(s) were unschedulable, 4 node(s) didn't match node selector. reason: Unschedulable status: 'True' type: Unschedulable ~~~ But now, the nodes have enough memory, but the EO don't try to create the deployment and then, not replica and not pod for elasticsearch-cdm-6bm7w9o8-1 and elasticsearch-cdm-6bm7w9o8-2 [Version-Release number of selected component (if applicable)] clusterlogging.4.5.0-202008100413.p0 [How reproducible] Customer was upgrading from OCP 4.4 to 4.5 and in the middle of it, not enough nodes to be scheduled the ES pods. [Actual results] After having enough resources the nodes, the ES should continue with the upgrading phase. But, it's not happening, the EO continue giving the error: ~~~ {"level":"error","ts":1598251293.5835052,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps \"elasticsearch-cdm-6bm7w9o8-1\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string(nil): `selector` does not match template `labels`, spec.template.spec.containers: Required value]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"} ~~~ And the deployments for elasticsearch-cdm-6bm7w9o8-1 and elasticsearch-cdm-6bm7w9o8-2 are not created. Attaching must-gather and logging-dump Regards, Oscar
Looking on the state provided, it looks like EO needs a restart. The internal state is corrupt, but it is a temporary state anyways. Putting this to low Prio and severity as it will not block 4.6.
@Jonas I am not sure if your customer is the same as Oscar's. But from what I can tell, the operator cannot proceed with your third ES pod because of: - conditions: - lastTransitionTime: "2020-08-20T11:24:58Z" message: '0/13 nodes are available: 2 Insufficient memory, 2 node(s) had volume node affinity conflict, 3 node(s) were unschedulable, 6 node(s) didn''t match node selector.' reason: Unschedulable status: "True" type: Unschedulable deploymentName: elasticsearch-cdm-jw5kn67g-3 upgradeStatus: upgradePhase: controllerUpdated The node selector seems to be an issue. here @Jonas + @Oscar Do you have jeager tracing on these clusters?
The PR provided helps resolving the conditions without manual intervention in the status.nodes field. However, the case can happen in the super unlikely event that the deployments get deleted but the status update never makes it to the APIServer, e.g. when the apiserver/etcd is degraded. Deleting nodes from operators is a super scary thing to me, but the design of the operator carries a lot of tradeoffs back from the early 4.x times when kubernetes was the young kid on the block. We need to document the workaround for the manual intervention, until a newer design of the elasticsearch-operator is accomplished. ---- To reproduce this bug, follow these instructions: 1. Fresh install 4.5.z on a cluster 2. Install clusterlogging with 3 nodes for ES - no need for selectors 3. Wait until the setup is fully workable. 4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable" 5. Switch CL+ES CR to Unmanaged 6. Delete one of the three ES deployments 7. Switch Resources in CL CR back to normal 8. Reset Unmanaged to Managed and see what happens. --- Manual intervention for the super rare case: - Set ClusterLogging + Elasticsearch CRs into "Unmanaged" - Drop the "Unschedulable" status for all not-existing nodes from "status.nodes" of the Elasticsearch CR - Set ClusterLogging + Elasticsearch CRs into "Managed"
This bug is verified, issue is fixed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198