Bug 1873652

Summary:	Upgrade from 4.4 to 4.5: Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps is invalid
Product:	OpenShift Container Platform	Reporter:	Oscar Casal Sanchez <ocasalsa>
Component:	Logging	Assignee:	Periklis Tsirakidis <periklis>
Status:	CLOSED ERRATA	QA Contact:	Giriyamma <gkarager>
Severity:	medium	Docs Contact:
Priority:	high
Version:	4.5	CC:	aivaraslaimikis, anli, aos-bugs, jnordell, ngirard, periklis
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	logging-exploration
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: Cluster/Operator Upgrades may eventually restart/crash the cluster nodes while the Elasticsearch CR status field is not up2date with all the information available due to the restart/crash. Consequence: The fix considers the crash/restart scenario while reconciling the Elasticsearch CR status. Fix: The fix provides eager pruning of any Elasticsearch nodes from the status field if they were deleted. Result: The mentioned pruning addresses the situation where the status field is not up2date due to a cluster upgrades and aligns the set of existing Elasticsearch nodes in the status field.	Story Points:	---
Clone Of:
Clones:	1878032 (view as bug list)		Environment:
Last Closed:	2020-10-27 15:10:28 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1878032

Description Oscar Casal Sanchez 2020-08-28 22:02:16 UTC

[Description of problem]

ES is not working after upgrading from OCP 4.4 to 4.5 giving the EO operator continually the error:
~~~
{"level":"error","ts":1598251293.5835052,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps \"elasticsearch-cdm-6bm7w9o8-1\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string(nil): `selector` does not match template `labels`, spec.template.spec.containers: Required value]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
~~~

Also, in the CLO is possible to see the error below, before starting to see the previous error in the EO operator:
~~~
       nodeConditions:
          elasticsearch-cdm-6bm7w9o8-1:
            - lastTransitionTime: '2020-08-21T16:38:51Z'
              message: >-
                0/8 nodes are available: 2 Insufficient memory, 2 node(s) were
                unschedulable, 4 node(s) didn't match node selector.
              reason: Unschedulable
              status: 'True'
              type: Unschedulable
~~~

But now, the nodes have enough memory, but the EO don't try to create the deployment and then, not replica and not pod for elasticsearch-cdm-6bm7w9o8-1 and elasticsearch-cdm-6bm7w9o8-2


[Version-Release number of selected component (if applicable)]
clusterlogging.4.5.0-202008100413.p0

[How reproducible]
Customer was upgrading from OCP 4.4 to 4.5 and in the middle of it, not enough nodes to be scheduled the ES pods.


[Actual results]
After having enough resources the nodes, the ES should continue with the upgrading phase. But, it's not happening, the EO continue giving the error:

~~~
{"level":"error","ts":1598251293.5835052,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps \"elasticsearch-cdm-6bm7w9o8-1\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string(nil): `selector` does not match template `labels`, spec.template.spec.containers: Required value]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
~~~

And the deployments for elasticsearch-cdm-6bm7w9o8-1 and elasticsearch-cdm-6bm7w9o8-2 are not created.

Attaching must-gather and logging-dump

Regards,
Oscar

Comment 3 Periklis Tsirakidis 2020-09-01 14:30:38 UTC

Looking on the state provided, it looks like EO needs a restart. The internal state is corrupt, but it is a temporary state anyways. Putting this to low Prio and severity as it will not block 4.6.

Comment 6 Periklis Tsirakidis 2020-09-02 07:19:18 UTC

@Jonas

I am not sure if your customer is the same as Oscar's. But from what I can tell, the operator cannot proceed with your third ES pod because of:

  - conditions:
    - lastTransitionTime: "2020-08-20T11:24:58Z"
      message: '0/13 nodes are available: 2 Insufficient memory, 2 node(s) had volume node affinity conflict, 3 node(s) were unschedulable, 6 node(s) didn''t match node selector.'
      reason: Unschedulable
      status: "True"
      type: Unschedulable
    deploymentName: elasticsearch-cdm-jw5kn67g-3
    upgradeStatus:
      upgradePhase: controllerUpdated

The node selector seems to be an issue. here


@Jonas + @Oscar

Do you have jeager tracing on these clusters?

Comment 10 Periklis Tsirakidis 2020-09-09 11:42:50 UTC

The PR provided helps resolving the conditions without manual intervention in the status.nodes field. However, the case can happen in the super unlikely event that the deployments get deleted but the status update never makes it to the APIServer, e.g. when the apiserver/etcd is degraded.

Deleting nodes from operators is a super scary thing to me, but the design of the operator carries a lot of tradeoffs back from the early 4.x times when kubernetes was the young kid on the block.

We need to document the workaround for the manual intervention, until a newer design of the elasticsearch-operator is accomplished.

----

To reproduce this bug, follow these instructions:

1. Fresh install 4.5.z on a cluster
2. Install clusterlogging with 3 nodes for ES - no need for selectors
3. Wait until the setup is fully workable.
4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable"
5. Switch CL+ES CR to Unmanaged
6. Delete one of the three ES deployments
7. Switch Resources in CL CR back to normal
8. Reset Unmanaged to Managed and see what happens.


---

Manual intervention for the super rare case:
- Set ClusterLogging + Elasticsearch CRs into "Unmanaged"
- Drop the "Unschedulable" status for all not-existing nodes from "status.nodes" of the Elasticsearch CR
- Set ClusterLogging + Elasticsearch CRs into "Managed"

Comment 13 Giriyamma 2020-09-16 12:09:58 UTC

This bug is verified, issue is fixed.

Comment 16 errata-xmlrpc 2020-10-27 15:10:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4198