Bug 1873652 - Upgrade from 4.4 to 4.5: Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps is invalid
Summary: Upgrade from 4.4 to 4.5: Failed to reconcile Elasticsearch deployment spec: C...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
high
medium
Target Milestone: ---
: 4.6.0
Assignee: Periklis Tsirakidis
QA Contact: Giriyamma
URL:
Whiteboard: logging-exploration
Depends On:
Blocks: 1878032
TreeView+ depends on / blocked
 
Reported: 2020-08-28 22:02 UTC by Oscar Casal Sanchez
Modified: 2024-03-25 16:23 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Cluster/Operator Upgrades may eventually restart/crash the cluster nodes while the Elasticsearch CR status field is not up2date with all the information available due to the restart/crash. Consequence: The fix considers the crash/restart scenario while reconciling the Elasticsearch CR status. Fix: The fix provides eager pruning of any Elasticsearch nodes from the status field if they were deleted. Result: The mentioned pruning addresses the situation where the status field is not up2date due to a cluster upgrades and aligns the set of existing Elasticsearch nodes in the status field.
Clone Of:
: 1878032 (view as bug list)
Environment:
Last Closed: 2020-10-27 15:10:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift elasticsearch-operator pull 481 0 None closed Bug 1873652: Update cluster status after populating nodes 2021-02-18 13:23:35 UTC
Red Hat Knowledge Base (Solution) 5384191 0 None None None 2020-09-09 10:40:19 UTC
Red Hat Product Errata RHBA-2020:4198 0 None None None 2020-10-27 15:12:43 UTC

Description Oscar Casal Sanchez 2020-08-28 22:02:16 UTC
[Description of problem]

ES is not working after upgrading from OCP 4.4 to 4.5 giving the EO operator continually the error:
~~~
{"level":"error","ts":1598251293.5835052,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps \"elasticsearch-cdm-6bm7w9o8-1\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string(nil): `selector` does not match template `labels`, spec.template.spec.containers: Required value]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
~~~

Also, in the CLO is possible to see the error below, before starting to see the previous error in the EO operator:
~~~
       nodeConditions:
          elasticsearch-cdm-6bm7w9o8-1:
            - lastTransitionTime: '2020-08-21T16:38:51Z'
              message: >-
                0/8 nodes are available: 2 Insufficient memory, 2 node(s) were
                unschedulable, 4 node(s) didn't match node selector.
              reason: Unschedulable
              status: 'True'
              type: Unschedulable
~~~

But now, the nodes have enough memory, but the EO don't try to create the deployment and then, not replica and not pod for elasticsearch-cdm-6bm7w9o8-1 and elasticsearch-cdm-6bm7w9o8-2


[Version-Release number of selected component (if applicable)]
clusterlogging.4.5.0-202008100413.p0

[How reproducible]
Customer was upgrading from OCP 4.4 to 4.5 and in the middle of it, not enough nodes to be scheduled the ES pods.


[Actual results]
After having enough resources the nodes, the ES should continue with the upgrading phase. But, it's not happening, the EO continue giving the error:

~~~
{"level":"error","ts":1598251293.5835052,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps \"elasticsearch-cdm-6bm7w9o8-1\" is invalid: [spec.selector: Required value, spec.template.metadata.labels: Invalid value: map[string]string(nil): `selector` does not match template `labels`, spec.template.spec.containers: Required value]","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
~~~

And the deployments for elasticsearch-cdm-6bm7w9o8-1 and elasticsearch-cdm-6bm7w9o8-2 are not created.

Attaching must-gather and logging-dump

Regards,
Oscar

Comment 3 Periklis Tsirakidis 2020-09-01 14:30:38 UTC
Looking on the state provided, it looks like EO needs a restart. The internal state is corrupt, but it is a temporary state anyways. Putting this to low Prio and severity as it will not block 4.6.

Comment 6 Periklis Tsirakidis 2020-09-02 07:19:18 UTC
@Jonas

I am not sure if your customer is the same as Oscar's. But from what I can tell, the operator cannot proceed with your third ES pod because of:

  - conditions:
    - lastTransitionTime: "2020-08-20T11:24:58Z"
      message: '0/13 nodes are available: 2 Insufficient memory, 2 node(s) had volume node affinity conflict, 3 node(s) were unschedulable, 6 node(s) didn''t match node selector.'
      reason: Unschedulable
      status: "True"
      type: Unschedulable
    deploymentName: elasticsearch-cdm-jw5kn67g-3
    upgradeStatus:
      upgradePhase: controllerUpdated

The node selector seems to be an issue. here


@Jonas + @Oscar

Do you have jeager tracing on these clusters?

Comment 10 Periklis Tsirakidis 2020-09-09 11:42:50 UTC
The PR provided helps resolving the conditions without manual intervention in the status.nodes field. However, the case can happen in the super unlikely event that the deployments get deleted but the status update never makes it to the APIServer, e.g. when the apiserver/etcd is degraded.

Deleting nodes from operators is a super scary thing to me, but the design of the operator carries a lot of tradeoffs back from the early 4.x times when kubernetes was the young kid on the block.

We need to document the workaround for the manual intervention, until a newer design of the elasticsearch-operator is accomplished.

----

To reproduce this bug, follow these instructions:

1. Fresh install 4.5.z on a cluster
2. Install clusterlogging with 3 nodes for ES - no need for selectors
3. Wait until the setup is fully workable.
4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable"
5. Switch CL+ES CR to Unmanaged
6. Delete one of the three ES deployments
7. Switch Resources in CL CR back to normal
8. Reset Unmanaged to Managed and see what happens.


---

Manual intervention for the super rare case:
- Set ClusterLogging + Elasticsearch CRs into "Unmanaged"
- Drop the "Unschedulable" status for all not-existing nodes from "status.nodes" of the Elasticsearch CR
- Set ClusterLogging + Elasticsearch CRs into "Managed"

Comment 13 Giriyamma 2020-09-16 12:09:58 UTC
This bug is verified, issue is fixed.

Comment 16 errata-xmlrpc 2020-10-27 15:10:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4198


Note You need to log in before you can comment on or make changes to this bug.