Bug 1873652
Summary: | Upgrade from 4.4 to 4.5: Failed to reconcile Elasticsearch deployment spec: Could not create node resource: Deployment.apps is invalid | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Oscar Casal Sanchez <ocasalsa> | |
Component: | Logging | Assignee: | Periklis Tsirakidis <periklis> | |
Status: | CLOSED ERRATA | QA Contact: | Giriyamma <gkarager> | |
Severity: | medium | Docs Contact: | ||
Priority: | high | |||
Version: | 4.5 | CC: | aivaraslaimikis, anli, aos-bugs, jnordell, ngirard, periklis | |
Target Milestone: | --- | |||
Target Release: | 4.6.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | logging-exploration | |||
Fixed In Version: | Doc Type: | Bug Fix | ||
Doc Text: |
Cause:
Cluster/Operator Upgrades may eventually restart/crash the cluster nodes while the Elasticsearch CR status field is not up2date with all the information available due to the restart/crash.
Consequence:
The fix considers the crash/restart scenario while reconciling the Elasticsearch CR status.
Fix:
The fix provides eager pruning of any Elasticsearch nodes from the status field if they were deleted.
Result:
The mentioned pruning addresses the situation where the status field is not up2date due to a cluster upgrades and aligns the set of existing Elasticsearch nodes in the status field.
|
Story Points: | --- | |
Clone Of: | ||||
: | 1878032 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 15:10:28 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1878032 |
Description
Oscar Casal Sanchez
2020-08-28 22:02:16 UTC
Looking on the state provided, it looks like EO needs a restart. The internal state is corrupt, but it is a temporary state anyways. Putting this to low Prio and severity as it will not block 4.6. @Jonas I am not sure if your customer is the same as Oscar's. But from what I can tell, the operator cannot proceed with your third ES pod because of: - conditions: - lastTransitionTime: "2020-08-20T11:24:58Z" message: '0/13 nodes are available: 2 Insufficient memory, 2 node(s) had volume node affinity conflict, 3 node(s) were unschedulable, 6 node(s) didn''t match node selector.' reason: Unschedulable status: "True" type: Unschedulable deploymentName: elasticsearch-cdm-jw5kn67g-3 upgradeStatus: upgradePhase: controllerUpdated The node selector seems to be an issue. here @Jonas + @Oscar Do you have jeager tracing on these clusters? The PR provided helps resolving the conditions without manual intervention in the status.nodes field. However, the case can happen in the super unlikely event that the deployments get deleted but the status update never makes it to the APIServer, e.g. when the apiserver/etcd is degraded. Deleting nodes from operators is a super scary thing to me, but the design of the operator carries a lot of tradeoffs back from the early 4.x times when kubernetes was the young kid on the block. We need to document the workaround for the manual intervention, until a newer design of the elasticsearch-operator is accomplished. ---- To reproduce this bug, follow these instructions: 1. Fresh install 4.5.z on a cluster 2. Install clusterlogging with 3 nodes for ES - no need for selectors 3. Wait until the setup is fully workable. 4. Upgrade the ES resource in CL CR to make the ES nodes go "Unschedulable" 5. Switch CL+ES CR to Unmanaged 6. Delete one of the three ES deployments 7. Switch Resources in CL CR back to normal 8. Reset Unmanaged to Managed and see what happens. --- Manual intervention for the super rare case: - Set ClusterLogging + Elasticsearch CRs into "Unmanaged" - Drop the "Unschedulable" status for all not-existing nodes from "status.nodes" of the Elasticsearch CR - Set ClusterLogging + Elasticsearch CRs into "Managed" This bug is verified, issue is fixed. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198 |