Bug 1906641
Summary: | Elasticsearch cluster can't be fully upgraded after upgrade logging from 4.5 to latest 4.6 | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Qiaoling Tang <qitang> | ||||
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> | ||||
Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.6 | CC: | achakrat, anli, aos-bugs, hkang, jcantril, mchebbi, periklis, wking | ||||
Target Milestone: | --- | Keywords: | Regression | ||||
Target Release: | 4.6.z | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | logging-core | ||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: Earlier change to certificate generation caused the CLO to improperly regenerate certificates
Consequence: The CLO could possible regenerate certificates while the EO is trying to restart the cluster which results in the EO being unable to communicate to the cluster and the individual nodes unable to cluster between themselves because of mismatched certs.
Fix: Properly store all certs in the master secret and properly extract them to the CLO's working directory.
Result: During reconcilation the CLO will have all the certificates in the working directory and is able to properly evaluate them to see if they need to be regenerated. Since they should not have expired, the CLO will not regenerate them which will allow the EO to communicate to the ES cluster without certificates changes mid upgrade.
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2021-01-25 20:21:05 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | 1905910 | ||||||
Bug Blocks: | 1915840 | ||||||
Attachments: |
|
Description
Qiaoling Tang
2020-12-11 02:33:57 UTC
Sorry. big change -> big chance. Created attachment 1746137 [details]
must-gather
Hit same issue when upgrade logging from released 4.6(4.6.0-202011221454.p0) to latest 4.7(4.7.0-202101092121.p0), details are in the attachment.
We should be able to work around this issue by restarting the ES pods (oc delete pods -l component=elasticsearch) so the operator and ES cluster are utilizing the same certificates (In reply to Qiaoling Tang from comment #14) > Tested in clusterlogging.4.6.0-202101162152.p0, here are the steps and > > 3. upgrade from 4.5 to 4.6: upgrade CLO and EO at the same time: > need to do the workaround (oc delete pods -l component=elasticsearch) twice, > then the upgrade can be successful > > 4. upgrade from 4.6 to 4.6, upgrade EO and CLO at the same time > > upgrade from released version(clusterlogging.4.6.0-202011221454.p0) to > latest 4.6: need to do the workaround(oc delete pods -l > component=elasticsearch), then the logging stack could be upgraded These issues are likely the same and we may need to get the change into 4.5. I'm betting the lower CLO that does not have the fix and is recreating the certs which causes issues when EO tries to upgrade. The cases you identify having the issues are all ones going from non-fix to fixed. The successful case is the one listed below which has fix to fix version. > > upgrade from non-released version to latest 4.6: succeeded > > > @Jeff, > my concern to verify this bz is: if customers set `installPlanApproval: > Automatic` in their subscriptions when deploy logging 4.6, after we release > new logging 4.6, the CLO and EO will be upgraded at the same time, then the > customers need to do the workaround, will the customers accept the > workaround? A can't comment if customer's will accept the workaround though it would be desirable to resolve this for them. We likely need to get the change into 4.5 too as I believe we backported earlier cert changes to 4.5 I believe the correct solution is to release https://github.com/openshift/cluster-logging-operator/pull/858 first which will bring in all the cert changes to 4.5 to eliminate the CLO from prematurely regenerating the certs Scenarios 1: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO First Result: Pass Scenarios 2: upgrade logging from 4.6 release to 4.6 latest -- Upgrade EO First Result: Pass Scenarios 3: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO and EO at same time Result: Fail The pull/858 is for 4.5. I don't think that can fix the Scenarios 3. (In reply to Anping Li from comment #18) > Scenarios 1: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO > First Result: Pass > Scenarios 2: upgrade logging from 4.6 release to 4.6 latest -- Upgrade EO > First Result: Pass > Scenarios 3: upgrade logging from 4.6 release to 4.6 latest -- Upgrade CLO > and EO at same time Result: Fail > > The pull/858 is for 4.5. I don't think that can fix the Scenarios 3. You are correct. It will only fix a 4.5 to 4.6 latest assuming 4.5 has the cert change. For 4.6 to 4.6 latest it will not and the only possibility is to only upgrade EO and allow it to settle before CLO or manually intervene to remove the pods Given #c18 where it looks like we have no good options, we may still hold this PR until 4.5 lands to resolve those cases but it might benefit users on 4.6. I defer to QE here *** Bug 1916911 has been marked as a duplicate of this bug. *** *** Bug 1918441 has been marked as a duplicate of this bug. *** @Anping moving this back to ONQA as there are no good options to address your finding other then to document the work around. The cert changes are partially in 4.6 and this change will correct them but any earlier deployments will not have them. There is a jira issue filed to guard against this but it would need to be backported and would otherwise hold up 4.6 changes in general. I'm not certain how to best convey upgrading should be to: 1. Upgrade CLO followed by EO (or visa versa) 2. If issue, oc delete pod -l component=es to restart the es pods to resolve the issue @jeff,How the customer/support team can know the workaround before they hit this issue? Move to verified, the doc include it in the z-stream update for 4.6.13 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.13 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:0173 |