Bug 1840274 - During upgrade, if CLO upgrades before EO fluentd writes to *-write index instead of alias
Summary: During upgrade, if CLO upgrades before EO fluentd writes to *-write index ins...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.5
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.0
Assignee: Jeff Cantrill
QA Contact: Anping Li
URL:
Whiteboard:
Depends On:
Blocks: 1843271
TreeView+ depends on / blocked
 
Reported: 2020-05-26 16:40 UTC by ewolinet
Modified: 2020-10-27 16:01 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-10-27 16:01:23 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 535 0 None closed Bug 1840274: creating init container to block waiting for es version 6+ 2020-10-07 01:38:06 UTC
Github openshift origin-aggregated-logging pull 1897 0 None closed Bug 1840274: Add script to let fluentd check ES version 2020-10-07 01:37:57 UTC
Github openshift origin-aggregated-logging pull 1928 0 None closed Bug 1840274: Enable ruby from scl if exists 2020-10-07 01:38:05 UTC
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:01:27 UTC

Description ewolinet 2020-05-26 16:40:45 UTC
Description of problem:
During upgrade from 4.4 -> 4.5, if CLO upgrades before EO does, fluentd will be updated to write to a '*-write' endpoint which should be an alias but if EO has not yet created the alias then fluent will cause a write index to be created instead. Then when EO upgrades this can cause issues with index management or loss of data that had been written by fluentd already.

Version-Release number of selected component (if applicable):
4.4 -> 4.5

How reproducible:
Always

Steps to Reproduce:
1. Upgrade CLO
2. Check ES indices

Actual results:
Fluentd causes write index to be created, incorrectly

Expected results:
Fluentd should wait until the alias is in place and then proceed to push its logs

Additional info:

Comment 4 Anping Li 2020-06-23 03:51:39 UTC
1) should be export the PATH ruby and Library libruby.so.2.5
$oc logs fluentd-k6t8c -c fluentd-init
./wait_for_es_version.sh: line 3: ruby: command not found

$docker run -it --entrypoint /opt/rh/rh-ruby25/root/usr/bin/ruby ose-logging-fluentd:v4.5.0 --help
/opt/rh/rh-ruby25/root/usr/bin/ruby: error while loading shared libraries: libruby.so.2.5: cannot open shared object file: No such file or directory

2) wait_for_es_version.sh shouldn't be executed when deploying fluentd only.

Comment 6 Anping Li 2020-06-24 01:49:52 UTC
Verified on the CI images

1) Upgrade CLO  to 4.6. one fluend is Init:CrashLoopBackOff. 
$oc get pods
fluentd-2cxs9                                   1/1     Running                 0          7m42s
fluentd-2mkwn                                   1/1     Running                 0          7m42s
fluentd-c6vs2                                   1/1     Running                 0          7m42s
fluentd-fkdmv                                   1/1     Running                 0          7m42s
fluentd-qcvgv                                   1/1     Running                 0          7m42s
fluentd-rtcsn                                   1/1     Running                 0          7m42s
fluentd-vn8fd                                   0/1     Init:CrashLoopBackOff   5          4m33s
$ oc logs fluentd-vn8fd -c fluentd-init
Elasticsearch is currently version: 5.6.16 - Expecting it to be at least: 6
2) Upgrade EO to 4.6. The ES pods are not Ready during upgrade. no data are received, no -write index.

3) After EO upgrade, the infra-000001, app-000001 are created in ES cluster. The fluentd start to upgrade. The doc.count increase in the old indices(.operatation.xxx and project.xxx indices) and new indices(infra-000001 and app-000001).

Comment 8 errata-xmlrpc 2020-10-27 16:01:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.