Bug 1888943 - ClusterLogForwarder with Output.Secret causes fluentd crash loop
Summary: ClusterLogForwarder with Output.Secret causes fluentd crash loop
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 4.6
Hardware: All
OS: All
urgent
urgent
Target Milestone: ---
: 4.7.0
Assignee: Alan Conway
QA Contact: Qiaoling Tang
Rolfe Dlugy-Hegwer
URL:
Whiteboard: logging-core
Depends On:
Blocks: 1929420
TreeView+ depends on / blocked
 
Reported: 2020-10-16 13:33 UTC by Alan Conway
Modified: 2023-12-15 19:48 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
* Previously, the fluentd collector pod went into a crash loop when the ClusterLogForwarder had an incorrectly-configured secret. The current release fixes this issue. Now, the ClusterLogForwarder validates the secrets and reports any errors in its status field. As a result, it does not cause the Fluentd collector pod to crash. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1888943[*1888943*])
Clone Of:
: 1929420 (view as bug list)
Environment:
Last Closed: 2021-02-24 11:21:19 UTC
Target Upstream Version:
Embargoed:
aconway: needinfo+


Attachments (Terms of Use)
YAML file to reproduce the problem. (13.88 KB, text/plain)
2020-10-16 13:33 UTC, Alan Conway
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-logging-operator pull 823 0 None closed Bug 1888943: ClusterLogForwarder with Output.Secret causes fluentd crash loop 2021-02-16 20:30:48 UTC
Github openshift cluster-logging-operator pull 872 0 None closed Bug 1888943: update spec doc for url and secret fields. 2021-02-16 20:30:48 UTC
Red Hat Bugzilla 1834414 0 low CLOSED Allow enable mutual TLS in cluster logging Log Forwarding feature 2023-12-15 17:52:21 UTC
Red Hat Product Errata RHBA-2021:0652 0 None None None 2021-02-24 11:22:11 UTC

Internal Links: 1834414

Description Alan Conway 2020-10-16 13:33:36 UTC
Created attachment 1722069 [details]
YAML file to reproduce the problem.

Description of problem:

Creating a CLF instance with an Output.Secret causes the CLO fluentd to crash-loop because of a missing file mount.

Version-Release number of selected component (if applicable):

Current master

How reproducible: 100%


Steps to Reproduce:
1. Deploy the CLO
2. oc create -f clf-secret.yaml # attached
3. oc get pod -l component=fluentd
4. oc logs $(oc get pod -l component=fluentd -o name)

Actual results:

fluentd pod in CrashLoopBackoff state logs contains error like this: 

    /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/config/literal_parser.rb:171:in `instance_eval': No such file or directory @ rb_sysopen - /var/run/ocp-collector/secrets/test-secret/shared_key (Errno::ENOENT)


Expected results:

No errors, fluentd running.

Additional info:

Missing file is supposed to be mounted from a secret, but for some reason this is not happening.

Question: do we even need the shared_key anymore? fluentd 1.0 can do TLS without shared keys, and the shared key adds no extra security since its value is fixed and plainly visible in the source code.

Comment 1 Alan Conway 2020-10-16 14:02:25 UTC
This is a case of bad error handling. The secret is expected to have a shared_key entry, but this is not checked. If the entry is missing, we crash trying to open a nil file name.

Proposed fix: verify all the secret keys before generating fluent configuration, take appropriate action or set a degraded status.

Comment 2 Alan Conway 2020-10-16 20:56:01 UTC
Another problem: If more than one ClusterLogForwarder output has a Secret, the CLO fails to reconcile with 

{"level":"error","ts":"2020-10-16T20:50:01Z","logger":"cluster-logging-operator","msg":"clusterlogforwarder-controller returning, error","cause":{"msg":"Unable to reconcile collection for \"instance\": Failure creating Fluentd Daemonset DaemonSet.apps \"fluentd\" is invalid: spec.template.spec.containers[0].volumeMounts[15].mountPath: Invalid value: \"/var/run/ocp-collector/secrets/receiver\": must be unique"}}

Comment 3 Jeff Cantrill 2020-10-19 14:18:35 UTC
(In reply to Alan Conway from comment #2)
> Another problem: If more than one ClusterLogForwarder output has a Secret,
> the CLO fails to reconcile with 
> 
> {"level":"error","ts":"2020-10-16T20:50:01Z","logger":"cluster-logging-
> operator","msg":"clusterlogforwarder-controller returning,
> error","cause":{"msg":"Unable to reconcile collection for \"instance\":
> Failure creating Fluentd Daemonset DaemonSet.apps \"fluentd\" is invalid:
> spec.template.spec.containers[0].volumeMounts[15].mountPath: Invalid value:
> \"/var/run/ocp-collector/secrets/receiver\": must be unique"}}

More then one secret with the same name?  Secrets should be treated as a SET and if they are not then this is a bug

Comment 4 Ahmed Anwar 2020-10-21 11:47:33 UTC
Ran into this issue, and found a work around.

Basically if you have multiple outputs using the same secret, the CLO won't update the ds properly.

Reported in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1890072

Comment 5 Jeff Cantrill 2020-10-23 15:19:58 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 6 Alan Conway 2020-11-19 19:12:49 UTC
Fix in progress, overhaul of secret validation & security config. Will also fix https://bugzilla.redhat.com/show_bug.cgi?id=1834414

Comment 7 Alan Conway 2020-12-02 19:48:21 UTC
Fix PR https://github.com/openshift/cluster-logging-operator/pull/823 awaiting merge.

Comment 9 Qiaoling Tang 2021-02-02 08:00:14 UTC
Verified with cluster-logging.5.0.0-32

When the secret doesn't have `shared_key`, the fluentd pods can start and can forward logs to the fluentd server.

Comment 10 Rolfe Dlugy-Hegwer 2021-02-02 21:46:12 UTC
Moved the contents of the Doc Text field here for future reference:

Feature: ClusterLogForwarder TLS configuration 

Reason: Improved control of TLS, need to clarify the text

Result: 

This is the spec documentation comment, update the appropriate user doc to be consistent.

	// URL to send log records to.
	//
	// An absolute URL, with a scheme. Valid schemes depend on `type`.
	// Special schemes `tcp`, `tls`, `udp` and `udps` are used for types that
	// have no scheme of their own. For example, to send syslog records using secure UDP:
	//
	//     { type: syslog, url: udps://syslog.example.com:1234 }
	//
	// Basic TLS is enabled if the URL scheme requires it (for example 'https' or 'tls').
	// The 'username@password' part of `url` is ignored.
	// Any additional authentication material is in the `secret`.
	// See the `secret` field for more details.
	//
	// +kubebuilder:validation:Pattern:=`^$|[a-zA-z]+:\/\/.*`
	// +optional
	URL string `json:"url,omitempty"`

	OutputTypeSpec `json:",inline"`

	// Secret for authentication.
	// Name of a secret in the same namespace as the cluster logging operator.
	//
	// For client authentication, set secret keys `tls.crt` and `tls.key` to the client certificate and private key.
	//
	// To use your own certificate authority, set secret key `ca-bundle.crt`.
	//
	// Depending on the `type` there may be other secret keys that have meaning.
	//
	// +optional
	Secret *OutputSecretSpec `json:"secret,omitempty"`

Comment 11 Rolfe Dlugy-Hegwer 2021-02-02 21:59:33 UTC
Please describe this enhancement for our release notes using the Doc Text field.

For examples of release notes for Logging-related enhancents, please [see the current release notes][1]

(The Doc Text field is poorly named. It doesn't create a request to update the product documentation. Instead, please [create a Jira task][2] with:
- Project = Docs for Red Hat Developers (RHDEVDOCS) 
- Component = Logging)


[1]https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-logging

[2]https://issues.redhat.com/secure/CreateIssue!default.jspa

Comment 12 Jeff Cantrill 2021-02-03 18:51:59 UTC
Moving to POST to satisfy the merge bot

Comment 13 Jeff Cantrill 2021-02-03 19:28:22 UTC
Moving back to verified per #c9

Comment 14 Alan Conway 2021-02-03 23:03:10 UTC
Updated doc text, created a documentation JIRA: https://issues.redhat.com/browse/LOG-1064

Comment 17 Alan Conway 2021-02-09 19:47:45 UTC
Doc text is good.

Comment 20 errata-xmlrpc 2021-02-24 11:21:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652


Note You need to log in before you can comment on or make changes to this bug.