1888943 – ClusterLogForwarder with Output.Secret causes fluentd crash loop

Bug 1888943 - ClusterLogForwarder with Output.Secret causes fluentd crash loop

Summary: ClusterLogForwarder with Output.Secret causes fluentd crash loop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.6
Hardware:	All
OS:	All
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Alan Conway
QA Contact:	Qiaoling Tang
Docs Contact:	Rolfe Dlugy-Hegwer
URL:
Whiteboard:	logging-core
Depends On:
Blocks:	1929420
TreeView+	depends on / blocked

Reported:	2020-10-16 13:33 UTC by Alan Conway
Modified:	2023-12-15 19:48 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	* Previously, the fluentd collector pod went into a crash loop when the ClusterLogForwarder had an incorrectly-configured secret. The current release fixes this issue. Now, the ClusterLogForwarder validates the secrets and reports any errors in its status field. As a result, it does not cause the Fluentd collector pod to crash. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1888943[1888943])
Clone Of:
Clones:	1929420 (view as bug list)
Environment:
Last Closed:	2021-02-24 11:21:19 UTC
Target Upstream Version:
Embargoed:
Flags:	aconway: needinfo+

Attachments	(Terms of Use)
YAML file to reproduce the problem. (13.88 KB, text/plain) 2020-10-16 13:33 UTC, Alan Conway	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-logging-operator pull 823	None	closed	Bug 1888943: ClusterLogForwarder with Output.Secret causes fluentd crash loop	2021-02-16 20:30:48 UTC
Github	openshift cluster-logging-operator pull 872	None	closed	Bug 1888943: update spec doc for url and secret fields.	2021-02-16 20:30:48 UTC
Red Hat Bugzilla	1834414	low	CLOSED	Allow enable mutual TLS in cluster logging Log Forwarding feature	2023-12-15 17:52:21 UTC
Red Hat Product Errata	RHBA-2021:0652	None	None	None	2021-02-24 11:22:11 UTC

Internal Links: 1834414

Description Alan Conway 2020-10-16 13:33:36 UTC

Created attachment 1722069 [details]
YAML file to reproduce the problem.

Description of problem:

Creating a CLF instance with an Output.Secret causes the CLO fluentd to crash-loop because of a missing file mount.

Version-Release number of selected component (if applicable):

Current master

How reproducible: 100%


Steps to Reproduce:
1. Deploy the CLO
2. oc create -f clf-secret.yaml # attached
3. oc get pod -l component=fluentd
4. oc logs $(oc get pod -l component=fluentd -o name)

Actual results:

fluentd pod in CrashLoopBackoff state logs contains error like this: 

    /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/config/literal_parser.rb:171:in `instance_eval': No such file or directory @ rb_sysopen - /var/run/ocp-collector/secrets/test-secret/shared_key (Errno::ENOENT)


Expected results:

No errors, fluentd running.

Additional info:

Missing file is supposed to be mounted from a secret, but for some reason this is not happening.

Question: do we even need the shared_key anymore? fluentd 1.0 can do TLS without shared keys, and the shared key adds no extra security since its value is fixed and plainly visible in the source code.

Comment 1 Alan Conway 2020-10-16 14:02:25 UTC

This is a case of bad error handling. The secret is expected to have a shared_key entry, but this is not checked. If the entry is missing, we crash trying to open a nil file name.

Proposed fix: verify all the secret keys before generating fluent configuration, take appropriate action or set a degraded status.

Comment 2 Alan Conway 2020-10-16 20:56:01 UTC

Another problem: If more than one ClusterLogForwarder output has a Secret, the CLO fails to reconcile with 

{"level":"error","ts":"2020-10-16T20:50:01Z","logger":"cluster-logging-operator","msg":"clusterlogforwarder-controller returning, error","cause":{"msg":"Unable to reconcile collection for \"instance\": Failure creating Fluentd Daemonset DaemonSet.apps \"fluentd\" is invalid: spec.template.spec.containers[0].volumeMounts[15].mountPath: Invalid value: \"/var/run/ocp-collector/secrets/receiver\": must be unique"}}

Comment 3 Jeff Cantrill 2020-10-19 14:18:35 UTC

(In reply to Alan Conway from comment #2)
> Another problem: If more than one ClusterLogForwarder output has a Secret,
> the CLO fails to reconcile with 
> 
> {"level":"error","ts":"2020-10-16T20:50:01Z","logger":"cluster-logging-
> operator","msg":"clusterlogforwarder-controller returning,
> error","cause":{"msg":"Unable to reconcile collection for \"instance\":
> Failure creating Fluentd Daemonset DaemonSet.apps \"fluentd\" is invalid:
> spec.template.spec.containers[0].volumeMounts[15].mountPath: Invalid value:
> \"/var/run/ocp-collector/secrets/receiver\": must be unique"}}

More then one secret with the same name?  Secrets should be treated as a SET and if they are not then this is a bug

Comment 4 Ahmed Anwar 2020-10-21 11:47:33 UTC

Ran into this issue, and found a work around.

Basically if you have multiple outputs using the same secret, the CLO won't update the ds properly.

Reported in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1890072

Comment 5 Jeff Cantrill 2020-10-23 15:19:58 UTC

Setting UpcomingSprint as unable to resolve before EOD

Comment 6 Alan Conway 2020-11-19 19:12:49 UTC

Fix in progress, overhaul of secret validation & security config. Will also fix https://bugzilla.redhat.com/show_bug.cgi?id=1834414

Comment 7 Alan Conway 2020-12-02 19:48:21 UTC

Fix PR https://github.com/openshift/cluster-logging-operator/pull/823 awaiting merge.

Comment 9 Qiaoling Tang 2021-02-02 08:00:14 UTC

Verified with cluster-logging.5.0.0-32

When the secret doesn't have `shared_key`, the fluentd pods can start and can forward logs to the fluentd server.

Comment 10 Rolfe Dlugy-Hegwer 2021-02-02 21:46:12 UTC

Moved the contents of the Doc Text field here for future reference:

Feature: ClusterLogForwarder TLS configuration 

Reason: Improved control of TLS, need to clarify the text

Result: 

This is the spec documentation comment, update the appropriate user doc to be consistent.

	// URL to send log records to.
	//
	// An absolute URL, with a scheme. Valid schemes depend on `type`.
	// Special schemes `tcp`, `tls`, `udp` and `udps` are used for types that
	// have no scheme of their own. For example, to send syslog records using secure UDP:
	//
	//     { type: syslog, url: udps://syslog.example.com:1234 }
	//
	// Basic TLS is enabled if the URL scheme requires it (for example 'https' or 'tls').
	// The 'username@password' part of `url` is ignored.
	// Any additional authentication material is in the `secret`.
	// See the `secret` field for more details.
	//
	// +kubebuilder:validation:Pattern:=`^$|[a-zA-z]+:\/\/.*`
	// +optional
	URL string `json:"url,omitempty"`

	OutputTypeSpec `json:",inline"`

	// Secret for authentication.
	// Name of a secret in the same namespace as the cluster logging operator.
	//
	// For client authentication, set secret keys `tls.crt` and `tls.key` to the client certificate and private key.
	//
	// To use your own certificate authority, set secret key `ca-bundle.crt`.
	//
	// Depending on the `type` there may be other secret keys that have meaning.
	//
	// +optional
	Secret *OutputSecretSpec `json:"secret,omitempty"`

Comment 11 Rolfe Dlugy-Hegwer 2021-02-02 21:59:33 UTC

Please describe this enhancement for our release notes using the Doc Text field.

For examples of release notes for Logging-related enhancents, please [see the current release notes][1]

(The Doc Text field is poorly named. It doesn't create a request to update the product documentation. Instead, please [create a Jira task][2] with:
- Project = Docs for Red Hat Developers (RHDEVDOCS) 
- Component = Logging)


[1]https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-logging

[2]https://issues.redhat.com/secure/CreateIssue!default.jspa

Comment 12 Jeff Cantrill 2021-02-03 18:51:59 UTC

Moving to POST to satisfy the merge bot

Comment 13 Jeff Cantrill 2021-02-03 19:28:22 UTC

Moving back to verified per #c9

Comment 14 Alan Conway 2021-02-03 23:03:10 UTC

Updated doc text, created a documentation JIRA: https://issues.redhat.com/browse/LOG-1064

Comment 17 Alan Conway 2021-02-09 19:47:45 UTC

Doc text is good.

Comment 20 errata-xmlrpc 2021-02-24 11:21:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652

Note You need to log in before you can comment on or make changes to this bug.