Bug 1888943

Summary: ClusterLogForwarder with Output.Secret causes fluentd crash loop
Product: OpenShift Container Platform Reporter: Alan Conway <aconway>
Component: LoggingAssignee: Alan Conway <aconway>
Status: CLOSED ERRATA QA Contact: Qiaoling Tang <qitang>
Severity: urgent Docs Contact: Rolfe Dlugy-Hegwer <rdlugyhe>
Priority: urgent    
Version: 4.6CC: aanwar, anli, aos-bugs, gferrazs, jcantril, periklis, qitang, rdlugyhe
Target Milestone: ---Flags: aconway: needinfo+
Target Release: 4.7.0   
Hardware: All   
OS: All   
Whiteboard: logging-core
Fixed In Version: Doc Type: Bug Fix
Doc Text:
* Previously, the fluentd collector pod went into a crash loop when the ClusterLogForwarder had an incorrectly-configured secret. The current release fixes this issue. Now, the ClusterLogForwarder validates the secrets and reports any errors in its status field. As a result, it does not cause the Fluentd collector pod to crash. (link:https://bugzilla.redhat.com/show_bug.cgi?id=1888943[*1888943*])
Story Points: ---
Clone Of:
: 1929420 (view as bug list) Environment:
Last Closed: 2021-02-24 11:21:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1929420    
Attachments:
Description Flags
YAML file to reproduce the problem. none

Description Alan Conway 2020-10-16 13:33:36 UTC
Created attachment 1722069 [details]
YAML file to reproduce the problem.

Description of problem:

Creating a CLF instance with an Output.Secret causes the CLO fluentd to crash-loop because of a missing file mount.

Version-Release number of selected component (if applicable):

Current master

How reproducible: 100%


Steps to Reproduce:
1. Deploy the CLO
2. oc create -f clf-secret.yaml # attached
3. oc get pod -l component=fluentd
4. oc logs $(oc get pod -l component=fluentd -o name)

Actual results:

fluentd pod in CrashLoopBackoff state logs contains error like this: 

    /usr/local/share/gems/gems/fluentd-1.7.4/lib/fluent/config/literal_parser.rb:171:in `instance_eval': No such file or directory @ rb_sysopen - /var/run/ocp-collector/secrets/test-secret/shared_key (Errno::ENOENT)


Expected results:

No errors, fluentd running.

Additional info:

Missing file is supposed to be mounted from a secret, but for some reason this is not happening.

Question: do we even need the shared_key anymore? fluentd 1.0 can do TLS without shared keys, and the shared key adds no extra security since its value is fixed and plainly visible in the source code.

Comment 1 Alan Conway 2020-10-16 14:02:25 UTC
This is a case of bad error handling. The secret is expected to have a shared_key entry, but this is not checked. If the entry is missing, we crash trying to open a nil file name.

Proposed fix: verify all the secret keys before generating fluent configuration, take appropriate action or set a degraded status.

Comment 2 Alan Conway 2020-10-16 20:56:01 UTC
Another problem: If more than one ClusterLogForwarder output has a Secret, the CLO fails to reconcile with 

{"level":"error","ts":"2020-10-16T20:50:01Z","logger":"cluster-logging-operator","msg":"clusterlogforwarder-controller returning, error","cause":{"msg":"Unable to reconcile collection for \"instance\": Failure creating Fluentd Daemonset DaemonSet.apps \"fluentd\" is invalid: spec.template.spec.containers[0].volumeMounts[15].mountPath: Invalid value: \"/var/run/ocp-collector/secrets/receiver\": must be unique"}}

Comment 3 Jeff Cantrill 2020-10-19 14:18:35 UTC
(In reply to Alan Conway from comment #2)
> Another problem: If more than one ClusterLogForwarder output has a Secret,
> the CLO fails to reconcile with 
> 
> {"level":"error","ts":"2020-10-16T20:50:01Z","logger":"cluster-logging-
> operator","msg":"clusterlogforwarder-controller returning,
> error","cause":{"msg":"Unable to reconcile collection for \"instance\":
> Failure creating Fluentd Daemonset DaemonSet.apps \"fluentd\" is invalid:
> spec.template.spec.containers[0].volumeMounts[15].mountPath: Invalid value:
> \"/var/run/ocp-collector/secrets/receiver\": must be unique"}}

More then one secret with the same name?  Secrets should be treated as a SET and if they are not then this is a bug

Comment 4 Ahmed Anwar 2020-10-21 11:47:33 UTC
Ran into this issue, and found a work around.

Basically if you have multiple outputs using the same secret, the CLO won't update the ds properly.

Reported in this bug https://bugzilla.redhat.com/show_bug.cgi?id=1890072

Comment 5 Jeff Cantrill 2020-10-23 15:19:58 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 6 Alan Conway 2020-11-19 19:12:49 UTC
Fix in progress, overhaul of secret validation & security config. Will also fix https://bugzilla.redhat.com/show_bug.cgi?id=1834414

Comment 7 Alan Conway 2020-12-02 19:48:21 UTC
Fix PR https://github.com/openshift/cluster-logging-operator/pull/823 awaiting merge.

Comment 9 Qiaoling Tang 2021-02-02 08:00:14 UTC
Verified with cluster-logging.5.0.0-32

When the secret doesn't have `shared_key`, the fluentd pods can start and can forward logs to the fluentd server.

Comment 10 Rolfe Dlugy-Hegwer 2021-02-02 21:46:12 UTC
Moved the contents of the Doc Text field here for future reference:

Feature: ClusterLogForwarder TLS configuration 

Reason: Improved control of TLS, need to clarify the text

Result: 

This is the spec documentation comment, update the appropriate user doc to be consistent.

	// URL to send log records to.
	//
	// An absolute URL, with a scheme. Valid schemes depend on `type`.
	// Special schemes `tcp`, `tls`, `udp` and `udps` are used for types that
	// have no scheme of their own. For example, to send syslog records using secure UDP:
	//
	//     { type: syslog, url: udps://syslog.example.com:1234 }
	//
	// Basic TLS is enabled if the URL scheme requires it (for example 'https' or 'tls').
	// The 'username@password' part of `url` is ignored.
	// Any additional authentication material is in the `secret`.
	// See the `secret` field for more details.
	//
	// +kubebuilder:validation:Pattern:=`^$|[a-zA-z]+:\/\/.*`
	// +optional
	URL string `json:"url,omitempty"`

	OutputTypeSpec `json:",inline"`

	// Secret for authentication.
	// Name of a secret in the same namespace as the cluster logging operator.
	//
	// For client authentication, set secret keys `tls.crt` and `tls.key` to the client certificate and private key.
	//
	// To use your own certificate authority, set secret key `ca-bundle.crt`.
	//
	// Depending on the `type` there may be other secret keys that have meaning.
	//
	// +optional
	Secret *OutputSecretSpec `json:"secret,omitempty"`

Comment 11 Rolfe Dlugy-Hegwer 2021-02-02 21:59:33 UTC
Please describe this enhancement for our release notes using the Doc Text field.

For examples of release notes for Logging-related enhancents, please [see the current release notes][1]

(The Doc Text field is poorly named. It doesn't create a request to update the product documentation. Instead, please [create a Jira task][2] with:
- Project = Docs for Red Hat Developers (RHDEVDOCS) 
- Component = Logging)


[1]https://docs.openshift.com/container-platform/4.6/release_notes/ocp-4-6-release-notes.html#ocp-4-6-logging

[2]https://issues.redhat.com/secure/CreateIssue!default.jspa

Comment 12 Jeff Cantrill 2021-02-03 18:51:59 UTC
Moving to POST to satisfy the merge bot

Comment 13 Jeff Cantrill 2021-02-03 19:28:22 UTC
Moving back to verified per #c9

Comment 14 Alan Conway 2021-02-03 23:03:10 UTC
Updated doc text, created a documentation JIRA: https://issues.redhat.com/browse/LOG-1064

Comment 17 Alan Conway 2021-02-09 19:47:45 UTC
Doc text is good.

Comment 20 errata-xmlrpc 2021-02-24 11:21:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Errata Advisory for Openshift Logging 5.0.0), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:0652