Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1173337

Summary:	[RFE] More descriptive logging on where errors occur with pending operations.
Product:	OpenShift Container Platform	Reporter:	Eric Rich <erich>
Component:	RFE	Assignee:	Mike Barrett <mbarrett>
Status:	CLOSED WONTFIX	QA Contact:	Johnny Liu <jialiu>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	2.1.0	CC:	aos-bugs, erich, jokerman, libra-onpremise-devel, lmeyer, mmasters, mmccomas, tiwillia
Target Milestone:	---	Keywords:	FutureFeature, Reopened
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Enhancement
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-04-13 17:17:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Eric Rich 2014-12-11 21:30:18 UTC

Description of problem:

When running operations that can happen on multiple nodes (like removing an ssh key) it would be good if the error logging was more descriptive in the broker logs or what is reported to the user. 

Example: 

rhc sshkey-remove KEY
Removing the key 'KEY ... Please sign in to start a new session to BROKER.
Password: ***********
Application config change did not complete on 1 gear. Please try again and contact support if the issue persists.

Version-Release number of selected component (if applicable): 


How reproducible: very


Steps to Reproduce:
1. add an ssh key to an application 
2. make /var/lib/openshift read only (re mount the file system) 
3. try and remove the ssh key. 

Actual results:

You get the error message above. 

Expected results:

It would be nice to (at a minimum) know what gear this failed on. So you can track this down to a node and investigate further. Since a users ssh keys can be applied to 100's of apps or (a domain) you could have to look at 100's of platform.log to find an error message like: 

December 11 15:52:10 INFO openshift-agent: request end: action=execute_parallel, requestid=a78e91c665d15f4a9f5e4865f643d880, senderid=BROKER, statuscode=0, data={:output=>[{:tag=>{"op_id"=>"548a03f94ce738f90f000006"}, :gear=>"5436de41536c95f89600002e", :job=>{:cartridge=>"openshift-origin-node", :action=>"authorized-ssh-key-batch-remove", :args=>{"--with-app-uuid"=>"5436de41536c95f89600002e", "--with-app-name"=>"NAME", "--with-expose-ports"=>true, "--with-container-uuid"=>"5436de41536c95f89600002e", "--with-container-name"=>"NAME", "--with-namespace"=>"NAMESPACE", "--with-uid"=>5494, "--with-request-id"=>"021dcedfdede1e9e9cd4e2eede0c9ead", "--with-ssh-keys"=>[{"key"=>KEY", "type"=>"ssh-rsa", "comment"=>"543576784ce7381dd6000001-jalagarslaekey", "content"=>"KEY"}]}}, :result_stdout=>"Read-only file system - /var/lib/openshift/5436de41536c95f89600002e//.ssh/authorized_keys", :result_stderr=>"", :result_exit_code=>1, :result_addtl_params=>nil}], :exitcode=>0}

Additional info:

Comment 5 Miciah Dashiel Butler Masters 2015-09-25 15:57:28 UTC

This appears to be the relevant code:

https://github.com/openshift/origin-server/blob/06d74092e7c881ff33542e62edc69584266389b5/controller/app/models/pending_app_op_group.rb#L122-L143

Looks like we have the information that we would need in order to log more detailed output on the broker or to return a more detailed error message to the user.  I have two questions:

• Would providing the gear id to the user as part of the current error message be useful, or would it be better instead to log the error, with the related gear ids, to the broker logs?

• Should we provide a full list of gear ids, or cap it to, say, the first 3 gear ids where the operation failed, so you would get something like, "Application config change did not complete on 10 gears: 5436de41536c95f89600002d, 5436de41536c95f89600002e, 5436de41536c95f89600002f, and 7 others."?

Comment 6 Eric Rich 2015-10-05 12:18:36 UTC

(In reply to Miciah Dashiel Butler Masters from comment #5)

> • Would providing the gear id to the user as part of the current error
> message be useful, or would it be better instead to log the error, with the
> related gear ids, to the broker logs?
> 
> • Should we provide a full list of gear ids, or cap it to, say, the first 3
> gear ids where the operation failed, so you would get something like,
> "Application config change did not complete on 10 gears:
> 5436de41536c95f89600002d, 5436de41536c95f89600002e,
> 5436de41536c95f89600002f, and 7 others."?

The issue here is that the following message to the user's is not descriptive enough to explain why the operation can not be completed. 

> Application config change did not complete on 1 gear. Please try again and contact support if the issue persists.

In looking at the logs from the perspective of an "administrator" it clear why this happens.

> result_stdout=>"Read-only file system - /var/lib/openshift/5436de41536c95f89600002e//.ssh/authorized_keys", :result_stderr=>"", :result_exit_code=>1, :result_addtl_params=>nil}], :exitcode=>0}

However the issue is finding out what node this issue occurs on. 

> It would be nice to (at a minimum) know what gear this failed on. So you can track this down to a node and investigate further.

I can use tools like `oo-app-info` to get a complete list of nodes for the application, however based on the error message: 

> Application config change did not complete on 1 gear. Please try again and contact support if the issue persists.

I don't know where the issue occurred. The key here is knowing what node the failure happened on, and not what gear's are failing. 

It might be possible to report on the RequestID like we do with other messages because then at least grep for the operation id that failed, in the sub-set of nodes we know that houses the application (not a complete fix for this request, but its something). 

> requestid=a78e91c665d15f4a9f5e4865f643d880, senderid=BROKER

The other option is to "over" share what node (hostname) the failure happens on.