Bug 1173337 - [RFE] More descriptive logging on where errors occur with pending operations.
Summary: [RFE] More descriptive logging on where errors occur with pending operations.
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RFE
Version: 2.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: ---
Assignee: Mike Barrett
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-12-11 21:30 UTC by Eric Rich
Modified: 2019-02-15 13:56 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-04-13 17:17:52 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Eric Rich 2014-12-11 21:30:18 UTC
Description of problem:

When running operations that can happen on multiple nodes (like removing an ssh key) it would be good if the error logging was more descriptive in the broker logs or what is reported to the user. 

Example: 

rhc sshkey-remove KEY
Removing the key 'KEY ... Please sign in to start a new session to BROKER.
Password: ***********
Application config change did not complete on 1 gear. Please try again and contact support if the issue persists.

Version-Release number of selected component (if applicable): 


How reproducible: very


Steps to Reproduce:
1. add an ssh key to an application 
2. make /var/lib/openshift read only (re mount the file system) 
3. try and remove the ssh key. 

Actual results:

You get the error message above. 

Expected results:

It would be nice to (at a minimum) know what gear this failed on. So you can track this down to a node and investigate further. Since a users ssh keys can be applied to 100's of apps or (a domain) you could have to look at 100's of platform.log to find an error message like: 

December 11 15:52:10 INFO openshift-agent: request end: action=execute_parallel, requestid=a78e91c665d15f4a9f5e4865f643d880, senderid=BROKER, statuscode=0, data={:output=>[{:tag=>{"op_id"=>"548a03f94ce738f90f000006"}, :gear=>"5436de41536c95f89600002e", :job=>{:cartridge=>"openshift-origin-node", :action=>"authorized-ssh-key-batch-remove", :args=>{"--with-app-uuid"=>"5436de41536c95f89600002e", "--with-app-name"=>"NAME", "--with-expose-ports"=>true, "--with-container-uuid"=>"5436de41536c95f89600002e", "--with-container-name"=>"NAME", "--with-namespace"=>"NAMESPACE", "--with-uid"=>5494, "--with-request-id"=>"021dcedfdede1e9e9cd4e2eede0c9ead", "--with-ssh-keys"=>[{"key"=>KEY", "type"=>"ssh-rsa", "comment"=>"543576784ce7381dd6000001-jalagarslaekey", "content"=>"KEY"}]}}, :result_stdout=>"Read-only file system - /var/lib/openshift/5436de41536c95f89600002e//.ssh/authorized_keys", :result_stderr=>"", :result_exit_code=>1, :result_addtl_params=>nil}], :exitcode=>0}

Additional info:

Comment 5 Miciah Dashiel Butler Masters 2015-09-25 15:57:28 UTC
This appears to be the relevant code:

https://github.com/openshift/origin-server/blob/06d74092e7c881ff33542e62edc69584266389b5/controller/app/models/pending_app_op_group.rb#L122-L143

Looks like we have the information that we would need in order to log more detailed output on the broker or to return a more detailed error message to the user.  I have two questions:

• Would providing the gear id to the user as part of the current error message be useful, or would it be better instead to log the error, with the related gear ids, to the broker logs?

• Should we provide a full list of gear ids, or cap it to, say, the first 3 gear ids where the operation failed, so you would get something like, "Application config change did not complete on 10 gears: 5436de41536c95f89600002d, 5436de41536c95f89600002e, 5436de41536c95f89600002f, and 7 others."?

Comment 6 Eric Rich 2015-10-05 12:18:36 UTC
(In reply to Miciah Dashiel Butler Masters from comment #5)

> • Would providing the gear id to the user as part of the current error
> message be useful, or would it be better instead to log the error, with the
> related gear ids, to the broker logs?
> 
> • Should we provide a full list of gear ids, or cap it to, say, the first 3
> gear ids where the operation failed, so you would get something like,
> "Application config change did not complete on 10 gears:
> 5436de41536c95f89600002d, 5436de41536c95f89600002e,
> 5436de41536c95f89600002f, and 7 others."?

The issue here is that the following message to the user's is not descriptive enough to explain why the operation can not be completed. 

> Application config change did not complete on 1 gear. Please try again and contact support if the issue persists.

In looking at the logs from the perspective of an "administrator" it clear why this happens.

> result_stdout=>"Read-only file system - /var/lib/openshift/5436de41536c95f89600002e//.ssh/authorized_keys", :result_stderr=>"", :result_exit_code=>1, :result_addtl_params=>nil}], :exitcode=>0}

However the issue is finding out what node this issue occurs on. 

> It would be nice to (at a minimum) know what gear this failed on. So you can track this down to a node and investigate further.

I can use tools like `oo-app-info` to get a complete list of nodes for the application, however based on the error message: 

> Application config change did not complete on 1 gear. Please try again and contact support if the issue persists.

I don't know where the issue occurred. The key here is knowing what node the failure happened on, and not what gear's are failing. 

It might be possible to report on the RequestID like we do with other messages because then at least grep for the operation id that failed, in the sub-set of nodes we know that houses the application (not a complete fix for this request, but its something). 

> requestid=a78e91c665d15f4a9f5e4865f643d880, senderid=BROKER

The other option is to "over" share what node (hostname) the failure happens on.


Note You need to log in before you can comment on or make changes to this bug.