Description of problem: When running operations that can happen on multiple nodes (like removing an ssh key) it would be good if the error logging was more descriptive in the broker logs or what is reported to the user. Example: rhc sshkey-remove KEY Removing the key 'KEY ... Please sign in to start a new session to BROKER. Password: *********** Application config change did not complete on 1 gear. Please try again and contact support if the issue persists. Version-Release number of selected component (if applicable): How reproducible: very Steps to Reproduce: 1. add an ssh key to an application 2. make /var/lib/openshift read only (re mount the file system) 3. try and remove the ssh key. Actual results: You get the error message above. Expected results: It would be nice to (at a minimum) know what gear this failed on. So you can track this down to a node and investigate further. Since a users ssh keys can be applied to 100's of apps or (a domain) you could have to look at 100's of platform.log to find an error message like: December 11 15:52:10 INFO openshift-agent: request end: action=execute_parallel, requestid=a78e91c665d15f4a9f5e4865f643d880, senderid=BROKER, statuscode=0, data={:output=>[{:tag=>{"op_id"=>"548a03f94ce738f90f000006"}, :gear=>"5436de41536c95f89600002e", :job=>{:cartridge=>"openshift-origin-node", :action=>"authorized-ssh-key-batch-remove", :args=>{"--with-app-uuid"=>"5436de41536c95f89600002e", "--with-app-name"=>"NAME", "--with-expose-ports"=>true, "--with-container-uuid"=>"5436de41536c95f89600002e", "--with-container-name"=>"NAME", "--with-namespace"=>"NAMESPACE", "--with-uid"=>5494, "--with-request-id"=>"021dcedfdede1e9e9cd4e2eede0c9ead", "--with-ssh-keys"=>[{"key"=>KEY", "type"=>"ssh-rsa", "comment"=>"543576784ce7381dd6000001-jalagarslaekey", "content"=>"KEY"}]}}, :result_stdout=>"Read-only file system - /var/lib/openshift/5436de41536c95f89600002e//.ssh/authorized_keys", :result_stderr=>"", :result_exit_code=>1, :result_addtl_params=>nil}], :exitcode=>0} Additional info:
This appears to be the relevant code: https://github.com/openshift/origin-server/blob/06d74092e7c881ff33542e62edc69584266389b5/controller/app/models/pending_app_op_group.rb#L122-L143 Looks like we have the information that we would need in order to log more detailed output on the broker or to return a more detailed error message to the user. I have two questions: • Would providing the gear id to the user as part of the current error message be useful, or would it be better instead to log the error, with the related gear ids, to the broker logs? • Should we provide a full list of gear ids, or cap it to, say, the first 3 gear ids where the operation failed, so you would get something like, "Application config change did not complete on 10 gears: 5436de41536c95f89600002d, 5436de41536c95f89600002e, 5436de41536c95f89600002f, and 7 others."?
(In reply to Miciah Dashiel Butler Masters from comment #5) > • Would providing the gear id to the user as part of the current error > message be useful, or would it be better instead to log the error, with the > related gear ids, to the broker logs? > > • Should we provide a full list of gear ids, or cap it to, say, the first 3 > gear ids where the operation failed, so you would get something like, > "Application config change did not complete on 10 gears: > 5436de41536c95f89600002d, 5436de41536c95f89600002e, > 5436de41536c95f89600002f, and 7 others."? The issue here is that the following message to the user's is not descriptive enough to explain why the operation can not be completed. > Application config change did not complete on 1 gear. Please try again and contact support if the issue persists. In looking at the logs from the perspective of an "administrator" it clear why this happens. > result_stdout=>"Read-only file system - /var/lib/openshift/5436de41536c95f89600002e//.ssh/authorized_keys", :result_stderr=>"", :result_exit_code=>1, :result_addtl_params=>nil}], :exitcode=>0} However the issue is finding out what node this issue occurs on. > It would be nice to (at a minimum) know what gear this failed on. So you can track this down to a node and investigate further. I can use tools like `oo-app-info` to get a complete list of nodes for the application, however based on the error message: > Application config change did not complete on 1 gear. Please try again and contact support if the issue persists. I don't know where the issue occurred. The key here is knowing what node the failure happened on, and not what gear's are failing. It might be possible to report on the RequestID like we do with other messages because then at least grep for the operation id that failed, in the sub-set of nodes we know that houses the application (not a complete fix for this request, but its something). > requestid=a78e91c665d15f4a9f5e4865f643d880, senderid=BROKER The other option is to "over" share what node (hostname) the failure happens on.