Bug 1707061

Summary: no termination message provided by failing openshift-controller-manager-operator pod
Product: OpenShift Container Platform Reporter: Luis Sanchez <sanchezl>
Component: BuildAssignee: Luis Sanchez <sanchezl>
Status: CLOSED ERRATA QA Contact: wewang <wewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.1.0CC: aos-bugs, jokerman, mfojtik, mmccomas, wzheng
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:48:31 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Luis Sanchez 2019-05-06 17:35:50 UTC
The openshift-controller-manager-operator pod does not provide a termination message, hindering debugging efforts when the pods are crash looping.

At minimum, the pod's terminationMessagePolicy should be "FallbackToLogsOnError".

See https://kubernetes.io/docs/tasks/debug-application-cluster/determine-reason-pod-failure/#customizing-the-termination-message

Expected Results:
The termination message should appear in a pod container's  .status.lastState.terminated.message field.

Comment 1 Luis Sanchez 2019-05-06 18:55:55 UTC
Fixed by https://bugzilla.redhat.com/show_bug.cgi?id=1707061

Comment 4 wewang 2019-05-08 02:35:04 UTC
Tested in version:
$ oc get clusterversion 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-07-201043   True        False         21m     Cluster version is 4.1.0-0.nightly-2019-05-07-201043
payload:registry.svc.ci.openshift.org/ocp/release@sha256:41319d522be6f0c739c38f6699320360732ee94d7f669d0646459cfd867c9963

1. Now deployment already added FallbackToLogsOnError
$ oc get deployment -o yaml -n openshift-controller-manager-operator |grep -i "terminationMessagePolicy"
          terminationMessagePolicy: FallbackToLogsOnError

2.Check the pod, status.lastState.terminated.message field had the message, but begin with "er -n openshift-controller-manager because it changed", what is "er"?, maybe should be "error" word.@Luis Sanchez could you help to confirm it,thanks

$ oc get pods -o yaml -n openshift-controller-manager-operator      
      lastState:
        terminated:
          containerID: cri-o://acf9ade72f4486756a181783061d6856ae68603d2ccd85cd21819bdb6e843a4e
          exitCode: 255
          finishedAt: "2019-05-08T01:47:46Z"
          message: |
            er -n openshift-controller-manager because it changed
            I0508 01:47:28.779571       1 status_controller.go:160] clusteroperator/openshift-controller-manager diff {"status":{"conditions":[{"lastTransitionTime":"2019-05-08T01:46:28Z","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2019-05-08T01:46:33Z","message":"Progressing: daemonset/controller-manager: observed generation is 6, desired generation is 7.","reason":"ProgressingDesiredStateNotYetAchieved","status":"True","type":"Progressing"},{"lastTransitionTime":"2019-05-08T01:46:33Z","message":"Available: no daemon pods available on any node.","reason":"AvailableNoPodsAvailable","status":"False","type":"Available"},{"lastTransitionTime":"2019-05-08T01:46:28Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}}
            I0508 01:47:28.794513       1 event.go:221] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-controller-manager-operator", Name:"openshift-controller-manager-operator", UID:"fda5aabd-7132-11e9-bab6-025511648778", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for operator openshift-controller-manager changed: Progressing message changed from "Progressing: daemonset/controller-manager: observed generation is 5, desired generation is 6." to "Progressing: daemonset/controller-manager: observed generation is 6, desired generation is 7."
            I0508 01:47:46.398872       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.key (current: "9aa132c772c9c7c3049c798e2e975534c716d5dcb2f759f0a457f114fa31222c", lastKnown: "")
            W0508 01:47:46.398907       1 builder.go:108] Restart triggered because of file /var/run/secrets/serving-cert/tls.key was created
            F0508 01:47:46.398965       1 leaderelection.go:65] leaderelection lost
            I0508 01:47:46.400265       1 observer_polling.go:78] Observed change: file:/var/run/secrets/serving-cert/tls.crt (current: "3014a8d059742d8afc5be688ce0d2f2a5b563770adcbf41db36fb8a09e40f13e", lastKnown: "")
          reason: Error

Comment 5 Wenjing Zheng 2019-05-08 08:16:28 UTC
Above error seems no related to this bug(no such error in a newer version), since we can see the policy as below, will verify this bug now:
$  oc get deployment -o yaml -n openshift-controller-manager-operator |grep -i "terminationMessagePolicy"
          terminationMessagePolicy: FallbackToLogsOnError
$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-07-233329   True        False         5h18m   Cluster version is 4.1.0-0.nightly-2019-05-07-233329

Comment 7 errata-xmlrpc 2019-06-04 10:48:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758