Bug 1970015

Summary: [master] Missing logs in case cluster installation failed
Product: OpenShift Container Platform Reporter: Eran Cohen <ercohen>
Component: assisted-installerAssignee: Fred Rolland <frolland>
assisted-installer sub component: Deployment Operator QA Contact: bjacot
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: urgent CC: akrzos, alazar, aos-bugs, frolland, itsoiref, mfilanov, sasha
Version: 4.8Keywords: Triaged
Target Milestone: ---Flags: ercohen: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: KNI-EDGE-JUKE-4.8 AI-Team-Hive
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1971288 (view as bug list) Environment:
Last Closed: 2021-06-29 12:21:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1971288    

Description Eran Cohen 2021-06-09 15:56:26 UTC
Description of problem:
I tried to get the logs of failed installation and got this from the link:
{"code":"500","href":"","id":500,"kind":"Error","reason":"No log files were found"}

Version-Release number of selected component (if applicable):


How reproducible:

1/1
Steps to Reproduce:
1. Create the following:
BMH
pull-secret
manifest that will fail the cluster installation (see content below, though it doesn't really matter how you fail the installation)
infra-env
agent-cluster-install (with the manifestsConfigMapRef)
cluster-deployment

2. describe the Agentclusterinstall once the installation fails
3. curl the logs link to get the logs


manifest content:

kind: ConfigMap
apiVersion: v1
metadata:
  name: single-node-manifests
  namespace: assisted-installer
data:
  99_master_kernel_arg.yaml: |
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
      labels:
        machineconfiguration.openshift.io/role: master
      name: 02-master-workload-partitioning
    spec:
      config:
        ignition:
          version: 3.2.0
        storage:
          files:
          - contents:
              source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS53b3JrbG9hZHMubWFuYWdlbWVudF0KYWN0aXZhdGlvbl9hbm5vdGF0aW9uID0gInRhcmdldC53b3JrbG9hZC5vcGVuc2hpZnQuaW8vbWFuYWdlbWVudCIKYW5ub3RhdGlvbl9wcmVmaXggPSAicmVzb3VyY2VzLndvcmtsb2FkLm9wZW5zaGlmdC5pbyIKcmVzb3VyY2VzID0geyAiY3B1c2hhcmVzIiA9IDAsICJjcHVzZXQiID0gIjAtMSw1Mi01MyIgfQ==
            mode: 420
            overwrite: true
            path: /etc/crio/crio.conf.d/01-workload-partitioning
            user:
              name: root
          - contents:
              source: data:text/plain;charset=utf-8;base64,ewogICJtYW5hZ2VtZW50IjogewogICAgImNwdXNldCI6ICIwLTEsNTItNTMiCiAgfQp9
            mode: 420
            overwrite: true
            path: /etc/kubernetes/openshift-workload-pinning
            user:
              name: root



Actual results:

{"code":"500","href":"","id":500,"kind":"Error","reason":"No log files were found"}

Expected results:
Get the cluster logs

Additional info:

I can log into the host, the assisted-installer-controller did upload the logs:
time="2021-06-08T16:09:49Z" level=info msg="Cluster installation failed."
time="2021-06-08T16:09:49Z" level=info msg="Waiting for all go routines to finish"
time="2021-06-08T16:09:49Z" level=info msg="Finished PostInstallConfigs"
time="2021-06-08T16:09:49Z" level=info msg="Finished UpdateBMHs"
time="2021-06-08T16:09:49Z" level=info msg="Finished UploadLogs"
time="2021-06-08T16:09:49Z" level=info msg="Finished all"

Comment 1 Igal Tsoiref 2021-06-10 07:34:17 UTC
In this specific case we shouldn't get controller  logs cause controller was started after cluster was in timeout already. Though we still miss assisted-installer logs that should be sent before reboot.

Comment 2 Fred Rolland 2021-06-15 10:55:33 UTC
I try the following scenario with kube API:
- Start install of SNO
- After reboot, suspend the VM
- Wait for installation to fail
- Download the logs via DebugInfo URL

The logs where available, without the controller log as expected.

@ercohen Anything I missed? Do you have the assisted service log?

Comment 3 Fred Rolland 2021-06-15 11:28:40 UTC
The issue reproduce if the service pod is restarted after the installation failures:

{"code":"500","href":"","id":500,"kind":"Error","reason":"No log files were found"}

The logs are not persisted after reboot.

@mfilanov WDYT?

Comment 4 Eran Cohen 2021-06-15 12:07:39 UTC
I suspect the logs are saved on the pod filesystem and not on a PV.

Comment 5 Michael Filanov 2021-06-15 13:01:13 UTC
If we are talking about test-infra then this is probably the case.
If it happened with the operator then it's a bug.

Comment 6 Fred Rolland 2021-06-15 14:25:14 UTC
Tested with env installed with operator with PV for filesystem and logs are persisted after reboot.

@ercohen WDYT?

Comment 7 Alexander Chuzhoy 2021-06-17 13:28:07 UTC
Reproduce the issue.

Note: Once the cluster got deployed successfully, the link actually started to work and I was able to download the log tarball.