Bug 1970015

Summary:	[master] Missing logs in case cluster installation failed
Product:	OpenShift Container Platform	Reporter:	Eran Cohen <ercohen>
Component:	assisted-installer	Assignee:	Fred Rolland <frolland>
assisted-installer sub component:	Deployment Operator	QA Contact:	bjacot
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	urgent
Priority:	urgent	CC:	akrzos, alazar, aos-bugs, frolland, itsoiref, mfilanov, sasha
Version:	4.8	Keywords:	Triaged
Target Milestone:	---	Flags:	ercohen: needinfo-
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:	KNI-EDGE-JUKE-4.8 AI-Team-Hive
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1971288 (view as bug list)		Environment:
Last Closed:	2021-06-29 12:21:35 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1971288

Description Eran Cohen 2021-06-09 15:56:26 UTC

Description of problem:
I tried to get the logs of failed installation and got this from the link:
{"code":"500","href":"","id":500,"kind":"Error","reason":"No log files were found"}

Version-Release number of selected component (if applicable):


How reproducible:

1/1
Steps to Reproduce:
1. Create the following:
BMH
pull-secret
manifest that will fail the cluster installation (see content below, though it doesn't really matter how you fail the installation)
infra-env
agent-cluster-install (with the manifestsConfigMapRef)
cluster-deployment

2. describe the Agentclusterinstall once the installation fails
3. curl the logs link to get the logs


manifest content:

kind: ConfigMap
apiVersion: v1
metadata:
  name: single-node-manifests
  namespace: assisted-installer
data:
  99_master_kernel_arg.yaml: |
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
      labels:
        machineconfiguration.openshift.io/role: master
      name: 02-master-workload-partitioning
    spec:
      config:
        ignition:
          version: 3.2.0
        storage:
          files:
          - contents:
              source: data:text/plain;charset=utf-8;base64,W2NyaW8ucnVudGltZS53b3JrbG9hZHMubWFuYWdlbWVudF0KYWN0aXZhdGlvbl9hbm5vdGF0aW9uID0gInRhcmdldC53b3JrbG9hZC5vcGVuc2hpZnQuaW8vbWFuYWdlbWVudCIKYW5ub3RhdGlvbl9wcmVmaXggPSAicmVzb3VyY2VzLndvcmtsb2FkLm9wZW5zaGlmdC5pbyIKcmVzb3VyY2VzID0geyAiY3B1c2hhcmVzIiA9IDAsICJjcHVzZXQiID0gIjAtMSw1Mi01MyIgfQ==
            mode: 420
            overwrite: true
            path: /etc/crio/crio.conf.d/01-workload-partitioning
            user:
              name: root
          - contents:
              source: data:text/plain;charset=utf-8;base64,ewogICJtYW5hZ2VtZW50IjogewogICAgImNwdXNldCI6ICIwLTEsNTItNTMiCiAgfQp9
            mode: 420
            overwrite: true
            path: /etc/kubernetes/openshift-workload-pinning
            user:
              name: root



Actual results:

{"code":"500","href":"","id":500,"kind":"Error","reason":"No log files were found"}

Expected results:
Get the cluster logs

Additional info:

I can log into the host, the assisted-installer-controller did upload the logs:
time="2021-06-08T16:09:49Z" level=info msg="Cluster installation failed."
time="2021-06-08T16:09:49Z" level=info msg="Waiting for all go routines to finish"
time="2021-06-08T16:09:49Z" level=info msg="Finished PostInstallConfigs"
time="2021-06-08T16:09:49Z" level=info msg="Finished UpdateBMHs"
time="2021-06-08T16:09:49Z" level=info msg="Finished UploadLogs"
time="2021-06-08T16:09:49Z" level=info msg="Finished all"

Comment 1 Igal Tsoiref 2021-06-10 07:34:17 UTC

In this specific case we shouldn't get controller  logs cause controller was started after cluster was in timeout already. Though we still miss assisted-installer logs that should be sent before reboot.

Comment 2 Fred Rolland 2021-06-15 10:55:33 UTC

I try the following scenario with kube API:
- Start install of SNO
- After reboot, suspend the VM
- Wait for installation to fail
- Download the logs via DebugInfo URL

The logs where available, without the controller log as expected.

@ercohen Anything I missed? Do you have the assisted service log?

Comment 3 Fred Rolland 2021-06-15 11:28:40 UTC

The issue reproduce if the service pod is restarted after the installation failures:

{"code":"500","href":"","id":500,"kind":"Error","reason":"No log files were found"}

The logs are not persisted after reboot.

@mfilanov WDYT?

Comment 4 Eran Cohen 2021-06-15 12:07:39 UTC

I suspect the logs are saved on the pod filesystem and not on a PV.

Comment 5 Michael Filanov 2021-06-15 13:01:13 UTC

If we are talking about test-infra then this is probably the case.
If it happened with the operator then it's a bug.

Comment 6 Fred Rolland 2021-06-15 14:25:14 UTC

Tested with env installed with operator with PV for filesystem and logs are persisted after reboot.

@ercohen WDYT?

Comment 7 Alexander Chuzhoy 2021-06-17 13:28:07 UTC

Reproduce the issue.

Note: Once the cluster got deployed successfully, the link actually started to work and I was able to download the log tarball.