Bug 1856928

Summary: Non yaml files in INSTALL_DIR/manifests or INSTALL_DIR/openshift directories prevent bootstrap from completing
Product: OpenShift Container Platform Reporter: Peter Larsen <plarsen>
Component: InstallerAssignee: Abhinav Dahiya <adahiya>
Installer sub component: openshift-installer QA Contact: jima
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: high CC: adahiya, aos-bugs, deads, jokerman, mfojtik, mpatel, pkramp, sbatsche, wking
Version: 4.4   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:14:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Comment 2 Abhinav Dahiya 2020-07-14 17:48:37 UTC
moving to api server team to look at why bootstrap kube-apiserver is not running.

Comment 3 David Eads 2020-07-14 18:11:30 UTC
1. install-gather's bootstrap/containers does not list kube-apiserver-**, which suggests the static pod/container did not start.
2. the bootstrap/pods/fe980b372dd6.log indicates that we rendered that manifest and you can see it in rendered-assets/bootstrap-manifests/kube-apiserver-pod.yaml
3. kubelet log in bootstrap/journals/kubelet.log has 

pod_workers.go:191] Error syncing pod d19abf4f1a86772b81f1494dac1ad358 ("bootstrap-kube-apiserver-openshiftdev-w4jxt-bootstrap_kube-system(d19abf4f1a86772b81f1494dac1ad358)"), skipping: failed to "CreatePodSandbox" for "bootstrap-kube-apiserver-openshiftdev-w4jxt-bootstrap_kube-system(d19abf4f1a86772b81f1494dac1ad358)" with CreatePodSandboxError: "CreatePodSandbox for pod \"bootstrap-kube-apiserver-openshiftdev-w4jxt-bootstrap_kube-system(d19abf4f1a86772b81f1494dac1ad358)\" failed: rpc error: code = Unknown desc = `/usr/bin/runc --root /run/runc start 1a10ffe4c992a373a8b26f0fb3cc622cb89e8b4960a19571025daff9e3d47ba4` failed: exit status 1"

   which I don't know how to interpret since normally even failed containers have logs.  Perhaps a failure to create the container in question?


Questions to node team: 

1. is there evidence that the kubelet was able to start the container's process or is this an indication that the container could not be started.
2. if the container did start, can you help retrieve the container logs so we can debug why the kube-apiserver failed?

Comment 13 Abhinav Dahiya 2020-07-17 18:56:12 UTC
```
Core failure is
Jul 14 14:20:08 openshiftdev-w4jxt-bootstrap bootkube.sh[2250]: Assert creation failed: failed to load some manifests:
Jul 14 14:20:08 openshiftdev-w4jxt-bootstrap bootkube.sh[2250]: ".openshift_install.log": unable to convert asset ".openshift_install.log" from YAML to JSON: yaml: line 44: mapping values are not allowed in this context
```

Peter please make sure you are putting the log file for the installer in `INSTALL_DIR/manifests` that is causing the cluster-bootstrap to fail.
`INSTALL_DIR/manifests` is only mean for k8s manifests.

Comment 14 Abhinav Dahiya 2020-07-20 17:26:03 UTC
I think it makes sense for installer to validate the manifests it loads from the install_dir/openshift or install_dir/manifests

- make sure these are yaml or json files
- also maybe enforce these are k8s objects..
  (make sure handle multiple manifests in single file usecase if we do end up adding this specific validation)

Comment 15 Peter Larsen 2020-07-20 20:08:48 UTC
(In reply to Abhinav Dahiya from comment #13)
> ```
> Core failure is
> Jul 14 14:20:08 openshiftdev-w4jxt-bootstrap bootkube.sh[2250]: Assert
> creation failed: failed to load some manifests:
> Jul 14 14:20:08 openshiftdev-w4jxt-bootstrap bootkube.sh[2250]:
> ".openshift_install.log": unable to convert asset ".openshift_install.log"
> from YAML to JSON: yaml: line 44: mapping values are not allowed in this
> context
> ```
> 
> Peter please make sure you are putting the log file for the installer in
> `INSTALL_DIR/manifests` that is causing the cluster-bootstrap to fail.
> `INSTALL_DIR/manifests` is only mean for k8s manifests.

Abhinav,
I think this explains it - the install has been wiped though and a new one was successful, but a file put in the wrong directory is absolutely without reason for the failure. This helps a lot, thanks for identifying the root cause.  We'll try to be more careful when customer guided installs run that we check for this issue.

Comment 20 Russell Teague 2020-09-15 15:58:31 UTC
Updating bug assignment because it is in the current sprint.

Comment 21 jima 2020-09-17 06:57:51 UTC
Verified on 4.6.0-0.nightly-2020-09-16-175801 and passed.

1. Prepare install-config.yaml
2. Create manifests file
# ./openshift-install create manifests --dir upi
INFO Consuming Install Config from target directory 
WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings 
INFO Manifests created in: upi/manifests and upi/openshift 
3. Copy .openshift-install.log to  upi/manifests
4. Create ignition file
# ./openshift-install create ignition-configs --dir upi
INFO Consuming Common Manifests from target directory 
INFO Consuming Openshift Manifests from target directory 
INFO Consuming OpenShift Install (Manifests) from target directory 
INFO Consuming Worker Machines from target directory 
INFO Consuming Master Machines from target directory 
INFO Ignition-Configs created in: upi and upi/auth 

Checked that .openshift-install.log is not loaded into bootstrap.ign.

Same procedures launched with 4.6.0-fc.3-x86_64 without the fix, .openshift-install.log was loaded into bootstrap.ign.

Comment 23 errata-xmlrpc 2020-10-27 16:14:36 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196