Bug 1371955

Summary: OSE deployment failed at 85%, ansible: "Unable to start service atomic-openshift-node"
Product: Red Hat Quickstart Cloud Installer Reporter: Antonin Pagac <apagac>
Component: Installation - OpenShiftAssignee: Dylan Murray <dymurray>
Status: VERIFIED --- QA Contact: Sudhir Mallamprabhakara <smallamp>
Severity: unspecified Docs Contact: Derek <dcadzow>
Priority: unspecified    
Version: 1.0CC: bthurber, dymurray, tsanders
Target Milestone: gaKeywords: Triaged
Target Release: 1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
deployment.log
none
ansible.log none

Description Antonin Pagac 2016-08-31 14:02:35 UTC
Description of problem:
RHV engine + 1 hypervisor, OSE master + 2 nodes; bare metal deployment with supermicro machines. In the past, this setup completed deployments of OSE with 1 node successfully.

From ansible.log:

'fatal: [rhev-ose-ose-master1.example.com]: FAILED! => {"changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}'

RHV is online, I can ssh to OSE master node. I can't however see the reason for the failure. Attaching logs.

Version-Release number of selected component (if applicable):
QCI-1.0-RHEL-7-20160830.t.0

How reproducible:
Happened to me once

Steps to Reproduce:
1. Start deployment of OSE with two nodes on top of RHV with one hypervisor
2. OSE fails on 85%
3.

Actual results:
Deployment of OSE failed

Expected results:
Deployment of OSE successful

Additional info:

Comment 1 Antonin Pagac 2016-08-31 14:03:14 UTC
Created attachment 1196395 [details]
deployment.log

Comment 2 Antonin Pagac 2016-08-31 14:03:41 UTC
Created attachment 1196396 [details]
ansible.log

Comment 3 Antonin Pagac 2016-08-31 14:12:07 UTC
After resuming the failing task in Satellite it fails again:

'failed: [rhev-ose-ose-master1.example.com] => {"changed": true, "cmd": "atomic-openshift-installer -u -c /tmp/atomic-openshift-installer.answers.cfg.yml install", "delta": "0:00:22.073518", "end": "2016-08-31 14:06:55.753432", "rc": 1, "start": "2016-08-31 14:06:33.679914", "warnings": []}'

Service atomic-openshift-node.service is running on master.

Comment 4 Dylan Murray 2016-08-31 15:38:47 UTC
https://github.com/fusor/fusor/pull/1200.

This PR addresses the issue where attempting to resume the task failed. The failed installation is an issue with the atomic-openshift-installer itself and not a bug for us.

Comment 6 Dylan Murray 2016-09-01 15:48:15 UTC
In the compose as of 8/31

Comment 7 Antonin Pagac 2016-09-11 15:45:30 UTC
Verified in QCI-1.0-RHEL-7-20160908.1.

I'm now able to resume the task in Satellite.