Bug 1380902

Summary: Customized deployment with OS::TripleO::NodeUserData fails
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: python-tripleoclientAssignee: Jiri Stransky <jstransk>
Status: CLOSED ERRATA QA Contact: Marius Cornea <mcornea>
Severity: urgent Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: dbecker, dtantsur, hbrock, jcoufal, jjoyce, jslagle, jstransk, mburns, mcornea, morazi, rhel-osp-director-maint, sasha, shardy
Target Milestone: rcKeywords: Regression, Triaged
Target Release: 10.0 (Newton)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-tripleoclient-5.3.0-1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-12-14 16:06:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2016-10-01 09:49:54 UTC
Description of problem:
Customized deployment with OS::TripleO::NodeUserData fails

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-5.0.0-0.20160929150845.4cdc4fc.el7ost.noarch

How reproducible:
100%

Steps to Reproduce:
1. Deploy overcloud:

source ~/stackrc
export THT=/usr/share/openstack-tripleo-heat-templates
openstack overcloud deploy --templates \
-e $THT/environments/network-isolation.yaml \
-e $THT/environments/network-management.yaml \
-e ~/templates/network-environment.yaml \
-e $THT/environments/storage-environment.yaml \
-e ~/templates/disk-layout.yaml \
-e ~/templates/wipe-disk-env.yaml \
-e ~/templates/enable-tls.yaml \
-e ~/templates/inject-trust-anchor.yaml \
-e ~/templates/tls-endpoints-public-ip.yaml \
-e ~/templates/ssl-ports.yaml \
--control-scale 3 \
--control-flavor controller \
--compute-scale 1 \
--compute-flavor compute \
--ceph-storage-scale 1 \
--ceph-storage-flavor ceph \
--ntp-server ntp.server.com 


Actual results:

Deployment fails with:
 u'message': u"No connection adapters were found for 'file:///home/stack/templates/wipe-disk.sh'",
 u'status': u'FAILED'}


Expected results:
Deployment succeeds.

Additional info:
cat ~/templates/wipe-disk-env.yaml 
resource_registry:
  OS::TripleO::NodeUserData: /home/stack/templates/wipe-disks.yaml

cat /home/stack/templates/wipe-disks.yaml
heat_template_version: 2014-10-16

description: >
  Wipe and convert all disks to GPT (except the disk containing the root file system)

resources:
  userdata:
    type: OS::Heat::MultipartMime
    properties:
      parts:
      - config: {get_resource: wipe_disk}

  wipe_disk:
    type: OS::Heat::SoftwareConfig
    properties:
      config: {get_file: wipe-disk.sh}

outputs:
  OS::stack_id:
    value: {get_resource: userdata}

Comment 1 James Slagle 2016-10-03 12:50:17 UTC
steve, can you triage this one? i guess it might be tripleoclient/tripleo-common related in which case we could send it over to dougal.

Comment 3 James Slagle 2016-10-06 13:28:17 UTC
jistr, can you take a look at this one? steve already has a few others on his plate

Comment 4 Jiri Stransky 2016-10-06 13:29:03 UTC
Sure thing.

Comment 5 Jiri Stransky 2016-10-06 13:55:41 UTC
Marius, can you please post wipe-disk.sh too (or a link)?

Comment 6 Jiri Stransky 2016-10-06 14:07:49 UTC
Actually i can try with a no-op .sh as well hopefully, i just wanted to reproduce the issue as closely as possible.

Comment 7 Marius Cornea 2016-10-06 17:16:27 UTC
wipe-disk.sh:

#!/bin/bash
if [[ `hostname` = *"stor"* ]]
then
  echo "Number of disks detected: $(lsblk -no NAME,TYPE,MOUNTPOINT | grep "disk" | awk '{print $1}' | wc -l)"
  for DEVICE in `lsblk -no NAME,TYPE,MOUNTPOINT | grep "disk" | awk '{print $1}'`
  do
    ROOTFOUND=0
    echo "Checking /dev/$DEVICE..."
    echo "Number of partitions on /dev/$DEVICE: $(expr $(lsblk -n /dev/$DEVICE | awk '{print $7}' | wc -l) - 1)"
    for MOUNTS in `lsblk -n /dev/$DEVICE | awk '{print $7}'`
    do
      if [ "$MOUNTS" = "/" ]
      then
        ROOTFOUND=1
      fi
    done
    if [ $ROOTFOUND = 0 ]
    then
      echo "Root not found in /dev/${DEVICE}"
      echo "Wiping disk /dev/${DEVICE}"
      sgdisk -Z /dev/${DEVICE}
      sgdisk -g /dev/${DEVICE}
    else
      echo "Root found in /dev/${DEVICE}"
    fi
  done
fi

Comment 8 Jiri Stransky 2016-10-07 11:06:03 UTC
Indeed this should probably be fixed in tripleoclient as suggested in #1. Documenting some rationale below w/r/t the problem and the fix.

The problem
-----------

I was able to reproduce and debug, this is due to the conceptual problem that when we deploy from Swift rather than locally, so we don't have all the files available. We already have special processing in place because of this, but it dons't go all the way. (Full feature parity to previous state incl. handling of absolute path links in get_file is likely not achievable, especially while keeping the same CLI interface.)

All externally referenced files already get uploaded into Swift as `user-files/<hash(original path)>-<file name>`. << This naming scheme changes file names and relative paths. Our current solution amends resource registry to work well with the new names/paths, but we don't scan and edit heat templates themselves for all { get_file: some_file_name } references.


Workaround
----------

Immediate workaround is to move the ~/templates directory as a subdirectory of what is passed as --templates location. Since we generally recommend to not modify that directory, doing this is not an ideal solution.


Proposed solution
-----------------

I'd like to avoid scanning and editing the templates w/r/t get_file references. I think we could make relative get_file links work by changing the naming scheme in Swift to `user-files/<full file path>`, which would preserve both names and relative paths between files.

Unfortunately, i still don't have success even with this approach, getting the same error. When i download wipe-disk.yaml from swift, it looks like this:

{"outputs": {"OS::stack_id": {"value": {"get_resource": "userdata"}}}, "heat_template_version": "2014-10-16", "description": "Wipe and convert all disks to GPT (except the disk containing the root file system)\n", "resources": {"userdata": {"type": "OS::Heat::MultipartMime", "properties": {"parts": [{"config": {"get_resource": "wipe_disk"}}]}}, "wipe_disk": {"type": "OS::Heat::SoftwareConfig", "properties": {"config": {"get_file": "file:///home/stack/userdata-bz/wipe-disk.sh"}}}}}

Obviously something is processing the template and replacing the relative link with an absolute one. I suspect this is done by heatclient itself when processing the passed-in environment files.

I'm debugging further and looking how can we solve this.

Comment 9 Jiri Stransky 2016-10-07 16:28:32 UTC
Unfortunately indeed heatclient replaces the relative links with absolute ones when processing processing the passed-in environment files and other files referenced from them. Looks like we can't avoid parsing and editing the external templates in the end. Working on a patch to parse through the templates and fix the links.

Comment 10 Jaromir Coufal 2016-10-10 02:00:31 UTC
Wrong DFG by mistake, returning back to DF.

Comment 11 Jiri Stransky 2016-10-13 08:36:14 UTC
Merged to master and stable/newton.

Comment 13 Alexander Chuzhoy 2016-10-14 20:17:19 UTC
*** Bug 1385153 has been marked as a duplicate of this bug. ***

Comment 18 errata-xmlrpc 2016-12-14 16:06:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-2948.html