Bug 1579409

Summary: Deployment fails in step 1 with "not a directory" when mounting "/etc/ssh/ssh_known_hosts" in scale deplyoments
Product: Red Hat OpenStack Reporter: Sai Sindhur Malleni <smalleni>
Component: openstack-tripleo-heat-templatesAssignee: Alex Schultz <aschultz>
Status: CLOSED ERRATA QA Contact: Gurenko Alex <agurenko>
Severity: high Docs Contact:
Priority: high    
Version: 13.0 (Queens)CC: abond, ahrechan, aschultz, bdobreli, dbecker, dhill, egallen, jhakimra, jschluet, lyarwood, mburns, morazi, owalsh, racedoro, smalleni, sputhenp, therve
Target Milestone: rcKeywords: OtherQA, Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-8.0.2-25.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1590368 (view as bug list) Environment:
Last Closed: 2018-06-27 13:57:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1590368, 1590370, 1590375, 1590376    

Description Sai Sindhur Malleni 2018-05-17 15:01:56 UTC
Description of problem:
Deploying OSP13 in the Red Hat Scale LAb with 3 controllers and 150 compute nodes, deployment fails in Step1 with the following error message.

overcloud.AllNodesDeploySteps.ControllerDeployment_Step1.1:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 4de9d804-9db7-4904-b3f5-823b7117369f
  status: CREATE_FAILED
  status_reason: |
    Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status
 code: 2
  deploy_stdout: |
    ...
            "Error running ['docker', 'run', '--name', 'memcached', '--label', 'config_id=tripleo_step1', '--labe
l', 'container_name=memcached', '--label', 'managed_by=paunch', '--label', 'config_data={\"start_order\": 1, \"co
mmand\": [\"/bin/bash\", \"-c\", \"source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m $
{CACHESIZE} -c ${MAXCONN} $OPTIONS >> /var/log/memcached.log 2>&1\"], \"volumes\": [\"/etc/hosts:/etc/hosts:ro\",
 \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki
/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki
/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \
"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/config-data/m
emcached/etc/sysconfig/memcached:/etc/sysconfig/memcached:ro\", \"/var/log/containers/memcached:/var/log/\"], \"i
mage\": \"192.168.0.1:8787/rhosp13/openstack-memcached:2018-05-07.2\", \"net\": \"host\", \"restart\": \"always\"
, \"privileged\": false}', '--detach=true', '--net=host', '--privileged=false', '--restart=always', '--volume=/et
c/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/p
ki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--vo
lume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/ce
rt.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_k
nown_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/config-data/memcached/etc/sysconfig/mem
cached:/etc/sysconfig/memcached:ro', '--volume=/var/log/containers/memcached:/var/log/', '192.168.0.1:8787/rhosp1
3/openstack-memcached:2018-05-07.2', '/bin/bash', '-c', 'source /etc/sysconfig/memcached; /usr/bin/memcached -p $
{PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS >> /var/log/memcached.log 2>&1']. [125]", 
            "stdout: 6645bfdc5b90ce52921f1f8f875ac83a5b878b3d30e3293e5dbbbb39e4a5ae67", 
            "stderr: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:2
47: starting container process caused \"process_linux.go:364: container init caused \\\"rootfs_linux.go:54: mount
ing \\\\\\\"/etc/ssh/ssh_known_hosts\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/2e613821885a32bebdb894820
d7c2e38a86c7b2024e029a12819dd5aed722751/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/2e613821885a32bebdb894
820d7c2e38a86c7b2024e029a12819dd5aed722751/merged/etc/ssh/ssh_known_hosts\\\\\\\" caused \\\\\\\"not a directory\
\\\\\\"\\\"\""


Logging into the controllers we see all containers are exited for a similar reason
CONTAINER ID        IMAGE                                                           COMMAND                  CREATED             STATUS                   PORTS               NAMES
b66e67777b4b        192.168.0.1:8787/rhosp13/openstack-mariadb:2018-05-07.2         "/bin/bash -c '/us..."   7 hours ago         Exited (0) 7 hours ago                       mysql_image_tag
6645bfdc5b90        192.168.0.1:8787/rhosp13/openstack-memcached:2018-05-07.2       "/bin/bash -c 'sou..."   7 hours ago         Created                                      memcached
25cdfcf14239        192.168.0.1:8787/rhosp13/openstack-haproxy:2018-05-07.2         "/bin/bash -c '/us..."   7 hours ago         Exited (0) 7 hours ago                       haproxy_image_tag
10e7c2b0878a        192.168.0.1:8787/rhosp13/openstack-mariadb:2018-05-07.2         "bash -ec 'if [ -e..."   7 hours ago         Created                                      mysql_bootstrap
822abff373f4        192.168.0.1:8787/rhosp13/openstack-redis:2018-05-07.2           "/bin/bash -c '/us..."   7 hours ago         Exited (0) 7 hours ago                       redis_image_tag
81619c0ad428        192.168.0.1:8787/rhosp13/openstack-rabbitmq:2018-05-07.2        "/bin/bash -c '/us..."   7 hours ago         Exited (0) 7 hours ago                       rabbitmq_image_tag
eccdf31c874a        192.168.0.1:8787/rhosp13/openstack-cinder-volume:2018-05-07.2   "/bin/bash -c '/us..."   7 hours ago         Exited (0) 7 hours ago                       cinder_volume_image_tag
000560604e74        192.168.0.1:8787/rhosp13/openstack-rabbitmq:2018-05-07.2        "kolla_start"            7 hours ago         Exited (0) 7 hours ago                       rabbitmq_bootstrap
573e30c4c8c2        192.168.0.1:8787/rhosp13/openstack-memcached:2018-05-07.2       "/bin/bash -c 'sou..."   7 hours ago         Exited (0) 7 hours ago                       memcached_init_logs
8a5d446f88d2        192.168.0.1:8787/rhosp13/openstack-mariadb:2018-05-07.2         "chown -R mysql: /..."   7 hours ago         Exited (0) 7 hours ago                       mysql_data_ownership


Version-Release number of selected component (if applicable):
OSP13

How reproducible:
100% on this env

Steps to Reproduce:
1. Deploy OSP13 
2.
3.

Actual results:
Deploy fails in step1 when creating controllers

Expected results:
Deployment should succeed

Additional info:

Comment 1 Sai Sindhur Malleni 2018-05-17 15:12:49 UTC
Error in /var/log/messages
May 17 03:25:36 overcloud-controller-2 os-collect-config: tron\", \n        \"2018-05-17 07:25:12,037 INFO: 43388
 -- Finished processing puppet configs for neutron\"\n    ]\n}\n\nTASK [Start containers for step 1] ************
*********************************\nok: [localhost]\n\nTASK [Debug output for task which failed: Start containers 
for step 1] *********\nfatal: [localhost]: FAILED! => {\n    \"changed\": false, \n    \"failed_when_result\": tr
ue, \n    \"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))\": [\n        \"stdout: Tryi
ng to pull repository 192.168.0.1:8787/rhosp13/openstack-cinder-volume ... \", \n        \"2018-05-07.2: Pulling 
from 192.168.0.1:8787/rhosp13/openstack-cinder-volume\", \n        \"d1fe25896eb5: Already exists\", \n        \"
001d79f68470: Already exists\", \n        \"ce78fb3ed5d6: Already exists\", \n        \"a89d0cba8f6c: Already exi
sts\", \n        \"838ec8b40498: Already exists\", \n        \"5f042e87ffad: Pulling fs layer\", \n        \"5f04
2e87ffad: Verifying Checksum\", \n        \"5f042e87ffad: Download complete\", \n        \"5f042e87ffad: Pull com
plete\", \n        \"Digest: sha256:7fe518739efb0f98290be9bcbccea486fa03e289e3fcac7ffa488726af712b82\", \n       
 \"Status: Downloaded newer image for 192.168.0.1:8787/rhosp13/openstack-cinder-volume:2018-05-07.2\", \n        
\"\", \n        \"stderr: \", \n        \"stdout: \", \n        \"stdout: 7aa209c5d9d7920a1276a24c1aef8703f0a8b67
c63a14471c22fd0c7a2871f41\", \n        \"Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 
'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label'
, 'config_data={\\\"start_order\\\": 1, \\\"image\\\": \\\"192.168.0.1:8787/rhosp13/openstack-mariadb:2018-05-07.
2\\\", \\\"environment\\\": [\\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\\", \\\"KOLLA_BOOTSTRAP=True\\\", \\\"DB_MAX_
TIMEOUT=60\\\", \\\"DB_CLUSTERCHECK_PASSWORD=GGfRWd4aKaD8QMzZGVRsBYzpz\\\", \\\"DB_ROOT_PASSWORD=yeYjBrfeJA\\\", 
\\\"TRIPLEO_CONFIG_HASH=0c61cca0c998ac3ce4571c22aac74b76\\\"], \\\"command\\\": [\\\"bash\\\", \\\"-ec\\\", \\\"i
f [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\\\\\necho -e \\\\\\\\\\\"\\\\\\\\\\\\\\\\n[mysqld]\\\\\\\\\\\\\
\\\nwsrep_provider=none\\\\\\\\\\\" >> /etc/my.cnf\\\\\\\\nkolla_set_configs\\\\\\\\nsudo -u mysql -E kolla_exten
d_start\\\\\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\\
\'until mysqladmin -uroot -p\\\\\\\\\\\"${DB_ROOT_PASSWORD}\\\\\\\\\\\" ping 2>/dev/null; do sleep 1; done\\\\'\\
\\\\\\nmysql -uroot -p\\\\\\\\\\\"${DB_ROOT_PASSWORD}\\\\\\\\\\\" -e \\\\\\\\\\\"CREATE USER \\\\'clustercheck\\\
\'@\\\\'localhost\\\\' IDENTIFIED BY \\\\'${DB_CLUSTERCHECK_PASSWORD}\\\\';\\\\\\\\\\\"\\\\\\\\nmysql -uroot -p\\
\\\\\\\\\"${DB_ROOT_PASSWORD}\\\\\\\\\\\" -e \\\\\\\\\\\"GRANT PROCESS ON *.* TO \\\\'clustercheck\\\\'@\\\\'loca
lhost\\\\' WITH GRANT OPTION;\\\\\\\\\\\"\\\\\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\\\\\\\\\"${DB_
ROOT_PASSWORD}\\\\\\\\\\\" shutdown\\\"], \\\"user\\\": \\\"root\\\", \\\"volumes\\\": [\\\"/etc/hosts:/etc/hosts
:ro\\\", \\\"/etc/localtime:/etc/localtime:ro\\\", \\\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro
\\\", \\\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\\", \\\"/etc/pki/tls/certs/ca-bun
dle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\\", \\\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\
\", \\\"/dev/log:/dev/log\\\", \\\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\\", \\\"/etc/puppet:/etc
/puppet:ro\\\", \\\"/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json\\\", \\\"/var/
lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro\\\", \\\"/var/lib/mysql:/var/lib/mysql
\\\"], \\\"net\\\": \\\"host\\\", \\\"detach\\\": false}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=KOLLA_BOOTSTRAP=True', '--env=DB_MAX_TIMEOUT=60', '--env=DB_CLUSTERCHECK_PASSWORD=GGfRWd4aKaD8QMzZGVRsBYzpz', '--env$
DB_ROOT_PASSWORD=yeYjBrfeJA', '--env=TRIPLEO_CONFIG_HASH=0c61cca0c998ac3ce4571c22aac74b76', '--net=host', '--use$
=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-$
rust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca$
bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--$
olume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_know$
_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_file$
/mysql.json:/var/lib/kolla/config_files/config.json', '--volume=/var/lib/config-data/puppet-generated/mysql/:/va$
/lib/kolla/config_files/src:ro', '--volume=/var/lib/mysql:/var/lib/mysql', '192.168.0.1:8787/rhosp13/openstack-m$
riadb:2018-05-07.2', 'bash', '-ec', 'if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\necho -e \\\"\\\\\\\\n[m$
sqld]\\\\\\\\nwsrep_provider=none\\\" >> /etc/my.cnf\\\\nkolla_set_configs\\\\nsudo -u mysql -E kolla_extend_sta$
t\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\\\'until mysql$
dmin -uroot -p\\\"${DB_ROOT_PASSWORD}\\\" ping 2>/dev/null; do sleep 1; done\\\\'\\\\nmysql -uroot -p\\\"${DB_RO$
T_PASSWORD}\\\" -e \\\"CREATE USER \\\\'clustercheck\\\\'@\\\\'localhost\\\\' IDENTIFIED BY \\\\'${DB_CLUSTERCHE$
K_PASSWORD}\\\\';\\\"\\\\nmysql -uroot -p\\\"${DB_ROOT_PASSWORD}\\\" -e \\\"GRANT PROCESS ON *.* TO \\\\'cluster$
heck\\\\'@\\\\'localhost\\\\' WITH GRANT OPTION;\\\"\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\"${DB_$
OOT_PASSWORD}\\\" shutdown']. [125]\", \n        \"stderr: container_linux.go:247: starting container process ca$
sed \\\"process_linux.go:364: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/etc/ss$
/ssh_known_hosts\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/be879fedd6d7638887900c7dd5b6$
29a4a2c5371f0770a1fb914bb593fa946b7/merged\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/be879fedd$
d7638887900c7dd5b6329a4a2c5371f0770a1fb914bb593fa946b7/merged/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\\\" caused \\\$
\\\\\\\\\\\"not a directory\\\\\\\\\\\\\\\"\\\\\\\"\\\"\", \n        \"/usr/bin/docker-current: Error response f$
om daemon: oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:364$
 container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\
\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/be879fedd6d7638887900c7dd5b6329a4a2c5371f0770a1fb914bb593f
a946b7/merged\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/be879fedd6d7638887900c7dd5b6329a4a2c537
1f0770a1fb914bb593fa946b7/merged/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"not a directory\\
\\\\\\\\\\\\\"\\\\\\\"\\\"\", \n        \": Are you trying to mount a directory onto a file (or vice-versa)? Chec
k if the specified host path exists and is the expected type.\", \n        \"time=\\\"2018-05-17T07:25:23Z\\\" le
vel=error msg=\\\"error getting events from daemon: net/http: request canceled\\\" \", \n        \"Error running 
['docker', 'run', '--name', 'memcached', '--label', 'config_id=tripleo_step1', '--label', 'container_name=memcach
ed', '--label', 'managed_by=paunch', '--label', 'config_data={\\\"start_order\\\": 1, \\\"command\\\": [\\\"/bin$bash\\\", \\\"-c\\\", \\\"source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZ
E} -c ${MAXCONN} $OPTIONS >> /var/log/memcached.log 2>&1\\\"], \\\"volumes\\\": [\\\"/etc/hosts:/etc/hosts:ro\\\"
, \\\"/etc/localtime:/etc/localtime:ro\\\", \\\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\\", \
\\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\\", \\\"/etc/pki/tls/certs/ca-bundle.tru
st.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\\", \\\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\\", \\\
"/dev/log:/dev/log\\\", \\\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\\", \\\"/etc/puppet:/etc/puppet
:ro\\\", \\\"/var/lib/config-data/memcached/etc/sysconfig/memcached:/etc/sysconfig/memcached:ro\\\", \\\"/var/log
/containers/memcached:/var/log/\\\"], \\\"image\\\": \\\"192.168.0.1:8787/rhosp13/openstack-memcached:2018-05-07.
2\\\", \\\"net\\\": \\\"host\\\", \\\"restart\\\": \\\"always\\\", \\\"privileged\\\": false}', '--detach=true', 
'--net=host', '--privileged=false', '--restart=always', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/local
time:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pk
i/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:
/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=
/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/p
uppet:ro', '--volume=/var/lib/config-data/memcached/etc/sysconfig/memcached:/etc/sysconfig/memcached:ro', '--volu
me=/var/log/containers/memcached:/var/log/', '192.168.0.1:8787/rhosp13/openstack-memcached:2018-05-07.2', '/bin/b
ash', '-c', 'source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCO
NN} $OPTIONS >> /var/log/memcached.log 2>&1']. [125]\", \n        \"stdout: a8dfcac4264e770c0fbbecc4b7929e655f027
c298403dd4e10920a806afc6334\", \n        \"stderr: /usr/bin/docker-current: Error response from daemon: oci runti
me error: container_linux.go:247: starting container process caused \\\"process_linux.go:364: container init caus
ed \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\\\" to rootfs \\\\\\
\\\\\\\\\"/var/lib/docker/overlay2/aaa75d2837288b054eb06d596c87ea8ac4023349eea0ff82b3f4e2aa7778a939/merged\\\\\\\
\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/aaa75d2837288b054eb06d596c87ea8ac4023349eea0ff82b3f4e2aa777
8a939/merged/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"not a directory\\\\\\\\\\\\\\\"\\\\\\
\"\\\"\"\n    ]\n}\n\tto retry, use: --limit @/var/lib/heat-config/heat-config-ansible/3cbb2f2e-8a80-45d2-8435-b1
394ea2a1a5_playbook.retry\n\nPLAY RECAP *********************************************************************\nlo
calhost

Comment 2 Sai Sindhur Malleni 2018-05-17 15:17:29 UTC
docker -v
Docker version 1.13.1, build 94f4240/1.13.1

Comment 3 Sai Sindhur Malleni 2018-05-17 15:54:01 UTC
/etc/ssh/ssh_known_hosts seems to be a directory on the overcloud node instead of a file.

Comment 6 Alex Schultz 2018-05-18 00:03:35 UTC
So /etc/ssh/ssh_know_hosts will be turned into a directory if it does not exist prior to being mounted by docker as a shared volume. Docker will create it as a dir.

Comment 7 Sai Sindhur Malleni 2018-05-18 00:27:53 UTC
So is this expected behavior?

Comment 8 Bogdan Dobrelya 2018-05-18 08:30:52 UTC
This seems like a bug in t-h-t's docker/services/containers-common.yaml, the host prep tasks should ensure /etc/ssh/ssh_known_hosts is created as a file before starting any containers. Otherwise, some of the early containers prolly race with ssh in creating ssh_known_hosts file

Comment 9 Ollie Walsh 2018-05-18 09:02:04 UTC
(In reply to Alex Schultz from comment #6)
> So /etc/ssh/ssh_know_hosts will be turned into a directory if it does not
> exist prior to being mounted by docker as a shared volume. Docker will
> create it as a dir.

Ah, then boo docker! Nobody likes magicians!

(In reply to Bogdan Dobrelya from comment #8)
> This seems like a bug in t-h-t's docker/services/containers-common.yaml, the
> host prep tasks should ensure /etc/ssh/ssh_known_hosts is created as a file
> before starting any containers. Otherwise, some of the early containers
> prolly race with ssh in creating ssh_known_hosts file

I think we're just missing https://review.openstack.org/386716 for SshKnownHostsDeployment.

Trying to reproduce in CI by slowing down the SshKnownHostsDeployment, but still helpful if you can try to reproduce (-> needinfo)

Comment 13 Ollie Walsh 2018-05-18 12:57:51 UTC
(In reply to Ollie Walsh from comment #9)
> (In reply to Alex Schultz from comment #6)
> > So /etc/ssh/ssh_know_hosts will be turned into a directory if it does not
> > exist prior to being mounted by docker as a shared volume. Docker will
> > create it as a dir.
> 
> Ah, then boo docker! Nobody likes magicians!
> 
> (In reply to Bogdan Dobrelya from comment #8)
> > This seems like a bug in t-h-t's docker/services/containers-common.yaml, the
> > host prep tasks should ensure /etc/ssh/ssh_known_hosts is created as a file
> > before starting any containers. Otherwise, some of the early containers
> > prolly race with ssh in creating ssh_known_hosts file
> 
> I think we're just missing https://review.openstack.org/386716 for
> SshKnownHostsDeployment.

To clarify, I meant that we need a similar patch but for the SshKnownHostsDeployment resource - https://review.openstack.org/569366

Comment 14 Sai Sindhur Malleni 2018-05-18 15:21:41 UTC
Copy pasting reproducer from email.

So to work around this, I deleted my stack and starting deploying my cloud in increments of 50 compute nodes. Deployment went fine until I scaled upto a 103 nodes (100 compute + 3 controllers) but when trying to scale to 150 compute nodes I saw a similar error[2] from multiple compute nodes (the ones that were newly being deployed during scale up) and the deployment failed in step 3.

In this gist[3] we can clearly see how /etc/ssh/ssh_known_hosts is a file on one of the computes when the deploy was only 100 nodes big, but it is a directory on a newly deployed compute during the scale up to 150 compute nodes.

At this point I'm not sure what is causing /etc/ssh/ssh_known_hosts to turn into a directory and that too only when doing a large scale deployment (not sure if its a matter of chance). Would appreciate your assistance as we have limited time on this hardware to complete scale testing for OSP13.


[2]- https://gist.githubusercontent.com/smalleni/199a36c3bc5e7fb20f6fdcbeefe46876/raw/a5e218f4597a69aec8291511b4e924575292a542/gistfile1.txt
[3]- https://gist.github.com/smalleni/1cccd85e0cd1763fe01a7a1b40e9c8c0

Comment 17 Sai Sindhur Malleni 2018-05-18 15:54:27 UTC
Trying this patch, ended up in a similar state. Actually more hosts reported this problem in step3 and deploy failed.

            "Error running ['docker', 'run', '--name', 'iscsid', '--label', 'config_id=tripleo_step3', '--label',
 'container_name=iscsid', '--label', 'managed_by=paunch', '--label', 'config_data={\"environment\": [\"KOLLA_CONF
IG_STRATEGY=COPY_ALWAYS\"], \"start_order\": 2, \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc
/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/tls/certs/ca-bundle.c
rt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.t
rust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_ho
sts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/kolla/config_files/iscsid.json:/var
/lib/kolla/config_files/config.json:ro\", \"/dev/:/dev/\", \"/run/:/run/\", \"/sys:/sys\", \"/lib/modules:/lib/mo
dules:ro\", \"/etc/iscsi:/var/lib/kolla/config_files/src-iscsid:ro\"], \"image\": \"192.168.0.1:8787/rhosp13/open
stack-iscsid:2018-05-07.2\", \"net\": \"host\", \"restart\": \"always\", \"privileged\": true}', '--detach=true',
 '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--net=host', '--privileged=true', '--restart=always', '--volume=/etc
/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pk
i/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--vol
ume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cer
t.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_kn
own_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/iscsid.json:/var/lib/
kolla/config_files/config.json:ro', '--volume=/dev/:/dev/', '--volume=/run/:/run/', '--volume=/sys:/sys', '--volu
me=/lib/modules:/lib/modules:ro', '--volume=/etc/iscsi:/var/lib/kolla/config_files/src-iscsid:ro', '192.168.0.1:8
787/rhosp13/openstack-iscsid:2018-05-07.2']. [125]", 
            "stdout: 55092d394343a91b114fefae665f6321247caad4319d0d4d106b6d73f799ba73", 
            "stderr: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:2
47: starting container process caused \"process_linux.go:364: container init caused \\\"rootfs_linux.go:54: mount
ing \\\\\\\"/etc/ssh/ssh_known_hosts\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/02b388d07e48bcb8c53a8c6b0
b66b2879dc31021419af71852b56b8038673e39/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/02b388d07e48bcb8c53a8c
6b0b66b2879dc31021419af71852b56b8038673e39/merged/etc/ssh/ssh_known_hosts\\\\\\\" caused \\\\\\\"not a directory\
\\\\\\"\\\"\""
        ]

Comment 18 Ollie Walsh 2018-05-18 17:50:42 UTC
Ok, I think we have two issues:

May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,078] (heat-config) [INFO] deploy_server_id=ac09fbda-6df1-4bc8-98e4-11ced79beac2
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,078] (heat-config) [INFO] deploy_action=CREATE
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_stack_id=overcloud-R620ComputeSshKnownHostsDeployment-sjfak6rrt3jz/714c38fc-2acb-4a7a-840f-4fa6c8e5c7ec
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_resource_name=80
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_signal_transport=CFN_SIGNAL
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_signal_id=http://192.168.0.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A2010502813db475f900567fe5fef6ea9%3Astacks/overcloud-R620ComputeSshKnownHostsDeployment-sjfak6rrt3jz/714c38fc-2acb-4a7a-840f-4fa6c8e5c7ec/resources/80?Timestamp=2018-05-18T04%3A48%3A12Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=5d15ecfa3bb1445bb4fdb5ce4feae637&SignatureVersion=2&Signature=PssuQ1AGxsitYxhVoZJlXaumoKekH4ZzpGW3jreXAL4%3D
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_signal_verb=POST
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/db593a0f-9229-4f6a-8ee5-e867b9734ff2
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: Traceback (most recent call last):
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: File "/usr/libexec/heat-config/hooks/script", line 96, in <module>
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: sys.exit(main(sys.argv))
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: File "/usr/libexec/heat-config/hooks/script", line 66, in main
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: stderr=subprocess.PIPE, env=env)
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: errread, errwrite)
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: raise child_exception
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: OSError: [Errno 7] Argument list too long
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,088] (heat-config) [ERROR] Error running /usr/libexec/heat-config/hooks/script. [1]
May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,088] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/db593a0f-9229-4f6a-8ee5-e867b9734ff2.json < /var/lib/heat-config/deployed/db593a0f-9229-4f6a-8ee5-e867b9734ff2.notify.json

First is the obvious one, config script isn't being run due to Errno 7 "Argument list too long". I assume that comes from execve returning E2BIG to indicate "The total number of bytes in the environment (envp) and argument list (argv) is too large."

Second issue is that the resource didn't fail in heat.

[root@overcloud-r620compute-80 heat-config]# cat /var/lib/heat-config/deployed/db593a0f-9229-4f6a-8ee5-e867b9734ff2.notify.json
{}


Back to the first issue...

Checking the limits:

IIUC the limit on linux is 1/4 of the stack size:

[root@overcloud-r620compute-80 ~]# ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 256605
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 256605
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

[root@overcloud-r620compute-80 ~]# xargs --show-limits
Your environment variables take up 2106 bytes
POSIX upper limit on argument length (this system): 2092998
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2090892
Size of command buffer we are actually using: 131072

Adding some logging to the heat-config hook to log the env sizes:

[root@overcloud-r620compute-80 ~]# cat /var/lib/heat-config/deployed/db593a0f-9229-4f6a-8ee5-e867b9734ff2.json | /usr/libexec/heat-config/hooks/script 
[2018-05-18 17:38:07,818] (heat-config) [INFO] Input known_hosts size=137497
[2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_server_id size=36
[2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_action size=6
[2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_stack_id size=94
[2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_resource_name size=2
[2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_signal_transport size=10
[2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_signal_id size=397
[2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_signal_verb size=4
[2018-05-18 17:38:07,820] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/db593a0f-9229-4f6a-8ee5-e867b9734ff2
Traceback (most recent call last):
  File "/usr/libexec/heat-config/hooks/script", line 97, in <module>
    sys.exit(main(sys.argv))
  File "/usr/libexec/heat-config/hooks/script", line 67, in main
    stderr=subprocess.PIPE, env=env)
  File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child
    raise child_exception
OSError: [Errno 7] Argument list too long


So it's not clear where the limit is coming from. Python maybe?

Comment 29 Artem Hrechanychenko 2018-05-24 17:51:41 UTC
VERIFIED
2018-05-23.1


(undercloud) [stack@undercloud-0 ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
| id                                   | stack_name | stack_status    | creation_time        | updated_time | project                          |
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
| e26232f7-43c5-4d95-a1fe-02a5d478ced2 | overcloud  | CREATE_COMPLETE | 2018-05-24T16:13:59Z | None         | f5c14a90eac147d0b615f3ab686105fd |
+--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+
(undercloud) [stack@undercloud-0 ~]$ nova list
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
/usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| ID                                   | Name         | Status | Task State | Power State | Networks               |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+
| f27bdc2d-45bb-4902-a7e8-8ba4eec06d68 | compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.24.20 |
| 66a48fd2-a37c-4326-88a1-7f6464d251b2 | compute-1    | ACTIVE | -          | Running     | ctlplane=192.168.24.19 |
| 84b26d4a-4fe3-4e36-8c78-951f844a4703 | compute-2    | ACTIVE | -          | Running     | ctlplane=192.168.24.18 |
| 71299a4d-e6eb-4653-8f91-4b1f0e0d7a67 | compute-3    | ACTIVE | -          | Running     | ctlplane=192.168.24.22 |
| b96105c1-3e06-427d-9f27-a32344c689f3 | compute-4    | ACTIVE | -          | Running     | ctlplane=192.168.24.21 |
| 59dd61a1-79e7-4fd4-865a-9b9aaf318fe5 | compute-5    | ACTIVE | -          | Running     | ctlplane=192.168.24.24 |
| 5eafcfc9-acda-4f22-b6db-2ae464b54380 | compute-6    | ACTIVE | -          | Running     | ctlplane=192.168.24.10 |
| 78e0f34a-dddf-489d-8f5d-99d5a353a4c7 | compute-7    | ACTIVE | -          | Running     | ctlplane=192.168.24.14 |
| d20fe950-9fb6-4dae-8f14-f856232cedd0 | compute-8    | ACTIVE | -          | Running     | ctlplane=192.168.24.7  |
| aca3c22b-dca8-496d-beb1-e2ca67dc520b | compute-9    | ACTIVE | -          | Running     | ctlplane=192.168.24.12 |
| 0287ffee-3fa2-4576-b4a2-95bdc21bd9b9 | controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.24.13 |
| ae000cda-26d7-4330-a115-905ab0962a41 | controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.24.8  |
| f0174a5e-df17-4338-ad99-e31d81929c54 | controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.24.9  |
+--------------------------------------+--------------+--------+------------+-------------+------------------------+


[heat-admin@compute-5 ~]$ ls -la /etc/ssh/ssh_known_hosts 
-rw-r--r--. 1 root root 8675 May 24 16:31 /etc/ssh/ssh_known_hosts

Comment 31 errata-xmlrpc 2018-06-27 13:57:08 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086