Bug 1579409
Summary: | Deployment fails in step 1 with "not a directory" when mounting "/etc/ssh/ssh_known_hosts" in scale deplyoments | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Sai Sindhur Malleni <smalleni> | |
Component: | openstack-tripleo-heat-templates | Assignee: | Alex Schultz <aschultz> | |
Status: | CLOSED ERRATA | QA Contact: | Gurenko Alex <agurenko> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 13.0 (Queens) | CC: | abond, ahrechan, aschultz, bdobreli, dbecker, dhill, egallen, jhakimra, jschluet, lyarwood, mburns, morazi, owalsh, racedoro, smalleni, sputhenp, therve | |
Target Milestone: | rc | Keywords: | OtherQA, Triaged | |
Target Release: | 13.0 (Queens) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | openstack-tripleo-heat-templates-8.0.2-25.el7ost | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1590368 (view as bug list) | Environment: | ||
Last Closed: | 2018-06-27 13:57:08 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1590368, 1590370, 1590375, 1590376 |
Description
Sai Sindhur Malleni
2018-05-17 15:01:56 UTC
Error in /var/log/messages May 17 03:25:36 overcloud-controller-2 os-collect-config: tron\", \n \"2018-05-17 07:25:12,037 INFO: 43388 -- Finished processing puppet configs for neutron\"\n ]\n}\n\nTASK [Start containers for step 1] ************ *********************************\nok: [localhost]\n\nTASK [Debug output for task which failed: Start containers for step 1] *********\nfatal: [localhost]: FAILED! => {\n \"changed\": false, \n \"failed_when_result\": tr ue, \n \"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))\": [\n \"stdout: Tryi ng to pull repository 192.168.0.1:8787/rhosp13/openstack-cinder-volume ... \", \n \"2018-05-07.2: Pulling from 192.168.0.1:8787/rhosp13/openstack-cinder-volume\", \n \"d1fe25896eb5: Already exists\", \n \" 001d79f68470: Already exists\", \n \"ce78fb3ed5d6: Already exists\", \n \"a89d0cba8f6c: Already exi sts\", \n \"838ec8b40498: Already exists\", \n \"5f042e87ffad: Pulling fs layer\", \n \"5f04 2e87ffad: Verifying Checksum\", \n \"5f042e87ffad: Download complete\", \n \"5f042e87ffad: Pull com plete\", \n \"Digest: sha256:7fe518739efb0f98290be9bcbccea486fa03e289e3fcac7ffa488726af712b82\", \n \"Status: Downloaded newer image for 192.168.0.1:8787/rhosp13/openstack-cinder-volume:2018-05-07.2\", \n \"\", \n \"stderr: \", \n \"stdout: \", \n \"stdout: 7aa209c5d9d7920a1276a24c1aef8703f0a8b67 c63a14471c22fd0c7a2871f41\", \n \"Error running ['docker', 'run', '--name', 'mysql_bootstrap', '--label', 'config_id=tripleo_step1', '--label', 'container_name=mysql_bootstrap', '--label', 'managed_by=paunch', '--label' , 'config_data={\\\"start_order\\\": 1, \\\"image\\\": \\\"192.168.0.1:8787/rhosp13/openstack-mariadb:2018-05-07. 2\\\", \\\"environment\\\": [\\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\\", \\\"KOLLA_BOOTSTRAP=True\\\", \\\"DB_MAX_ TIMEOUT=60\\\", \\\"DB_CLUSTERCHECK_PASSWORD=GGfRWd4aKaD8QMzZGVRsBYzpz\\\", \\\"DB_ROOT_PASSWORD=yeYjBrfeJA\\\", \\\"TRIPLEO_CONFIG_HASH=0c61cca0c998ac3ce4571c22aac74b76\\\"], \\\"command\\\": [\\\"bash\\\", \\\"-ec\\\", \\\"i f [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\\\\\necho -e \\\\\\\\\\\"\\\\\\\\\\\\\\\\n[mysqld]\\\\\\\\\\\\\ \\\nwsrep_provider=none\\\\\\\\\\\" >> /etc/my.cnf\\\\\\\\nkolla_set_configs\\\\\\\\nsudo -u mysql -E kolla_exten d_start\\\\\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\\ \'until mysqladmin -uroot -p\\\\\\\\\\\"${DB_ROOT_PASSWORD}\\\\\\\\\\\" ping 2>/dev/null; do sleep 1; done\\\\'\\ \\\\\\nmysql -uroot -p\\\\\\\\\\\"${DB_ROOT_PASSWORD}\\\\\\\\\\\" -e \\\\\\\\\\\"CREATE USER \\\\'clustercheck\\\ \'@\\\\'localhost\\\\' IDENTIFIED BY \\\\'${DB_CLUSTERCHECK_PASSWORD}\\\\';\\\\\\\\\\\"\\\\\\\\nmysql -uroot -p\\ \\\\\\\\\"${DB_ROOT_PASSWORD}\\\\\\\\\\\" -e \\\\\\\\\\\"GRANT PROCESS ON *.* TO \\\\'clustercheck\\\\'@\\\\'loca lhost\\\\' WITH GRANT OPTION;\\\\\\\\\\\"\\\\\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\\\\\\\\\"${DB_ ROOT_PASSWORD}\\\\\\\\\\\" shutdown\\\"], \\\"user\\\": \\\"root\\\", \\\"volumes\\\": [\\\"/etc/hosts:/etc/hosts :ro\\\", \\\"/etc/localtime:/etc/localtime:ro\\\", \\\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro \\\", \\\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\\", \\\"/etc/pki/tls/certs/ca-bun dle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\\", \\\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\ \", \\\"/dev/log:/dev/log\\\", \\\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\\", \\\"/etc/puppet:/etc /puppet:ro\\\", \\\"/var/lib/kolla/config_files/mysql.json:/var/lib/kolla/config_files/config.json\\\", \\\"/var/ lib/config-data/puppet-generated/mysql/:/var/lib/kolla/config_files/src:ro\\\", \\\"/var/lib/mysql:/var/lib/mysql \\\"], \\\"net\\\": \\\"host\\\", \\\"detach\\\": false}', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=KOLLA_BOOTSTRAP=True', '--env=DB_MAX_TIMEOUT=60', '--env=DB_CLUSTERCHECK_PASSWORD=GGfRWd4aKaD8QMzZGVRsBYzpz', '--env$ DB_ROOT_PASSWORD=yeYjBrfeJA', '--env=TRIPLEO_CONFIG_HASH=0c61cca0c998ac3ce4571c22aac74b76', '--net=host', '--use$ =root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-$ rust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca$ bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--$ olume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_know$ _hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_file$ /mysql.json:/var/lib/kolla/config_files/config.json', '--volume=/var/lib/config-data/puppet-generated/mysql/:/va$ /lib/kolla/config_files/src:ro', '--volume=/var/lib/mysql:/var/lib/mysql', '192.168.0.1:8787/rhosp13/openstack-m$ riadb:2018-05-07.2', 'bash', '-ec', 'if [ -e /var/lib/mysql/mysql ]; then exit 0; fi\\\\necho -e \\\"\\\\\\\\n[m$ sqld]\\\\\\\\nwsrep_provider=none\\\" >> /etc/my.cnf\\\\nkolla_set_configs\\\\nsudo -u mysql -E kolla_extend_sta$ t\\\\nmysqld_safe --skip-networking --wsrep-on=OFF &\\\\ntimeout ${DB_MAX_TIMEOUT} /bin/bash -c \\\\'until mysql$ dmin -uroot -p\\\"${DB_ROOT_PASSWORD}\\\" ping 2>/dev/null; do sleep 1; done\\\\'\\\\nmysql -uroot -p\\\"${DB_RO$ T_PASSWORD}\\\" -e \\\"CREATE USER \\\\'clustercheck\\\\'@\\\\'localhost\\\\' IDENTIFIED BY \\\\'${DB_CLUSTERCHE$ K_PASSWORD}\\\\';\\\"\\\\nmysql -uroot -p\\\"${DB_ROOT_PASSWORD}\\\" -e \\\"GRANT PROCESS ON *.* TO \\\\'cluster$ heck\\\\'@\\\\'localhost\\\\' WITH GRANT OPTION;\\\"\\\\ntimeout ${DB_MAX_TIMEOUT} mysqladmin -uroot -p\\\"${DB_$ OOT_PASSWORD}\\\" shutdown']. [125]\", \n \"stderr: container_linux.go:247: starting container process ca$ sed \\\"process_linux.go:364: container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/etc/ss$ /ssh_known_hosts\\\\\\\\\\\\\\\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/be879fedd6d7638887900c7dd5b6$ 29a4a2c5371f0770a1fb914bb593fa946b7/merged\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/be879fedd$ d7638887900c7dd5b6329a4a2c5371f0770a1fb914bb593fa946b7/merged/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\\\" caused \\\$ \\\\\\\\\\\"not a directory\\\\\\\\\\\\\\\"\\\\\\\"\\\"\", \n \"/usr/bin/docker-current: Error response f$ om daemon: oci runtime error: container_linux.go:247: starting container process caused \\\"process_linux.go:364$ container init caused \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\ \\" to rootfs \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/be879fedd6d7638887900c7dd5b6329a4a2c5371f0770a1fb914bb593f a946b7/merged\\\\\\\\\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/be879fedd6d7638887900c7dd5b6329a4a2c537 1f0770a1fb914bb593fa946b7/merged/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"not a directory\\ \\\\\\\\\\\\\"\\\\\\\"\\\"\", \n \": Are you trying to mount a directory onto a file (or vice-versa)? Chec k if the specified host path exists and is the expected type.\", \n \"time=\\\"2018-05-17T07:25:23Z\\\" le vel=error msg=\\\"error getting events from daemon: net/http: request canceled\\\" \", \n \"Error running ['docker', 'run', '--name', 'memcached', '--label', 'config_id=tripleo_step1', '--label', 'container_name=memcach ed', '--label', 'managed_by=paunch', '--label', 'config_data={\\\"start_order\\\": 1, \\\"command\\\": [\\\"/bin$bash\\\", \\\"-c\\\", \\\"source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZ E} -c ${MAXCONN} $OPTIONS >> /var/log/memcached.log 2>&1\\\"], \\\"volumes\\\": [\\\"/etc/hosts:/etc/hosts:ro\\\" , \\\"/etc/localtime:/etc/localtime:ro\\\", \\\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\\", \ \\"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\\", \\\"/etc/pki/tls/certs/ca-bundle.tru st.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\\", \\\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\\", \\\ "/dev/log:/dev/log\\\", \\\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\\", \\\"/etc/puppet:/etc/puppet :ro\\\", \\\"/var/lib/config-data/memcached/etc/sysconfig/memcached:/etc/sysconfig/memcached:ro\\\", \\\"/var/log /containers/memcached:/var/log/\\\"], \\\"image\\\": \\\"192.168.0.1:8787/rhosp13/openstack-memcached:2018-05-07. 2\\\", \\\"net\\\": \\\"host\\\", \\\"restart\\\": \\\"always\\\", \\\"privileged\\\": false}', '--detach=true', '--net=host', '--privileged=false', '--restart=always', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/local time:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pk i/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt: /etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume= /dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/p uppet:ro', '--volume=/var/lib/config-data/memcached/etc/sysconfig/memcached:/etc/sysconfig/memcached:ro', '--volu me=/var/log/containers/memcached:/var/log/', '192.168.0.1:8787/rhosp13/openstack-memcached:2018-05-07.2', '/bin/b ash', '-c', 'source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCO NN} $OPTIONS >> /var/log/memcached.log 2>&1']. [125]\", \n \"stdout: a8dfcac4264e770c0fbbecc4b7929e655f027 c298403dd4e10920a806afc6334\", \n \"stderr: /usr/bin/docker-current: Error response from daemon: oci runti me error: container_linux.go:247: starting container process caused \\\"process_linux.go:364: container init caus ed \\\\\\\"rootfs_linux.go:54: mounting \\\\\\\\\\\\\\\"/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\\\" to rootfs \\\\\\ \\\\\\\\\"/var/lib/docker/overlay2/aaa75d2837288b054eb06d596c87ea8ac4023349eea0ff82b3f4e2aa7778a939/merged\\\\\\\ \\\\\\\\" at \\\\\\\\\\\\\\\"/var/lib/docker/overlay2/aaa75d2837288b054eb06d596c87ea8ac4023349eea0ff82b3f4e2aa777 8a939/merged/etc/ssh/ssh_known_hosts\\\\\\\\\\\\\\\" caused \\\\\\\\\\\\\\\"not a directory\\\\\\\\\\\\\\\"\\\\\\ \"\\\"\"\n ]\n}\n\tto retry, use: --limit @/var/lib/heat-config/heat-config-ansible/3cbb2f2e-8a80-45d2-8435-b1 394ea2a1a5_playbook.retry\n\nPLAY RECAP *********************************************************************\nlo calhost docker -v Docker version 1.13.1, build 94f4240/1.13.1 /etc/ssh/ssh_known_hosts seems to be a directory on the overcloud node instead of a file. So /etc/ssh/ssh_know_hosts will be turned into a directory if it does not exist prior to being mounted by docker as a shared volume. Docker will create it as a dir. So is this expected behavior? This seems like a bug in t-h-t's docker/services/containers-common.yaml, the host prep tasks should ensure /etc/ssh/ssh_known_hosts is created as a file before starting any containers. Otherwise, some of the early containers prolly race with ssh in creating ssh_known_hosts file (In reply to Alex Schultz from comment #6) > So /etc/ssh/ssh_know_hosts will be turned into a directory if it does not > exist prior to being mounted by docker as a shared volume. Docker will > create it as a dir. Ah, then boo docker! Nobody likes magicians! (In reply to Bogdan Dobrelya from comment #8) > This seems like a bug in t-h-t's docker/services/containers-common.yaml, the > host prep tasks should ensure /etc/ssh/ssh_known_hosts is created as a file > before starting any containers. Otherwise, some of the early containers > prolly race with ssh in creating ssh_known_hosts file I think we're just missing https://review.openstack.org/386716 for SshKnownHostsDeployment. Trying to reproduce in CI by slowing down the SshKnownHostsDeployment, but still helpful if you can try to reproduce (-> needinfo) (In reply to Ollie Walsh from comment #9) > (In reply to Alex Schultz from comment #6) > > So /etc/ssh/ssh_know_hosts will be turned into a directory if it does not > > exist prior to being mounted by docker as a shared volume. Docker will > > create it as a dir. > > Ah, then boo docker! Nobody likes magicians! > > (In reply to Bogdan Dobrelya from comment #8) > > This seems like a bug in t-h-t's docker/services/containers-common.yaml, the > > host prep tasks should ensure /etc/ssh/ssh_known_hosts is created as a file > > before starting any containers. Otherwise, some of the early containers > > prolly race with ssh in creating ssh_known_hosts file > > I think we're just missing https://review.openstack.org/386716 for > SshKnownHostsDeployment. To clarify, I meant that we need a similar patch but for the SshKnownHostsDeployment resource - https://review.openstack.org/569366 Copy pasting reproducer from email. So to work around this, I deleted my stack and starting deploying my cloud in increments of 50 compute nodes. Deployment went fine until I scaled upto a 103 nodes (100 compute + 3 controllers) but when trying to scale to 150 compute nodes I saw a similar error[2] from multiple compute nodes (the ones that were newly being deployed during scale up) and the deployment failed in step 3. In this gist[3] we can clearly see how /etc/ssh/ssh_known_hosts is a file on one of the computes when the deploy was only 100 nodes big, but it is a directory on a newly deployed compute during the scale up to 150 compute nodes. At this point I'm not sure what is causing /etc/ssh/ssh_known_hosts to turn into a directory and that too only when doing a large scale deployment (not sure if its a matter of chance). Would appreciate your assistance as we have limited time on this hardware to complete scale testing for OSP13. [2]- https://gist.githubusercontent.com/smalleni/199a36c3bc5e7fb20f6fdcbeefe46876/raw/a5e218f4597a69aec8291511b4e924575292a542/gistfile1.txt [3]- https://gist.github.com/smalleni/1cccd85e0cd1763fe01a7a1b40e9c8c0 Trying this patch, ended up in a similar state. Actually more hosts reported this problem in step3 and deploy failed. "Error running ['docker', 'run', '--name', 'iscsid', '--label', 'config_id=tripleo_step3', '--label', 'container_name=iscsid', '--label', 'managed_by=paunch', '--label', 'config_data={\"environment\": [\"KOLLA_CONF IG_STRATEGY=COPY_ALWAYS\"], \"start_order\": 2, \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc /localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/tls/certs/ca-bundle.c rt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.t rust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_ho sts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/kolla/config_files/iscsid.json:/var /lib/kolla/config_files/config.json:ro\", \"/dev/:/dev/\", \"/run/:/run/\", \"/sys:/sys\", \"/lib/modules:/lib/mo dules:ro\", \"/etc/iscsi:/var/lib/kolla/config_files/src-iscsid:ro\"], \"image\": \"192.168.0.1:8787/rhosp13/open stack-iscsid:2018-05-07.2\", \"net\": \"host\", \"restart\": \"always\", \"privileged\": true}', '--detach=true', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--net=host', '--privileged=true', '--restart=always', '--volume=/etc /hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pk i/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--vol ume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cer t.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_kn own_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/iscsid.json:/var/lib/ kolla/config_files/config.json:ro', '--volume=/dev/:/dev/', '--volume=/run/:/run/', '--volume=/sys:/sys', '--volu me=/lib/modules:/lib/modules:ro', '--volume=/etc/iscsi:/var/lib/kolla/config_files/src-iscsid:ro', '192.168.0.1:8 787/rhosp13/openstack-iscsid:2018-05-07.2']. [125]", "stdout: 55092d394343a91b114fefae665f6321247caad4319d0d4d106b6d73f799ba73", "stderr: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:2 47: starting container process caused \"process_linux.go:364: container init caused \\\"rootfs_linux.go:54: mount ing \\\\\\\"/etc/ssh/ssh_known_hosts\\\\\\\" to rootfs \\\\\\\"/var/lib/docker/overlay2/02b388d07e48bcb8c53a8c6b0 b66b2879dc31021419af71852b56b8038673e39/merged\\\\\\\" at \\\\\\\"/var/lib/docker/overlay2/02b388d07e48bcb8c53a8c 6b0b66b2879dc31021419af71852b56b8038673e39/merged/etc/ssh/ssh_known_hosts\\\\\\\" caused \\\\\\\"not a directory\ \\\\\\"\\\"\"" ] Ok, I think we have two issues: May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,078] (heat-config) [INFO] deploy_server_id=ac09fbda-6df1-4bc8-98e4-11ced79beac2 May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,078] (heat-config) [INFO] deploy_action=CREATE May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_stack_id=overcloud-R620ComputeSshKnownHostsDeployment-sjfak6rrt3jz/714c38fc-2acb-4a7a-840f-4fa6c8e5c7ec May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_resource_name=80 May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_signal_transport=CFN_SIGNAL May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_signal_id=http://192.168.0.1:8000/v1/signal/arn%3Aopenstack%3Aheat%3A%3A2010502813db475f900567fe5fef6ea9%3Astacks/overcloud-R620ComputeSshKnownHostsDeployment-sjfak6rrt3jz/714c38fc-2acb-4a7a-840f-4fa6c8e5c7ec/resources/80?Timestamp=2018-05-18T04%3A48%3A12Z&SignatureMethod=HmacSHA256&AWSAccessKeyId=5d15ecfa3bb1445bb4fdb5ce4feae637&SignatureVersion=2&Signature=PssuQ1AGxsitYxhVoZJlXaumoKekH4ZzpGW3jreXAL4%3D May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [INFO] deploy_signal_verb=POST May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,079] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/db593a0f-9229-4f6a-8ee5-e867b9734ff2 May 18 00:50:01 overcloud-r620compute-80 os-collect-config: Traceback (most recent call last): May 18 00:50:01 overcloud-r620compute-80 os-collect-config: File "/usr/libexec/heat-config/hooks/script", line 96, in <module> May 18 00:50:01 overcloud-r620compute-80 os-collect-config: sys.exit(main(sys.argv)) May 18 00:50:01 overcloud-r620compute-80 os-collect-config: File "/usr/libexec/heat-config/hooks/script", line 66, in main May 18 00:50:01 overcloud-r620compute-80 os-collect-config: stderr=subprocess.PIPE, env=env) May 18 00:50:01 overcloud-r620compute-80 os-collect-config: File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ May 18 00:50:01 overcloud-r620compute-80 os-collect-config: errread, errwrite) May 18 00:50:01 overcloud-r620compute-80 os-collect-config: File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child May 18 00:50:01 overcloud-r620compute-80 os-collect-config: raise child_exception May 18 00:50:01 overcloud-r620compute-80 os-collect-config: OSError: [Errno 7] Argument list too long May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,088] (heat-config) [ERROR] Error running /usr/libexec/heat-config/hooks/script. [1] May 18 00:50:01 overcloud-r620compute-80 os-collect-config: [2018-05-18 00:50:01,088] (heat-config) [DEBUG] Running heat-config-notify /var/lib/heat-config/deployed/db593a0f-9229-4f6a-8ee5-e867b9734ff2.json < /var/lib/heat-config/deployed/db593a0f-9229-4f6a-8ee5-e867b9734ff2.notify.json First is the obvious one, config script isn't being run due to Errno 7 "Argument list too long". I assume that comes from execve returning E2BIG to indicate "The total number of bytes in the environment (envp) and argument list (argv) is too large." Second issue is that the resource didn't fail in heat. [root@overcloud-r620compute-80 heat-config]# cat /var/lib/heat-config/deployed/db593a0f-9229-4f6a-8ee5-e867b9734ff2.notify.json {} Back to the first issue... Checking the limits: IIUC the limit on linux is 1/4 of the stack size: [root@overcloud-r620compute-80 ~]# ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 256605 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 256605 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited [root@overcloud-r620compute-80 ~]# xargs --show-limits Your environment variables take up 2106 bytes POSIX upper limit on argument length (this system): 2092998 POSIX smallest allowable upper limit on argument length (all systems): 4096 Maximum length of command we could actually use: 2090892 Size of command buffer we are actually using: 131072 Adding some logging to the heat-config hook to log the env sizes: [root@overcloud-r620compute-80 ~]# cat /var/lib/heat-config/deployed/db593a0f-9229-4f6a-8ee5-e867b9734ff2.json | /usr/libexec/heat-config/hooks/script [2018-05-18 17:38:07,818] (heat-config) [INFO] Input known_hosts size=137497 [2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_server_id size=36 [2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_action size=6 [2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_stack_id size=94 [2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_resource_name size=2 [2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_signal_transport size=10 [2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_signal_id size=397 [2018-05-18 17:38:07,819] (heat-config) [INFO] Input deploy_signal_verb size=4 [2018-05-18 17:38:07,820] (heat-config) [DEBUG] Running /var/lib/heat-config/heat-config-script/db593a0f-9229-4f6a-8ee5-e867b9734ff2 Traceback (most recent call last): File "/usr/libexec/heat-config/hooks/script", line 97, in <module> sys.exit(main(sys.argv)) File "/usr/libexec/heat-config/hooks/script", line 67, in main stderr=subprocess.PIPE, env=env) File "/usr/lib64/python2.7/subprocess.py", line 711, in __init__ errread, errwrite) File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child raise child_exception OSError: [Errno 7] Argument list too long So it's not clear where the limit is coming from. Python maybe? VERIFIED 2018-05-23.1 (undercloud) [stack@undercloud-0 ~]$ heat stack-list WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning +--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+ | id | stack_name | stack_status | creation_time | updated_time | project | +--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+ | e26232f7-43c5-4d95-a1fe-02a5d478ced2 | overcloud | CREATE_COMPLETE | 2018-05-24T16:13:59Z | None | f5c14a90eac147d0b615f3ab686105fd | +--------------------------------------+------------+-----------------+----------------------+--------------+----------------------------------+ (undercloud) [stack@undercloud-0 ~]$ nova list /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning /usr/lib/python2.7/site-packages/requests/packages/urllib3/connection.py:344: SubjectAltNameWarning: Certificate for 192.168.24.2 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | f27bdc2d-45bb-4902-a7e8-8ba4eec06d68 | compute-0 | ACTIVE | - | Running | ctlplane=192.168.24.20 | | 66a48fd2-a37c-4326-88a1-7f6464d251b2 | compute-1 | ACTIVE | - | Running | ctlplane=192.168.24.19 | | 84b26d4a-4fe3-4e36-8c78-951f844a4703 | compute-2 | ACTIVE | - | Running | ctlplane=192.168.24.18 | | 71299a4d-e6eb-4653-8f91-4b1f0e0d7a67 | compute-3 | ACTIVE | - | Running | ctlplane=192.168.24.22 | | b96105c1-3e06-427d-9f27-a32344c689f3 | compute-4 | ACTIVE | - | Running | ctlplane=192.168.24.21 | | 59dd61a1-79e7-4fd4-865a-9b9aaf318fe5 | compute-5 | ACTIVE | - | Running | ctlplane=192.168.24.24 | | 5eafcfc9-acda-4f22-b6db-2ae464b54380 | compute-6 | ACTIVE | - | Running | ctlplane=192.168.24.10 | | 78e0f34a-dddf-489d-8f5d-99d5a353a4c7 | compute-7 | ACTIVE | - | Running | ctlplane=192.168.24.14 | | d20fe950-9fb6-4dae-8f14-f856232cedd0 | compute-8 | ACTIVE | - | Running | ctlplane=192.168.24.7 | | aca3c22b-dca8-496d-beb1-e2ca67dc520b | compute-9 | ACTIVE | - | Running | ctlplane=192.168.24.12 | | 0287ffee-3fa2-4576-b4a2-95bdc21bd9b9 | controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.13 | | ae000cda-26d7-4330-a115-905ab0962a41 | controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.8 | | f0174a5e-df17-4338-ad99-e31d81929c54 | controller-2 | ACTIVE | - | Running | ctlplane=192.168.24.9 | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ [heat-admin@compute-5 ~]$ ls -la /etc/ssh/ssh_known_hosts -rw-r--r--. 1 root root 8675 May 24 16:31 /etc/ssh/ssh_known_hosts Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2086 |