Description of problem: During execution following scale up playbook: nsible-playbook -i /tmp/inventory.yaml /usr/share/ansible/openshift-ansible/playbooks/scaleup.yml --private-key=/tmp/openshift-dev.pem -v It ended up with non zero return code cause of this warning: PLAY RECAP ********************************************************************* localhost : ok=1 changed=1 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0 rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com : ok=37 changed=29 unreachable=0 failed=0 skipped=7 rescued=0 ignored=0 rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28 changed=20 unreachable=0 failed=1 skipped=7 rescued=0 ignored=0 rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28 changed=20 unreachable=0 failed=1 skipped=7 rescued=0 ignored=0 2020-03-11 07:05:59,341 - WARNING - ocs_ci.utility.utils.run_cmd.421 - Command stderr: [WARNING]: The value 1 (type int) in a string field was converted to u'1' (type string). If this does not look like what you expect, quote the entire value to ensure it does not change. command terminated with exit code 2 cat /tmp/inventory.yaml #/usr/share/ansible/openshift-ansible/playbooks/scaleup.yml --private-key=/tmp/openshift-dev.pem -v. [all:vars] ansible_user=test ansible_become=True openshift_kubeconfig_path=/tmp/kubeconfig openshift_pull_secret_path=/tmp/pull-secret [new_workers] rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com cat /usr/share/ansible/openshift-ansible/playbooks/scaleup.yml --- - name: Pre-scaleup checks hosts: localhost connection: local gather_facts: no tasks: - import_role: name: openshift_node tasks_from: scaleup_checks.yml - name: install nodes hosts: new_workers roles: - openshift_node Version-Release number of the following components: rpm -q openshift-ansible openshift-ansible-4.3.6-202003090404.git.0.612bb09.el7.noarch rpm -q ansible ansible-2.8.7-1.el7ae.noarch ansible --version ansible 2.8.7 config file = /etc/ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Jun 11 2019, 14:33:56) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] How reproducible: Steps to Reproduce: 1. Prepare rhel nodes 2. Run command described above 3. Playbook pass but RC is not 0 and it breaks automation Actual results: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-dc7-rhcos/nberry-dc7-rhcos_20200311T060425/logs/ocs-ci-logs-1583907322/by_outcome/failed/tests/ecosystem/deployment/test_deployment.py/test_deployment/logs Look for ansible-playbook command in log above Expected results: Do not see such warning and playbook should end with RC 0 Additional info: Please attach logs from ansible-playbook with the -vvv flag Do not have such output. Must gather after this failure here: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-dc7-rhcos/nberry-dc7-rhcos_20200311T060425/logs/failed_testcase_ocs_logs_1583907322/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/
The playbook did not complete successfully for all hosts. Two of the nodes failed because of an issue pulling the release image. An exit code of '2' is expected when there are failed hosts. TASK [openshift_node : Pull release image] ************************************* fatal: [rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com]: FAILED! => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.GCwW1y/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:07.230095", "end": "2020-03-11 12:33:58.304901", "msg": "non-zero return code", "rc": 125, "start": "2020-03-11 12:33:51.074806", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\ntime=\"2020-03-11T12:33:58+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage372764296/1\\\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer\"\nFailed\nError: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage372764296/1\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "time=\"2020-03-11T12:33:58+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage372764296/1\\\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer\"", "Failed", "Error: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage372764296/1\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer"], "stdout": "", "stdout_lines": []} fatal: [rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com]: FAILED! => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.JGeE6n/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:09.737676", "end": "2020-03-11 12:34:00.838081", "msg": "non-zero return code", "rc": 125, "start": "2020-03-11 12:33:51.100405", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\ntime=\"2020-03-11T12:34:00+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage101571689/2\\\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer\"\nFailed\nError: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage101571689/2\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "time=\"2020-03-11T12:34:00+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage101571689/2\\\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer\"", "Failed", "Error: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage101571689/2\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer"], "stdout": "", "stdout_lines": []} changed: [rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com] => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.XeiKYS/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:14.166416", "end": "2020-03-11 12:34:05.233500", "rc": 0, "start": "2020-03-11 12:33:51.067084", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\nCopying config sha256:d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571\nWriting manifest to image destination\nStoring signatures", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "Copying config sha256:d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571", "Writing manifest to image destination", "Storing signatures"], "stdout": "d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571", "stdout_lines": ["d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571"]}
Oh, I saw: failed=0 so thought that no failures there. Do we know why it could happen? Or is it possible to set some global retry for such failures to retry those tasks? Thanks. Vijay, only one thing I can think of is the same issue we had before with conflicting IP addresses in the network. Did you talk to PNT guys about this? Thanks.
The PLAY RECAP shows the status of tasks for all the hosts for that playbook run. PLAY RECAP ********************************************************************* localhost : ok=1 changed=1 unreachable=0 failed=0 skipped=3 rescued=0 ignored=0 rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com : ok=37 changed=29 unreachable=0 failed=0 skipped=7 rescued=0 ignored=0 rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28 changed=20 unreachable=0 failed=1 skipped=7 rescued=0 ignored=0 rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28 changed=20 unreachable=0 failed=1 skipped=7 rescued=0 ignored=0 You can see above that both rhel1-1 and rhel1-2 reported failed=1. As for retying, I recently added retries to `podman pull` tasks in an attempt to be more resilient to transient network issues. Example: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/config.yml#L129-L132 These changes will be in 4.5 and there is not a plan to backport to earlier releases. I was seeing occasional failures in CI for these tasks but it has not been a significant issue since 4.1 GA.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days