Bug 1812407 - Executin of ansible playbook for cluster expansion returns RC 2 and strderr is: [WARNING]: The value 1 (type int) in a string field was converted to u'1' (type string). [NEEDINFO]
Summary: Executin of ansible playbook for cluster expansion returns RC 2 and strderr i...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Russell Teague
QA Contact: Johnny Liu
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-11 09:35 UTC by Petr Balogh
Modified: 2020-03-11 17:50 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-03-11 12:40:02 UTC
Target Upstream Version:
pbalogh: needinfo? (vavuthu)


Attachments (Terms of Use)

Description Petr Balogh 2020-03-11 09:35:33 UTC
Description of problem:
During execution following scale up playbook:
nsible-playbook -i /tmp/inventory.yaml /usr/share/ansible/openshift-ansible/playbooks/scaleup.yml --private-key=/tmp/openshift-dev.pem -v


It ended up with non zero return code cause of this warning:
PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0   
rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com : ok=37   changed=29   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28   changed=20   unreachable=0    failed=1    skipped=7    rescued=0    ignored=0   
rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28   changed=20   unreachable=0    failed=1    skipped=7    rescued=0    ignored=0   


2020-03-11 07:05:59,341 - WARNING - ocs_ci.utility.utils.run_cmd.421 - Command stderr: [WARNING]: The value 1 (type int) in a string field was converted to u'1' (type
string). If this does not look like what you expect, quote the entire value to
ensure it does not change.
command terminated with exit code 2


cat /tmp/inventory.yaml  #/usr/share/ansible/openshift-ansible/playbooks/scaleup.yml --private-key=/tmp/openshift-dev.pem -v.
[all:vars]
ansible_user=test
ansible_become=True

openshift_kubeconfig_path=/tmp/kubeconfig
openshift_pull_secret_path=/tmp/pull-secret

[new_workers]
rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com
rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com
rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com



cat /usr/share/ansible/openshift-ansible/playbooks/scaleup.yml
---
- name: Pre-scaleup checks
  hosts: localhost
  connection: local
  gather_facts: no
  tasks:
  - import_role:
      name: openshift_node
      tasks_from: scaleup_checks.yml

- name: install nodes
  hosts: new_workers
  roles:
  - openshift_node

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-4.3.6-202003090404.git.0.612bb09.el7.noarch
rpm -q ansible
ansible-2.8.7-1.el7ae.noarch
ansible --version
ansible 2.8.7
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jun 11 2019, 14:33:56) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]

How reproducible:

Steps to Reproduce:
1. Prepare rhel nodes
2. Run command described above
3. Playbook pass but RC is not 0 and it breaks automation

Actual results:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-dc7-rhcos/nberry-dc7-rhcos_20200311T060425/logs/ocs-ci-logs-1583907322/by_outcome/failed/tests/ecosystem/deployment/test_deployment.py/test_deployment/logs

Look for ansible-playbook command in log above

Expected results:
Do not see such warning and playbook should end with RC 0

Additional info:
Please attach logs from ansible-playbook with the -vvv flag
Do not have such output.

Must gather after this failure here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-dc7-rhcos/nberry-dc7-rhcos_20200311T060425/logs/failed_testcase_ocs_logs_1583907322/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/

Comment 1 Russell Teague 2020-03-11 12:40:02 UTC
The playbook did not complete successfully for all hosts.  Two of the nodes failed because of an issue pulling the release image. An exit code of '2' is expected when there are failed hosts.

TASK [openshift_node : Pull release image] *************************************
fatal: [rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com]: FAILED! => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.GCwW1y/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:07.230095", "end": "2020-03-11 12:33:58.304901", "msg": "non-zero return code", "rc": 125, "start": "2020-03-11 12:33:51.074806", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\ntime=\"2020-03-11T12:33:58+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage372764296/1\\\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer\"\nFailed\nError: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage372764296/1\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "time=\"2020-03-11T12:33:58+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage372764296/1\\\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer\"", "Failed", "Error: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage372764296/1\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer"], "stdout": "", "stdout_lines": []}
fatal: [rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com]: FAILED! => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.JGeE6n/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:09.737676", "end": "2020-03-11 12:34:00.838081", "msg": "non-zero return code", "rc": 125, "start": "2020-03-11 12:33:51.100405", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\ntime=\"2020-03-11T12:34:00+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage101571689/2\\\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer\"\nFailed\nError: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage101571689/2\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "time=\"2020-03-11T12:34:00+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage101571689/2\\\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer\"", "Failed", "Error: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage101571689/2\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer"], "stdout": "", "stdout_lines": []}
changed: [rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com] => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.XeiKYS/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:14.166416", "end": "2020-03-11 12:34:05.233500", "rc": 0, "start": "2020-03-11 12:33:51.067084", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\nCopying config sha256:d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571\nWriting manifest to image destination\nStoring signatures", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "Copying config sha256:d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571", "Writing manifest to image destination", "Storing signatures"], "stdout": "d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571", "stdout_lines": ["d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571"]}

Comment 2 Petr Balogh 2020-03-11 17:27:04 UTC
Oh, I saw: failed=0  so thought that no failures there.
Do we know why it could happen? Or is it possible to set some global retry for such failures to retry those tasks?

Thanks.


Vijay, only one thing I can think of is the same issue we had before with conflicting IP addresses in the network.
Did you talk to PNT guys about this?

Thanks.

Comment 3 Russell Teague 2020-03-11 17:50:00 UTC
The PLAY RECAP shows the status of tasks for all the hosts for that playbook run.

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0   
rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com : ok=37   changed=29   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28   changed=20   unreachable=0    failed=1    skipped=7    rescued=0    ignored=0   
rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28   changed=20   unreachable=0    failed=1    skipped=7    rescued=0    ignored=0   

You can see above that both rhel1-1 and rhel1-2 reported failed=1.

As for retying, I recently added retries to `podman pull` tasks in an attempt to be more resilient to transient network issues.

Example:
https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/config.yml#L129-L132

These changes will be in 4.5 and there is not a plan to backport to earlier releases.  I was seeing occasional failures in CI for these tasks but it has not been a significant issue since 4.1 GA.


Note You need to log in before you can comment on or make changes to this bug.