Bug 1812407

Summary: Executin of ansible playbook for cluster expansion returns RC 2 and strderr is: [WARNING]: The value 1 (type int) in a string field was converted to u'1' (type string).
Product: OpenShift Container Platform Reporter: Petr Balogh <pbalogh>
Component: InstallerAssignee: Russell Teague <rteague>
Installer sub component: openshift-ansible QA Contact: Johnny Liu <jialiu>
Status: CLOSED NOTABUG Docs Contact:
Severity: urgent    
Priority: urgent CC: vavuthu
Version: 4.3.0Keywords: Automation
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-11 12:40:02 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Petr Balogh 2020-03-11 09:35:33 UTC
Description of problem:
During execution following scale up playbook:
nsible-playbook -i /tmp/inventory.yaml /usr/share/ansible/openshift-ansible/playbooks/scaleup.yml --private-key=/tmp/openshift-dev.pem -v


It ended up with non zero return code cause of this warning:
PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0   
rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com : ok=37   changed=29   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28   changed=20   unreachable=0    failed=1    skipped=7    rescued=0    ignored=0   
rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28   changed=20   unreachable=0    failed=1    skipped=7    rescued=0    ignored=0   


2020-03-11 07:05:59,341 - WARNING - ocs_ci.utility.utils.run_cmd.421 - Command stderr: [WARNING]: The value 1 (type int) in a string field was converted to u'1' (type
string). If this does not look like what you expect, quote the entire value to
ensure it does not change.
command terminated with exit code 2


cat /tmp/inventory.yaml  #/usr/share/ansible/openshift-ansible/playbooks/scaleup.yml --private-key=/tmp/openshift-dev.pem -v.
[all:vars]
ansible_user=test
ansible_become=True

openshift_kubeconfig_path=/tmp/kubeconfig
openshift_pull_secret_path=/tmp/pull-secret

[new_workers]
rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com
rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com
rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com



cat /usr/share/ansible/openshift-ansible/playbooks/scaleup.yml
---
- name: Pre-scaleup checks
  hosts: localhost
  connection: local
  gather_facts: no
  tasks:
  - import_role:
      name: openshift_node
      tasks_from: scaleup_checks.yml

- name: install nodes
  hosts: new_workers
  roles:
  - openshift_node

Version-Release number of the following components:
rpm -q openshift-ansible
openshift-ansible-4.3.6-202003090404.git.0.612bb09.el7.noarch
rpm -q ansible
ansible-2.8.7-1.el7ae.noarch
ansible --version
ansible 2.8.7
  config file = /etc/ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Jun 11 2019, 14:33:56) [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)]

How reproducible:

Steps to Reproduce:
1. Prepare rhel nodes
2. Run command described above
3. Playbook pass but RC is not 0 and it breaks automation

Actual results:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-dc7-rhcos/nberry-dc7-rhcos_20200311T060425/logs/ocs-ci-logs-1583907322/by_outcome/failed/tests/ecosystem/deployment/test_deployment.py/test_deployment/logs

Look for ansible-playbook command in log above

Expected results:
Do not see such warning and playbook should end with RC 0

Additional info:
Please attach logs from ansible-playbook with the -vvv flag
Do not have such output.

Must gather after this failure here:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/nberry-dc7-rhcos/nberry-dc7-rhcos_20200311T060425/logs/failed_testcase_ocs_logs_1583907322/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/

Comment 1 Russell Teague 2020-03-11 12:40:02 UTC
The playbook did not complete successfully for all hosts.  Two of the nodes failed because of an issue pulling the release image. An exit code of '2' is expected when there are failed hosts.

TASK [openshift_node : Pull release image] *************************************
fatal: [rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com]: FAILED! => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.GCwW1y/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:07.230095", "end": "2020-03-11 12:33:58.304901", "msg": "non-zero return code", "rc": 125, "start": "2020-03-11 12:33:51.074806", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\ntime=\"2020-03-11T12:33:58+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage372764296/1\\\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer\"\nFailed\nError: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage372764296/1\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "time=\"2020-03-11T12:33:58+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage372764296/1\\\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer\"", "Failed", "Error: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage372764296/1\": read tcp 10.46.27.217:40438->99.84.222.79:443: read: connection reset by peer"], "stdout": "", "stdout_lines": []}
fatal: [rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com]: FAILED! => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.JGeE6n/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:09.737676", "end": "2020-03-11 12:34:00.838081", "msg": "non-zero return code", "rc": 125, "start": "2020-03-11 12:33:51.100405", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\ntime=\"2020-03-11T12:34:00+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage101571689/2\\\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer\"\nFailed\nError: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage101571689/2\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "time=\"2020-03-11T12:34:00+05:30\" level=error msg=\"Error pulling image ref //quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: Error writing blob: error storing blob to file \\\"/var/tmp/storage101571689/2\\\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer\"", "Failed", "Error: error pulling image \"quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3\": unable to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3: unable to pull image: Error writing blob: error storing blob to file \"/var/tmp/storage101571689/2\": read tcp 10.46.27.219:48276->99.84.222.83:443: read: connection reset by peer"], "stdout": "", "stdout_lines": []}
changed: [rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com] => {"changed": true, "cmd": ["podman", "pull", "--tls-verify=False", "--authfile", "/tmp/ansible.XeiKYS/pull-secret.json", "quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3"], "delta": "0:00:14.166416", "end": "2020-03-11 12:34:05.233500", "rc": 0, "start": "2020-03-11 12:33:51.067084", "stderr": "Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures\nCopying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc\nCopying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a\nCopying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60\nCopying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82\nCopying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379\nCopying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a\nCopying config sha256:d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571\nWriting manifest to image destination\nStoring signatures", "stderr_lines": ["Trying to pull quay.io/openshift-release-dev/ocp-release@sha256:ea7ac3ad42169b39fce07e5e53403a028644810bee9a212e7456074894df40f3...Getting image source signatures", "Copying blob sha256:6d1814359c742980a065213f5968a5e80121728e122a7568e8d708087c42dcbc", "Copying blob sha256:104cc276e39f95e92c9e08dd76b6c8db89b6f4093f4dd7d3975f3df1e683f78a", "Copying blob sha256:964d57e311cc40e34746a347692678b801ad54144e367eef117e66d43d749c60", "Copying blob sha256:92d368863ef4e46361306339fae7473b2a569eedcda94f99f472a6545c70ef82", "Copying blob sha256:875cc07082e20f772339a6f18de201d3640ebaf0021afba6c4abe5e21ed39379", "Copying blob sha256:e88ff14059f505f5faddad1237a07f4378fe0dcc0346fff7fd20576fe289638a", "Copying config sha256:d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571", "Writing manifest to image destination", "Storing signatures"], "stdout": "d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571", "stdout_lines": ["d06ffb37f9543ca2c57592912bbeef82ee225b3e3c7b8470cde4d5873584e571"]}

Comment 2 Petr Balogh 2020-03-11 17:27:04 UTC
Oh, I saw: failed=0  so thought that no failures there.
Do we know why it could happen? Or is it possible to set some global retry for such failures to retry those tasks?

Thanks.


Vijay, only one thing I can think of is the same issue we had before with conflicting IP addresses in the network.
Did you talk to PNT guys about this?

Thanks.

Comment 3 Russell Teague 2020-03-11 17:50:00 UTC
The PLAY RECAP shows the status of tasks for all the hosts for that playbook run.

PLAY RECAP *********************************************************************
localhost                  : ok=1    changed=1    unreachable=0    failed=0    skipped=3    rescued=0    ignored=0   
rhel1-0.nberry-dc7-rhcos.qe.rh-ocs.com : ok=37   changed=29   unreachable=0    failed=0    skipped=7    rescued=0    ignored=0   
rhel1-1.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28   changed=20   unreachable=0    failed=1    skipped=7    rescued=0    ignored=0   
rhel1-2.nberry-dc7-rhcos.qe.rh-ocs.com : ok=28   changed=20   unreachable=0    failed=1    skipped=7    rescued=0    ignored=0   

You can see above that both rhel1-1 and rhel1-2 reported failed=1.

As for retying, I recently added retries to `podman pull` tasks in an attempt to be more resilient to transient network issues.

Example:
https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/config.yml#L129-L132

These changes will be in 4.5 and there is not a plan to backport to earlier releases.  I was seeing occasional failures in CI for these tasks but it has not been a significant issue since 4.1 GA.

Comment 4 Red Hat Bugzilla 2023-09-14 05:54:10 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days