Bug 1690093

Summary:	python command not in rhel8-based rhcs4 container image (only python3)
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	John Fulton <johfulto>
Component:	Container	Assignee:	Dimitri Savineau <dsavinea>
Status:	CLOSED ERRATA	QA Contact:	Yogev Rabl <yrabl>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.0	CC:	ceph-eng-bugs, gabrioux, gfidente, tserlin, vashastr, yrabl
Target Milestone:	rc
Target Release:	4.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	rhceph-4.0-rhel8:latest	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-01-31 14:44:57 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1594251

Description John Fulton 2019-03-18 18:59:18 UTC

While using test builds of ceph-ansible 4 and the rhel8-based rhcs4 container image, a command is run with python [1]. That command fails because "python" is not a working command within the container, though python3 is. Symlinking python to python3 in /usr/bin works around the issue [2], but the container should be updated to handle this scenario (e.g. maybe here [3]?). This problem presents itself from ceph-ansible like [4]. 

[1]
[root@overcloud-computehci-0 ~]# podman exec -ti ceph-osd-0 bash                                                                                                                                                                                                              
bash-4.4# 
bash-4.4# cat /opt/ceph-container/bin/osd_volume_activate.sh
#!/bin/bash
set -e

function osd_volume_activate {
  : "${OSD_ID:?Give me an OSD ID to activate, eg: -e OSD_ID=0}"

  CEPH_VOLUME_LIST_JSON="$(ceph-volume lvm list --format json)"

  if ! echo "$CEPH_VOLUME_LIST_JSON" | python -c "import sys, json; print(json.load(sys.stdin)[\"$OSD_ID\"])" &> /dev/null; then
    log "OSD id $OSD_ID does not exist"
    exit 1
  fi


[2] Reproduce locally and then workaround:
bash-4.4# ceph-volume lvm list --format json | python -c "import sys, json; print(json.load(sys.stdin)[\"0\"])"
bash: python: command not found
bash-4.4#

bash-4.4# ln -s /usr/bin/python3 /usr/bin/python
bash-4.4# ceph-volume lvm list --format json | python -c "import sys, json; print(json.load(sys.stdin)[\"0\"])"
[{'devices': ... <snip>


[3] https://github.com/ceph/ceph-container/tree/master/ceph-releases/ALL/rhel8/daemon

[4] 
2019-03-15 22:29:44,241 p=328317 u=root |  fatal: [overcloud-computehci-1 -> 192.168.24.12]: FAILED! => changed=false
  attempts: 30
  cmd: test "$(podman exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_osds"])')" -gt 0 && test "$(podman exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s -f
json | python -c 'import sys, json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_osds"])')" = "$(podman exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_up_osds"]
)')"
  delta: '0:00:01.101414'
  end: '2019-03-15 22:29:44.218315'
  invocation:
    module_args:
      _raw_params: test "$(podman exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_osds"])')" -gt 0 && test "$(podman exec ceph-mon-overcloud-controller-0 ceph --cluster
 ceph -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_osds"])')" = "$(podman exec ceph-mon-overcloud-controller-0 ceph --cluster ceph -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)["osdmap"]["osdmap"]["num_up_osds"])')"
      _uses_shell: true
      argv: null
      chdir: null
      creates: null
      executable: null
      removes: null
      stdin: null
      warn: true
  msg: non-zero return code
  rc: 1
  start: '2019-03-15 22:29:43.116901'
  stderr: ''
  stderr_lines: []
  stdout: ''
  stdout_lines: <omitted>
2019-03-15 22:29:44,241 p=328317 u=root |  NO MORE HOSTS LEFT *************************************************************
2019-03-15 22:29:44,242 p=328317 u=root |  PLAY RECAP *********************************************************************
2019-03-15 22:29:44,242 p=328317 u=root |  overcloud-computehci-0     : ok=90   changed=10   unreachable=0    failed=0
2019-03-15 22:29:44,242 p=328317 u=root |  overcloud-computehci-1     : ok=89   changed=10   unreachable=0    failed=1
2019-03-15 22:29:44,242 p=328317 u=root |  overcloud-controller-0     : ok=182  changed=11   unreachable=0    failed=0
2019-03-15 22:29:44,242 p=328317 u=root |  overcloud-controller-1     : ok=170  changed=9    unreachable=0    failed=0
2019-03-15 22:29:44,242 p=328317 u=root |  overcloud-controller-2     : ok=172  changed=11   unreachable=0    failed=0

Comment 2 John Fulton 2019-03-18 19:20:04 UTC

You might also see this error when debugging locally like this:

[root@overcloud-computehci-0 ~]# ./ceph-osd-run.sh 0
2019-03-15 23:10:16  /opt/ceph-container/bin/entrypoint.sh: OSD id 0 does not exist
[root@overcloud-computehci-0 ~]#

Comment 3 Dimitri Savineau 2019-03-19 15:55:35 UTC

> You might also see this error when debugging locally like this:
> 
> [root@overcloud-computehci-0 ~]# ./ceph-osd-run.sh 0
> 2019-03-15 23:10:16  /opt/ceph-container/bin/entrypoint.sh: OSD id 0 does not exist

In containerized deployment you need to use device name not OSD id

Comment 4 Dimitri Savineau 2019-03-19 20:21:35 UTC

> In containerized deployment you need to use device name not OSD id

Nevermind I forgot that rhcs 4 is based on nautilus so it's ceph-volume only (the statement was only true for ceph-disk deployment with container)

Comment 8 Artem Hrechanychenko 2019-05-13 09:02:44 UTC

Reproduced:

TASK [ceph-osd : wait for all osd to be up] ************************************",
        "task path: /usr/share/ceph-ansible/roles/ceph-osd/tasks/openstack_config.yml:2",
        "Friday 10 May 2019  18:27:47 +0000 (0:00:00.314)       0:04:05.481 ************ ",
        "FAILED - RETRYING: wait for all osd to be up (60 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (59 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (58 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (57 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (56 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (55 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (54 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (53 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (52 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (51 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (50 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (49 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (48 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (47 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (46 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (45 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (44 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (43 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (42 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (41 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (40 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (39 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (38 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (37 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (36 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (35 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (34 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (33 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (32 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (31 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (30 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (29 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (28 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (27 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (26 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (25 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (24 retries left).",
 "FAILED - RETRYING: wait for all osd to be up (23 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (22 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (21 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (20 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (19 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (18 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (17 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (16 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (15 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (14 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (13 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (12 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (11 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (10 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (9 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (8 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (7 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (6 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (5 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (4 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (3 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (2 retries left).",
        "FAILED - RETRYING: wait for all osd to be up (1 retries left).",
        "fatal: [ceph-2]: FAILED! => changed=false ",
        "  attempts: 60",
        "    test \"$(podman exec ceph-mon-controller-0 ceph --cluster ceph -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)[\"osdmap\"][\"osdmap\"][\"num_osds\"])')\" -gt 0 && test \"$(podman exec ceph-mon-controller-0 ceph --cluster ceph -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)[\"osdmap\"][\"osdmap\"][\"num_osds\"])')\" = \"$(podman exec ceph-mon-controller-0 ceph --cluster ceph -s -f json | python -c 'import sys, json; print(json.load(sys.stdin)[\"osdmap\"][\"osdmap\"][\"num_up_osds\"])')\"",
        "  delta: '0:00:01.748742'",
        "  end: '2019-05-10 18:39:48.793998'",
        "  rc: 1",
        "  start: '2019-05-10 18:39:47.045256'",
        "NO MORE HOSTS LEFT *************************************************************",
        "PLAY RECAP *********************************************************************",
        "ceph-0                     : ok=102  changed=10   unreachable=0    failed=0    skipped=184  rescued=0    ignored=0   ",
        "ceph-1                     : ok=100  changed=10   unreachable=0    failed=0    skipped=179  rescued=0    ignored=0   ",
        "ceph-2                     : ok=101  changed=10   unreachable=0    failed=1    skipped=178  rescued=0    ignored=0   ",
        "compute-0                  : ok=31   changed=0    unreachable=0    failed=0    skipped=86   rescued=0    ignored=0   ",
        "controller-0               : ok=189  changed=22   unreachable=0    failed=0    skipped=307  rescued=0    ignored=0   ",

parameter_defaults:
  ContainerImagePrepare:
  - push_destination: true
    set:
      ceph_image: rhceph-4.0-rhel8
      ceph_namespace: docker-registry.upshift.redhat.com/ceph
      ceph_tag: latest


tar logs - https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/df/view/deployment/job/DFG-df-deployment-15-virthost-1cont_1comp_3ceph-no_UC_SSL-no_OC_SSL-ceph-ipv6-vlan-RHELOSP-31817/3/artifact/

Comment 9 John Fulton 2019-05-13 13:29:02 UTC

(In reply to Artem Hrechanychenko from comment #8)
> Reproduced:

Artem,

We need to be careful here. It is very easy to reproduce this error for other reasons than the root cause of this bug; e.g. unclean disks or more OSDs than you have time to bring up. 

The new docker-registry.upshift.redhat.com/ceph/rhceph-4.0-rhel8:latest container does have a python command which is the root cause of this bug (unfixed versions only had a python3 command). Feel free to launch the container and verify that you have a python command directly. 

If you reproduce the issue and keep the system running, then please ping me and I will help you debug it on that live system. I don't doubt that you are seeing the issue you reported in #8. I just don't think THIS bug is the root cause since I see it has the necessary binary. Let's figure out why you're running into the issue you reported and go from there. Please ping me after you reproduce and keep the system running.

  John

Comment 10 Dimitri Savineau 2019-05-13 13:48:50 UTC

Agreed with John because this doesn't seem to be the same issue.

The original issue was related to the python command not present in the rhceph 4 container (python3 only).

In your situation the python command is executed in on the host

> podman exec ceph-mon-controller-0 ceph --cluster ceph -s -f json | python -c 'import ....'

Only 'ceph --cluster ceph -s -f json' is executed on the container, the rest of the pipe on the host.

Comment 11 John Fulton 2019-05-21 15:03:13 UTC

Confirmed that Artem had a DIFFERENT issue realated to IPv6. More details in https://bugzilla.redhat.com/show_bug.cgi?id=1710319. 

The fix for THIS bug (not related to IPv6 but related to python in the Ceph container) is ready to be tested with:

 docker-registry.upshift.redhat.com/ceph/rhceph-4.0-rhel8:latest

Comment 12 Yogev Rabl 2019-06-04 13:24:27 UTC

Verified

Comment 14 errata-xmlrpc 2020-01-31 14:44:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0313