Bug 1566561 - Failure on deployment: "stderr: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 29\\\"\"."
Summary: Failure on deployment: "stderr: /usr/bin/docker-current: Error response from ...
Keywords:
Status: CLOSED DUPLICATE of bug 1562035
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: Saravanan KR
QA Contact: Yariv
URL:
Whiteboard:
Depends On: 1562035 1577745
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-04-12 14:06 UTC by Yolanda Robla
Modified: 2018-05-30 13:37 UTC (History)
16 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-05-30 13:37:54 UTC
Target Upstream Version:


Attachments (Terms of Use)
sosreport on the controller that fails (14.63 MB, application/x-xz)
2018-04-12 14:52 UTC, Yolanda Robla
no flags Details
sosreport on the compute-0 that fails (11.80 MB, application/x-xz)
2018-04-12 15:47 UTC, Yolanda Robla
no flags Details

Description Yolanda Robla 2018-04-12 14:06:48 UTC
Description of problem:

I tried deploying OSP 12 with 7.5 and it's unable to complete. It is failing with an error on docker:

overcloud.AllNodesDeploySteps.ComputeDeployment_Step4.0:
  resource_type: OS::Heat::StructuredDeployment
  physical_resource_id: 3d87c07c-ae48-44b6-b2f6-b47bcf6350dc
  status: CREATE_FAILED
  status_reason: |
    Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 2
  deploy_stdout: |
    ...
            "stderr: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 29\\\"\".", 
            "stdout: 3297a0da1f515b589d875a21c70cbc148e5b6da8f4ae0a61576d2e963ced6776", 
            "stdout: e0f349bd96d1a30cd0135870d8dd1c7be7c724f5e1e2e850c33165e5fe5d045b"
        ]
    }
    	to retry, use: --limit @/var/lib/heat-config/heat-config-ansible/00cce50b-25c4-4ac6-85a3-7709aece00db_playbook.retry
    
    PLAY RECAP *********************************************************************
    localhost                  : ok=7    changed=2    unreachable=0    failed=1   
    
    (truncated, view all with --long)
  deploy_stderr: |

Those are my docker packages:

python-docker-pycreds-1.10.6-3.el7.noarch
python-docker-py-1.10.6-3.el7.noarch
docker-client-1.13.1-58.git87f2fab.el7.x86_64
docker-common-1.13.1-58.git87f2fab.el7.x86_64
python-heat-agent-docker-cmd-1.4.0-1.el7ost.noarch
docker-rhel-push-plugin-1.13.1-58.git87f2fab.el7.x86_64
docker-1.13.1-58.git87f2fab.el7.x86_64

Comment 1 Yolanda Robla 2018-04-12 14:52:43 UTC
Created attachment 1420875 [details]
sosreport on the controller that fails

Comment 2 Alex Schultz 2018-04-12 15:33:44 UTC
Can you please provide a full 'openstack stack failures list overcloud -f yaml'. I'm not seeing the error in the sosreport.  I'm not sure the node that failed is the node provide in the sosreport.

Comment 3 Alex Schultz 2018-04-12 15:41:26 UTC
Actual error:

            "Error running ['docker', 'run', '--name', 'ceilometer_agent_compute', '--label', 'config_id=tripleo_step4', '--label', 'container_name=ceilometer_agent_compute', '--label', 'managed_by=paunch', '--label', 'config_data={\"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\", \"TRIPLEO_CONFIG_HASH=3b9ff94d55e51b37915cd2e78aea42de\"], \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/kolla/config_files/ceilometer_agent_compute.json:/var/lib/kolla/config_files/config.json:ro\", \"/var/lib/config-data/puppet-generated/ceilometer/:/var/lib/kolla/config_files/src:ro\", \"/var/run/libvirt:/var/run/libvirt:ro\", \"/var/log/containers/ceilometer:/var/log/ceilometer\"], \"image\": \"registry.access.redhat.com/rhosp12/openstack-ceilometer-compute:latest\", \"net\": \"host\", \"restart\": \"always\", \"privileged\": false}', '--detach=true', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=TRIPLEO_CONFIG_HASH=3b9ff94d55e51b37915cd2e78aea42de', '--net=host', '--privileged=false', '--restart=always', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/ceilometer_agent_compute.json:/var/lib/kolla/config_files/config.json:ro', '--volume=/var/lib/config-data/puppet-generated/ceilometer/:/var/lib/kolla/config_files/src:ro', '--volume=/var/run/libvirt:/var/run/libvirt:ro', '--volume=/var/log/containers/ceilometer:/var/log/ceilometer', 'registry.access.redhat.com/rhosp12/openstack-ceilometer-compute:latest']. [125]

Comment 4 Yolanda Robla 2018-04-12 15:47:40 UTC
Created attachment 1420917 [details]
sosreport on the compute-0 that fails

Comment 5 Alex Schultz 2018-04-12 15:51:20 UTC
So this is likely the same issue being reported in Bug 1562035. This seems to be related to the sriov configuration being attempted.

Comment 6 Alex Schultz 2018-04-12 15:55:09 UTC
Apr 12 13:42:35 overcloud-compute-0 dockerd-current[17824]: nsenter: failed to unshare namespaces: Invalid argument
Apr 12 13:42:35 overcloud-compute-0 oci-systemd-hook[28408]: systemdhook <debug>: 8ae658bd92fc: Skipping as container command is kolla_start, not init or systemd
Apr 12 13:42:35 overcloud-compute-0 oci-umount[28410]: umounthook <debug>: 8ae658bd92fc: only runs in prestart stage, ignoring
Apr 12 13:42:35 overcloud-compute-0 dockerd-current[17824]: container_linux.go:247: starting container process caused "process_linux.go:245: running exec setns process for init caused \"exit status 29\""

Comment 7 Steve Baker 2018-04-12 21:33:04 UTC
Could you please try the suggestion in this comment and reply with your results?

https://bugzilla.redhat.com/show_bug.cgi?id=1562035#c5

Comment 8 Yolanda Robla 2018-04-13 09:45:24 UTC
I tried, but it still fails in the same way

Comment 9 Yolanda Robla 2018-04-13 12:20:12 UTC
I can confirm it's an interaction with tuned. If i deploy without my tuned profile, deploy succeeds. The code i used for tuned is:

              yum install -y tuned-profiles-cpu-partitioning

              tuned_conf_path="/etc/tuned/cpu-partitioning-variables.conf"
              if [ -n "$TUNED_CORES" ]; then
                grep -q "^isolated_cores" $tuned_conf_path
                if [ "$?" -eq 0 ]; then
                  sed -i 's/^isolated_cores=.*/isolated_cores=$TUNED_CORES/' $tuned_conf_path
                else
                  echo "isolated_cores=$TUNED_CORES" >> $tuned_conf_path
                fi
                tuned-adm profile cpu-partitioning
              fi

As soon as i disabled that change, the deploy works

Comment 10 Gurenko Alex 2018-04-17 15:32:54 UTC
Just as related and referenced BZ, assigning to DF:NFV dradaz and marking as untriaged. Once the issue is diagnosed it can already be reassigned back to us if the fix is changing the container config.

Comment 11 Luis Arizmendi 2018-04-19 06:29:38 UTC
I faced same issue with nova-compute container during Step4 in a DPDK node. 

In my case, if you try to deploy again without changing anything, you can see how deployment goes through that step (I had two DPDK compute nodes so I had to run overcloud deploy 3 times to complete it).

I'm using RHEL 7.4 but latest container image tags (12.0-20180405.1).

As additional info, I don't use mistral workflow to calculate derived params (I did it manually). Also I've seen above that in the KernelArgs you are including isolcpus instead of using the IsolCpusList param (I use it in my deployment):

Parameters used:

    TunedProfileName: "cpu-partitioning"
    OvsPmdCoreList: 2,34,18,50
    NovaVcpuPinSet: 3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63
    IsolCpusList: 2,34,3,35,4,36,5,37,6,38,7,39,8,40,9,41,10,42,11,43,12,44,13,45,14,46,15,47,18,50,19,51,20,52,21,53,22,54,23,55,24,56,25,57,26,58,27,59,28,60,29,61,30,62,31,63
    OvsDpdkCoreList: 0,32,16,48
    OvsDpdkMemoryChannels: 4
    NovaReservedHostMemory: 28672
    OvsDpdkSocketMemory: 6144,1024
    KernelArgs: default_hugepagesz=1GB hugepagesz=1G hugepages=32 iommu=pt intel_iommu=on
    NeutronDatapathType: "netdev"
 

Error log:

"outputs.stdout_lines|default([])|union(outputs.stderr_lines|default([]))": [
            "Error running ['docker', 'run', '--name', 'nova_compute', '--label', 'config_id=tripleo_step4', '--label', 'container_name=nova_compute', '--label', 'managed_by=paunch', '--label', 'config_data={\"ipc\": \"host\", \"image\": \"192.168.128.11:8787/rhosp12/open
stack-nova-compute:12.0-20180405.1\", \"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\", \"TRIPLEO_CONFIG_HASH=424fde46f2697dc3bb7f50f6a8ad3689\"], \"user\": \"nova\", \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-tr
ust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/
dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/kolla/config_files/nova_compute.json:/var/lib/kolla/config_files/config.json:ro\", \"/var/lib/config-data/puppet-generated/nova_libvirt/:/var/lib/kolla/config_f
iles/src:ro\", \"/etc/ceph:/var/lib/kolla/config_files/src-ceph:ro\", \"/dev:/dev\", \"/lib/modules:/lib/modules:ro\", \"/etc/iscsi:/etc/iscsi\", \"/run:/run\", \"/var/lib/nova:/var/lib/nova:shared\", \"/var/lib/libvirt:/var/lib/libvirt\", \"/var/log/containers/nova:/var/
log/nova\", \"/sys/class/net:/sys/class/net\", \"/sys/bus/pci:/sys/bus/pci\"], \"net\": \"host\", \"privileged\": true, \"restart\": \"always\"}', '--detach=true', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=TRIPLEO_CONFIG_HASH=424fde46f2697dc3bb7f50f6a8ad3689', '--
net=host', '--ipc=host', '--privileged=true', '--restart=always', '--user=nova', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle
.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh
_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/nova_compute.json:/var/lib/kolla/config_files/config.json:ro', '--volume=/var/lib/config-data/puppet-generated/nova_libvirt/:/var/lib/kolla/config_files/src:ro', '--volume=/etc/
ceph:/var/lib/kolla/config_files/src-ceph:ro', '--volume=/dev:/dev', '--volume=/lib/modules:/lib/modules:ro', '--volume=/etc/iscsi:/etc/iscsi', '--volume=/run:/run', '--volume=/var/lib/nova:/var/lib/nova:shared', '--volume=/var/lib/libvirt:/var/lib/libvirt', '--volume=/va
r/log/containers/nova:/var/log/nova', '--volume=/sys/class/net:/sys/class/net', '--volume=/sys/bus/pci:/sys/bus/pci', '192.168.128.11:8787/rhosp12/openstack-nova-compute:12.0-20180405.1']. [125]", 
            "", 
            "stdout: f2b2e112081e5c4a0e25c6b12ef3a75c4091126e2e29e470ab930e4537891c74", 
            "stderr: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 29\\\"\".", 
            "stdout: 0bdb69b5c147966400d9a12757da3e5148eedc5e7eac6238ba639b29d889fc2e", 
            "stderr: ", 
            "stdout: 44535769e4197b6634834584fd3abb12bff7e103c27ec52f0ab4a870d5a9961f"
        ]
    }

Comment 12 Yolanda Robla 2018-05-10 08:27:51 UTC
This blocks fast forward upgrade for telcos as well. These telcos have tuned profiles initially on their deployments. And when i execute fast forward upgrade on my computes, i cannot upgrade them.

Comment 13 Yolanda Robla 2018-05-14 09:56:47 UTC
Today i also could hit it with tuned disabled. It seems to be happening randomly on containers, but I mostly found failures on crond, iscsid, nova-libvirt and neutron

Comment 15 Saravanan KR 2018-05-30 13:37:54 UTC

*** This bug has been marked as a duplicate of bug 1562035 ***


Note You need to log in before you can comment on or make changes to this bug.