Red Hat Bugzilla – Bug 1562035
DPDK: Deployment fails intermittently only when tuned profile "cpu-partitioning" is activated
Last modified: 2018-07-24 02:25:08 EDT
Created attachment 1414748 [details] stack failure log with tuned enabled deployment Description of problem: Deployment failed with docker run failing intermittently with below error log ----------------------------- "2018-03-21 06:57:02,719 ERROR: 12318 -- Failed running docker-puppet.py for iscsid", "2018-03-21 06:57:02,719 ERROR: 12318 -- nsenter: failed to unshare namespaces: Invalid argument", "container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 29\\\"\"", "/usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 29\\\"\".", "time=\"2018-03-21T06:57:02Z\" level=error msg=\"error getting events from daemon: net/http: request canceled\" ", ----------------------------- * Happens only with RHEL, verified in centos, doesn't happen. * It is not always with the same docker image, randomly occuring on different containers and different steps of deployment * Sometimes the deployment is successful when deployed again (stack update), but not always. * Deployment is successful when the tuned profile is removed from the parameters (2 out of 2 passed)
From the logs, it looks similar to this known issue: https://github.com/moby/moby/issues/34971 - the reporter is on CentOS7 though. I wonder if the kernel parameters for namespaces in RHEL are different from CentOS. Could you please give the version of your rpms? and the command used to deploy so we can try to reproduce. Thanks
(In reply to Emilien Macchi from comment #1) > From the logs, it looks similar to this known issue: > https://github.com/moby/moby/issues/34971 - the reporter is on CentOS7 > though. I wonder if the kernel parameters for namespaces in RHEL are > different from CentOS. > > Could you please give the version of your rpms? and the command used to > deploy so we can try to reproduce. > > Thanks Able to deploy successfully by reducing the docker puppet instances to 1 and with Tuned profile enabled. DockerPuppetProcessCount: 1 I am re-deploying few times to ensure that it works always. I will update the BZ.
(In reply to Saravanan KR from comment #2) > (In reply to Emilien Macchi from comment #1) > Able to deploy successfully by reducing the docker puppet instances to 1 and > with Tuned profile enabled. > > DockerPuppetProcessCount: 1 > > I am re-deploying few times to ensure that it works always. I will update > the BZ. It failed with the same error for the 2nd run. Tested with OSP-13 puddle 2018-03-16.1. DPDK specific parameters: -------------------------------- parameter_defaults: ComputeOvsDpdkParameters: KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=64 intel_iommu=on iommu=pt isolcpus=10-87" TunedProfileName: "cpu-partitioning" IsolCpusList: "10-87" OvsDpdkSocketMemory: "2048,2048" OvsDpdkMemoryChannels: "4" NovaReservedHostMemory: 4096 OvsPmdCoreList: "10,11,22,23" OvsDpdkCoreList: "0,1,2,3,4,5,6,7,8,9" NovaVcpuPinSet: ['12-21','24-87'] --------------------------------
Just to make a general comment, the purpose of docker-puppet.py is to use puppet write configuration files, so it is optimised for concurrency. This means if docker-puppet.py modifies any state on the host there is a risk that a race will be triggered, resulting in unpredictable errors. Since you've tried with DockerPuppetProcessCount:1 that may not be the cause of this particular issue, but it sounds like the host *is* being modified, and we should make the required change to do this in a separate "puppet apply" call rather than using docker-puppet.py. Looking at https://github.com/moby/moby/issues/34971, I suppose we could be hitting some kind of namespace limit. I'll keep investigating.
According to this[1] it might be worth adding this to your KernelArgs. user_namespace.enable=1 Could you please try this with the DockerPuppetProcessCount:1 parameter and post any failed stack log? [1] https://github.com/opencontainers/runc/issues/1343
(In reply to Steve Baker from comment #5) > According to this[1] it might be worth adding this to your KernelArgs. > > user_namespace.enable=1 > > Could you please try this with the DockerPuppetProcessCount:1 parameter and > post any failed stack log? > > [1] https://github.com/opencontainers/runc/issues/1343 Yes, I too have tried with the kernel args and process count, but doesn't help. ------------------------------------- parameter_defaults: ComputeOvsDpdkParameters: KernelArgs: "default_hugepagesz=1GB hugepagesz=1G hugepages=64 intel_iommu=on iommu=pt isolcpus=10-87 user_namespace.enable=1" ExtraSysctlSettings: user.max_user_namespaces: value: 15000 ------------------------------------- And it started happening in rhel7.5. Event I have tried with the latest docker packages in rhel. Also this issue never occurs in CentOS based deployments. If I disable the TunedProfileName alone, then the deployment is successful.
Assigning to DF:NFV dradaz and marking as untriaged. Once the issue is diagnosed it can already be reassigned back to us if the fix is changing the container config.
*** Bug 1570412 has been marked as a duplicate of this bug. ***
On trying with puddle 2018-04-19.2, I am unable to reproduce this issue. Asked for confirmation from QE. Still the root cause of the BZ is unknown. All it is known is is related to user namespace but the solutions suggested above does not work. If it is reproduced again, we have to work with container and tuned team to fix this issue. Moving it to ON_QA to ensure that QE validates this scenario on the latest puddle.
Verified with SRIOV with puddle 2018-04-26.1
The issue appears back in the latest 2018-05-04.1 puddle.
(In reply to Maxim Babushkin from comment #12) > The issue appears back in the latest 2018-05-04.1 puddle. Can you past the error logs? openstack stack failures list --long overcloud
Created attachment 1434177 [details] stack_failures_log
I face similar failures in paunch container start also on overcloud.AllNodesDeploySteps.ComputeOvsDpdkDeployment_Step4.0 "Error running ['docker', 'run', '--name', 'ceilometer_agent_compute', '--label', 'config_id=tripleo_step4', '--label', 'container_name=ceilometer_agent_compute', '--label', 'managed_by=paunch', '--label', 'config_data={\"environment\": [\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\", \"TRIPLEO_CONFIG_HASH=6b3b5b2d9dd55807977f98e19c7c715d\"], \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/var/lib/kolla/config_files/ceilometer_agent_compute.json:/var/lib/kolla/config_files/config.json:ro\", \"/var/lib/config-data/puppet-generated/ceilometer/:/var/lib/kolla/config_files/src:ro\", \"/var/run/libvirt:/var/run/libvirt:ro\", \"/var/log/containers/ceilometer:/var/log/ceilometer\"], \"image\": \"192.0.10.1:8787/rhosp13/openstack-ceilometer-compute:2018-05-07.2\", \"net\": \"host\", \"restart\": \"always\", \"privileged\": false}', '--detach=true', '--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS', '--env=TRIPLEO_CONFIG_HASH=6b3b5b2d9dd55807977f98e19c7c715d', '--net=host', '--privileged=false', '--restart=always', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro', '--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro', '--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro', '--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro', '--volume=/dev/log:/dev/log', '--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro', '--volume=/etc/puppet:/etc/puppet:ro', '--volume=/var/lib/kolla/config_files/ceilometer_agent_compute.json:/var/lib/kolla/config_files/config.json:ro', '--volume=/var/lib/config-data/puppet-generated/ceilometer/:/var/lib/kolla/config_files/src:ro', '--volume=/var/run/libvirt:/var/run/libvirt:ro', '--volume=/var/log/containers/ceilometer:/var/log/ceilometer', '192.0.10.1:8787/rhosp13/openstack-ceilometer-compute:2018-05-07.2']. [125]", "stdout: 811a9940b8afa258ccb467cd765a9b24edbce059e106ef125dc1a79686df3e32", "stderr: /usr/bin/docker-current: Error response from daemon: oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:245: running exec setns process for init caused \\\"exit status 29\\\"\".",
We're seeing this issue also in Pike.
Reproduced it standalone by running the docker-puppet script directly on the compute nodes. Out of 5 trials, the same error occurred once. Trying to figure out if it could be reproduced by directly with docker command. export PROCESS_COUNT=3 export NET_HOST='true' python /var/lib/docker-puppet/docker-puppet.py http://paste.openstack.org/show/720984/
*** Bug 1578355 has been marked as a duplicate of this bug. ***
cat <<EOF>/home/stack/nsenter_error_tht.patch diff --git a/docker/docker-puppet.py b/docker/docker-puppet.py index 599ed14b..89e92ce4 100755 --- a/docker/docker-puppet.py +++ b/docker/docker-puppet.py @@ -275,6 +275,7 @@ def mp_puppet_config((config_volume, puppet_tags, manifest, config_image, volume pull_image(config_image) dcmd = ['/usr/bin/docker', 'run', + '--pid', 'host', '--user', 'root', '--name', 'docker-puppet-%s' % config_volume, '--env', 'PUPPET_TAGS=%s' % puppet_tags, EOF cat <<EOF>/home/stack/nsenter_error_tht.patch diff --git a/docker/docker-puppet.py b/docker/docker-puppet.py index 599ed14b..89e92ce4 100755 --- a/docker/docker-puppet.py +++ b/docker/docker-puppet.py @@ -275,6 +275,7 @@ def mp_puppet_config((config_volume, puppet_tags, manifest, config_image, volume pull_image(config_image) dcmd = ['/usr/bin/docker', 'run', + '--pid', 'host', '--user', 'root', '--name', 'docker-puppet-%s' % config_volume, '--env', 'PUPPET_TAGS=%s' % puppet_tags, EOF sudo patch -d /usr/share/openstack-tripleo-heat-templates/ -p1 < /home/stack/nsenter_error_tht.patch sudo yum install libguestfs-tools -y cd /home/stack/images sudo su mkdir -p mount_image export LIBGUESTFS_BACKEND=direct guestmount -a overcloud-full.qcow2 -m /dev/sda mount_image patch -d mount_image/usr/lib/python2.7/site-packages/ -p1 < /home/stack/nsenter_error_paunch.patch exit openstack overcloud image upload --update-existing
cat <<EOF>/home/stack/nsenter_error_paunch.patch diff --git a/paunch/builder/compose1.py b/paunch/builder/compose1.py index 5472d37..a041bc0 100644 --- a/paunch/builder/compose1.py +++ b/paunch/builder/compose1.py @@ -144,6 +144,7 @@ class ComposeV1Builder(object): for v in cconfig.get('environment', []): if v: cmd.append('--env=%s' % v) + cmd.append('--pid=host') if cconfig.get('remove', False): cmd.append('--rm') if cconfig.get('interactive', False): EOF
*** Bug 1581711 has been marked as a duplicate of this bug. ***
*** Bug 1566561 has been marked as a duplicate of this bug. ***
Verified with Kernel 3.10.0-862.6.3.el7.x86_64 # With Tuned enabled, please refer NFV CI' THT already merged https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/nfv/view/nfv/
As per comment #46, this issue has been verified by QE. And this kernel is available as part of OSP13 19th July update. Moving to verified state.
Closing it as the required kernel version is already part of 19th July 2018 release of OSP13.