Bug 1523043 - OC deployment fails with oci runtime error applying cgroup configuration
Summary: OC deployment fails with oci runtime error applying cgroup configuration
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-containers
Version: 12.0 (Pike)
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: zstream
: 12.0 (Pike)
Assignee: Dan Prince
QA Contact: Omri Hochman
Andrew Burden
URL:
Whiteboard:
Keywords: Triaged, ZStream
: 1525229 1550588 (view as bug list)
Depends On: 1532586 1543575
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-12-07 04:39 UTC by Alexander Chuzhoy
Modified: 2018-06-15 22:18 UTC (History)
31 users (show)

(edit)
Clone Of:
: 1543575 (view as bug list)
(edit)
Last Closed: 2018-04-23 14:25:02 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Launchpad 1744954 None None None 2018-01-31 13:54 UTC

Description Alexander Chuzhoy 2017-12-07 04:39:37 UTC
OC deployment with ipv6+vlan fails: invalid header field value \\"oci runtime error: container_linux.go:247

Environment:
openstack-tripleo-heat-templates-7.0.3-18.el7ost.noarch
openstack-puppet-modules-11.0.0-1.el7ost.noarch
instack-undercloud-7.4.3-5.el7ost.noarch

Steps to reproduce:
Attempt a deploy with ipv6:
openstack overcloud deploy --templates \
--libvirt-type kvm \
-e /home/stack/templates/nodes_data.yaml \
-e  /usr/share/openstack-tripleo-heat-templates/environments/ceph-ansible/ceph-ansible.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation-v6.yaml \
-e /home/stack/virt/network/network-environment-v6.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/enable-tls.yaml \
-e /home/stack/virt/public_vip.yaml \
-e /usr/share/openstack-tripleo-heat-templates/environments/ssl/tls-endpoints-public-ip.yaml \
-e /home/stack/inject-trust-anchor-hiera.yaml \
-e /home/stack/rhos12.yaml



Looking for errors in heat, I see the following:
 \"Error running ['docker', 'run', '--name', 'rabbitmq_image_tag', '--label', 'config_id=tripleo_step1', '--label', 'container_name=rabbitmq_image_tag', '--label', 'managed_by=paunch', '--label', 'config_data={\\"start_order\\": 1, \\"command\\": [\\"/bin/bash\\", \\"-c\\", \\"/usr/bin/docker tag \\'192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1\\' \\'192.168.24.1:8787/rhosp12/openstack-rabbitmq:pcmklatest\\'\\"], \\"user\\": \\"root\\", \\"volumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/dev/shm:/dev/shm:rw\\", \\"/etc/sysconfig/docker:/etc/sysconfig/docker:ro\\", \\"/usr/bin:/usr/bin:ro\\", \\"/var/run/docker.sock:/var/run/docker.sock:rw\\"], \\"image\\": \\"192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1\\", \\"detach\\": false, \\"net\\": \\"host\\"}', '--net=host', '--user=root', '--volume=/etc/hosts:/etc/hosts:ro', '--volume=/etc/localtime:/etc/localtime:ro', '--volume=/dev/shm:/dev/shm:rw', '--volume=/etc/sysconfig/docker:/etc/sysconfig/docker:ro', '--volume=/usr/bin:/usr/bin:ro', '--volume=/var/run/docker.sock:/var/run/docker.sock:rw', '192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1', '/bin/bash', '-c', \\"/usr/bin/docker tag '192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1' '192.168.24.1:8787/rhosp12/openstack-rabbitmq:pcmklatest'\\"]. [125]\", 
 \"/usr/bin/docker-current: Error response from daemon: invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\"process_linux.go:258: applying cgroup configuration for process caused \\\\\\\\"write /sys/fs/cgroup/pids/system.slice/docker-0642d71adf65f90fac83693d33be8857e9b1c4a5c69254357ea04fdeadf10c49.scope/cgroup.procs: no such device\\\\\\\\"\\\\"\\n\\".\", 


Checking the logs on controller I see the following error message:
Dec 06 22:26:25 overcloud-controller-0 oci-umount[33118]: umounthook <error>: 3fa2cdcfe1e6: Failed to read directory /usr/share/oci-umount/oci-umount.d: No such file or directory

Comment 1 Martin André 2017-12-07 09:00:22 UTC
The "umounthook <error>: 40d5622b04b3: Failed to read directory /usr/share/oci-umount/oci-umount.d: No such file or directory" error is certainly concerning but I do not think that is the cause of the issue we're seeing since it also shows up on "healthy" nodes.

According to https://github.com/moby/moby/issues/17653 there seem to be a known issue with systemd cgroups driver and it is now recommended to use the cgroupfs driver (with native.cgroupdriver=cgroupfs) but I have absolutely no idea of the implications.

Some more info:

[heat-admin@overcloud-controller-2 ~]$ sudo docker ps -a
CONTAINER ID        IMAGE                                                           COMMAND                  CREATED             STATUS                    PORTS               NAMES
0fc1ea1995c2        192.168.24.1:8787/rhosp12/openstack-mariadb:12.0-20171201.1     "/bin/bash -c '/usr/b"   10 hours ago        Exited (0) 10 hours ago                       mysql_image_tag
d953b2a5373b        192.168.24.1:8787/rhosp12/openstack-memcached:12.0-20171201.1   "/bin/bash -c 'source"   10 hours ago        Up 10 hours                                   memcached
fa989369dbf3        192.168.24.1:8787/rhosp12/openstack-haproxy:12.0-20171201.1     "/bin/bash -c '/usr/b"   10 hours ago        Exited (0) 10 hours ago                       haproxy_image_tag
cb7c0b3c844c        192.168.24.1:8787/rhosp12/openstack-mariadb:12.0-20171201.1     "bash -ecx 'if [ -e /"   10 hours ago        Exited (0) 10 hours ago                       mysql_bootstrap
ad27bd9c00d5        192.168.24.1:8787/rhosp12/openstack-redis:12.0-20171201.1       "/bin/bash -c '/usr/b"   10 hours ago        Exited (0) 10 hours ago                       redis_image_tag
0642d71adf65        192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1    "/bin/bash -c '/usr/b"   10 hours ago        Created                                       rabbitmq_image_tag
b6933a2f5745        192.168.24.1:8787/rhosp12/openstack-rabbitmq:12.0-20171201.1    "kolla_start"            10 hours ago        Exited (0) 10 hours ago                       rabbitmq_bootstrap
25bea91ba36c        192.168.24.1:8787/rhosp12/openstack-memcached:12.0-20171201.1   "/bin/bash -c 'source"   10 hours ago        Exited (0) 10 hours ago                       memcached_init_logs
a46e74f2f80a        192.168.24.1:8787/rhosp12/openstack-mariadb:12.0-20171201.1     "chown -R mysql: /var"   10 hours ago        Exited (0) 10 hours ago                       mysql_data_ownership

[heat-admin@overcloud-controller-2 ~]$ sudo docker logs rabbitmq_image_tag                                                                                                                                           
container_linux.go:247: starting container process caused "process_linux.go:258: applying cgroup configuration for process caused \"write /sys/fs/cgroup/pids/system.slice/docker-0642d71adf65f90fac83693d33be8857e9b1c4a5c69254357ea04fdeadf10c49.scope/cgroup.procs: no such device\""

[heat-admin@overcloud-controller-2 ~]$ sudo docker info
Containers: 9
 Running: 1
 Paused: 0
 Stopped: 8
Images: 18
Server Version: 1.12.6
Storage Driver: overlay2
 Backing Filesystem: xfs
 Native Overlay Diff: true
Logging Driver: journald
Cgroup Driver: systemd
Plugins:
 Volume: local
 Network: null host bridge overlay
 Authorization: rhel-push-plugin
Swarm: inactive
Runtimes: docker-runc runc
Default Runtime: docker-runc
Security Options: seccomp
Kernel Version: 3.10.0-693.11.1.el7.x86_64
Operating System: Red Hat Enterprise Linux Server 7.4 (Maipo)
OSType: linux
Architecture: x86_64
Number of Docker Hooks: 3
CPUs: 8
Total Memory: 31.26 GiB
Name: overcloud-controller-2
ID: Z7Y7:QR7R:Q35Z:JN57:JT5B:Q4CB:QLKM:4TFY:L54O:P23C:TDJE:4RKR
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://registry.access.redhat.com/v1/
Insecure Registries:
 192.168.24.1:8787
 127.0.0.0/8
Registries: registry.access.redhat.com (secure), docker.io (secure)

Comment 2 Yurii Prokulevych 2017-12-07 09:42:58 UTC
Got same issue during minor update
...
     u'        "Trying to pull repository 192.168.24.1:8787/rhosp12/openstack-ceilometer-notification-docker ... ", ',
     u'        "12.0-20171201.1: Pulling from 192.168.24.1:8787/rhosp12/openstack-ceilometer-notification-docker", ',
     u'        "243dc7b9e786: Already exists", ',
     u'        "550516fb1c76: Already exists", ',
     u'        "d0b13a963636: Already exists", ',
     u'        "9e15370858a9: Already exists", ',
     u'        "5b5d4699b9fb: Already exists", ',
     u'        "a554773d8409: Pulling fs layer", ',
     u'        "a554773d8409: Verifying Checksum", ',
     u'        "a554773d8409: Download complete", ',
     u'        "a554773d8409: Pull complete", ',
     u'        "Digest: sha256:4c793db2cbaaa8d506e5ae46ce3b3a77f2e8a3230021815f6152e7253bb966fd", ',
     u'        "Error running [\'docker\', \'run\', \'--name\', \'horizon\', \'--label\', \'config_id=tripleo_step3\', \'--label\', \'container_name=horizon\', \'--label\', \'managed_by=paunch\', \'--label\', \'conf
    ig_data={\\"environment\\": [\\"KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\\", \\"ENABLE_IRONIC=yes\\", \\"ENABLE_MANILA=yes\\", \\"ENABLE_SAHARA=yes\\", \\"ENABLE_CLOUDKITTY=no\\", \\"ENABLE_FREEZER=no\\", \\"ENABLE_FWA
    AS=no\\", \\"ENABLE_KARBOR=no\\", \\"ENABLE_DESIGNATE=no\\", \\"ENABLE_MAGNUM=no\\", \\"ENABLE_MISTRAL=no\\", \\"ENABLE_MURANO=no\\", \\"ENABLE_NEUTRON_LBAAS=no\\", \\"ENABLE_SEARCHLIGHT=no\\", \\"ENABLE_SENLIN=
    no\\", \\"ENABLE_SOLUM=no\\", \\"ENABLE_TACKER=no\\", \\"ENABLE_TROVE=no\\", \\"ENABLE_WATCHER=no\\", \\"ENABLE_ZAQAR=no\\", \\"ENABLE_ZUN=no\\", \\"TRIPLEO_CONFIG_HASH=00aefaf228b0ca7aa445b3952d87fbca\\"], \\"v
    olumes\\": [\\"/etc/hosts:/etc/hosts:ro\\", \\"/etc/localtime:/etc/localtime:ro\\", \\"/etc/puppet:/etc/puppet:ro\\", \\"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\\", \\"/etc/pki/tls/certs/ca-bu
    ndle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\\", \\"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\\", \\"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\\", \\"/dev/log:/dev/log
    \\", \\"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\\", \\"/var/lib/kolla/config_files/horizon.json:/var/lib/kolla/config_files/config.json:ro\\", \\"/var/lib/config-data/puppet-generated/horizon/:/var/
    lib/kolla/config_files/src:ro\\", \\"/var/log/containers/horizon:/var/log/horizon\\", \\"/var/log/containers/httpd/horizon:/var/log/httpd\\", \\"\\", \\"\\"], \\"image\\": \\"192.168.24.1:8787/rhosp12/openstack-
    horizon-docker:12.0-20171201.1Waiting for messages on queue '29bb1c58-b3b9-47f3-ac49-b459e0374747' with no timeout.
    inor update failed with: {u'status': u'FAILED', u'execution': {u'name': u'tripleo.package_update.v1.update_nodes', u'created_at': u'2017-12-07 08:59:23', u'id': u'2182486a-e6eb-4c14-8700-ea7d582c962f', u'params
    ': {u'namespace': u''}, u'input': {u'inventory_file': u"[undercloud]\nlocalhost\n\n[undercloud:vars]\nusername = admin\novercloud_keystone_url = https://10.0.0.101:13000/v2.0\nproject_name = admin\nundercloud_se
    rvice_list = ['openstack-nova-compute', 'openstack-heat-engine', 'openstack-ironic-conductor', 'openstack-swift-container', 'openstack-swift-object', 'openstack-mistral-engine']\novercloud_horizon_url = https://
    10.0.0.101:443/dashboard\nos_auth_token = gAAAAABaKQLl-DlR0O8mB56fcrYZ0ruDdEOnpcNw1K58ExR_BJCIDQMZmP7HoqIIzgltYinKc4zxYTUVPzxYfc3dQlmzuYRC4MKH6kajYrYLraqsVsGzB0AfXp-YkUkF2zmHbIuZN2WpWuXY8mUhpdBBKOy-nAIp79c-ykE1E
    IXSQKNwhywb4c4\novercloud_admin_password = Vd3cvaRM9W2h2UrFpfhERmB9K\nauth_url = https://192.168.24.2:13000/\nansible_connection = local\nundercloud_swift_url = https://192.168.24.2:13808/v1/AUTH_6a052c6cf58a4a6
    0a3c6e4519477f42d\nplan = overcloud\n\n[controller-0]\n192.168.24.15\n\n[controller-0:vars]\ndeploy_server_id = fdcf3353-4513-4c23-8607-7143ca197ca9\n\n[controller-1]\n192.168.24.19\n\n[controller-1:vars]\ndeplo
    y_server_id = 9d97088d-b495-4a31-9c33-51ab0be09eec\n\n[controller-2]\n192.168.24.14\n\n[controller-2:vars]\ndeploy_server_id = 73bc306c-214a-45c1-8a9b-4ba37e9ab213\n\n[Controller:vars]\nrole_name = Controller\na
    nsible_ssh_user = heat-admin\nbootstrap_server_id = fdcf3353-4513-4c23-8607-7143ca197ca9\n\n[Controller:children]\ncontroller-0\ncontroller-1\ncontroller-2\n\n[compute-0]\n192.168.24.11\n\n[compute-0:vars]\ndepl
    oy_server_id = 6836ca7e-3f8e-43ef-91dd-50be4124379c\n\n[compute-1]\n192.168.24.7\n\n[compute-1:vars]\ndeploy_server_id = 5712379a-44ed-4075-ab7e-cb1c53487725\n\n[Compute:vars]\nrole_name = Compute\nansible_ssh_u
    ser = heat-admin\nbootstrap_server_id = fdcf3353-4513-4c23-8607-7143ca197ca9\n\n[Compute:children]\ncompute-0\ncompute-1\n\n[ceph-0]\n192.168.24.9\n\n[ceph-0:vars]\ndeploy_server_id = a63f8d94-7aad-46b6-8fc2-bd4
    ff501bbeb\n\n[ceph-1]\n192.168.24.16\n\n[ceph-1:vars]\ndeploy_server_id = 4e9a327a-1948-47fc-927b-8e58e0d19aff\n\n[ceph-2]\n192.168.24.10\n\n[ceph-2:vars]\ndeploy_server_id = 6fe90fbb-36b8-4fe8-8e79-3b844a7bd8bc
    \n\n[CephStorage:vars]\nrole_name = CephStorage\nansible_ssh_user = heat-admin\nbootstrap_server_id = fdcf3353-4513-4c23-8607-7143ca197ca9\n\n[CephStorage:children]\nceph-0\nceph-1\nceph-2\n\n[overcloud:children
    ]\nCephStorage\nCompute\nController\n\n[aodh_evaluator:vars]\nansible_ssh_user = heat-admin\n\n[aodh_evaluator:children]\nController\n\n[kernel:vars]\nansible_ssh_user = heat-admin\n\n[kernel:children]\nCephStor
    age\nCompute\nController\n\n[neutron_metadata:vars]\nansible_ssh_user = heat-admin\n\n[neutron_metadata:children]\nController\n\n[pacemaker:vars]\nansible_ssh_user = heat-admin\n\n[pacemaker:children]\nControlle
    r\n\n[nova_placement:vars]\nansible_ssh_user = heat-admin\n\n[nova_placement:children]\nController\n\n[snmp:vars]\nansible_ssh_user = heat-admin\n\n[snmp:children]\nCephStorage\nCompute\nController\n\n[heat_api:
    vars]\nansible_ssh_user = heat-admin\n\n[heat_api:children]\nController\n\n[cinder_api:vars]\nansible_ssh_user = heat-admin\n\n[cinder_api:children]\nController\n\n[ceph_client:vars]\nansible_ssh_user = heat-adm
    in\n\n[ceph_client:children]\nCompute\n\n[ceph_mon:vars]\nansible_ssh_user = heat-admin\n\n[ceph_mon:children]\nController\n\n[aodh_listener:vars]\nansible_ssh_user = heat-admin\n\n[aodh_listener:children]\nCont
    roller\n\n[swift_ringbuilder:vars]\nansible_ssh_user = heat-admin\n\n[swift_ringbuilder:children]\nController\n\n[neutron_dhcp:vars]\nansible_ssh_user = heat-admin\n\n[neutron_dhcp:children]\nController\n\n[gnoc
    chi_api:vars]\nansible_ssh_user = heat-admin\n\n[gnocchi_api:children]\nController\n\n[timezone:vars]\nansible_ssh_user = heat-admin\n\n[timezone:children]\nCephStorage\nCompute\nController\n\n[ceilometer_agent_
    central:vars]\nansible_ssh_user = heat-admin\n\n[ceilometer_agent_central:children]\nController\n\n[heat_api_cloudwatch_disabled:vars]\nansible_ssh_user = heat-admin\n\n[heat_api_cloudwatch_disabled:children]\nC
    ontroller\n\n[aodh_notifier:vars]\nansible_ssh_user = heat-admin\n\n[aodh_notifier:children]\nController\n\n[tripleo_firewall:vars]\nansible_ssh_user = heat-admin\n\n[tripleo_firewall:children]\nCephStorage\nCom
    pute\nController\n\n[swift_storage:vars]\nansible_ssh_user = heat-admin\n\n[swift_storage:children]\nController\n\n[redis:vars]\nansible_ssh_user = heat-admin\n\n[redis:children]\nController\n\n[gnocchi_statsd:v
    ars]\nansible_ssh_user = heat-admin\n\n[gnocchi_statsd:children]\nController\n\n[iscsid:vars]\nansible_ssh_user = heat-admin\n\n[iscsid:children]\nCompute\nController\n\n[nova_conductor:vars]\nansible_ssh_user =
     heat-admin\n\n[nova_conductor:children]\nController\n\n[mysql_client:vars]\nansible_ssh_user = heat-admin\n\n[mysql_client:children]\nCephStorage\nCompute\nController\n\n[nova_consoleauth:vars]\nansible_ssh_use
    r = heat-admin\n\n[nova_consoleauth:children]\nController\n\n[glance_api:vars]\nansible_ssh_user = heat-admin\n\n[glance_api:children]\nController\n\n[keystone:vars]\nansible_ssh_user = heat-admin\n\n[keystone:c
    hildren]\nController\n\n[cinder_volume:vars]\nansible_ssh_user = heat-admin\n\n[cinder_volume:children]\nController\n\n[ceilometer_collector_disabled:vars]\nansible_ssh_user = heat-admin\n\n[ceilometer_collector
    _disabled:children]\nController\n\n[ceilometer_agent_notification:vars]\nansible_ssh_user = heat-admin\n\n[ceilometer_agent_notification:children]\nController\n\n[memcached:vars]\nansible_ssh_user = heat-admin\n
    \n[memcached:children]\nController\n\n[haproxy:vars]\nansible_ssh_user = heat-admin\n\n[haproxy:children]\nController\n\n[mongodb_disabled:vars]\nansible_ssh_user = heat-admin\n\n[mongodb_disabled:children]\nCon
    troller\n\n[neutron_plugin_ml2:vars]\nansible_ssh_user = heat-admin\n\n[neutron_plugin_ml2:children]\nCompute\nController\n\n[nova_api:vars]\nansible_ssh_user = heat-admin\n\n[nova_api:children]\nController\n\n[
    aodh_api:vars]\nansible_ssh_user = heat-admin\n\n[aodh_api:children]\nController\n\n[nova_metadata:vars]\nansible_ssh_user = heat-admin\n\n[nova_metadata:children]\nController\n\n[heat_engine:vars]\nansible_ssh_
    user = heat-admin\n\n[heat_engine:children]\nController\n\n[ntp:vars]\nansible_ssh_user = heat-admin\n\n[ntp:children]\nCephStorage\nCompute\nController\n\n[ceilometer_expirer_disabled:vars]\nansible_ssh_user =
    heat-admin\n\n[ceilometer_expirer_disabled:children]\nController\n\n[ceilometer_api_disabled:vars]\nansible_ssh_user = heat-admin\n\n[ceilometer_api_disabled:children]\nController\n\n[nova_migration_target:vars]
    \nansible_ssh_user = heat-admin\n\n[nova_migration_target:children]\nCompute\n\n[cinder_scheduler:vars]\nansible_ssh_user = heat-admin\n\n[cinder_scheduler:children]\nController\n\n[gnocchi_metricd:vars]\nansibl
    e_ssh_user = heat-admin\n\n[gnocchi_metricd:children]\nController\n\n[tripleo_packages:vars]\nansible_ssh_user = heat-admin\n\n[tripleo_packages:children]\nCephStorage\nCompute\nController\n\n[nova_scheduler:vars]\nansible_ssh_user = heat-admin\n\n[nova_scheduler:children]\nController\n\n[nova_compute:vars]\nansible_ssh_user = heat-admin\n\n[nova_compute:children]\nCompute\n\n[ceph_osd:vars]\nansible_ssh_user = heat-admin\n\n[ceph_osd:children]\nCephStorage\n\n[logrotate_crond:vars]\nansible_ssh_user = heat-admin\n\n[logrotate_crond:children]\nCephStorage\nCompute\nController\n\n[neutron_ovs_agent:vars]\nansible_ssh_user = heat-admin\n\n[neutron_ovs_agent:children]\nCompute\nController\n\n[swift_proxy:vars]\nansible_ssh_user = heat-admin\n\n[swift_proxy:children]\nController\n\n[sshd:vars]\nansible_ssh_user = heat-admin\n\n[sshd:children]\nCephStorage\nCompute\nController\n\n[mysql:vars]\nansible_ssh_user = heat-admin\n\n[mysql:children]\nController\n\n[ceilometer_agent_compute:vars]\nansible_ssh_user = heat-admin\n\n[ceilometer_agent_compute:children]\nCompute\n\n[neutron_l3:vars]\nansible_ssh_user = heat-admin\n\n[neutron_l3:children]\nController\n\n[nova_libvirt:vars]\nansible_ssh_user = heat-admin\n\n[nova_libvirt:children]\nCompute\n\n[rabbitmq:vars]\nansible_ssh_user = heat-admin\n\n[rabbitmq:children]\nController\n\n[tuned:vars]\nansible_ssh_user = heat-admin\n\n[tuned:children]\nCephStorage\nCompute\nController\n\n[panko_api:vars]\nansible_ssh_user = heat-admin\n\n[panko_api:children]\nController\n\n[horizon:vars]\nansible_ssh_user = heat-admin\n\n[horizon:children]\nController\n\n[neutron_api:vars]\nansible_ssh_user = heat-admin\n\n[neutron_api:children]\nController\n\n[ca_certs:vars]\nansible_ssh_user = heat-admin\n\n[ca_certs:children]\nCephStorage\nCompute\nController\n\n[heat_api_cfn:vars]\nansible_ssh_user = heat-admin\n\n[heat_api_cfn:children]\nController\n\n[docker:vars]\nansible_ssh_user = heat-admin\n\n[docker:children]\nCephStorage\nCompute\nController\n\n[nova_vnc_proxy:vars]\nansible_ssh_user = heat-admin\n\n[nova_vnc_proxy:children]\nController\n\n[clustercheck:vars]\nansible_ssh_user = heat-admin\n\n[clustercheck:children]\nController\n\n", u'queue_name': u'29bb1c58-b3b9-47f3-ac49-b459e0374747', u'playbook': u'update_steps_playbook.yaml', u'ansible_extra_env_variables': {u'ANSIBLE_HOST_KEY_CHECKING': u'False'}, u'module_path': u'/usr/share/ansible-modules', u'nodes': u'Controller', u'node_user': u'heat-admin', u'ansible_queue_name': u'update'}, u'spec': {u'tasks': {u'node_update': {u'name': u'node_update', u'on-error': u'node_update_failed', u'on-success': [{u'node_update_passed': u'<% task().result.returncode = 0 %>'}, {u'node_update_failed': u'<% task().result.returncode != 0 %>'}], u'publish': {u'output': u'<% task(node_update).result %>'}, u'version': u'2.0', u'action': u'tripleo.ansible-playbook', u'input': {u'remote_user': u'<% $.node_user %>', u'become_user': u'root', u'ssh_private_key': u'<% $.private_key %>', u'verbosity': 0, u'queue_name': u'<% $.ansible_queue_name %>', u'extra_env_variables': u'<% $.ansible_extra_env_variables %>', u'inventory': u'<% $.inventory_file %>', u'module_path': u'<% $.module_path %>', u'become': True, u'limit_hosts': u'<% $.nodes %>', u'playbook': u'<% $.tmp_path %>/<% $.playbook %>'}, u'type': u'direct'}, u'get_private_key': {u'name': u'get_private_key', u'on-success': u'node_update', u'publish': {u'private_key': u'<% task(get_private_key).result %>'}, u'version': u'2.0', u'action': u'tripleo.validations.get_privkey', u'type': u'direct'}, u'node_update_failed': {u'version': u'2.0', u'type': u'direct', u'name': u'node_update_failed', u'publish': {u'status': u'FAILED', u'message': u'Failed to update nodes - <% $.nodes %>, please see the logs.'}, u'on-success': u'notify_zaqar'}, u'node_update_passed': {u'version': u'2.0', u'type': u'direct', u'name': u'node_update_passed', u'publish': {u'status': u'SUCCESS', u'message': u'Updated nodes - <% $.nodes %>'}, u'on-success': u'notify_zaqar'}, u'notify_zaqar': {u'retry': u'count=5 delay=1', u'name': u'notify_zaqar', u'on-success': [{u'fail': u'<% $.get(\'status\') = "FAILED" %>'}], u'version': u'2.0', u'action': u'zaqar.queue_post', u'input': {u'queue_name': u'<% $.queue_name %>', u'messages': {u'body': {u'type': u'tripleo.package_update.v1.update_nodes', u'payload': {u'status': u'<% $.status %>', u'execution': u'<% execution() %>'}}}}, u'type': u'direct'}, u'download_config': {u'name': u'download_config', u'on-error': u'node_update_failed', u'on-success': u'get_private_key', u'publish': {u'tmp_path': u'<% task(download_config).result %>'}, u'version': u'2.0', u'action': u'tripleo.config.download_config', u'type': u'direct'}}, u'name': u'update_nodes', u'tags': [u'tripleo-common-managed'], u'version': u'2.0', u'input': [{u'node_user': u'heat-admin'}, u'nodes', u'playbook', u'inventory_file', {u'queue_name': u'tripleo'}, {u'ansible_queue_name': u'tripleo'}, {u'module_path': u'/usr/share/ansible-modules'}, {u'ansible_extra_env_variables': {u'ANSIBLE_HOST_KEY_CHECKING': u'False'}}], u'description': u'Take a container and perform an update nodes by nodes'}}}
    \\", \\"net\\": \\"host\\", \\"restart\\": \\"always\\", \\"privileged\\": false}\', \'--detach=true\', \'--env=KOLLA_CONFIG_STRATEGY=COPY_ALWAYS\', \'--env=ENABLE_IRONIC=yes\', \'--env=ENABLE_MANILA=yes\', \'--env=ENABLE_SAHARA=yes\', \'--env=ENABLE_CLOUDKITTY=no\', \'--env=ENABLE_FREEZER=no\', \'--env=ENABLE_FWAAS=no\', \'--env=ENABLE_KARBOR=no\', \'--env=ENABLE_DESIGNATE=no\', \'--env=ENABLE_MAGNUM=no\', \'--env=ENABLE_MISTRAL=no\', \'--env=ENABLE_MURANO=no\', \'--env=ENABLE_NEUTRON_LBAAS=no\', \'--env=ENABLE_SEARCHLIGHT=no\', \'--env=ENABLE_SENLIN=no\', \'--env=ENABLE_SOLUM=no\', \'--env=ENABLE_TACKER=no\', \'--env=ENABLE_TROVE=no\', \'--env=ENABLE_WATCHER=no\', \'--env=ENABLE_ZAQAR=no\', \'--env=ENABLE_ZUN=no\', \'--env=TRIPLEO_CONFIG_HASH=00aefaf228b0ca7aa445b3952d87fbca\', \'--net=host\', \'--privileged=false\', \'--restart=always\', \'--volume=/etc/hosts:/etc/hosts:ro\', \'--volume=/etc/localtime:/etc/localtime:ro\', \'--volume=/etc/puppet:/etc/puppet:ro\', \'--volume=/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\', \'--volume=/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\', \'--volume=/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\', \'--volume=/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\', \'--volume=/dev/log:/dev/log\', \'--volume=/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\', \'--volume=/var/lib/kolla/config_files/horizon.json:/var/lib/kolla/config_files/config.json:ro\', \'--volume=/var/lib/config-data/puppet-generated/horizon/:/var/lib/kolla/config_files/src:ro\', \'--volume=/var/log/containers/horizon:/var/log/horizon\', \'--volume=/var/log/containers/httpd/horizon:/var/log/httpd\', \'192.168.24.1:8787/rhosp12/openstack-horizon-docker:12.0-20171201.1\']. [125]", ',
     u'        "stdout: 6b51b137c2539503f5c7403600390a32fe5ad1b4ff3551349d80ae235353923e", ',
     u'        "stderr: /usr/bin/docker-current: Error response from daemon: invalid header field value \\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\"process_linux.go:258: applying cgroup configuration for process caused \\\\\\\\\\\\\\"write /sys/fs/cgroup/pids/system.slice/docker-6b51b137c2539503f5c7403600390a32fe5ad1b4ff3551349d80ae235353923e.scope/cgroup.procs: no such device\\\\\\\\\\\\\\"\\\\\\"\\\\n\\".", ',
     u'        "stdout: 0b5d54fd2499da7608c08b3c834452266b22dd76c0d472260bf4c496b230335e", ',
     u'        "stderr: Unable to find image \'192.168.24.1:8787/rhosp12/openstack-swift-account-docker:12.0-20171201.1\' locally", ',


docker logs horizon
container_linux.go:247: starting container process caused "process_linux.go:258: applying cgroup configuration for process caused \"write /sys/fs/cgroup/pids/system.slice/docker-6b51b137c2539503f5c7403600390a32fe5ad1b4ff3551349d80ae235353923e.scope/cgroup.procs: no such device\""


Packages:
-------------------------------------------------------
docker-rhel-push-plugin-1.12.6-68.gitec8512b.el7.x86_64
python-docker-pycreds-1.10.6-3.el7.noarch
docker-common-1.12.6-68.gitec8512b.el7.x86_64
python-heat-agent-docker-cmd-1.4.0-1.el7ost.noarch
docker-client-1.12.6-68.gitec8512b.el7.x86_64
docker-1.12.6-68.gitec8512b.el7.x86_64
python-docker-py-1.10.6-3.el7.noarch

libcgroup-0.41-13.el7.x86_64
libcgroup-tools-0.41-13.el7.x86_64

Comment 3 Alexander Chuzhoy 2017-12-07 23:52:38 UTC
Not sure if related, but playing with docker I was able to run into the same error when tried to start a container from an image build with 'VOLUME /' instruction.

"[root@undercloud-0 docker]# docker run -d --name nisim exportedroot
26c65ccd8e6172185260f24899e1cf59b2e55a9913df427f107229378cb74216
/usr/bin/docker-current: Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"open /proc/self/fd: no such file or directory\"\n".
"
Once I set the VOLUME to a particular directory inside the image and rebuilt it - was able to run containers of it.

Comment 4 Martin André 2017-12-08 13:04:26 UTC
Hello Dan, we'll need help from your team here.

We've seen containers failing with:

container_linux.go:247: starting container process caused "process_linux.go:258: applying cgroup configuration for process caused \"write /sys/fs/cgroup/pids/system.slice/docker-0642d71adf65f90fac83693d33be8857e9b1c4a5c69254357ea04fdeadf10c49.scope/cgroup.procs: no such device\""

And according to https://github.com/moby/moby/issues/17653 it is a known issue with systemd cgroups driver and people recommended to use the cgroupfs driver instead. Do you think it is a valid workaround? Anything we need to know if we switch from systemd to cgroupfs driver?

Comment 5 Omri Hochman 2017-12-13 13:28:11 UTC
Raising: severity/priority as this bug was also reported by CI_TEAM in non_IPv6_vlan 

possible Dup Bug repoted -> https://bugzilla.redhat.com/show_bug.cgi?id=1525229

Comment 6 Jaromir Coufal 2017-12-13 13:58:47 UTC
*** Bug 1525229 has been marked as a duplicate of this bug. ***

Comment 8 Mark McLoughlin 2017-12-13 14:11:44 UTC
Looks similar to https://github.com/openshift/origin/issues/16246

Adding Vikas also

Comment 9 Jon Schlueter 2017-12-13 14:23:14 UTC
python-docker-py is being pulled from RHEL for overcloud-full images

python-docker-py-1.10.6-3.el7

Comment 10 Mark McLoughlin 2017-12-13 14:26:40 UTC
Vikas took a look and basically said:

1) this is a very rare race condition (see https://github.com/openshift/origin/issues/16246) that has only ever been reported as reproduced in CI

2) they are not yet sure of a fix, and so it will be some time before any fix will be available in RHEL

Based on that, I think we need to proceed, but with very clear advise for customers on what to do next if they hit this

Comment 11 Mark McLoughlin 2017-12-13 14:33:25 UTC
Note, in bug #1514511, it looks like Marian Krcmarik encountered this issue when trying to reproduce the other systemd/containers issue:

https://bugzilla.redhat.com/show_bug.cgi?id=1514511#c7

Comment 15 Daniel Walsh 2017-12-13 15:44:03 UTC
Lets disable oci-register-machine for now.

Set 
/etc/oci-register-machine.conf 
# Disable oci-register-machine by setting the disabled field to true
disabled : true

Which will stop it from running and failing.

This is only needed if you are running systemd in a container. Even in that case it is not fully needed.

You could also just remove oci-register-machine package from the host.

Comment 17 Mark McLoughlin 2017-12-13 16:32:42 UTC
oci-register-machine-0-3.14.gitcd1e331.el7_4 from bug #1514511 disables oci-register-machine

If I understand Dan correctly, that's our fix ...

Comment 18 Mark McLoughlin 2017-12-13 16:49:00 UTC
Wait ... we also disabled it in puppet-tripleo already:

https://code.engineering.redhat.com/gerrit/#/c/124023/

and this fix should be in puppet-tripleo-7.4.3-9.el7ost and later.

https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=627532

Comment 19 Mark McLoughlin 2017-12-13 16:53:22 UTC
And yet, according to the logs in:

https://bugzilla.redhat.com/show_bug.cgi?id=1525229#c1

we have apparently reproduced this with puppet-tripleo-7.4.3-11.el7ost.noarch

Comment 20 Mark McLoughlin 2017-12-13 16:57:18 UTC
also confirmed in https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/DFG-network-neutron-12_director-rhel-virthost-3cont_2comp-ipv4-vxlan/122/artifact/

that /etc/oci-register-machine.conf contains:

# Disable oci-register-machine by setting the disabled field to true
disabled : true

Comment 29 Dan Prince 2017-12-20 15:21:23 UTC
(In reply to Daniel Walsh from comment #15)
> Lets disable oci-register-machine for now.
> 
> Set 
> /etc/oci-register-machine.conf 
> # Disable oci-register-machine by setting the disabled field to true
> disabled : true
> 
> Which will stop it from running and failing.
> 
> This is only needed if you are running systemd in a container. Even in that
> case it is not fully needed.
> 
> You could also just remove oci-register-machine package from the host.

In our environment we are already setting /etc/oci-register-machine.conf's disabled: true.

In order to be able to remove the oci-register-machine package entirely we'd need to update the RPM deps. Currently I get this:

error: Failed dependencies:
	oci-register-machine >= 1:0-3.10 is needed by (installed) docker-common-2:1.12.6-61.git85d7426.el7.x86_64

Comment 30 Dan Prince 2017-12-20 15:27:15 UTC
Ah. So now I'm wondering if perhaps the missing piece here on our side is a 'systemctl restart systemd-machined'. The puppet-tripleo patch currently would restart only the docker service, not systemd-machined.

Comment 31 Dan Prince 2017-12-20 15:48:56 UTC
Could you confirm if we need to restart anything along with the /etc/oci-register-machine.conf settings change?

systemctl restart systemd-machined?

----

Also, see the RPM dependency issue with regards to removing the oci-register-machine package.

Comment 32 Daniel Walsh 2017-12-20 18:14:28 UTC
No the configuration for register-machine is read for every container. Should not need to restart any services.

Comment 33 Vikas Choudhary 2018-01-07 21:10:19 UTC
Here is the detailed analysis of this issue

https://github.com/openshift/origin/issues/16246#issuecomment-355852817

Comment 34 Vikas Choudhary 2018-01-07 21:11:12 UTC
Here is the detailed analysis of this issue

https://github.com/openshift/origin/issues/16246#issuecomment-355852817

Comment 35 Vikas Choudhary 2018-01-09 06:11:43 UTC
https://github.com/lnykryn/systemd-rhel/issues/180

Comment 36 Mark McLoughlin 2018-01-10 15:00:34 UTC
Awesome, bug #1532586 is the underlying systemd bug but we think we don't need this fix because we have disabled oci-register-machine already

Comment 38 Bogdan Dobrelya 2018-01-24 11:16:59 UTC
It is reproduced upstream as a promotion blocker https://bugs.launchpad.net/tripleo/+bug/1744954

Comment 39 Emilien Macchi 2018-01-24 15:55:34 UTC
It's not a promotion blocker (as we promoted this morning) but definitely a bug (possibly race condition).

Comment 40 Emilien Macchi 2018-01-30 18:19:47 UTC
We are having a similar if not the same situation in TripleO gate at this time:
https://bugs.launchpad.net/tripleo/+bug/1746298

The issue is critical as it makes our jobs randomly failing and it blocks OSP13 production chain at this time. Note that oci-register-machine is already disabled.

AFIU the comments, we shouldn't need the systemd fix, but can someone confirm it?

Comment 41 Emilien Macchi 2018-01-30 19:03:07 UTC
Dan, thanks for disabling the hook by default: https://github.com/projectatomic/oci-register-machine/commit/66691c3d0805c41e7336a364934e3e144e97a20f
but it seems we still hit a similar issue at this sime, see https://bugs.launchpad.net/tripleo/+bug/1746298

A full journal can be found here: http://logs.openstack.org/46/538346/1/gate/tripleo-ci-centos-7-scenario004-multinode-oooq-container/c0e6264/logs/subnode-2/var/log/journal.txt.gz

(can be found by grepping "oci runtime error").

I was wondering if we missed something and if you could help.

Thanks a lot

Comment 42 Emilien Macchi 2018-01-31 13:57:58 UTC
For some reasons, another bug report was created on launchpad: https://bugs.launchpad.net/tripleo/+bug/1744954

One patch was proposed but I'm not sure it'll really help: https://review.openstack.org/539537

Note: we don't have the latest version of oci-register-machine in TripleO CI yet.

Comment 43 Emilien Macchi 2018-01-31 22:24:36 UTC
I realized I wanted Dan's thoughts but missed NEEDINFO flag.

Comment 44 Alan Pevec 2018-02-01 02:03:27 UTC
> Note: we don't have the latest version of oci-register-machine in TripleO CI
> yet.

Sorry for misleading info in LP, I looked closer and oci-register-machine version we have does have disable: true
https://bugs.launchpad.net/tripleo/+bug/1744954/comments/7

tl;dr disabling oci-register-machine does not help in the current case and I'm not sure how it helped back in December...

Comment 45 Daniel Walsh 2018-02-02 19:18:30 UTC
This looks weird.

Jan 30 16:26:42 centos-7-citycloud-sto2-0002270221 oci-umount[12679]: umounthook <error>: 4e2c2bba7cb3: Failed to read directory /usr/share/oci-umount/oci-umount.d: No such file or directory

Does the oci-umount package include this directory.  If you crrate this directory does everything work?

Comment 46 Vikas Choudhary 2018-02-05 05:35:04 UTC
(In reply to Mark McLoughlin from comment #36)
> Awesome, bug #1532586 is the underlying systemd bug but we think we don't
> need this fix because we have disabled oci-register-machine already

@Mark I doubt that. race b/w runc and systemd is not related to enabling/disabling of oci-register-machine. If runc is runtime, then one should make sure that this fix, https://github.com/opencontainers/runc/pull/1683, is in there or systemd has fix for bug #1532586

Disabling oci-register-machine, most probably, will help with another race.

Comment 47 Vikas Choudhary 2018-02-05 05:41:48 UTC
There are two different races. The one related to pids cgroup join, IMO, will not get work arounded by disabling oci-register-machine.

Comment 49 Vikas Choudhary 2018-02-06 10:21:19 UTC
(In reply to Emilien Macchi from comment #48)
> dwalsh: not sure yet but I'll try to create the directory and see how that
> works.
> 
> Vikas: can you please look at
> https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-
> multinode-1ctlr-featureset017-master/d559096/undercloud/home/jenkins/
> overcloud_prep_containers.log.txt.gz#_2018-02-02_22_01_51
> 
> A lot of logs are available here:
> https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-
> multinode-1ctlr-featureset017-master/d559096/undercloud/var/log/
> 
> And the rpm versions:
> https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-
> multinode-1ctlr-featureset017-master/d559096/rpm-qa.txt.gz
> 
> Thanks

Verified from the logs that you shared, docker in use is 1.12.6 and that is using runc which is at this commit:
https://github.com/projectatomic/runc/commit/c5d311627d39439c5b1cc35c67a51c9c6ccda648

Fix from opencontainers/runc, https://github.com/opencontainers/runc/pull/1683, is not there.  Therefore as i said in previous comment, to avoid this failure, mentioned fix should be backported to projectatomic/runc

Comment 51 Daniel Walsh 2018-02-07 14:58:05 UTC
@run

Comment 52 Daniel Walsh 2018-02-07 15:02:18 UTC
Antonio can you see if we can get this patch back ported to docker-runc?

Comment 55 Alan Pevec 2018-02-08 18:14:40 UTC
FYI I cloned this to RHEL/docker bug 1543575

Comment 56 Alex Schultz 2018-03-02 00:31:31 UTC
*** Bug 1550588 has been marked as a duplicate of this bug. ***

Comment 57 Alex Schultz 2018-04-23 14:25:02 UTC
The new docker version addresses this. See bug 1543575


Note You need to log in before you can comment on or make changes to this bug.