Bug 1881033 - [Assisted-4.6][Staging] Agent unable to start container during discovery "podman[1973]: Error: error creating container storage: size for layer "layer_sha" is unknown, failing getSize()
Summary: [Assisted-4.6][Staging] Agent unable to start container during discovery "pod...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: assisted-installer
Version: 4.6
Hardware: Unspecified
OS: Unspecified
urgent
unspecified
Target Milestone: ---
: ---
Assignee: Ori Amizur
QA Contact: Yuri Obshansky
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-09-21 12:04 UTC by nshidlin
Modified: 2022-08-28 08:45 UTC (History)
2 users (show)

Fixed In Version: OCP-Metal-v1.0.9.5
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-28 08:45:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description nshidlin 2020-09-21 12:04:33 UTC
Description of problem:
When booting hosts into ISO some or all of hosts are not discovered by the service. agent logs:
Sep 21 11:41:29 master-0-1 podman[1860]: Trying to pull registry.stage.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-15...
Sep 21 11:41:34 master-0-1 podman[1860]: Getting image source signatures
Sep 21 11:41:36 master-0-1 podman[1860]: Copying blob sha256:9d0d09b1ea44d90760e85af2ce11a721c1fd9c646015000e4453b10c661fd21c
Sep 21 11:41:36 master-0-1 podman[1860]: Copying blob sha256:8e7ef64fd2cf1d5fbafc966ca5339400a2a8d26dcd5025cae0bffeb76155b005
Sep 21 11:41:36 master-0-1 podman[1860]: Copying blob sha256:c4d668e229cd131e0a8e4f8218dca628d9cf9697572875e355fe4b247b6aa9f0
Sep 21 11:41:36 master-0-1 podman[1860]: Copying blob sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30
Sep 21 11:42:55 master-0-1 podman[1860]: Copying config sha256:8a15567bae378372803735ca3e4359cd5d91057b30ae54631c3ba82a7e6660fc
Sep 21 11:42:55 master-0-1 podman[1860]: Writing manifest to image destination
Sep 21 11:42:55 master-0-1 podman[1860]: Storing signatures
Sep 21 11:42:59 master-0-1 systemd[1]: agent.service: Start-pre operation timed out. Terminating.
Sep 21 11:42:59 master-0-1 systemd[1]: agent.service: Failed with result 'timeout'.
Sep 21 11:42:59 master-0-1 systemd[1]: Failed to start agent.service.
Sep 21 11:43:03 master-0-1 systemd[1]: agent.service: Service RestartSec=3s expired, scheduling restart.
Sep 21 11:43:03 master-0-1 systemd[1]: agent.service: Scheduled restart job, restart counter is at 6.
Sep 21 11:43:03 master-0-1 systemd[1]: Stopped agent.service.
Sep 21 11:43:03 master-0-1 systemd[1]: Starting agent.service...
Sep 21 11:43:03 master-0-1 podman[1973]: Trying to pull registry.stage.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-15...
Sep 21 11:43:08 master-0-1 podman[1973]: Getting image source signatures
Sep 21 11:43:08 master-0-1 podman[1973]: Copying blob sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30
Sep 21 11:43:08 master-0-1 podman[1973]: Copying blob sha256:c4d668e229cd131e0a8e4f8218dca628d9cf9697572875e355fe4b247b6aa9f0
Sep 21 11:43:08 master-0-1 podman[1973]: Copying blob sha256:9d0d09b1ea44d90760e85af2ce11a721c1fd9c646015000e4453b10c661fd21c
Sep 21 11:43:09 master-0-1 podman[1973]: Copying blob sha256:8e7ef64fd2cf1d5fbafc966ca5339400a2a8d26dcd5025cae0bffeb76155b005
Sep 21 11:43:36 master-0-1 podman[1973]: Copying config sha256:8a15567bae378372803735ca3e4359cd5d91057b30ae54631c3ba82a7e6660fc
Sep 21 11:43:36 master-0-1 podman[1973]: Writing manifest to image destination
Sep 21 11:43:36 master-0-1 podman[1973]: Storing signatures
Sep 21 11:43:36 master-0-1 podman[1973]: Error: error creating container storage: size for layer "053c169f70b03d18472b4004472910fdb2465d55ae9335c422f2cd4b7479a21e" is unknown, failing getSize()
Sep 21 11:43:36 master-0-1 systemd[1]: agent.service: Control process exited, code=exited status=125
Sep 21 11:43:36 master-0-1 systemd[1]: agent.service: Failed with result 'exit-code'.
Sep 21 11:43:36 master-0-1 systemd[1]: Failed to start agent.service.
Sep 21 11:43:40 master-0-1 systemd[1]: agent.service: Service RestartSec=3s expired, scheduling restart.
Sep 21 11:43:40 master-0-1 systemd[1]: agent.service: Scheduled restart job, restart counter is at 7.
Sep 21 11:43:40 master-0-1 systemd[1]: Stopped agent.service.




Version-Release number of selected component (if applicable):
Staging:
{
    "release_tag": "v1.0.9.4-ds",
    "versions": {
        "assisted-ignition-generator": "quay.io/ocpmetal/assisted-ignition-generator:v1.0.9.4",
        "assisted-installer": "registry.stage.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-19",
        "assisted-installer-controller": "registry.stage.redhat.io/openshift4/assisted-installer-reporter-rhel8:v4.6.0-15",
        "assisted-installer-service": "quay.io/app-sre/assisted-service:b793c52",
        "discovery-agent": "registry.stage.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-15",
        "image-builder": "quay.io/app-sre/assisted-iso-create:b793c52"
    }
}


How reproducible:
This is not reproduced every time. Also when it is reproduced, somtimes this happens with all the nodes and sometimes with only one node.

Steps to Reproduce:
1. Create Cluster
2. Generate and download ISO
3. Boot nodes into ISO

Actual results:
Node discovery fails

Expected results:
Nodes to be discovered by service

Additional info:

Comment 1 Ronnie Lazar 2020-09-21 12:25:05 UTC
Maybe this is a timeout issue of the systemd service pre-step

Comment 2 Lital Alon 2020-09-21 14:03:07 UTC
I have one environment (out of 2) which constantly reproduce this issue (seal32). 
Hosts are getting up to 20 retries. Agent fails to start.
When manually ran:
sudo podman pull registry.stage.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-15

issue resolves. so i guess its timeout issue.

Comment 3 Yuri Obshansky 2020-09-21 15:21:59 UTC
Reproducible on ocp-edge33.lab.eng.tlv2.redhat.com
1 master node failed
2 masters and 2 workers are OK
Sep 21 15:18:24 master-0-1 podman[6154]: Error: error creating container storage: size for layer "ccf04fbd6e1943f648d1c2980e96038edc02b543c597556>
Sep 21 15:18:24 master-0-1 systemd[1]: agent.service: Control process exited, code=exited status=125
Sep 21 15:18:24 master-0-1 systemd[1]: agent.service: Failed with result 'exit-code'.
Sep 21 15:18:24 master-0-1 systemd[1]: Failed to start agent.service.

Comment 4 Ori Amizur 2020-09-22 07:38:36 UTC
PR - https://github.com/openshift/assisted-service/pull/406

Comment 5 nshidlin 2020-09-22 18:05:34 UTC
issue persists on staging:
{
    "release_tag": "v1.0.9.5-ds",
    "versions": {
        "assisted-ignition-generator": "quay.io/ocpmetal/assisted-ignition-generator:v1.0.9.5",
        "assisted-installer": "registry.stage.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-19",
        "assisted-installer-controller": "registry.stage.redhat.io/openshift4/assisted-installer-reporter-rhel8:v4.6.0-15",
        "assisted-installer-service": "quay.io/app-sre/assisted-service:7fd51db",
        "discovery-agent": "registry.stage.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-15",
        "image-builder": "quay.io/app-sre/assisted-iso-create:7fd51db"
    }
}

Comment 6 nshidlin 2020-09-23 11:49:26 UTC
Verified on staging:

{
    "release_tag": "v1.0.9.5-ds",
    "versions": {
        "assisted-ignition-generator": "quay.io/ocpmetal/assisted-ignition-generator:v1.0.9.5",
        "assisted-installer": "registry.stage.redhat.io/openshift4/assisted-installer-rhel8:v4.6.0-19",
        "assisted-installer-controller": "registry.stage.redhat.io/openshift4/assisted-installer-reporter-rhel8:v4.6.0-15",
        "assisted-installer-service": "quay.io/app-sre/assisted-service:27cfe0d",
        "discovery-agent": "registry.stage.redhat.io/openshift4/assisted-installer-agent-rhel8:v4.6.0-15",
        "image-builder": "quay.io/app-sre/assisted-iso-create:27cfe0d"
    }
}


Note You need to log in before you can comment on or make changes to this bug.