Bug 1730403 - sysroot partition is full and can't grow up the partition. [NEEDINFO]
Summary: sysroot partition is full and can't grow up the partition.
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.1.z
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: ---
: ---
Assignee: Steve Milner
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-07-16 15:32 UTC by jmselmi
Modified: 2019-07-25 14:02 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-25 14:02:21 UTC
Target Upstream Version:
smilner: needinfo? (jmselmi)


Attachments (Terms of Use)

Description jmselmi 2019-07-16 15:32:27 UTC
Description of problem:

I have the sysroot partition is full and it's mounted on a disk partition (no lvm).
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        7.6G     0  7.6G   0% /dev
tmpfs           7.7G     0  7.7G   0% /dev/shm
tmpfs           7.7G  751M  6.9G  10% /run
tmpfs           7.7G     0  7.7G   0% /sys/fs/cgroup
/dev/nvme0n1p3   15G   15G   20K 100% /sysroot
/dev/nvme0n1p2  976M  135M  774M  15% /boot
tmpfs           1.6G     0  1.6G   0% /run/user/1000

[root@ip-10-0-1-182 libexec]# lsblk
NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
nvme0n1     259:0    0  16G  0 disk
├─nvme0n1p1 259:1    0   1M  0 part
├─nvme0n1p2 259:2    0   1G  0 part /boot
└─nvme0n1p3 259:3    0  15G  0 part /sysroot


I extended the ebs size (the setup is on aws) and I tried to `oc debug node/` and run `/usr/libexec/coreos-growpart` 

Results KO:
----------
oc debug node/ip-10-0-1-182.eu-west-3.compute.internal
Starting pod/ip-10-0-1-182eu-west-3computeinternal-debug ...
To use host binaries, run `chroot /host`

Removing debug pod ...
Error from server (BadRequest): container "container-00" in pod "ip-10-0-1-182eu-west-3computeinternal-debug" is not available

I ssh the machine and tried to extend it, some issue:
root@ip-10-0-1-182 libexec]# ./coreos-growpart /sysroot
mkdir: cannot create directory '/tmp/growpart.19846': No space left on device
FAILED: failed to make temp dir
meta-data=/dev/nvme0n1p3         isize=512    agcount=4, agsize=982848 blks
         =                       sectsz=512   attr=2, projid32bit=1
         =                       crc=1        finobt=1, sparse=1, rmapbt=0
         =                       reflink=1
data     =                       bsize=4096   blocks=3931392, imaxpct=25
         =                       sunit=0      swidth=0 blks
naming   =version 2              bsize=4096   ascii-ci=0, ftype=1
log      =internal log           bsize=4096   blocks=2560, version=2
         =                       sectsz=512   sunit=0 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

No documentation about to extend a partition on the docs.openshift.com

it's on OCP 4.1.4

Comment 1 Steve Milner 2019-07-16 16:44:40 UTC
Thanks for the report. 

These lines ...

> mkdir: cannot create directory '/tmp/growpart.19846': No space left on device
> FAILED: failed to make temp dir

seem to indicate that growpart was unsuccessful because there was no space left. Freeing up space first may end up allowing you to run growpart.

Can you provide the deployment information (rpm-ostree status) as well as the size of container storage (sudo du -d 1 -ch /var/lib/containers/)? It's possible some space can be freed up in one of these locations. Checking podman specific storage (sudo podman system df), looking at the images (sudo podman images) and containers (sudo podman ps -a) may also be helpful in finding items which can be removed for space.

Comment 2 Ben Breard 2019-07-16 17:40:11 UTC
How did you end up w/ a 15G / ? 

Looking at the UPI on AWS documentation, they recommend resizing the image to 120G at deployment time. 15G is much too small for an OCP deployment.

Resources:
  Master0:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: !Ref RhcosAmi
      BlockDeviceMappings:
      - DeviceName: /dev/xvda
        Ebs:
          VolumeSize: "120"
          VolumeType: "gp2"

Comment 3 jmselmi 2019-07-17 08:31:36 UTC
(In reply to Steve Milner from comment #1)
> Thanks for the report. 
> 
> These lines ...
> 
> > mkdir: cannot create directory '/tmp/growpart.19846': No space left on device
> > FAILED: failed to make temp dir
> 
> seem to indicate that growpart was unsuccessful because there was no space
> left. Freeing up space first may end up allowing you to run growpart.
> 
> Can you provide the deployment information (rpm-ostree status) as well as
> the size of container storage (sudo du -d 1 -ch /var/lib/containers/)? It's
> possible some space can be freed up in one of these locations. Checking
> podman specific storage (sudo podman system df), looking at the images (sudo
> podman images) and containers (sudo podman ps -a) may also be helpful in
> finding items which can be removed for space.

I lost the master now, can't ssh into nor debug. I will do it once recovered.

Comment 4 jmselmi 2019-07-17 08:32:22 UTC
(In reply to Ben Breard from comment #2)
> How did you end up w/ a 15G / ? 
> 
> Looking at the UPI on AWS documentation, they recommend resizing the image
> to 120G at deployment time. 15G is much too small for an OCP deployment.
> 
> Resources:
>   Master0:
>     Type: AWS::EC2::Instance
>     Properties:
>       ImageId: !Ref RhcosAmi
>       BlockDeviceMappings:
>       - DeviceName: /dev/xvda
>         Ebs:
>           VolumeSize: "120"
>           VolumeType: "gp2"

After checking there's some missing in the master configuration.

Comment 5 jmselmi 2019-07-17 08:46:26 UTC
Sorry, I see the machine configuration we already have the 120GB configuration:
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  selfLink: >-
    /apis/machine.openshift.io/v1beta1/namespaces/openshift-machine-api/machines/silver-56dsw-master-0
  resourceVersion: '27146009'
  name: silver-56dsw-master-0
  uid: 41613ac5-a80f-11e9-bd33-0a5dc4646e7c
  creationTimestamp: '2019-07-16T21:18:28Z'
  generation: 1
  namespace: openshift-machine-api
  finalizers:
    - machine.machine.openshift.io
  labels:
    machine.openshift.io/cluster-api-cluster: silver-56dsw
    machine.openshift.io/cluster-api-machine-role: master
    machine.openshift.io/cluster-api-machine-type: master
spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      userDataSecret:
        name: master-user-data
      placement:
        availabilityZone: eu-west-3a
        region: eu-west-3
      credentialsSecret:
        name: aws-cloud-credentials
      instanceType: m5.xlarge
      metadata:
        creationTimestamp: null
      publicIp: null
      blockDevices:
        - ebs:
            iops: 0
            volumeSize: 120
            volumeType: gp2
      securityGroups:
        - filters:
            - name: 'tag:Name'
              values:
                - silver-56dsw-master-sg
      kind: AWSMachineProviderConfig
      loadBalancers:
        - name: silver-56dsw-ext
          type: network
        - name: silver-56dsw-int
          type: network
      tags:
        - name: kubernetes.io/cluster/silver-56dsw
          value: owned
        - name: auto_shut_bool
          value: 'True'
      deviceIndex: 0
      ami:
        id: ami-064c1a19b5600d4bb
      subnet:
        filters:
          - name: 'tag:Name'
            values:
              - silver-56dsw-private-eu-west-3a
      apiVersion: awsproviderconfig.openshift.io/v1beta1
      iamInstanceProfile:
        id: silver-56dsw-master-profile
status:
  lastUpdated: '2019-07-16T21:18:49Z'
  providerStatus:
    apiVersion: awsproviderconfig.openshift.io/v1beta1
    conditions:
      - lastProbeTime: '2019-07-16T21:18:49Z'
        lastTransitionTime: '2019-07-16T21:18:49Z'
        message: >-
          error launching instance: error getting subnet IDs: no subnet IDs were
          found,
        reason: MachineCreationFailed
        status: 'True'
        type: MachineCreation
    kind: AWSMachineProviderStatus

Comment 6 Steve Milner 2019-07-17 14:28:31 UTC
Based on

> [root@ip-10-0-1-182 libexec]# lsblk
> NAME        MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
> nvme0n1     259:0    0  16G  0 disk
> ├─nvme0n1p1 259:1    0   1M  0 part
> ├─nvme0n1p2 259:2    0   1G  0 part /boot
> └─nvme0n1p3 259:3    0  15G  0 part /sysroot

it doesn't look like there is a 120G device mounted in for image storage and the system itself, as noted, is 16G in total (15G for /) which, if also used for image storage, could run out of space quickly.


Note You need to log in before you can comment on or make changes to this bug.