Bug 1666390 - Prometheus persistent volume backed by GlusterFS PersistentVolume changes fs to read-only
Summary: Prometheus persistent volume backed by GlusterFS PersistentVolume changes fs ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: core
Version: cns-3.10
Hardware: All
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Raghavendra G
QA Contact: RamaKasturi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-15 17:21 UTC by Andre Costa
Modified: 2023-09-18 00:15 UTC (History)
29 users (show)

Fixed In Version: glusterfs-6.0-8
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-11-11 04:42:08 UTC
Embargoed:


Attachments (Terms of Use)
Results from grep (1.59 MB, application/x-tar)
2019-05-03 13:30 UTC, Andre Costa
no flags Details

Description Andre Costa 2019-01-15 17:21:00 UTC
Description of problem:
Volume appears healthy on glusterfs cluster and inside the Prometheus pod, but customer seeing fs changing to read-only when prometheus tries to write to its database.

level=error ts=2019-01-07T03:39:52.83844915Z caller=compact.go:432 component=tsdb msg="removed tmp folder after failed compaction" err="lstat /prometheus/01D0K6AC3B0YGTNSWQ1VM14KTV.tmp/chunks/000001: read-only file system"
level=error ts=2019-01-07T03:39:52.847779918Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V /prometheus/01D0JP9XQWBKER257F3WZXWYXR /prometheus/01D0JX5N01TMG2V2M5JD262R26]: 4 errors: write compaction: write chunks: write /prometheus/01D0K6AC3B0YGTNSWQ1VM14KTV.tmp/chunks/000001: read-only file system; setting compaction failed for block: /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V: open /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JP9XQWBKER257F3WZXWYXR: open /prometheus/01D0JP9XQWBKER257F3WZXWYXR/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JX5N01TMG2V2M5JD262R26: open /prometheus/01D0JX5N01TMG2V2M5JD262R26/meta.json.tmp: read-only file system"
level=error ts=2019-01-07T03:40:55.535068802Z caller=compact.go:432 component=tsdb msg="removed tmp folder after failed compaction" err="lstat /prometheus/01D0K6C9F8Y6Q4XFQMSXRHNK5V.tmp/chunks/000001: read-only file system"
level=error ts=2019-01-07T03:40:55.545717844Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V /prometheus/01D0JP9XQWBKER257F3WZXWYXR /prometheus/01D0JX5N01TMG2V2M5JD262R26]: 4 errors: write compaction: write chunks: write /prometheus/01D0K6C9F8Y6Q4XFQMSXRHNK5V.tmp/chunks/000001: read-only file system; setting compaction failed for block: /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V: open /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JP9XQWBKER257F3WZXWYXR: open /prometheus/01D0JP9XQWBKER257F3WZXWYXR/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JX5N01TMG2V2M5JD262R26: open /prometheus/01D0JX5N01TMG2V2M5JD262R26/meta.json.tmp: read-only file system"


Version-Release number of selected component (if applicable):

How reproducible:
Unknown

Steps to Reproduce:

Actual results:

Expected results:

GlusterFS info:

sh-4.2# mount |grep "brick_be1e435cb58f398745fe661d710c164f"
/dev/mapper/vg_7b13b83c5a6f709443f2911174919a74-brick_be1e435cb58f398745fe661d710c164f on /var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f type xfs (rw,noatime,seclabel,nouuid,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)

sh-4.2# lvs -a | grep 'brick_be1e435cb58f398745fe661d710c164f'
  brick_be1e435cb58f398745fe661d710c164f      vg_7b13b83c5a6f709443f2911174919a74 Vwi-aotz-- 100.00g tp_be1e435cb58f398745fe661d710c164f        99.99                                  
sh-4.2# lvs |grep be1e
  brick_be1e435cb58f398745fe661d710c164f vg_7b13b83c5a6f709443f2911174919a74 Vwi-aotz-- 100.00g tp_be1e435cb58f398745fe661d710c164f        99.99                                  
  tp_be1e435cb58f398745fe661d710c164f    vg_7b13b83c5a6f709443f2911174919a74 twi-aotz-- 100.00g                                            99.99  5.62                 

sh-4.2# lvdisplay /dev/mapper/vg_7b13b83c5a6f709443f2911174919a74-brick_be1e435cb58f398745fe661d710c164f
  --- Logical volume ---
  LV Path                /dev/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f
  LV Name                brick_be1e435cb58f398745fe661d710c164f
  VG Name                vg_7b13b83c5a6f709443f2911174919a74
  LV UUID                6Z3Ka6-yc3Y-qjnK-ukap-jvev-TNDV-QvrQRu
  LV Write Access        read/write
  LV Creation host, time crp-prod-glusterfs05.srv.allianz, 2018-07-18 09:04:39 +0000
  LV Pool name           tp_be1e435cb58f398745fe661d710c164f
  LV Status              available
  # open                 1
  LV Size                100.00 GiB
  Mapped size            99.99%
  Current LE             25600
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:424


NOTE: I cannot stop the volume as it is in use, so I started with "force" option instead. 

sh-4.2# gluster vol start vol_a3ceb43ae2afb2ca88473e9e47a51dbf force
volume start: vol_a3ceb43ae2afb2ca88473e9e47a51dbf: success

sh-4.2# gluster vol heal vol_a3ceb43ae2afb2ca88473e9e47a51dbf info
Brick 10.16.77.24:/var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f/brick
Status: Connected
Number of entries: 0

Brick 10.16.77.20:/var/lib/heketi/mounts/vg_0a7e1052758ea35c3a27b5842e14e8b4/brick_ed909162e25120d770c8bcbba152e6e4/brick
Status: Connected
Number of entries: 0

Brick 10.16.77.23:/var/lib/heketi/mounts/vg_7b95e143984218f8535eb1bfb273377c/brick_cb286acc39576c577c0906bc9a0d2feb/brick
Status: Connected
Number of entries: 0

sh-4.2# gluster vol status vol_a3ceb43ae2afb2ca88473e9e47a51dbf
Status of volume: vol_a3ceb43ae2afb2ca88473e9e47a51dbf
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.16.77.24:/var/lib/heketi/mounts/vg
_7b13b83c5a6f709443f2911174919a74/brick_be1
e435cb58f398745fe661d710c164f/brick         49156     0          Y       2737 
Brick 10.16.77.20:/var/lib/heketi/mounts/vg
_0a7e1052758ea35c3a27b5842e14e8b4/brick_ed9
09162e25120d770c8bcbba152e6e4/brick         49156     0          Y       24338
Brick 10.16.77.23:/var/lib/heketi/mounts/vg
_7b95e143984218f8535eb1bfb273377c/brick_cb2
86acc39576c577c0906bc9a0d2feb/brick         49155     0          Y       3822 
Self-heal Daemon on localhost               N/A       N/A        Y       28878
Self-heal Daemon on 10.16.77.22             N/A       N/A        Y       23050
Self-heal Daemon on 10.16.77.25             N/A       N/A        Y       20195
Self-heal Daemon on crp-prod-glusterfs02.sr
v.allianz                                   N/A       N/A        Y       29402
Self-heal Daemon on 10.16.77.23             N/A       N/A        Y       27702
Self-heal Daemon on crp-prod-glusterfs01.sr
v.allianz                                   N/A       N/A        Y       2988 
 
Task Status of Volume vol_a3ceb43ae2afb2ca88473e9e47a51dbf
------------------------------------------------------------------------------
There are no active volume tasks
 
sh-4.2# gluster vol heal vol_a3ceb43ae2afb2ca88473e9e47a51dbf info summary
Brick 10.16.77.24:/var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 10.16.77.20:/var/lib/heketi/mounts/vg_0a7e1052758ea35c3a27b5842e14e8b4/brick_ed909162e25120d770c8bcbba152e6e4/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 10.16.77.23:/var/lib/heketi/mounts/vg_7b95e143984218f8535eb1bfb273377c/brick_cb286acc39576c577c0906bc9a0d2feb/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0

PV Dump:
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    Description: 'Gluster-Internal: Dynamically provisioned PV'
    gluster.kubernetes.io/heketi-volume-id: a3ceb43ae2afb2ca88473e9e47a51dbf
    gluster.org/type: file
    kubernetes.io/createdby: heketi-dynamic-provisioner
    pv.beta.kubernetes.io/gid: "2198"
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: kubernetes.io/glusterfs
    volume.beta.kubernetes.io/mount-options: auto_unmount
  creationTimestamp: null
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-98fb0df9-8a69-11e8-8a9f-005056885d84
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: prometheus-data-dev-2
    namespace: az-tech-monitoring
    resourceVersion: "376140698"
    uid: 98fb0df9-8a69-11e8-8a9f-005056885d84
  glusterfs:
    endpoints: glusterfs-dynamic-prometheus-data-dev-2
    path: vol_a3ceb43ae2afb2ca88473e9e47a51dbf
  persistentVolumeReclaimPolicy: Delete
  storageClassName: dynamic-and-replicated
status:
  phase: Bound

Additional info:

Comment 10 Sanju 2019-02-15 09:08:03 UTC
any updates on this?

Comment 60 Andre Costa 2019-05-03 13:30:20 UTC
Created attachment 1562445 [details]
Results from grep

Comment 80 Melanie Falz 2019-10-08 14:21:37 UTC
customer faced the issue again today with influxdb and OCS (3.11.3) and re-opened the case 
please review the provided data and let me know the findings
supportshell.prod.useraccess-us-west-2.redhat.com : /02277972/INFLUXDB_08_OKT
tx and regards,
Melanie

Comment 81 Melanie Falz 2019-10-08 14:21:49 UTC
customer faced the issue again today with influxdb and OCS (3.11.3) and re-opened the case 
please review the provided data and let me know the findings
supportshell.prod.useraccess-us-west-2.redhat.com : /02277972/INFLUXDB_08_OKT
tx and regards,
Melanie

Comment 83 Melanie Falz 2019-10-09 07:14:18 UTC
Hello Mohit,

we have on supportshell (provided yesterday) the crp-prod-largeapps053.srv.allianz.
Can you please specify the system you need the sos-report collected?

? "all the nodes" ? could mean a lot,

many thanks for your support

Melanie

Comment 90 Nithya Balachandran 2019-10-09 09:36:05 UTC
(In reply to Melanie Falz from comment #83)
> Hello Mohit,
> 
> we have on supportshell (provided yesterday) the
> crp-prod-largeapps053.srv.allianz.

This looks like it was captured in December. That will not have information relevant to the issue reported with Influxdb.

Comment 92 Melanie Falz 2019-10-09 11:40:57 UTC
Hello Mohit,

the customer uploaded the requested data

he sent all /var/log/glusterfsd.log*  and glfsheal-vol_623950910a3c501e5dc2df493f7ced82.log* from all 6 gluster nodes. 
Please extract the log files in the previous posts for these logs on each gluster node. 

He tried to find the related log files in /var/log/glusterfs/bricks but could not find log files on any gluster node that matches with vgs and bricks below.  

vol_623950910a3c501e5dc2df493f7ced82
Brick 10.16.77.21:/var/lib/heketi/mounts/vg_f723106e5bab792cb49161e684bc8176/brick_dece77aeff0bfc2f1acb726e82f76cf2/brick
Status: Connected
Number of entries: 0

Brick 10.16.77.24:/var/lib/heketi/mounts/vg_f02c133d69613ed8fd6ca0b1f4d3aff6/brick_5ce73c3e2551eaff4546c4f667bf6f6d/brick
Status: Connected
Number of entries: 0

Brick 10.16.77.25:/var/lib/heketi/mounts/vg_af928a22a7c82fe8785097bb71fb9704/brick_9d6c09ec9751ad34f08ad9a6cbaba5e6/brick
Status: Connected
Number of entries: 0


Please find and review the new data on supportshell within the DIR: 02277972/GLUSTER_LOGS 


290-02277972-glusterlogs.tar     320-02277972-glusterlogs.tar.gz  350-var_log_logfiles.tar.gz  380-var_log_logfiles.tar.gz
300-02277972-glusterlogs.tar.gz  330-02277972-glusterlogs.tar     360-var_log_logfiles.tar.gz  390-var_log_logfiles.tar.gz
310-02277972-glusterlogs.tar.gz  340-02277972-glusterlogs.tar     370-var_log_logfiles.tar.gz  400-var_log_logfiles.tar.gz


thanks

Melanie

Comment 95 Melanie Falz 2019-10-10 07:08:42 UTC
Hello again,

the customer now provided now all_bricks_logs_in_oct_19

Please check on supportshell in 02277972/GLUSTER_LOGS

drwxrwx---+ 2 yank yank  137 Oct 10 05:09 410-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 420-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 430-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 440-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 450-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 460-var_logs_oct_19.tar.gz

and let me know the outcome

thanks
Melanie

Comment 99 Yaniv Kaul 2019-11-08 19:24:14 UTC
It's not part of the released 3.5.0?

Comment 100 Mohit Agrawal 2019-11-11 04:36:20 UTC
Yes, it is part of rhgs 3.5.0.

Comment 102 Red Hat Bugzilla 2023-09-18 00:15:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.