Bug 1666390

Summary: Prometheus persistent volume backed by GlusterFS PersistentVolume changes fs to read-only
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Andre Costa <andcosta>
Component: coreAssignee: Raghavendra G <rgowdapp>
Status: CLOSED CURRENTRELEASE QA Contact: RamaKasturi <knarra>
Severity: high Docs Contact:
Priority: high    
Version: cns-3.10CC: amukherj, andcosta, aos-bugs, aos-storage-staff, bjarolim, bkunal, bmchugh, jsafrane, kramdoss, ksubrahm, madam, mchangir, mfalz, moagrawa, nbalacha, nchilaka, ndevos, olim, pasik, pprakash, psony, puebele, ravishankar, rgowdapp, rhs-bugs, rtalur, sheggodu, srakonde, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-6.0-8 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-11-11 04:42:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Results from grep none

Description Andre Costa 2019-01-15 17:21:00 UTC
Description of problem:
Volume appears healthy on glusterfs cluster and inside the Prometheus pod, but customer seeing fs changing to read-only when prometheus tries to write to its database.

level=error ts=2019-01-07T03:39:52.83844915Z caller=compact.go:432 component=tsdb msg="removed tmp folder after failed compaction" err="lstat /prometheus/01D0K6AC3B0YGTNSWQ1VM14KTV.tmp/chunks/000001: read-only file system"
level=error ts=2019-01-07T03:39:52.847779918Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V /prometheus/01D0JP9XQWBKER257F3WZXWYXR /prometheus/01D0JX5N01TMG2V2M5JD262R26]: 4 errors: write compaction: write chunks: write /prometheus/01D0K6AC3B0YGTNSWQ1VM14KTV.tmp/chunks/000001: read-only file system; setting compaction failed for block: /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V: open /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JP9XQWBKER257F3WZXWYXR: open /prometheus/01D0JP9XQWBKER257F3WZXWYXR/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JX5N01TMG2V2M5JD262R26: open /prometheus/01D0JX5N01TMG2V2M5JD262R26/meta.json.tmp: read-only file system"
level=error ts=2019-01-07T03:40:55.535068802Z caller=compact.go:432 component=tsdb msg="removed tmp folder after failed compaction" err="lstat /prometheus/01D0K6C9F8Y6Q4XFQMSXRHNK5V.tmp/chunks/000001: read-only file system"
level=error ts=2019-01-07T03:40:55.545717844Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V /prometheus/01D0JP9XQWBKER257F3WZXWYXR /prometheus/01D0JX5N01TMG2V2M5JD262R26]: 4 errors: write compaction: write chunks: write /prometheus/01D0K6C9F8Y6Q4XFQMSXRHNK5V.tmp/chunks/000001: read-only file system; setting compaction failed for block: /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V: open /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JP9XQWBKER257F3WZXWYXR: open /prometheus/01D0JP9XQWBKER257F3WZXWYXR/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JX5N01TMG2V2M5JD262R26: open /prometheus/01D0JX5N01TMG2V2M5JD262R26/meta.json.tmp: read-only file system"


Version-Release number of selected component (if applicable):

How reproducible:
Unknown

Steps to Reproduce:

Actual results:

Expected results:

GlusterFS info:

sh-4.2# mount |grep "brick_be1e435cb58f398745fe661d710c164f"
/dev/mapper/vg_7b13b83c5a6f709443f2911174919a74-brick_be1e435cb58f398745fe661d710c164f on /var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f type xfs (rw,noatime,seclabel,nouuid,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota)

sh-4.2# lvs -a | grep 'brick_be1e435cb58f398745fe661d710c164f'
  brick_be1e435cb58f398745fe661d710c164f      vg_7b13b83c5a6f709443f2911174919a74 Vwi-aotz-- 100.00g tp_be1e435cb58f398745fe661d710c164f        99.99                                  
sh-4.2# lvs |grep be1e
  brick_be1e435cb58f398745fe661d710c164f vg_7b13b83c5a6f709443f2911174919a74 Vwi-aotz-- 100.00g tp_be1e435cb58f398745fe661d710c164f        99.99                                  
  tp_be1e435cb58f398745fe661d710c164f    vg_7b13b83c5a6f709443f2911174919a74 twi-aotz-- 100.00g                                            99.99  5.62                 

sh-4.2# lvdisplay /dev/mapper/vg_7b13b83c5a6f709443f2911174919a74-brick_be1e435cb58f398745fe661d710c164f
  --- Logical volume ---
  LV Path                /dev/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f
  LV Name                brick_be1e435cb58f398745fe661d710c164f
  VG Name                vg_7b13b83c5a6f709443f2911174919a74
  LV UUID                6Z3Ka6-yc3Y-qjnK-ukap-jvev-TNDV-QvrQRu
  LV Write Access        read/write
  LV Creation host, time crp-prod-glusterfs05.srv.allianz, 2018-07-18 09:04:39 +0000
  LV Pool name           tp_be1e435cb58f398745fe661d710c164f
  LV Status              available
  # open                 1
  LV Size                100.00 GiB
  Mapped size            99.99%
  Current LE             25600
  Segments               1
  Allocation             inherit
  Read ahead sectors     auto
  - currently set to     256
  Block device           253:424


NOTE: I cannot stop the volume as it is in use, so I started with "force" option instead. 

sh-4.2# gluster vol start vol_a3ceb43ae2afb2ca88473e9e47a51dbf force
volume start: vol_a3ceb43ae2afb2ca88473e9e47a51dbf: success

sh-4.2# gluster vol heal vol_a3ceb43ae2afb2ca88473e9e47a51dbf info
Brick 10.16.77.24:/var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f/brick
Status: Connected
Number of entries: 0

Brick 10.16.77.20:/var/lib/heketi/mounts/vg_0a7e1052758ea35c3a27b5842e14e8b4/brick_ed909162e25120d770c8bcbba152e6e4/brick
Status: Connected
Number of entries: 0

Brick 10.16.77.23:/var/lib/heketi/mounts/vg_7b95e143984218f8535eb1bfb273377c/brick_cb286acc39576c577c0906bc9a0d2feb/brick
Status: Connected
Number of entries: 0

sh-4.2# gluster vol status vol_a3ceb43ae2afb2ca88473e9e47a51dbf
Status of volume: vol_a3ceb43ae2afb2ca88473e9e47a51dbf
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.16.77.24:/var/lib/heketi/mounts/vg
_7b13b83c5a6f709443f2911174919a74/brick_be1
e435cb58f398745fe661d710c164f/brick         49156     0          Y       2737 
Brick 10.16.77.20:/var/lib/heketi/mounts/vg
_0a7e1052758ea35c3a27b5842e14e8b4/brick_ed9
09162e25120d770c8bcbba152e6e4/brick         49156     0          Y       24338
Brick 10.16.77.23:/var/lib/heketi/mounts/vg
_7b95e143984218f8535eb1bfb273377c/brick_cb2
86acc39576c577c0906bc9a0d2feb/brick         49155     0          Y       3822 
Self-heal Daemon on localhost               N/A       N/A        Y       28878
Self-heal Daemon on 10.16.77.22             N/A       N/A        Y       23050
Self-heal Daemon on 10.16.77.25             N/A       N/A        Y       20195
Self-heal Daemon on crp-prod-glusterfs02.sr
v.allianz                                   N/A       N/A        Y       29402
Self-heal Daemon on 10.16.77.23             N/A       N/A        Y       27702
Self-heal Daemon on crp-prod-glusterfs01.sr
v.allianz                                   N/A       N/A        Y       2988 
 
Task Status of Volume vol_a3ceb43ae2afb2ca88473e9e47a51dbf
------------------------------------------------------------------------------
There are no active volume tasks
 
sh-4.2# gluster vol heal vol_a3ceb43ae2afb2ca88473e9e47a51dbf info summary
Brick 10.16.77.24:/var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 10.16.77.20:/var/lib/heketi/mounts/vg_0a7e1052758ea35c3a27b5842e14e8b4/brick_ed909162e25120d770c8bcbba152e6e4/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0
Number of entries possibly healing: 0

Brick 10.16.77.23:/var/lib/heketi/mounts/vg_7b95e143984218f8535eb1bfb273377c/brick_cb286acc39576c577c0906bc9a0d2feb/brick
Status: Connected
Total Number of entries: 0
Number of entries in heal pending: 0
Number of entries in split-brain: 0

PV Dump:
apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    Description: 'Gluster-Internal: Dynamically provisioned PV'
    gluster.kubernetes.io/heketi-volume-id: a3ceb43ae2afb2ca88473e9e47a51dbf
    gluster.org/type: file
    kubernetes.io/createdby: heketi-dynamic-provisioner
    pv.beta.kubernetes.io/gid: "2198"
    pv.kubernetes.io/bound-by-controller: "yes"
    pv.kubernetes.io/provisioned-by: kubernetes.io/glusterfs
    volume.beta.kubernetes.io/mount-options: auto_unmount
  creationTimestamp: null
  finalizers:
  - kubernetes.io/pv-protection
  name: pvc-98fb0df9-8a69-11e8-8a9f-005056885d84
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 100Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: prometheus-data-dev-2
    namespace: az-tech-monitoring
    resourceVersion: "376140698"
    uid: 98fb0df9-8a69-11e8-8a9f-005056885d84
  glusterfs:
    endpoints: glusterfs-dynamic-prometheus-data-dev-2
    path: vol_a3ceb43ae2afb2ca88473e9e47a51dbf
  persistentVolumeReclaimPolicy: Delete
  storageClassName: dynamic-and-replicated
status:
  phase: Bound

Additional info:

Comment 10 Sanju 2019-02-15 09:08:03 UTC
any updates on this?

Comment 60 Andre Costa 2019-05-03 13:30:20 UTC
Created attachment 1562445 [details]
Results from grep

Comment 80 Melanie Falz 2019-10-08 14:21:37 UTC
customer faced the issue again today with influxdb and OCS (3.11.3) and re-opened the case 
please review the provided data and let me know the findings
supportshell.prod.useraccess-us-west-2.redhat.com : /02277972/INFLUXDB_08_OKT
tx and regards,
Melanie

Comment 81 Melanie Falz 2019-10-08 14:21:49 UTC
customer faced the issue again today with influxdb and OCS (3.11.3) and re-opened the case 
please review the provided data and let me know the findings
supportshell.prod.useraccess-us-west-2.redhat.com : /02277972/INFLUXDB_08_OKT
tx and regards,
Melanie

Comment 83 Melanie Falz 2019-10-09 07:14:18 UTC
Hello Mohit,

we have on supportshell (provided yesterday) the crp-prod-largeapps053.srv.allianz.
Can you please specify the system you need the sos-report collected?

? "all the nodes" ? could mean a lot,

many thanks for your support

Melanie

Comment 90 Nithya Balachandran 2019-10-09 09:36:05 UTC
(In reply to Melanie Falz from comment #83)
> Hello Mohit,
> 
> we have on supportshell (provided yesterday) the
> crp-prod-largeapps053.srv.allianz.

This looks like it was captured in December. That will not have information relevant to the issue reported with Influxdb.

Comment 92 Melanie Falz 2019-10-09 11:40:57 UTC
Hello Mohit,

the customer uploaded the requested data

he sent all /var/log/glusterfsd.log*  and glfsheal-vol_623950910a3c501e5dc2df493f7ced82.log* from all 6 gluster nodes. 
Please extract the log files in the previous posts for these logs on each gluster node. 

He tried to find the related log files in /var/log/glusterfs/bricks but could not find log files on any gluster node that matches with vgs and bricks below.  

vol_623950910a3c501e5dc2df493f7ced82
Brick 10.16.77.21:/var/lib/heketi/mounts/vg_f723106e5bab792cb49161e684bc8176/brick_dece77aeff0bfc2f1acb726e82f76cf2/brick
Status: Connected
Number of entries: 0

Brick 10.16.77.24:/var/lib/heketi/mounts/vg_f02c133d69613ed8fd6ca0b1f4d3aff6/brick_5ce73c3e2551eaff4546c4f667bf6f6d/brick
Status: Connected
Number of entries: 0

Brick 10.16.77.25:/var/lib/heketi/mounts/vg_af928a22a7c82fe8785097bb71fb9704/brick_9d6c09ec9751ad34f08ad9a6cbaba5e6/brick
Status: Connected
Number of entries: 0


Please find and review the new data on supportshell within the DIR: 02277972/GLUSTER_LOGS 


290-02277972-glusterlogs.tar     320-02277972-glusterlogs.tar.gz  350-var_log_logfiles.tar.gz  380-var_log_logfiles.tar.gz
300-02277972-glusterlogs.tar.gz  330-02277972-glusterlogs.tar     360-var_log_logfiles.tar.gz  390-var_log_logfiles.tar.gz
310-02277972-glusterlogs.tar.gz  340-02277972-glusterlogs.tar     370-var_log_logfiles.tar.gz  400-var_log_logfiles.tar.gz


thanks

Melanie

Comment 95 Melanie Falz 2019-10-10 07:08:42 UTC
Hello again,

the customer now provided now all_bricks_logs_in_oct_19

Please check on supportshell in 02277972/GLUSTER_LOGS

drwxrwx---+ 2 yank yank  137 Oct 10 05:09 410-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 420-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 430-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 440-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 450-var_logs_oct_19.tar.gz
drwxrwx---+ 2 yank yank  137 Oct 10 05:09 460-var_logs_oct_19.tar.gz

and let me know the outcome

thanks
Melanie

Comment 99 Yaniv Kaul 2019-11-08 19:24:14 UTC
It's not part of the released 3.5.0?

Comment 100 Mohit Agrawal 2019-11-11 04:36:20 UTC
Yes, it is part of rhgs 3.5.0.

Comment 102 Red Hat Bugzilla 2023-09-18 00:15:17 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days