Description of problem: Volume appears healthy on glusterfs cluster and inside the Prometheus pod, but customer seeing fs changing to read-only when prometheus tries to write to its database. level=error ts=2019-01-07T03:39:52.83844915Z caller=compact.go:432 component=tsdb msg="removed tmp folder after failed compaction" err="lstat /prometheus/01D0K6AC3B0YGTNSWQ1VM14KTV.tmp/chunks/000001: read-only file system" level=error ts=2019-01-07T03:39:52.847779918Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V /prometheus/01D0JP9XQWBKER257F3WZXWYXR /prometheus/01D0JX5N01TMG2V2M5JD262R26]: 4 errors: write compaction: write chunks: write /prometheus/01D0K6AC3B0YGTNSWQ1VM14KTV.tmp/chunks/000001: read-only file system; setting compaction failed for block: /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V: open /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JP9XQWBKER257F3WZXWYXR: open /prometheus/01D0JP9XQWBKER257F3WZXWYXR/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JX5N01TMG2V2M5JD262R26: open /prometheus/01D0JX5N01TMG2V2M5JD262R26/meta.json.tmp: read-only file system" level=error ts=2019-01-07T03:40:55.535068802Z caller=compact.go:432 component=tsdb msg="removed tmp folder after failed compaction" err="lstat /prometheus/01D0K6C9F8Y6Q4XFQMSXRHNK5V.tmp/chunks/000001: read-only file system" level=error ts=2019-01-07T03:40:55.545717844Z caller=db.go:305 component=tsdb msg="compaction failed" err="compact [/prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V /prometheus/01D0JP9XQWBKER257F3WZXWYXR /prometheus/01D0JX5N01TMG2V2M5JD262R26]: 4 errors: write compaction: write chunks: write /prometheus/01D0K6C9F8Y6Q4XFQMSXRHNK5V.tmp/chunks/000001: read-only file system; setting compaction failed for block: /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V: open /prometheus/01D0JFE6FWZ0R7RPR1EYC3SM5V/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JP9XQWBKER257F3WZXWYXR: open /prometheus/01D0JP9XQWBKER257F3WZXWYXR/meta.json.tmp: read-only file system; setting compaction failed for block: /prometheus/01D0JX5N01TMG2V2M5JD262R26: open /prometheus/01D0JX5N01TMG2V2M5JD262R26/meta.json.tmp: read-only file system" Version-Release number of selected component (if applicable): How reproducible: Unknown Steps to Reproduce: Actual results: Expected results: GlusterFS info: sh-4.2# mount |grep "brick_be1e435cb58f398745fe661d710c164f" /dev/mapper/vg_7b13b83c5a6f709443f2911174919a74-brick_be1e435cb58f398745fe661d710c164f on /var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f type xfs (rw,noatime,seclabel,nouuid,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota) sh-4.2# lvs -a | grep 'brick_be1e435cb58f398745fe661d710c164f' brick_be1e435cb58f398745fe661d710c164f vg_7b13b83c5a6f709443f2911174919a74 Vwi-aotz-- 100.00g tp_be1e435cb58f398745fe661d710c164f 99.99 sh-4.2# lvs |grep be1e brick_be1e435cb58f398745fe661d710c164f vg_7b13b83c5a6f709443f2911174919a74 Vwi-aotz-- 100.00g tp_be1e435cb58f398745fe661d710c164f 99.99 tp_be1e435cb58f398745fe661d710c164f vg_7b13b83c5a6f709443f2911174919a74 twi-aotz-- 100.00g 99.99 5.62 sh-4.2# lvdisplay /dev/mapper/vg_7b13b83c5a6f709443f2911174919a74-brick_be1e435cb58f398745fe661d710c164f --- Logical volume --- LV Path /dev/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f LV Name brick_be1e435cb58f398745fe661d710c164f VG Name vg_7b13b83c5a6f709443f2911174919a74 LV UUID 6Z3Ka6-yc3Y-qjnK-ukap-jvev-TNDV-QvrQRu LV Write Access read/write LV Creation host, time crp-prod-glusterfs05.srv.allianz, 2018-07-18 09:04:39 +0000 LV Pool name tp_be1e435cb58f398745fe661d710c164f LV Status available # open 1 LV Size 100.00 GiB Mapped size 99.99% Current LE 25600 Segments 1 Allocation inherit Read ahead sectors auto - currently set to 256 Block device 253:424 NOTE: I cannot stop the volume as it is in use, so I started with "force" option instead. sh-4.2# gluster vol start vol_a3ceb43ae2afb2ca88473e9e47a51dbf force volume start: vol_a3ceb43ae2afb2ca88473e9e47a51dbf: success sh-4.2# gluster vol heal vol_a3ceb43ae2afb2ca88473e9e47a51dbf info Brick 10.16.77.24:/var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f/brick Status: Connected Number of entries: 0 Brick 10.16.77.20:/var/lib/heketi/mounts/vg_0a7e1052758ea35c3a27b5842e14e8b4/brick_ed909162e25120d770c8bcbba152e6e4/brick Status: Connected Number of entries: 0 Brick 10.16.77.23:/var/lib/heketi/mounts/vg_7b95e143984218f8535eb1bfb273377c/brick_cb286acc39576c577c0906bc9a0d2feb/brick Status: Connected Number of entries: 0 sh-4.2# gluster vol status vol_a3ceb43ae2afb2ca88473e9e47a51dbf Status of volume: vol_a3ceb43ae2afb2ca88473e9e47a51dbf Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.16.77.24:/var/lib/heketi/mounts/vg _7b13b83c5a6f709443f2911174919a74/brick_be1 e435cb58f398745fe661d710c164f/brick 49156 0 Y 2737 Brick 10.16.77.20:/var/lib/heketi/mounts/vg _0a7e1052758ea35c3a27b5842e14e8b4/brick_ed9 09162e25120d770c8bcbba152e6e4/brick 49156 0 Y 24338 Brick 10.16.77.23:/var/lib/heketi/mounts/vg _7b95e143984218f8535eb1bfb273377c/brick_cb2 86acc39576c577c0906bc9a0d2feb/brick 49155 0 Y 3822 Self-heal Daemon on localhost N/A N/A Y 28878 Self-heal Daemon on 10.16.77.22 N/A N/A Y 23050 Self-heal Daemon on 10.16.77.25 N/A N/A Y 20195 Self-heal Daemon on crp-prod-glusterfs02.sr v.allianz N/A N/A Y 29402 Self-heal Daemon on 10.16.77.23 N/A N/A Y 27702 Self-heal Daemon on crp-prod-glusterfs01.sr v.allianz N/A N/A Y 2988 Task Status of Volume vol_a3ceb43ae2afb2ca88473e9e47a51dbf ------------------------------------------------------------------------------ There are no active volume tasks sh-4.2# gluster vol heal vol_a3ceb43ae2afb2ca88473e9e47a51dbf info summary Brick 10.16.77.24:/var/lib/heketi/mounts/vg_7b13b83c5a6f709443f2911174919a74/brick_be1e435cb58f398745fe661d710c164f/brick Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick 10.16.77.20:/var/lib/heketi/mounts/vg_0a7e1052758ea35c3a27b5842e14e8b4/brick_ed909162e25120d770c8bcbba152e6e4/brick Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 Number of entries possibly healing: 0 Brick 10.16.77.23:/var/lib/heketi/mounts/vg_7b95e143984218f8535eb1bfb273377c/brick_cb286acc39576c577c0906bc9a0d2feb/brick Status: Connected Total Number of entries: 0 Number of entries in heal pending: 0 Number of entries in split-brain: 0 PV Dump: apiVersion: v1 kind: PersistentVolume metadata: annotations: Description: 'Gluster-Internal: Dynamically provisioned PV' gluster.kubernetes.io/heketi-volume-id: a3ceb43ae2afb2ca88473e9e47a51dbf gluster.org/type: file kubernetes.io/createdby: heketi-dynamic-provisioner pv.beta.kubernetes.io/gid: "2198" pv.kubernetes.io/bound-by-controller: "yes" pv.kubernetes.io/provisioned-by: kubernetes.io/glusterfs volume.beta.kubernetes.io/mount-options: auto_unmount creationTimestamp: null finalizers: - kubernetes.io/pv-protection name: pvc-98fb0df9-8a69-11e8-8a9f-005056885d84 spec: accessModes: - ReadWriteOnce capacity: storage: 100Gi claimRef: apiVersion: v1 kind: PersistentVolumeClaim name: prometheus-data-dev-2 namespace: az-tech-monitoring resourceVersion: "376140698" uid: 98fb0df9-8a69-11e8-8a9f-005056885d84 glusterfs: endpoints: glusterfs-dynamic-prometheus-data-dev-2 path: vol_a3ceb43ae2afb2ca88473e9e47a51dbf persistentVolumeReclaimPolicy: Delete storageClassName: dynamic-and-replicated status: phase: Bound Additional info:
any updates on this?
Created attachment 1562445 [details] Results from grep
customer faced the issue again today with influxdb and OCS (3.11.3) and re-opened the case please review the provided data and let me know the findings supportshell.prod.useraccess-us-west-2.redhat.com : /02277972/INFLUXDB_08_OKT tx and regards, Melanie
Hello Mohit, we have on supportshell (provided yesterday) the crp-prod-largeapps053.srv.allianz. Can you please specify the system you need the sos-report collected? ? "all the nodes" ? could mean a lot, many thanks for your support Melanie
(In reply to Melanie Falz from comment #83) > Hello Mohit, > > we have on supportshell (provided yesterday) the > crp-prod-largeapps053.srv.allianz. This looks like it was captured in December. That will not have information relevant to the issue reported with Influxdb.
Hello Mohit, the customer uploaded the requested data he sent all /var/log/glusterfsd.log* and glfsheal-vol_623950910a3c501e5dc2df493f7ced82.log* from all 6 gluster nodes. Please extract the log files in the previous posts for these logs on each gluster node. He tried to find the related log files in /var/log/glusterfs/bricks but could not find log files on any gluster node that matches with vgs and bricks below. vol_623950910a3c501e5dc2df493f7ced82 Brick 10.16.77.21:/var/lib/heketi/mounts/vg_f723106e5bab792cb49161e684bc8176/brick_dece77aeff0bfc2f1acb726e82f76cf2/brick Status: Connected Number of entries: 0 Brick 10.16.77.24:/var/lib/heketi/mounts/vg_f02c133d69613ed8fd6ca0b1f4d3aff6/brick_5ce73c3e2551eaff4546c4f667bf6f6d/brick Status: Connected Number of entries: 0 Brick 10.16.77.25:/var/lib/heketi/mounts/vg_af928a22a7c82fe8785097bb71fb9704/brick_9d6c09ec9751ad34f08ad9a6cbaba5e6/brick Status: Connected Number of entries: 0 Please find and review the new data on supportshell within the DIR: 02277972/GLUSTER_LOGS 290-02277972-glusterlogs.tar 320-02277972-glusterlogs.tar.gz 350-var_log_logfiles.tar.gz 380-var_log_logfiles.tar.gz 300-02277972-glusterlogs.tar.gz 330-02277972-glusterlogs.tar 360-var_log_logfiles.tar.gz 390-var_log_logfiles.tar.gz 310-02277972-glusterlogs.tar.gz 340-02277972-glusterlogs.tar 370-var_log_logfiles.tar.gz 400-var_log_logfiles.tar.gz thanks Melanie
Hello again, the customer now provided now all_bricks_logs_in_oct_19 Please check on supportshell in 02277972/GLUSTER_LOGS drwxrwx---+ 2 yank yank 137 Oct 10 05:09 410-var_logs_oct_19.tar.gz drwxrwx---+ 2 yank yank 137 Oct 10 05:09 420-var_logs_oct_19.tar.gz drwxrwx---+ 2 yank yank 137 Oct 10 05:09 430-var_logs_oct_19.tar.gz drwxrwx---+ 2 yank yank 137 Oct 10 05:09 440-var_logs_oct_19.tar.gz drwxrwx---+ 2 yank yank 137 Oct 10 05:09 450-var_logs_oct_19.tar.gz drwxrwx---+ 2 yank yank 137 Oct 10 05:09 460-var_logs_oct_19.tar.gz and let me know the outcome thanks Melanie
It's not part of the released 3.5.0?
Yes, it is part of rhgs 3.5.0.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days