Bug 2272495
| Summary: | [RDR] CephFS snapshots: VolumeSnapshot->ReadyToUse is false and stays false forever due to timeout error | ||
|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat OpenShift Data Foundation | Reporter: | Aman Agrawal <amagrawa> |
| Component: | ceph | Assignee: | Radoslaw Zarzynski <rzarzyns> |
| ceph sub component: | RADOS | QA Contact: | Elad <ebenahar> |
| Status: | ASSIGNED --- | Docs Contact: | |
| Severity: | high | ||
| Priority: | unspecified | CC: | assingh, bhubbard, bmekhiss, bniver, gfarnum, khiremat, mrajanna, muagarwa, nojha, odf-bz-bot, rar, rzarzyns, sheggodu, sostapov, vshankar |
| Version: | 4.15 | Keywords: | Reopened |
| Target Milestone: | --- | Flags: | gfarnum:
needinfo?
(rzarzyns) khiremat: needinfo- bhubbard: needinfo? (rzarzyns) sheggodu: needinfo? (rzarzyns) |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2024-04-11 12:51:21 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Aman Agrawal
2024-04-01 16:26:43 UTC
This is the only read call in the cephfs client debug log that's reading data off the OSD (data pool):
```
2024-05-06T11:36:57.841121364Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 10 client.584545 _read_async 0x10000020458.head(faked_ino=0 nref=10 ll_ref=0 cap_refs={2048=1} open={1=1} mode=100640 size=158/0 nlink
=1 btime=2024-04-30T20:53:25.974991+0000 mtime=2024-04-30T20:53:25.975615+0000 ctime=2024-04-30T20:53:26.386700+0000 change_attr=7 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[0x10000020458 ts 0/0 objects 0 dirty_or_
tx 0] parents=0x100000066cb.head[".meta"] 0x5634ec816a00) 0~4096
2024-05-06T11:36:57.841121364Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 10 client.584545 min_bytes=131072 max_bytes=16777216 max_periods=4
2024-05-06T11:36:57.841121364Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 10 client.584545.objecter _op_submit op 0x5634ec41fe00
2024-05-06T11:36:57.841135304Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 20 client.584545.objecter _calc_target epoch 7696 base 10000020458.00000000 @12 precalc_pgid 0 pgid 0.0 is_read
2024-05-06T11:36:57.841135304Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 20 client.584545.objecter _calc_target target 10000020458.00000000 @12 -> pgid 12.53bb8e39
2024-05-06T11:36:57.841140932Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 10 client.584545.objecter _calc_target raw pgid 12.53bb8e39 -> actual 12.19 acting [0,1,2] primary 0
2024-05-06T11:36:57.841179626Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 20 client.584545.objecter _get_session s=0x5634ec88e480 osd=0 3
2024-05-06T11:36:57.841179626Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 10 client.584545.objecter _op_submit oid 10000020458.00000000 '@12' '@12' [read 0~158] tid 1 osd.0
2024-05-06T11:36:57.841183709Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 20 client.584545.objecter get_session s=0x5634ec88e480 osd=0 3
2024-05-06T11:36:57.841183709Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 15 client.584545.objecter _session_op_assign 0 1
2024-05-06T11:36:57.841199599Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 15 client.584545.objecter _send_op 1 to 12.19 on osd.0
```
And later...
```
2024-05-06T11:37:12.813418983Z debug 2024-05-06T11:37:12.812+0000 7f9c18b67640 10 client.584545.objecter tick
2024-05-06T11:37:12.813418983Z debug 2024-05-06T11:37:12.812+0000 7f9c18b67640 2 client.584545.objecter tid 1 on osd.0 is laggy
2024-05-06T11:37:12.813418983Z debug 2024-05-06T11:37:12.812+0000 7f9c18b67640 10 client.584545.objecter _maybe_request_map subscribing (onetime) to next osd map
2024-05-06T11:37:12.813551139Z debug 2024-05-06T11:37:12.812+0000 7f9cedb7c640 20 --2- 10.131.1.11:0/1469616214 >> [v2:242.1.255.250:3300/0,v1:242.1.255.250:6789/0] conn(0x5634ec834400 0x5634ec6bf180 secure :-1 s=READY pgs=110714 cs=0 l=1 rev1=1 crypto rx=0x5634ec799e00 tx=0x5634ec535f80 comp rx=0 tx=0).write_message sending m=0x5634ec8ba4e0 seq=7 src=client.584545 off=0
2024-05-06T11:37:12.814280714Z debug 2024-05-06T11:37:12.813+0000 7f9c16362640 10 client.584545.objecter ms_dispatch 0x5634e28b8000 osd_map(7697..7697 src has 7082..7697)
2024-05-06T11:37:12.814280714Z debug 2024-05-06T11:37:12.813+0000 7f9c16362640 3 client.584545.objecter handle_osd_map got epochs [7697,7697] > 7696
2024-05-06T11:37:12.814280714Z debug 2024-05-06T11:37:12.813+0000 7f9c16362640 3 client.584545.objecter handle_osd_map decoding incremental epoch 7697
2024-05-06T11:37:12.814468790Z debug 2024-05-06T11:37:12.813+0000 7f9c16362640 10 client.584545.objecter checking op 1
2024-05-06T11:37:12.814468790Z debug 2024-05-06T11:37:12.813+0000 7f9c16362640 20 client.584545.objecter _calc_target epoch 7697 base 10000020458.00000000 @12 precalc_pgid 0 pgid 0.0 is_read
2024-05-06T11:37:12.814468790Z debug 2024-05-06T11:37:12.813+0000 7f9c16362640 20 client.584545.objecter _calc_target target 10000020458.00000000 @12 -> pgid 12.53bb8e39
2024-05-06T11:37:12.814468790Z debug 2024-05-06T11:37:12.813+0000 7f9c16362640 20 client.584545.objecter dump_active .. 0 homeless
2024-05-06T11:37:12.814468790Z debug 2024-05-06T11:37:12.813+0000 7f9c16362640 20 client.584545.objecter 1 12.53bb8e39 osd.0 10000020458.00000000 [read 0~158]
```
The "tid 1 on osd.0 is laggy" log appears over and over again and I believe the [read 0~158] osd_op is tid 1 to osd.0.
OK. So, the above update was with grep `client.584545` just to see what's happening to the call at the client/objecter level, not necessarily at the messenger layer. That's what upcoming: [read 0~158] is being sent to osd.0, so objecter initiates a connect with endpoint: [v2:242.1.255.254:6800/2685472366,v1:242.1.255.254:6801/2685472366]. Then, soon after that ``` 2024-05-06T11:36:57.841199599Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 1 -- 10.131.1.11:0/1469616214 --> [v2:242.1.255.254:6800/2685472366,v1:242.1.255.254:6801/2685472366] -- osd_op(unknown.0.0:1 12.19 12:9c71ddca:::10000020458.00000000:head [read 0~158] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e7696) -- 0x5634ec6ef000 con 0x5634ec835000 2024-05-06T11:36:57.841199599Z debug 2024-05-06T11:36:57.840+0000 7f9ced37b640 20 -- 10.131.1.11:0/1469616214 >> [v2:242.1.255.254:6800/2685472366,v1:242.1.255.254:6801/2685472366] conn(0x5634ec835000 msgr2=0x5634ec836b00 unknown :-1 s=STATE_CONNECTING l=0).process 2024-05-06T11:36:57.841204861Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 5 --2- 10.131.1.11:0/1469616214 >> [v2:242.1.255.254:6800/2685472366,v1:242.1.255.254:6801/2685472366] conn(0x5634ec835000 0x5634ec836b00 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).send_message enqueueing message m=0x5634ec6ef000 type=42 osd_op(client.584545.0:1 12.19 12:9c71ddca:::10000020458.00000000:head [read 0~158] snapc 0=[] ondisk+read+known_if_redirected+supports_pool_eio e7696) 2024-05-06T11:36:57.841204861Z debug 2024-05-06T11:36:57.840+0000 7f9c27504640 15 --2- 10.131.1.11:0/1469616214 >> [v2:242.1.255.254:6800/2685472366,v1:242.1.255.254:6801/2685472366] conn(0x5634ec835000 0x5634ec836b00 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=0 rev1=0 crypto rx=0 tx=0 comp rx=0 tx=0).send_message inline write is denied, reschedule m=0x5634ec6ef000 ``` due to (osd) connection still in-progress (which is hinted as "nonblock connect inprogress" below)? ``` 2024-05-06T11:36:57.841393513Z debug 2024-05-06T11:36:57.840+0000 7f9ced37b640 20 Event(0x5634e35b0988 nevent=5000 time_id=56).create_file_event create event started fd=83 mask=1 original mask is 0 2024-05-06T11:36:57.841393513Z debug 2024-05-06T11:36:57.840+0000 7f9ced37b640 20 EpollDriver.add_event add event fd=83 cur_mask=0 add_mask=1 to 11 2024-05-06T11:36:57.841393513Z debug 2024-05-06T11:36:57.840+0000 7f9ced37b640 20 Event(0x5634e35b0988 nevent=5000 time_id=56).create_file_event create event end fd=83 mask=1 current mask is 1 2024-05-06T11:36:57.841393513Z debug 2024-05-06T11:36:57.840+0000 7f9ced37b640 10 NetHandler reconnect reconnect: 114 Operation already in progress 2024-05-06T11:36:57.841401866Z debug 2024-05-06T11:36:57.840+0000 7f9ced37b640 10 -- 10.131.1.11:0/1469616214 >> [v2:242.1.255.254:6800/2685472366,v1:242.1.255.254:6801/2685472366] conn(0x5634ec835000 msgr2=0x5634ec836b00 unknown :-1 s=STATE_CONNECTING_RE l=0).process nonblock connect inprogress ``` After going through lot many messenger related logs which I could not fully make sense of, at this point log messages such as ``` 2024-05-06T11:37:07.288541431Z debug 2024-05-06T11:37:07.287+0000 7f9cee37d640 10 -- 10.131.1.11:0/751099213 >> [v2:242.1.255.254:6800/2685472366,v1:242.1.255.254:6801/2685472366] conn(0x5634ec428000 msgr2=0x5634ec014100 unknown :-1 s=STATE_CONNECTING_RE l=1).process nonblock connect inprogress ``` are still showing up. At "2024-05-06T11:43:10.691169634" the "connect improgress" log messages are still seen: ``` 2024-05-06T11:43:10.691169634Z debug 2024-05-06T11:43:10.690+0000 7f9cedb7c640 10 -- 10.131.1.11:0/1651991364 >> [v2:242.1.255.254:6800/2685472366,v1:242.1.255.254:6801/2685472366] conn(0x5634ec8ea400 msgr2=0x5634 ec8f8580 unknown :-1 s=STATE_CONNECTING_RE l=1).process nonblock connect inprogress ``` What's happening at the messenger layer? cc Radek. Radek/Brad - https://tracker.ceph.com/issues/65986 Now I see that the sync is progressing well on this setup. Probably due to pod restart/node reboot. |