Created attachment 1295819 [details] script to reproduce Description of problem: In Gnocchi, we recently encounter some data corruption when "rados_osd_op_timeout" is set. After digging, we end up that aio_read() doesn't return the expected data and doesn't return any error. The issue on Gnocchi side: https://github.com/gnocchixyz/gnocchi/pull/190 This have been workarounded by doing read() instead of aio_read() The bug have been discovered by RDO CI, Ceph version was 10.2.7, but I can reproduce it on many other version. How reproducible: I have attach a python script to reproduce it. Actual results: When "rados_osd_op_timeout" is ceph, aio_read() returned data are corrupted. Expected results: No corruption
Actual output of the script: no timeout read(): 'my fancy blob' : True with timeout read(): 'my fancy blob' : True no timeout aio_read(): 'my fancy blob' : True with timeout aio_read(): 'no timeout ai' : False The last line shows that aio_read doesn't return the expected blob.
Created attachment 1295832 [details] The script to reproduce
Created attachment 1295834 [details] The script to reproduce I have added the result of rados_aio_get_return_value()
Retargetting since this only affects jewel + earlier. I'm not sure exactly what commit fixed it.
Fixed in luminous+later, not severe enough to warrant patching jewel.