introduce lazy prefetch #551
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
实现了 MEGFILE_READER_LAZY_PREFETCH 环境变量,用于控制一个 reader 是 lazy 的,即创建 reader 后不会立刻请求,直到第一次 read 时才会开始请求,期间可以做 seek 等操作
背景:业务使用时会打开大量文件文件句柄,每次都是从文件中间开始读一部分,这导致
具体修改内容:
MEGFILE_READER_LAZY_PREFETCH
环境变量,因为是实验性功能,没有在smart_open
接口上支持,先只用环境变量控制;找了下到底为什么 reader 创建的时候就会读取,发现有两处_get_content_size
之前为了优化,读了第一个块数据,改成了配置MEGFILE_READER_LAZY_PREFETCH
的时候只做 head 了_is_pickle
这玩意read(2)
结果触发了 prefetch,比_get_content_size
开销还大,改成MEGFILE_READER_LAZY_PREFETCH
的时候默认关闭了max_buffer_size=0
,这又产生另一个问题是 reader 行为上会变成要求读多少就读多少,会形成非常多小请求,因此也会同时打开buffered=True
,但这时io.BufferedReader
默认缓存大小只有 16K,对 s3 来说每次请求还是很小,因此把block_size=8M
赋值给io.BufferedReader
的buffer_size
方便每次读取的数据块还是 8MLRUCacheFutureManager
和ShareCacheFutureManager
加了一些 log 方便观察缓存的行为是否符合预期,另外把ShareCacheFutureManager
之前叫key
的变量都改名叫name
了,跟 reader 里的称呼对齐