introduce lazy prefetch #551

bbtfr · 2025-08-23T08:38:23Z

实现了 MEGFILE_READER_LAZY_PREFETCH 环境变量，用于控制一个 reader 是 lazy 的，即创建 reader 后不会立刻请求，直到第一次 read 时才会开始请求，期间可以做 seek 等操作

背景：业务使用时会打开大量文件文件句柄，每次都是从文件中间开始读一部分，这导致

prefetch reader 会在文件句柄打开时就开始从头读 128M，启动速度慢
有些文件可以忍受读取慢，但希望读取他们的时候更省内存，避免 OOM

具体修改内容：

增加 MEGFILE_READER_LAZY_PREFETCH 环境变量，因为是实验性功能，没有在 smart_open 接口上支持，先只用环境变量控制；找了下到底为什么 reader 创建的时候就会读取，发现有两处
1. _get_content_size 之前为了优化，读了第一个块数据，改成了配置 MEGFILE_READER_LAZY_PREFETCH 的时候只做 head 了
2. _is_pickle 这玩意 read(2) 结果触发了 prefetch，比 _get_content_size 开销还大，改成 MEGFILE_READER_LAZY_PREFETCH 的时候默认关闭了
有时也希望 reader 更省内存，这时候会设置 max_buffer_size=0，这又产生另一个问题是 reader 行为上会变成要求读多少就读多少，会形成非常多小请求，因此也会同时打开 buffered=True，但这时 io.BufferedReader 默认缓存大小只有 16K，对 s3 来说每次请求还是很小，因此把 block_size=8M 赋值给 io.BufferedReader 的 buffer_size 方便每次读取的数据块还是 8M
LRUCacheFutureManager 和 ShareCacheFutureManager 加了一些 log 方便观察缓存的行为是否符合预期，另外把 ShareCacheFutureManager 之前叫 key 的变量都改名叫 name 了，跟 reader 里的称呼对齐

codecov · 2025-08-23T08:40:38Z

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 99.00%. Comparing base (0dd2329) to head (30bdc88).

Files with missing lines	Patch %	Lines
megfile/utils/__init__.py	66.66%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #551      +/-   ##
==========================================
- Coverage   99.01%   99.00%   -0.02%     
==========================================
  Files          44       44              
  Lines        6490     6510      +20     
==========================================
+ Hits         6426     6445      +19     
- Misses         64       65       +1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

bbtfr · 2025-08-23T09:02:56Z

megfile/lib/base_prefetch_reader.py

@@ -82,9 +82,7 @@ def __init__(

        self._offset = 0
        self._cached_buffer = None
-        self._block_index = None  # Current block index


发现 _block_index 必须是个 int，后面的 _seek_buffer 会把它设为 0，这里先设为 None 比较误导，先删了

self._block_index = 0 吧，不放 init 里面怕后面漏了

LoveEatCandy · 2025-08-26T02:28:43Z

_is_pickle 只保留按扩展名判断怎么样，这个 read(2) 当时还没怎么兼容开头不读的情况，现在看起来缺点大于优点了

bbtfr force-pushed the liyang/lazy-prefetch branch from 787245d to f7960bf Compare August 23, 2025 08:50

introduce lazy prefetch

30bdc88

bbtfr force-pushed the liyang/lazy-prefetch branch from f7960bf to 30bdc88 Compare August 23, 2025 08:54

bbtfr commented Aug 23, 2025

View reviewed changes

bbtfr requested a review from LoveEatCandy August 23, 2025 09:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

introduce lazy prefetch #551

introduce lazy prefetch #551

Uh oh!

bbtfr commented Aug 23, 2025 •

edited

Loading

Uh oh!

codecov bot commented Aug 23, 2025 •

edited

Loading

Uh oh!

bbtfr Aug 23, 2025

Uh oh!

LoveEatCandy Aug 26, 2025

Uh oh!

LoveEatCandy commented Aug 26, 2025

Uh oh!

Uh oh!

introduce lazy prefetch #551

Are you sure you want to change the base?

introduce lazy prefetch #551

Uh oh!

Conversation

bbtfr commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Aug 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

bbtfr Aug 23, 2025

Choose a reason for hiding this comment

Uh oh!

LoveEatCandy Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

LoveEatCandy commented Aug 26, 2025

Uh oh!

Uh oh!

bbtfr commented Aug 23, 2025 •

edited

Loading

codecov bot commented Aug 23, 2025 •

edited

Loading