Skip to content

Conversation

bbtfr
Copy link
Member

@bbtfr bbtfr commented Aug 23, 2025

实现了 MEGFILE_READER_LAZY_PREFETCH 环境变量,用于控制一个 reader 是 lazy 的,即创建 reader 后不会立刻请求,直到第一次 read 时才会开始请求,期间可以做 seek 等操作

背景:业务使用时会打开大量文件文件句柄,每次都是从文件中间开始读一部分,这导致

  1. prefetch reader 会在文件句柄打开时就开始从头读 128M,启动速度慢
  2. 有些文件可以忍受读取慢,但希望读取他们的时候更省内存,避免 OOM

具体修改内容:

  1. 增加 MEGFILE_READER_LAZY_PREFETCH 环境变量,因为是实验性功能,没有在 smart_open 接口上支持,先只用环境变量控制;找了下到底为什么 reader 创建的时候就会读取,发现有两处
    1. _get_content_size 之前为了优化,读了第一个块数据,改成了配置 MEGFILE_READER_LAZY_PREFETCH 的时候只做 head 了
    2. _is_pickle 这玩意 read(2) 结果触发了 prefetch,比 _get_content_size 开销还大,改成 MEGFILE_READER_LAZY_PREFETCH 的时候默认关闭了
  2. 有时也希望 reader 更省内存,这时候会设置 max_buffer_size=0,这又产生另一个问题是 reader 行为上会变成要求读多少就读多少,会形成非常多小请求,因此也会同时打开 buffered=True,但这时 io.BufferedReader 默认缓存大小只有 16K,对 s3 来说每次请求还是很小,因此把 block_size=8M 赋值给 io.BufferedReaderbuffer_size 方便每次读取的数据块还是 8M
  3. LRUCacheFutureManagerShareCacheFutureManager 加了一些 log 方便观察缓存的行为是否符合预期,另外把 ShareCacheFutureManager 之前叫 key 的变量都改名叫 name 了,跟 reader 里的称呼对齐

Copy link

codecov bot commented Aug 23, 2025

Codecov Report

❌ Patch coverage is 97.22222% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 99.00%. Comparing base (0dd2329) to head (30bdc88).

Files with missing lines Patch % Lines
megfile/utils/__init__.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #551      +/-   ##
==========================================
- Coverage   99.01%   99.00%   -0.02%     
==========================================
  Files          44       44              
  Lines        6490     6510      +20     
==========================================
+ Hits         6426     6445      +19     
- Misses         64       65       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bbtfr bbtfr force-pushed the liyang/lazy-prefetch branch from 787245d to f7960bf Compare August 23, 2025 08:50
@bbtfr bbtfr force-pushed the liyang/lazy-prefetch branch from f7960bf to 30bdc88 Compare August 23, 2025 08:54
@@ -82,9 +82,7 @@ def __init__(

self._offset = 0
self._cached_buffer = None
self._block_index = None # Current block index
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

发现 _block_index 必须是个 int,后面的 _seek_buffer 会把它设为 0,这里先设为 None 比较误导,先删了

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

self._block_index = 0 吧,不放 init 里面怕后面漏了

@bbtfr bbtfr requested a review from LoveEatCandy August 23, 2025 09:16
@LoveEatCandy
Copy link
Collaborator

_is_pickle 只保留按扩展名判断怎么样,这个 read(2) 当时还没怎么兼容开头不读的情况,现在看起来缺点大于优点了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants