-
Notifications
You must be signed in to change notification settings - Fork 152
BM-1473: Zeroecco/bento #1011
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
BM-1473: Zeroecco/bento #1011
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but would get @austinabell 's eyes on it before merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The nvidia compat is the only thing I think is blocking, other comments are somewhat opinionated
@@ -75,6 +87,13 @@ services: | |||
- 6379:6379 | |||
volumes: | |||
- redis-data:/data | |||
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru --save 900 1 --appendonly yes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does the 2gb number come from? Wondering the reason for this default? This seems like it would cause issues for provers locking a lot of large orders
build: | ||
context: . | ||
dockerfile: dockerfiles/agent.dockerfile |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting that this change will add a lot of friction for provers building. It is already incredibly time consuming to build just the broker and we have been getting a lot of complaints.
Seems like this could be done in two ways, either also including the image in the declaration, so if you don't pass --build
it uses prebuilt, and if you do it would rebuild. Alternative to have a compose.local.yml
to look something like:
services:
exec_agent0:
image: null
build:
context: .
dockerfile: dockerfiles/agent.dockerfile
exec_agent1:
image: null
build:
context: .
dockerfile: dockerfiles/agent.dockerfile
aux_agent:
image: null
build:
context: .
dockerfile: dockerfiles/agent.dockerfile
gpu_prove_agent0:
image: null
build:
context: .
dockerfile: dockerfiles/agent.dockerfile
snark_agent:
image: null
build:
context: .
dockerfile: dockerfiles/agent.dockerfile
(haven't extensively tested both, there could be a reason why one might be worse)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is my suggestion that we have really two options here:
- pre build the binaries and pull them from somewhere directly into the container
- the baking of images.
I favor the first option I am presenting. if this compiling is a larger issue we should pre compile the binaries and post them in releases IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we agree on my suggestion I am fine to begin setting up the binaries build process
@@ -26,7 +31,6 @@ x-exec-agent-common: &exec-agent-common | |||
cpus: 3 | |||
environment: | |||
<<: *base-environment | |||
LD_LIBRARY_PATH: ${LD_LIBRARY_PATH:-/usr/local/cuda-12.2/compat/} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this instead of updating to 12.9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not needed in my testing with 12.9 on ampere and blackwell cards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this is the case, although the specific use case is being able to run these on machines without a CUDA GPU, and not certain it works on all machines -- harder to test. Perhaps fine for this to be removed and see if anyone complains
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had to remove this line to make it work with 5090
🎉 move bento over to boundless
🎉 start redis before starting bento stuff
🎉 update nvidia to 12.9.1
🐛 drop sm_code=90 as it messes with mixed envs