Skip to content

BM-1473: Zeroecco/bento #1011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
Open

BM-1473: Zeroecco/bento #1011

wants to merge 12 commits into from

Conversation

zeroecco
Copy link
Contributor

🎉 move bento over to boundless
🎉 start redis before starting bento stuff
🎉 update nvidia to 12.9.1
🐛 drop sm_code=90 as it messes with mixed envs

@github-actions github-actions bot changed the title Zeroecco/bento BM-1473: Zeroecco/bento Aug 12, 2025
Copy link
Contributor

@willpote willpote left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but would get @austinabell 's eyes on it before merge

Copy link
Contributor

@austinabell austinabell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nvidia compat is the only thing I think is blocking, other comments are somewhat opinionated

@@ -75,6 +87,13 @@ services:
- 6379:6379
volumes:
- redis-data:/data
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru --save 900 1 --appendonly yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the 2gb number come from? Wondering the reason for this default? This seems like it would cause issues for provers locking a lot of large orders

Comment on lines +14 to +16
build:
context: .
dockerfile: dockerfiles/agent.dockerfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this change will add a lot of friction for provers building. It is already incredibly time consuming to build just the broker and we have been getting a lot of complaints.

Seems like this could be done in two ways, either also including the image in the declaration, so if you don't pass --build it uses prebuilt, and if you do it would rebuild. Alternative to have a compose.local.yml to look something like:

services:
  exec_agent0:
    image: null
    build:
      context: .
      dockerfile: dockerfiles/agent.dockerfile

  exec_agent1:
    image: null
    build:
      context: .
      dockerfile: dockerfiles/agent.dockerfile

  aux_agent:
    image: null
    build:
      context: .
      dockerfile: dockerfiles/agent.dockerfile

  gpu_prove_agent0:
    image: null
    build:
      context: .
      dockerfile: dockerfiles/agent.dockerfile

  snark_agent:
    image: null
    build:
      context: .
      dockerfile: dockerfiles/agent.dockerfile 

(haven't extensively tested both, there could be a reason why one might be worse)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is my suggestion that we have really two options here:

  1. pre build the binaries and pull them from somewhere directly into the container
  2. the baking of images.

I favor the first option I am presenting. if this compiling is a larger issue we should pre compile the binaries and post them in releases IMHO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we agree on my suggestion I am fine to begin setting up the binaries build process

@@ -26,7 +31,6 @@ x-exec-agent-common: &exec-agent-common
cpus: 3
environment:
<<: *base-environment
LD_LIBRARY_PATH: ${LD_LIBRARY_PATH:-/usr/local/cuda-12.2/compat/}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this instead of updating to 12.9?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not needed in my testing with 12.9 on ampere and blackwell cards

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this is the case, although the specific use case is being able to run these on machines without a CUDA GPU, and not certain it works on all machines -- harder to test. Perhaps fine for this to be removed and see if anyone complains

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had to remove this line to make it work with 5090

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants