Skip to content

Commit a5f615d

Browse files
committed
Merge branch 'main' into agent-revamp
2 parents a51f812 + ae514f5 commit a5f615d

33 files changed

+1560
-122
lines changed

.changeset/chilly-laws-smile.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"@browserbasehq/stagehand": patch
3+
---
4+
5+
add webvoyager evals

.changeset/cruel-onions-live.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

.changeset/neat-walls-walk.md

Lines changed: 0 additions & 5 deletions
This file was deleted.

.changeset/slimy-cars-matter.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,5 @@
1+
---
2+
"@browserbasehq/stagehand": patch
3+
---
4+
5+
add support for custom baseUrl within openai provider

.github/ISSUE_TEMPLATE/bug_report.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
---
2+
name: Bug report
3+
about: Detailed descriptions help us resolve faster
4+
title: ''
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
**Before submitting an issue, please:**
11+
12+
- [ ] Check the [documentation](https://docs.stagehand.dev/) for relevant information
13+
- [ ] Search existing [issues](https://github.com/browserbase/stagehand/issues) to avoid duplicates
14+
15+
## Environment Information
16+
17+
Please provide the following information to help us reproduce and resolve your issue:
18+
19+
**Stagehand:**
20+
21+
- Language/SDK: [TypeScript, Python, MCP…]
22+
- Stagehand version: [e.g., 1.0.0]
23+
24+
**AI Provider:**
25+
26+
- Provider: [e.g., OpenAI, Anthropic, Azure OpenAI]
27+
- Model: [e.g., gpt-4o, claude-3-7-sonnet-latest]
28+
29+
## Issue Description
30+
31+
```
32+
[Describe the current behavior here]
33+
34+
```
35+
36+
### Steps to Reproduce
37+
38+
1.
39+
2.
40+
3.
41+
42+
### Minimal Reproduction Code
43+
44+
```tsx
45+
// Your minimal reproduction code here
46+
import { Stagehand } from '@browserbase/stagehand';
47+
48+
const stagehand = new Stagehand({
49+
// IMPORTANT: include your stagehand config
50+
});
51+
52+
// Steps that reproduce the issue
53+
54+
```
55+
56+
### Error Messages / Log trace
57+
58+
```
59+
[Paste error messages/logs here]
60+
61+
```
62+
63+
### Screenshots / Videos
64+
65+
```
66+
[Attach screenshots or videos here]
67+
68+
```
69+
70+
### Related Issues
71+
72+
Are there any related issues or PRs?
73+
74+
- Related to: #[issue number]
75+
- Duplicate of: #[issue number]
76+
- Blocks: #[issue number]
Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,23 @@
1+
---
2+
name: Feature request
3+
about: Suggest an idea for this project
4+
title: ''
5+
labels: ''
6+
assignees: ''
7+
8+
---
9+
10+
**Is your feature request related to a problem? Please describe.**
11+
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
12+
13+
**Describe the solution you'd like**
14+
A clear and concise description of what you want to happen.
15+
16+
**Describe alternatives you've considered**
17+
A clear and concise description of any alternative solutions or features you've considered.
18+
19+
**Are you willing to contribute to implementing this feature or fix?**
20+
21+
- [ ] Yes, I can submit a PR
22+
- [ ] Yes, but I need guidance
23+
- [ ] No, I cannot contribute at this time

.github/workflows/ci.yml

Lines changed: 74 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ on:
1212

1313
env:
1414
EVAL_MODELS: "openai/gpt-4.1,google/gemini-2.0-flash,anthropic/claude-3-5-sonnet-latest"
15-
EVAL_CATEGORIES: "observe,act,combination,extract,targeted_extract"
15+
EVAL_CATEGORIES: "observe,act,combination,extract,targeted_extract,agent"
1616
EVAL_MAX_CONCURRENCY: 25
1717
EVAL_TRIAL_COUNT: 5
1818

@@ -29,6 +29,7 @@ jobs:
2929
run-act: ${{ steps.check-labels.outputs.run-act }}
3030
run-observe: ${{ steps.check-labels.outputs.run-observe }}
3131
run-targeted-extract: ${{ steps.check-labels.outputs.run-targeted-extract }}
32+
run-agent: ${{ steps.check-labels.outputs.run-agent }}
3233
steps:
3334
- id: check-labels
3435
run: |
@@ -40,6 +41,7 @@ jobs:
4041
echo "run-act=true" >> $GITHUB_OUTPUT
4142
echo "run-observe=true" >> $GITHUB_OUTPUT
4243
echo "run-targeted-extract=true" >> $GITHUB_OUTPUT
44+
echo "run-agent=true" >> $GITHUB_OUTPUT
4345
exit 0
4446
fi
4547
@@ -49,6 +51,7 @@ jobs:
4951
echo "run-act=${{ contains(github.event.pull_request.labels.*.name, 'act') }}" >> $GITHUB_OUTPUT
5052
echo "run-observe=${{ contains(github.event.pull_request.labels.*.name, 'observe') }}" >> $GITHUB_OUTPUT
5153
echo "run-targeted-extract=${{ contains(github.event.pull_request.labels.*.name, 'targeted-extract') }}" >> $GITHUB_OUTPUT
54+
echo "run-agent=${{ contains(github.event.pull_request.labels.*.name, 'agent') }}" >> $GITHUB_OUTPUT
5255
5356
run-lint:
5457
runs-on: ubuntu-latest
@@ -562,3 +565,73 @@ jobs:
562565
echo "Eval summary not found for targeted_extract category. Failing CI."
563566
exit 1
564567
fi
568+
569+
run-agent-evals:
570+
needs: [run-targeted-extract-evals, determine-evals]
571+
runs-on: ubuntu-latest
572+
timeout-minutes: 90 # Agent evals can be long-running
573+
env:
574+
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
575+
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
576+
GOOGLE_GENERATIVE_AI_API_KEY: ${{ secrets.GOOGLE_GENERATIVE_AI_API_KEY }}
577+
BRAINTRUST_API_KEY: ${{ secrets.BRAINTRUST_API_KEY }}
578+
BROWSERBASE_API_KEY: ${{ secrets.BROWSERBASE_API_KEY }}
579+
BROWSERBASE_PROJECT_ID: ${{ secrets.BROWSERBASE_PROJECT_ID }}
580+
HEADLESS: true
581+
EVAL_ENV: browserbase
582+
# Use agent models for agent evals in CI
583+
EVAL_AGENT_MODELS: "computer-use-preview-2025-03-11,claude-3-7-sonnet-latest"
584+
EVAL_TRIAL_COUNT: 2 # Reduce trials for agent evals
585+
EVAL_MAX_CONCURRENCY: 10 # Lower concurrency for agent evals
586+
steps:
587+
- name: Check out repository code
588+
uses: actions/checkout@v4
589+
590+
- name: Check for 'agent' label
591+
id: label-check
592+
run: |
593+
if [ "${{ needs.determine-evals.outputs.run-agent }}" != "true" ]; then
594+
echo "has_label=false" >> $GITHUB_OUTPUT
595+
echo "No label for AGENT. Exiting with success."
596+
else
597+
echo "has_label=true" >> $GITHUB_OUTPUT
598+
fi
599+
600+
- name: Set up Node.js
601+
if: needs.determine-evals.outputs.run-agent == 'true'
602+
uses: actions/setup-node@v4
603+
with:
604+
node-version: "20"
605+
606+
- name: Install dependencies
607+
if: needs.determine-evals.outputs.run-agent == 'true'
608+
run: |
609+
rm -rf node_modules
610+
npm i -g pnpm
611+
pnpm install --no-frozen-lockfile
612+
613+
- name: Build Stagehand
614+
if: needs.determine-evals.outputs.run-agent == 'true'
615+
run: pnpm run build
616+
617+
- name: Run Agent Evals
618+
if: needs.determine-evals.outputs.run-agent == 'true'
619+
run: pnpm run evals category agent
620+
621+
- name: Log Agent Evals Performance
622+
if: needs.determine-evals.outputs.run-agent == 'true'
623+
run: |
624+
experimentName=$(jq -r '.experimentName' eval-summary.json)
625+
echo "View results at https://www.braintrust.dev/app/Browserbase/p/stagehand/experiments/${experimentName}"
626+
if [ -f eval-summary.json ]; then
627+
agent_score=$(jq '.categories.agent' eval-summary.json)
628+
echo "Agent category score: $agent_score%"
629+
# Lower threshold for agent evals since they're complex
630+
if (( $(echo "$agent_score < 50" | bc -l) )); then
631+
echo "Agent category score is below 50%. Failing CI."
632+
exit 1
633+
fi
634+
else
635+
echo "Eval summary not found for agent category. Failing CI."
636+
exit 1
637+
fi

CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,11 @@
11
# @browserbasehq/stagehand
22

3+
## 2.4.4
4+
5+
### Patch Changes
6+
7+
- [#1012](https://github.com/browserbase/stagehand/pull/1012) [`9e8c173`](https://github.com/browserbase/stagehand/commit/9e8c17374fdc8fbe7f26e6cf802c36bd14f11039) Thanks [@miguelg719](https://github.com/miguelg719)! - Fix disabling api validation whenever a customLLM client is provided
8+
39
## 2.4.3
410

511
### Patch Changes

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -125,6 +125,7 @@ pnpm playwright install
125125
pnpm run build
126126
pnpm run example # run the blank script at ./examples/example.ts
127127
pnpm run example 2048 # run the 2048 example at ./examples/2048.ts
128+
pnpm run evals -man # see evaluation suite options
128129
```
129130

130131
Stagehand is best when you have an API key for an LLM provider and Browserbase credentials. To add these to your project, run:

evals/CHANGELOG.md

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,12 @@
11
# @browserbasehq/stagehand-evals
22

3+
## 1.0.8
4+
5+
### Patch Changes
6+
7+
- Updated dependencies [[`9e8c173`](https://github.com/browserbase/stagehand/commit/9e8c17374fdc8fbe7f26e6cf802c36bd14f11039)]:
8+
- @browserbasehq/stagehand@2.4.4
9+
310
## 1.0.7
411

512
### Patch Changes

0 commit comments

Comments
 (0)