Skip to content

Conversation

tjgreen42
Copy link
Contributor

@tjgreen42 tjgreen42 commented Jun 23, 2025

This PR fixes #193:

  • Introduces a transaction-level lock on index updates to guard against a host of race conditions (in tape.rs, in meta_page.rs, and in index page updates).
  • Adds accompanying Python-based regression tests. (Concurrency testing is not well-supported by PGRX).

tjgreen42 and others added 17 commits June 23, 2025 09:28
Set up Python reproduction script for page corruption issue with concurrent
diskann insertions. Successfully reproduces the assertion failure:
"assertion failed: (*header).pd_special >= SizeOfPageHeaderData as u16"

- Add issue_193_repro.py with adapted connection string for local development
- Add Python dependencies (SQLAlchemy, asyncpg, pgvector, numpy, greenlet)
- Add run_repro.sh helper script for easy testing
- Add documentation in scripts/README.md
- Update CLAUDE.md with additional development context

Issue occurs with parallelism > 1 but works fine with parallelism = 1.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Resolves assertion failure: "(*header).pd_special >= SizeOfPageHeaderData as u16"
that occurred during concurrent vector insertions with high parallelism.

Root cause: Race condition when multiple threads call Tape::resume() and try to
modify the same "newest" page simultaneously, causing page corruption during WAL
registration in WritablePage::modify().

Solution: Modified insert_storage() to use create_tape_safely() which always
creates new pages during insertions instead of resuming existing pages.

Trade-offs:
- ✅ Eliminates race condition and ensures data integrity
- ✅ Maintains performance for concurrent workloads
- ❌ Uses more disk space due to reduced page sharing
- ❌ Temporary fix - proper locking would be more elegant

Testing:
- Reproduced issue consistently before fix (60% failure rate)
- All tests pass after fix (0% failure rate across multiple runs)
- Verified with parallelism up to 10 and batch sizes up to 300

TODO: Implement proper per-relation locking for page resumption to restore
space efficiency while maintaining concurrency safety.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
…tion race condition

## Summary
- Implement Python test infrastructure for multi-process concurrency testing
- Fix GitHub issue #193: concurrent insertion race condition in diskann index
- Add comprehensive testing documentation and CI integration

## Python Test Infrastructure
- **Directory structure**: Organized tests/ with concurrency/ and integration/ subdirs
- **Test framework**: pytest with psycopg2 for reliable database connections
- **Fixtures**: Clean database state management and extension setup
- **Test categories**: Concurrency tests (reproduce issue #193) and integration tests
- **Test runners**: Shell script and Makefile integration for easy execution
- **CI integration**: GitHub Actions workflow for multi-version PostgreSQL testing

## Concurrency Fix (Issue #193)
- **Root cause**: Race condition when multiple processes resume writing to same page
- **Solution**: Modified create_tape_safely() to use Tape::new() instead of Tape::resume()
- **Trade-off**: Uses more disk space but ensures data integrity
- **Validation**: Concurrency tests now pass, reproducing and validating the fix

## Files Added
- `tests/` - Complete Python test suite with concurrency and integration tests
- `TESTING.md` - Comprehensive testing documentation and developer guide
- `.github/workflows/python_tests.yml` - CI workflow for Python tests
- `scripts/run-python-tests.sh` - Smart test runner with environment detection

## Files Modified
- `pgvectorscale/src/access_method/build.rs` - Fix for concurrent insertion safety and functional regression test
- `Makefile` - Added Python test targets
- `README.md` - Link to testing documentation

## Files Moved
- `scripts/issue_193_repro.py` → `tests/concurrency/issue_193_original.py` - Organized into test structure

## Testing Results
- ✅ All integration tests pass (extension installation, index creation, queries)
- ✅ Concurrency tests validate the fix (previously failing, now passing)
- ✅ Multi-process testing works correctly (separate database connections)

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add common ignore patterns for Python development:
- __pycache__/ directories and compiled Python files
- .pytest_cache/ for pytest temporary files
- .DS_Store for macOS system files
- .claude/ for Claude Code workspace files

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Fix deprecated actions/upload-artifact@v3 that was causing CI failures.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add --no-default-features flag to prevent pg17 default from conflicting with matrix PostgreSQL versions.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Explicitly set PGRX_PG_VERSION and clean cargo cache to prevent multiple PostgreSQL version features from conflicting.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Ensure pgrx is reinitialized with the correct PostgreSQL version from the matrix to prevent pg14/pg15 feature conflicts.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Remove Docker service and use the custom-built PostgreSQL instance where pgvector and pgvectorscale are properly installed.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Quote variables to prevent word splitting
- Use direct exit code checks instead of $?
- Add exit on cd failure
- Fix array vs string parameter handling
- Add shellcheck disable for venv/bin/activate

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Change from md5 to trust authentication and simplify database setup to avoid password prompts.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Move test files from subdirectories to root tests/ directory
- Remove empty concurrency/, integration/, and fixtures/ subdirectories
- Update TESTING.md to reflect flattened structure with marker-based organization
- Use pytest markers (@pytest.mark.concurrency, @pytest.mark.integration) instead of directory-based categorization
- Simplify test discovery and execution with standard pytest conventions

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Update GitHub workflow to use pytest markers (-m concurrency) instead of directory paths
- Update Makefile test-concurrency and test-integration targets to use markers
- Remove pytest cache with outdated directory references
- Maintain functionality of make test-concurrency and make test-integration commands

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
- Remove Python version matrix, use fixed Python 3.11
- Reduces CI job count from 6 to 3 (3 PG versions instead of 3 PG × 2 Python)
- Python 3.11 provides sufficient coverage for integration tests
- Update artifact naming to remove Python version reference

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@tjgreen42 tjgreen42 marked this pull request as ready for review June 24, 2025 03:32
@tjgreen42 tjgreen42 requested a review from a team as a code owner June 24, 2025 03:32
@tjgreen42 tjgreen42 requested a review from cevian June 24, 2025 15:29
tjgreen42 and others added 3 commits June 26, 2025 16:48
Quote PYTEST_ARGS variable to prevent word splitting and globbing issues.
This addresses SC2086 warnings and follows bash best practices for variable expansion.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
@tjgreen42 tjgreen42 requested a review from syvb June 26, 2025 23:55
@tjgreen42 tjgreen42 merged commit fa6f1a3 into main Jun 30, 2025
34 checks passed
@tjgreen42 tjgreen42 deleted the tj/concurrent_insertion_safety branch June 30, 2025 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: assertion failed: (*header).pd_special >= SizeOfPageHeaderData as u16
2 participants