Skip to content

Downsample Timeseries client #320

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 15 commits into
base: main-v2
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,15 @@
# Changes

## June 9, 2025
- Added DownsampledTimeseriesClient with downsampling methods LTTB (Largest Triangle Three Buckets) and Decimate for improved performance with large datasets

## June 6, 2025
- Added SequentialRecordingsTable plugin for NWB files to visualize SequentialRecordingsTable neurodata type

## June 5, 2025
- Modernized Python package structure with pyproject.toml configuration. Removed legacy setup.py, setup.cfg, and setup.cfg.j2 files
- Added option for using local neurosift server with the CLI
- Fixed CORS policy in local file access server to allow any localhost port for development

## May 20, 2025
- Added support for resolving NWB file URLs from dandiset path in NwbPage
Expand Down
220 changes: 220 additions & 0 deletions docs/llm_docs/nwb_read_in_neurosift.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
# NWB Data Reading in Neurosift

This document explains how Neurosift reads and processes NWB (Neurodata Without Borders) files, including the architecture, optimizations, and technical details relevant for contributors.

## Overview

Neurosift uses a multi-layered approach to read NWB files efficiently in the browser. The system supports both traditional HDF5 files and optimized LINDI (Linked Data Interface) files, with intelligent format detection and performance optimizations.

## Architecture Components

### 1. Entry Point (`src/pages/NwbPage/NwbPage.tsx`)
- Handles URL processing and format detection
- Manages DANDI API integration for asset resolution
- Coordinates LINDI optimization attempts

### 2. HDF5 Interface Layer (`src/pages/NwbPage/hdf5Interface.ts`)
- Central abstraction for all file operations
- Implements caching and request deduplication
- Manages authentication and error handling
- Provides React hooks for component integration

### 3. Remote File Access (`src/remote-h5-file/`)
- Core file reading implementation
- Supports multiple file formats (HDF5, LINDI)
- Handles HTTP range requests and chunking
- Web Worker integration for non-blocking operations

## Data Flow

```
URL Input → Format Detection → LINDI Check → File Access → Caching → Visualization
↓ ↓ ↓ ↓ ↓ ↓
NwbPage.tsx → hdf5Interface → tryGetLindiUrl → RemoteH5File* → Cache → Plugins
```

### Step-by-Step Process

1. **URL Resolution**: Convert DANDI paths to direct download URLs
2. **LINDI Detection**: Check for optimized `.lindi.json` or `.lindi.tar` files
3. **File Access**: Use appropriate reader (HDF5 or LINDI)
4. **Data Loading**: Lazy load only required data with chunking
5. **Caching**: Store results to avoid redundant requests
6. **Visualization**: Pass data to type-specific plugins

## File Format Support

### Traditional HDF5 Files
- **Access Method**: HTTP range requests via Web Workers
- **Worker URL**: `https://tempory.net/js/RemoteH5Worker.js`
- **Chunk Size**: 100KB default (configurable)
- **Limitations**: Slower metadata access, requires full header parsing

### LINDI Files (Optimized)
- **Format**: JSON-based reference file system
- **Metadata**: Instant access to all HDF5 structure
- **Data Storage**: References to external URLs or embedded chunks
- **Location**: `https://lindi.neurosift.org/[dandi|dandi-staging]/dandisets/{id}/assets/{asset_id}/nwb.lindi.json`
- **Tar Support**: `.lindi.tar` files containing both metadata and data

## Performance Optimizations

### 1. LINDI Priority System
```typescript
if (isDandiAssetUrl(url) && currentDandisetId && tryUsingLindi) {
const lindiUrl = await tryGetLindiUrl(url, currentDandisetId);
if (lindiUrl) return { url: lindiUrl }; // 10-100x faster metadata access
}
```

### 2. Lazy Loading Strategy
- **Groups**: Load structure on-demand
- **Datasets**: Load metadata separately from data
- **Data**: Only load when visualization requires it

### 3. HTTP Range Requests
- Load only required byte ranges from large files
- Configurable chunk sizes for optimal network usage
- Automatic retry logic for failed requests

### 4. Multi-Level Caching
- **In-Memory**: Groups, datasets, and data results
- **Request Deduplication**: Prevent duplicate network calls
- **Status Tracking**: Monitor ongoing operations

### 5. Web Workers
- Non-blocking file operations
- Prevents UI freezing during large data loads
- Single worker by default (configurable)

## Technical Limits and Constraints

### Data Size Limits
```typescript
const maxNumElements = 1e7; // 10 million elements maximum
if (totalSize > maxNumElements) {
throw new Error(`Dataset too large: ${formatSize(totalSize)} > ${formatSize(maxNumElements)}`);
}
```

### Slicing Constraints
- Maximum 3 dimensions can be sliced simultaneously
- Slice parameters must be valid integers
- Format: `[[start, end], [start, end], ...]`

### Authentication Requirements
- DANDI API key required for embargoed datasets
- Automatic detection of authentication errors
- User notification system for access issues

## Key Implementation Details

### Core Functions

#### `getHdf5Group(url, path)`
- Returns HDF5 group structure with subgroups and datasets
- Implements caching to avoid redundant requests
- Used for building file hierarchy views

#### `getHdf5Dataset(url, path)`
- Returns dataset metadata (shape, dtype, attributes)
- Does not load actual data
- Essential for understanding data structure before loading

#### `getHdf5DatasetData(url, path, options)`
- Loads actual array data with optional slicing
- Supports cancellation via `Canceler` objects
- Handles BigInt conversion for compatibility

### React Integration
```typescript
// Hook-based API for components
const group = useHdf5Group(url, "/acquisition");
const dataset = useHdf5Dataset(url, "/data/timeseries");
const { data, errorMessage } = useHdf5DatasetData(url, "/data/values");
```

### Error Handling
- Network timeout handling (3-minute default)
- Authentication error detection and user notification
- Graceful fallbacks for failed LINDI attempts
- CORS issue mitigation strategies

## DANDI Integration

### Asset URL Resolution
```typescript
// Convert DANDI paths to download URLs
const response = await fetch(
`https://api.dandiarchive.org/api/dandisets/${dandisetId}/versions/${version}/assets/?glob=${path}`
);
const assetId = data.results[0].asset_id;
const downloadUrl = `https://api.dandiarchive.org/api/assets/${assetId}/download/`;
```

### LINDI URL Construction
```typescript
const aa = staging ? "dandi-staging" : "dandi";
const lindiUrl = `https://lindi.neurosift.org/${aa}/dandisets/${dandisetId}/assets/${assetId}/nwb.lindi.json`;
```

## Contributing Guidelines

### Adding New File Formats
1. Implement `RemoteH5FileX` interface in `src/remote-h5-file/lib/`
2. Add format detection logic in `hdf5Interface.ts`
3. Update `getMergedRemoteH5File` for multi-file support

### Performance Considerations
- Always prefer LINDI files when available
- Implement proper caching for new data types
- Use Web Workers for CPU-intensive operations
- Consider memory usage for large datasets

### Testing Large Files
- Test with files >1GB to verify chunking works
- Verify LINDI fallback mechanisms
- Test authentication flows with embargoed data
- Check error handling for network failures

### Plugin Development
- Use provided hooks (`useHdf5Group`, `useHdf5Dataset`, etc.)
- Implement proper loading states and error handling
- Consider data slicing for large arrays
- Follow lazy loading patterns

## Debugging and Monitoring

### Status Bar Integration
The system provides real-time statistics in the status bar:
- `numGroups / numDatasets / numDatasetDatas`: Operation counters
- Loading indicators for active operations
- Error notifications for failed requests

### Console Logging
- LINDI detection attempts and results
- Authentication error details
- Performance metrics and timing
- Cache hit/miss information

### Common Issues
1. **CORS Errors**: Usually resolved by LINDI files or proper headers
2. **Authentication Failures**: Check DANDI API key configuration
3. **Large Dataset Errors**: Implement proper slicing
4. **Worker Loading Failures**: Verify CDN accessibility

## Future Improvements

### Potential Optimizations
- Implement progressive loading for very large datasets
- Add compression support for data transfers
- Enhance caching with persistence across sessions
- Improve error recovery mechanisms

### Format Extensions
- Support for additional HDF5-compatible formats
- Enhanced LINDI features (compression, encryption)
- Integration with cloud storage providers
- Real-time streaming capabilities

This architecture enables Neurosift to efficiently handle NWB files ranging from megabytes to gigabytes while providing responsive user interactions and comprehensive error handling.
7 changes: 6 additions & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@
"@types/plotly.js": "^2.35.2",
"@types/react-syntax-highlighter": "^15.5.13",
"@vercel/analytics": "^1.4.1",
"downsample": "^1.4.0",
"mathjs": "^14.2.1",
"nifti-reader-js": "^0.7.0",
"numcodecs": "^0.3.1",
Expand Down
26 changes: 16 additions & 10 deletions python/neurosift/TemporaryDirectory.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,40 +4,46 @@
import random
import tempfile

class TemporaryDirectory():
def __init__(self, *, remove: bool=True, prefix: str='tmp'):

class TemporaryDirectory:
def __init__(self, *, remove: bool = True, prefix: str = "tmp"):
self._remove = remove
self._prefix = prefix

def __enter__(self) -> str:
tmpdir = tempfile.gettempdir()
self._path = f'{tmpdir}/{self._prefix}_{_random_string(8)}'
self._path = f"{tmpdir}/{self._prefix}_{_random_string(8)}"
os.mkdir(self._path)
return self._path

def __exit__(self, exc_type, exc_val, exc_tb):
if self._remove:
if not os.getenv('KACHERY_CLOUD_KEEP_TEMP_FILES') == '1':
if not os.getenv("KACHERY_CLOUD_KEEP_TEMP_FILES") == "1":
_rmdir_with_retries(self._path, num_retries=5)

def path(self):
return self._path


def _rmdir_with_retries(dirname: str, num_retries: int, delay_between_tries: float=1):
def _rmdir_with_retries(dirname: str, num_retries: int, delay_between_tries: float = 1):
for retry_num in range(1, num_retries + 1):
if not os.path.exists(dirname):
return
try:
shutil.rmtree(dirname)
break
except: # pragma: no cover
except: # pragma: no cover
if retry_num < num_retries:
print('Retrying to remove directory: {}'.format(dirname))
print("Retrying to remove directory: {}".format(dirname))
time.sleep(delay_between_tries)
else:
raise Exception('Unable to remove directory after {} tries: {}'.format(num_retries, dirname))
raise Exception(
"Unable to remove directory after {} tries: {}".format(
num_retries, dirname
)
)


def _random_string(num_chars: int) -> str:
chars = 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789'
return ''.join(random.choice(chars) for _ in range(num_chars))
chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
return "".join(random.choice(chars) for _ in range(num_chars))
Loading