|
| 1 | +# NWB Data Reading in Neurosift |
| 2 | + |
| 3 | +This document explains how Neurosift reads and processes NWB (Neurodata Without Borders) files, including the architecture, optimizations, and technical details relevant for contributors. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +Neurosift uses a multi-layered approach to read NWB files efficiently in the browser. The system supports both traditional HDF5 files and optimized LINDI (Linked Data Interface) files, with intelligent format detection and performance optimizations. |
| 8 | + |
| 9 | +## Architecture Components |
| 10 | + |
| 11 | +### 1. Entry Point (`src/pages/NwbPage/NwbPage.tsx`) |
| 12 | +- Handles URL processing and format detection |
| 13 | +- Manages DANDI API integration for asset resolution |
| 14 | +- Coordinates LINDI optimization attempts |
| 15 | + |
| 16 | +### 2. HDF5 Interface Layer (`src/pages/NwbPage/hdf5Interface.ts`) |
| 17 | +- Central abstraction for all file operations |
| 18 | +- Implements caching and request deduplication |
| 19 | +- Manages authentication and error handling |
| 20 | +- Provides React hooks for component integration |
| 21 | + |
| 22 | +### 3. Remote File Access (`src/remote-h5-file/`) |
| 23 | +- Core file reading implementation |
| 24 | +- Supports multiple file formats (HDF5, LINDI) |
| 25 | +- Handles HTTP range requests and chunking |
| 26 | +- Web Worker integration for non-blocking operations |
| 27 | + |
| 28 | +## Data Flow |
| 29 | + |
| 30 | +``` |
| 31 | +URL Input → Format Detection → LINDI Check → File Access → Caching → Visualization |
| 32 | + ↓ ↓ ↓ ↓ ↓ ↓ |
| 33 | +NwbPage.tsx → hdf5Interface → tryGetLindiUrl → RemoteH5File* → Cache → Plugins |
| 34 | +``` |
| 35 | + |
| 36 | +### Step-by-Step Process |
| 37 | + |
| 38 | +1. **URL Resolution**: Convert DANDI paths to direct download URLs |
| 39 | +2. **LINDI Detection**: Check for optimized `.lindi.json` or `.lindi.tar` files |
| 40 | +3. **File Access**: Use appropriate reader (HDF5 or LINDI) |
| 41 | +4. **Data Loading**: Lazy load only required data with chunking |
| 42 | +5. **Caching**: Store results to avoid redundant requests |
| 43 | +6. **Visualization**: Pass data to type-specific plugins |
| 44 | + |
| 45 | +## File Format Support |
| 46 | + |
| 47 | +### Traditional HDF5 Files |
| 48 | +- **Access Method**: HTTP range requests via Web Workers |
| 49 | +- **Worker URL**: `https://tempory.net/js/RemoteH5Worker.js` |
| 50 | +- **Chunk Size**: 100KB default (configurable) |
| 51 | +- **Limitations**: Slower metadata access, requires full header parsing |
| 52 | + |
| 53 | +### LINDI Files (Optimized) |
| 54 | +- **Format**: JSON-based reference file system |
| 55 | +- **Metadata**: Instant access to all HDF5 structure |
| 56 | +- **Data Storage**: References to external URLs or embedded chunks |
| 57 | +- **Location**: `https://lindi.neurosift.org/[dandi|dandi-staging]/dandisets/{id}/assets/{asset_id}/nwb.lindi.json` |
| 58 | +- **Tar Support**: `.lindi.tar` files containing both metadata and data |
| 59 | + |
| 60 | +## Performance Optimizations |
| 61 | + |
| 62 | +### 1. LINDI Priority System |
| 63 | +```typescript |
| 64 | +if (isDandiAssetUrl(url) && currentDandisetId && tryUsingLindi) { |
| 65 | + const lindiUrl = await tryGetLindiUrl(url, currentDandisetId); |
| 66 | + if (lindiUrl) return { url: lindiUrl }; // 10-100x faster metadata access |
| 67 | +} |
| 68 | +``` |
| 69 | + |
| 70 | +### 2. Lazy Loading Strategy |
| 71 | +- **Groups**: Load structure on-demand |
| 72 | +- **Datasets**: Load metadata separately from data |
| 73 | +- **Data**: Only load when visualization requires it |
| 74 | + |
| 75 | +### 3. HTTP Range Requests |
| 76 | +- Load only required byte ranges from large files |
| 77 | +- Configurable chunk sizes for optimal network usage |
| 78 | +- Automatic retry logic for failed requests |
| 79 | + |
| 80 | +### 4. Multi-Level Caching |
| 81 | +- **In-Memory**: Groups, datasets, and data results |
| 82 | +- **Request Deduplication**: Prevent duplicate network calls |
| 83 | +- **Status Tracking**: Monitor ongoing operations |
| 84 | + |
| 85 | +### 5. Web Workers |
| 86 | +- Non-blocking file operations |
| 87 | +- Prevents UI freezing during large data loads |
| 88 | +- Single worker by default (configurable) |
| 89 | + |
| 90 | +## Technical Limits and Constraints |
| 91 | + |
| 92 | +### Data Size Limits |
| 93 | +```typescript |
| 94 | +const maxNumElements = 1e7; // 10 million elements maximum |
| 95 | +if (totalSize > maxNumElements) { |
| 96 | + throw new Error(`Dataset too large: ${formatSize(totalSize)} > ${formatSize(maxNumElements)}`); |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +### Slicing Constraints |
| 101 | +- Maximum 3 dimensions can be sliced simultaneously |
| 102 | +- Slice parameters must be valid integers |
| 103 | +- Format: `[[start, end], [start, end], ...]` |
| 104 | + |
| 105 | +### Authentication Requirements |
| 106 | +- DANDI API key required for embargoed datasets |
| 107 | +- Automatic detection of authentication errors |
| 108 | +- User notification system for access issues |
| 109 | + |
| 110 | +## Key Implementation Details |
| 111 | + |
| 112 | +### Core Functions |
| 113 | + |
| 114 | +#### `getHdf5Group(url, path)` |
| 115 | +- Returns HDF5 group structure with subgroups and datasets |
| 116 | +- Implements caching to avoid redundant requests |
| 117 | +- Used for building file hierarchy views |
| 118 | + |
| 119 | +#### `getHdf5Dataset(url, path)` |
| 120 | +- Returns dataset metadata (shape, dtype, attributes) |
| 121 | +- Does not load actual data |
| 122 | +- Essential for understanding data structure before loading |
| 123 | + |
| 124 | +#### `getHdf5DatasetData(url, path, options)` |
| 125 | +- Loads actual array data with optional slicing |
| 126 | +- Supports cancellation via `Canceler` objects |
| 127 | +- Handles BigInt conversion for compatibility |
| 128 | + |
| 129 | +### React Integration |
| 130 | +```typescript |
| 131 | +// Hook-based API for components |
| 132 | +const group = useHdf5Group(url, "/acquisition"); |
| 133 | +const dataset = useHdf5Dataset(url, "/data/timeseries"); |
| 134 | +const { data, errorMessage } = useHdf5DatasetData(url, "/data/values"); |
| 135 | +``` |
| 136 | + |
| 137 | +### Error Handling |
| 138 | +- Network timeout handling (3-minute default) |
| 139 | +- Authentication error detection and user notification |
| 140 | +- Graceful fallbacks for failed LINDI attempts |
| 141 | +- CORS issue mitigation strategies |
| 142 | + |
| 143 | +## DANDI Integration |
| 144 | + |
| 145 | +### Asset URL Resolution |
| 146 | +```typescript |
| 147 | +// Convert DANDI paths to download URLs |
| 148 | +const response = await fetch( |
| 149 | + `https://api.dandiarchive.org/api/dandisets/${dandisetId}/versions/${version}/assets/?glob=${path}` |
| 150 | +); |
| 151 | +const assetId = data.results[0].asset_id; |
| 152 | +const downloadUrl = `https://api.dandiarchive.org/api/assets/${assetId}/download/`; |
| 153 | +``` |
| 154 | + |
| 155 | +### LINDI URL Construction |
| 156 | +```typescript |
| 157 | +const aa = staging ? "dandi-staging" : "dandi"; |
| 158 | +const lindiUrl = `https://lindi.neurosift.org/${aa}/dandisets/${dandisetId}/assets/${assetId}/nwb.lindi.json`; |
| 159 | +``` |
| 160 | + |
| 161 | +## Contributing Guidelines |
| 162 | + |
| 163 | +### Adding New File Formats |
| 164 | +1. Implement `RemoteH5FileX` interface in `src/remote-h5-file/lib/` |
| 165 | +2. Add format detection logic in `hdf5Interface.ts` |
| 166 | +3. Update `getMergedRemoteH5File` for multi-file support |
| 167 | + |
| 168 | +### Performance Considerations |
| 169 | +- Always prefer LINDI files when available |
| 170 | +- Implement proper caching for new data types |
| 171 | +- Use Web Workers for CPU-intensive operations |
| 172 | +- Consider memory usage for large datasets |
| 173 | + |
| 174 | +### Testing Large Files |
| 175 | +- Test with files >1GB to verify chunking works |
| 176 | +- Verify LINDI fallback mechanisms |
| 177 | +- Test authentication flows with embargoed data |
| 178 | +- Check error handling for network failures |
| 179 | + |
| 180 | +### Plugin Development |
| 181 | +- Use provided hooks (`useHdf5Group`, `useHdf5Dataset`, etc.) |
| 182 | +- Implement proper loading states and error handling |
| 183 | +- Consider data slicing for large arrays |
| 184 | +- Follow lazy loading patterns |
| 185 | + |
| 186 | +## Debugging and Monitoring |
| 187 | + |
| 188 | +### Status Bar Integration |
| 189 | +The system provides real-time statistics in the status bar: |
| 190 | +- `numGroups / numDatasets / numDatasetDatas`: Operation counters |
| 191 | +- Loading indicators for active operations |
| 192 | +- Error notifications for failed requests |
| 193 | + |
| 194 | +### Console Logging |
| 195 | +- LINDI detection attempts and results |
| 196 | +- Authentication error details |
| 197 | +- Performance metrics and timing |
| 198 | +- Cache hit/miss information |
| 199 | + |
| 200 | +### Common Issues |
| 201 | +1. **CORS Errors**: Usually resolved by LINDI files or proper headers |
| 202 | +2. **Authentication Failures**: Check DANDI API key configuration |
| 203 | +3. **Large Dataset Errors**: Implement proper slicing |
| 204 | +4. **Worker Loading Failures**: Verify CDN accessibility |
| 205 | + |
| 206 | +## Future Improvements |
| 207 | + |
| 208 | +### Potential Optimizations |
| 209 | +- Implement progressive loading for very large datasets |
| 210 | +- Add compression support for data transfers |
| 211 | +- Enhance caching with persistence across sessions |
| 212 | +- Improve error recovery mechanisms |
| 213 | + |
| 214 | +### Format Extensions |
| 215 | +- Support for additional HDF5-compatible formats |
| 216 | +- Enhanced LINDI features (compression, encryption) |
| 217 | +- Integration with cloud storage providers |
| 218 | +- Real-time streaming capabilities |
| 219 | + |
| 220 | +This architecture enables Neurosift to efficiently handle NWB files ranging from megabytes to gigabytes while providing responsive user interactions and comprehensive error handling. |
0 commit comments