You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+29-6Lines changed: 29 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,11 +1,23 @@
1
1
# edf2parquet
2
2
Simple utility package to convert EDF/EDF+ files into Apache Parquet format
3
-
while preserving the EDF file header information and signal headers metadata information.
4
-
Currently, each signal is stored as a separate parquet file, with the option to automatically
5
-
add a pandas readable DatetimeIndex.
3
+
while preserving the EDF file header information and signal headers metadata information and some nice enhanced features:
4
+
- handling of non-strictly EDF compliant .EDF headers (e.g. UTF-8 characters in the header, etc.).
5
+
- automatic conversion of the EDF file header start date and signal sampling frequency to a pd.DatetimeIndex with the correct timezone and frequency for easy Pandas interoperability (at the cost of slightly bigger file sizes of course).
6
+
- skipping of specific signals during conversion
7
+
- bundling signals with the same sampling frequency into a single parquet file
8
+
- splitting of EDF files by non-use periods (e.g. if a file consists of continuous multiple nights, and you want to split it into a single file per night)
9
+
- compression of the resulting parquet files
10
+
6
11
7
12
## Installation
8
13
14
+
### Requirements
15
+
The package was tested with the pinned versions in the `requirements.txt` file.
16
+
If something does not work, try to install this exact versions. I would particularly advise
17
+
to use matching or more recent versions of PyArrow and Pandas (version 2.0 is important
18
+
as its using underlying Arrow datastructures itself, thus it will break with anything
converter = AdvancedEdfToParquetConverter(edf_file_path=my_edf_file_path, # path to the EDF file
46
+
exclude_signals=["Audio"], # list of signals to exclude from the conversion
47
+
parquet_output_dir=my_parquet_output_dir, # path to the output directory (will be created if not exists)
48
+
group_by_sampling_freq=True, # whether to group signals with same sampling frequency into single parquet files
49
+
datetime_index=True, # whether to automatically add a pd.DatetimeIndex to the resulting parquet files
50
+
local_timezone=(pytz.timezone("Europe/Zurich"), pytz.timezone("Europe/Zurich")), # specifies the timezone of the EDF file and the timezone of the start_date in the EDF file (should be the same for most cases)
51
+
compression_codec="GZIP", # compression codec to use for the resulting parquet files
52
+
split_non_use_by_col="MY_COLUMN") # only specify this if you want to split the EDF file by non-use periods (e.g. if a file consists of continuous multiple nights and you want to split it into a single file per night) -> read the docstring of the AdvancedEdfToParquetConverter class for more information. The column specifies the column to use for splitting the file.
53
+
30
54
converter.convert()
31
55
```
32
56
### Reading:
@@ -62,8 +86,7 @@ reader.get_signal_headers()
62
86
Check the `examples.ipynb` notebook for detailed outputs.
63
87
64
88
## Todo
65
-
-[] Allow to bundle signals with the same sampling rate into a single parquet file.
89
+
-[x] Allow to bundle signals with the same sampling rate into a single parquet file.
66
90
-[ ] Provide a high level user API
67
91
-[ ] Enable (possibly distributed) parallel processing to efficiently convert a whole directory of EDF files.
68
-
-[ ] Provide a high level API to convert EDF files with the same sampling frequency (fs) into a single parquet file with a single row per signal.
"my_edf_file = \"path_to_my_edfile.edf\" # REPLACE WITH YOUR EDF FILE PATH\n",
55
+
"my_parquet_output_dir = \"path_to_my_parquet_output_dir\" # REPLACE WITH YOUR PARQUET OUTPUT DIRECTORY\n",
56
+
"\n",
57
+
"converter = AdvancedEdfToParquetConverter(edf_file_path=my_edf_file, # path to the EDF file\n",
58
+
" exclude_signals=[\"Audio\"], # list of signals to exclude from the conversion\n",
59
+
" parquet_output_dir=my_parquet_output_dir, # path to the output directory (will be created if not exists)\n",
60
+
" group_by_sampling_freq=True, # whether to group signals with same sampling frequency into single parquet files\n",
61
+
" datetime_index=True, # whether to automatically add a pd.DatetimeIndex to the resulting parquet files\n",
62
+
" local_timezone=(pytz.timezone(\"Europe/Zurich\"), pytz.timezone(\"Europe/Zurich\")), # specifies the timezone of the EDF file and the timezone of the start_date in the EDF file (should be the same for most cases)\n",
63
+
" compression_codec=\"GZIP\", # compression codec to use for the resulting parquet files\n",
64
+
" split_non_use_by_col=\"MY_COLUMN\") # only specify this if you want to split the EDF file by non-use periods (e.g. if a file consists of continuous multiple nights and you want to split it into a single file per night) -> read the docstring of the AdvancedEdfToParquetConverter class for more information. The column specifies the column to use for splitting the file.\n",
0 commit comments