Labels #789

Aske-Rosted · 2025-03-18T09:03:08Z

Hey all.

This PR has been in the process for a lot longer than I initially expected and have also grown in to be quite unruly in size. I will do my best to describe everything that has been added and/or changed.

Discussion about the resulting labels is both encouraged and very much appreciated. It seems to me that there might be others around in the IceCube collaboration who have been working on similar MC labels for training machine learning algorithms and I would especially like to hear your input.

The attempt was to add some labels which could be used for training machine learning algorithms. Two different main approaches have been taken, one is a calorimetric approach in which the labels try to describe all the energy deposited in and around the detector during the event window. And a pseudo-truth particle approach in which the labels try to describe the particle (or close bunch of particles) which produced a signal in the detector.

I am currently processing a large dataset using the new labels and could at a later time upload some plots if necessary. I have created some plots looking at the energy distributions of these new labels compared to the Homogenized-Qtot and the InIceNeutrino energy in order to, at least to some degree, verify that the labels are working as intended. (Example below)

I also looked into the SHAP values of BDT's trained on the truth MC-labels with the regression target being the energy of the in ice neutrino.

As there are quite a lot of these plots and they are quite difficult to parse I will not upload all of them here.

The new label extractors

src/graphnet/data/extractors/icecube/i3calorimetry.py

e_entrance_track_ : Total energy at the time of entrance of all tracks entering the detector volume.
e_deposited_track_: Total deposited energy of all tracks entering the detector volume.
e_cascade_: Total energy of cascades contained inside the detector volume
e_visible_: e_entrance_track_ + e_cascade_
fraction_primary_: e_visible_ as fraction of the primary particle(s) energy
fraction_cascade_: e_cascade_/e_visible_

src/graphnet/data/extractors/icecube/i3highesteparticleextractor.py

e_fraction_: e_on_entrance_ divided by the energy of the primary
distance_: For cascade, starting and contained tracks this is the distance from the center of the detector to the interaction vertex, for stopping and throughgoing tracks this is the distance to where the particle first enters the detector volume.
e_on_entrance_: Energy of the Highest Energy Particle (HEP) entering the detector volume, when the particle or production from the particle first becomes visible, the definition of this energy varies depending on whether the HEP starting, a track entering or a bundle.
zenith_: zenith angle of the HEP
azimuth_azimuth angle of the HEP
dir_x_ x direction of the HEP
dir_y_ y direction of the HEP
dir_z_ z direction of the HEP
pos_x_ position x of the HEP (see distance)
pos_y_position y of the HEP (see distance)
pos_z_position z of the HEP (see distance)
time_ time of the HEP at the given position
length_: full length of the HEP track (should maybe be removed to not confuse with visible_length_
visible_length_: visible length of the track of length of the maximum expansion by the cascade inside the detector.
trackness_: fraction of energy produces by track like interactions.
interaction_shape_ shape of the interaction
particle_type_ particle type of the HEP
containment_ containment of the HEP
parent_type_ the particle id of the parent of the HEP

examples/01_icetray/05_convert_i3_files_advanced.py

An example script which does not work as much as a test as the files do not contain the necessary frames but is intended to give the user an idea of some of the more advanced options available.

src/graphnet/data/extractors/icecube/utilities/gcd_hull.py

Is a wrapper around the scipy.spatial ConvexHull class which can be used to effectively determine if a point or an array of points are located inside or outside of a volume spanned by a collection of points.
Is combined with the MuonGun.extruded_polygon to also be able to determine vectors intersection points on the surface of the hull.
Finally a sphere approximation can also be used to more efficiently determine which of a number of vectors intersect the smallest sphere enveloping the hull.

Additions: Completely new additions to GraphNet

src/graphnet/data/dataconverter.py

Allow for specification of an output directory when merging files.

src/graphnet/data/extractors/icecube/i3filtermapextractor.py

Class for extraction the boolean values of the I3FilterMap which has a bit of a special construction.

src/graphnet/data/extractors/icecube/utilities/containments.py

Adds enums of containment and containment types inspired by what already exists in icetray. (All of these are not necessarily used in the extractors I have added.

src/graphnet/data/extractors/icecube/utilities/i3_filters.py

TableFilter - only events containing the table specified are converted .
ChargeFilter - only events with the required charge are converted.

src/graphnet/data/extractors/icecube/utilities/i3_in_ice_neutrino.py

Utility function for getting the highest energy in ice neutrino

src/graphnet/data/extractors/icecube/utilities/gcd_hull.py

Major change: Files that have seen major changes necessitated by the file conversion

src/graphnet/data/extractors/extractor.py

Allow for exclusion of extracted data, can be useful if you want to use on of the standard extractors but want to remove e.g. a str input which would otherwise impact the database performance.

src/graphnet/data/extractors/icecube/i3extractor.py

Proper initation of gcd_file and i3_file
get_primaries function which can be inherited by classes - This leads to having to define if the file is a CORSIKA file since determining the primary of the muon bundle has to be handled differently.
check_primary_energy function, tries to handle cases where the energy of the primary particle is not set. Usually an identical particle exist as a daughter of the particle with the missing energy which can then be inserted as the new primary. This function can be called on both a list of primaries and on a single primary.

src/graphnet/data/extractors/icecube/i3dictvalueextractor.py

Extract all values from I3Map key in frame.

src/graphnet/data/utilities/string_selection_resolver.py

There are quite a bit of differences between the string selection which can be done using pandas string selection resolver and using the string selection resolver of sqlite. This allows the user to determine which of the two methods should be used. Parquet does not support string selection and therefore ony the pandas selection resolver can be used in that case.

src/graphnet/data/writers/sqlite_writer.py

remove_originals : Allows the user to toggle whether or not the input files should be deleted after merging. This is done only after all files have been merged so files are not removed if the script terminates with an error. (This however also limits the usefulness of removing the files since you would still need to have the full space available)
reset_integer_primary_key : Allows the user to indicate that the key used for indexing the sqlite db needs to be reset. This is useful in instances where different processes where used to create the db files which are being merged.

src/graphnet/utilities/filesys.py

Allows the input for file conversion to either be a folder or a list of I3 files. This means that you would not have to copy the files that you wish to convert into another folder when making a sub-selection.

Minor changes: all the minor stuff I encountered and had to fix while working on this contribution.

.github/workflows/build.yml

fix libboost issue with workflows

examples/01_icetray/01_convert_i3_files.py

missing f-string marker

src/graphnet/data/dataconverter.py

Trying to catch some instances with empty dataframes which could happen if no events in the i3 file fill the selection criteria for conversion
Properly updating the number of workers if number of workers is more than number of input files.
Make the converter a little more verbose

src/graphnet/data/dataset/dataset.py

Parsing of super selection argument
Dynamical handling of truth array

src/graphnet/data/extractors/combine_extractors.py

Properly set the gcd_file value for combined extractors

src/graphnet/data/pre_configured/dataconverters.py

allow passing of the max_table_size to the SQLiteWriter

Merge from main

merge from main

sevmag · 2025-08-14T07:39:59Z

This PR is closed by #807, isn't it?

Aske-Rosted added 30 commits December 16, 2024 11:24

Highest energy particle inside volume

3da559d

refactor

a669490

add energy labels functionality

dc54fbc

add MClabeler support

5fb93f6

add table and charge filters

741e4bb

add dict value extractor

10ab817

filter map extractor

6c3ad6b

add option to remove files after merging

d2bc43d

proper passing of max_table_size

2735720

allow for using sqlite selection

77cb06b

add extractors to init

25ce118

import fix

a9d5a74

import fixes + variable name fix

6a9adf0

track id fix

9a22f0a

Merge branch 'main' into labels

8ce4e1f

Merge from main

remove MCLabeler

7eb5d2e

icecube TYPE_CHECKING fix

55587f6

more icecube TYPE_CHECKING fix

00a4dd5

has icecube fix

0a5124c

remove from init

0baa572

dataclasses missing fix

79dc6bc

revert

7b765b3

revert

cdcdd00

change to np.array

fb62624

typings to str

6dafafd

tuple -> Tuple

14e97b5

tuple -> Tuple

36495e2

try importing icetray

09fa1b4

revert

d9881e8

import only submodules not MuonGun

4c4517e

Aske-Rosted and others added 23 commits January 17, 2025 04:38

small fix

a68678a

name changes

a4981fd

Merge branch 'main' into labels

da1539c

merge from main

small refactor

189ab2d

str fix

0f82527

code optimization

5b3abf9

pre-commit fix

422d1c0

set gcd & file

16812f4

optimize and sphere approx

52a57a1

pre-commit fix

f146d3c

add track containment

dea1058

typing fix

cc75fb5

parent type fix

8b8b1e7

optimize cascade E

b72a903

changes to dataconverter

08ab82d

large changes to energy labels

cbbf623

updating example

41c640a

Merge branch 'main' into labels

68ae65d

merge from main

small refactor

6b0fd72

allow file list input & remake keys

f6820c1

revert changes to workflow build

ff65a98

libboost fix

b4b7f66

update filtermapextractor docstring

e508534

This was referenced Apr 21, 2025

Sqlite string selection #791

Open

Converter extractor changes #795

Merged

Writer changes #794

Merged

f-string fix #793

Merged

I3Filters #792

Merged

Aske-Rosted mentioned this pull request Jun 2, 2025

Mc labels #807

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Labels #789

Labels #789

Aske-Rosted commented Mar 18, 2025

Uh oh!

sevmag commented Aug 14, 2025

Uh oh!

Uh oh!

Labels #789

Are you sure you want to change the base?

Labels #789

Conversation

Aske-Rosted commented Mar 18, 2025

Uh oh!

sevmag commented Aug 14, 2025

Uh oh!

Uh oh!