Skip to content

Labels #789

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 99 commits into
base: main
Choose a base branch
from
Open

Labels #789

wants to merge 99 commits into from

Conversation

Aske-Rosted
Copy link
Collaborator

Hey all.

This PR has been in the process for a lot longer than I initially expected and have also grown in to be quite unruly in size. I will do my best to describe everything that has been added and/or changed.

Discussion about the resulting labels is both encouraged and very much appreciated. It seems to me that there might be others around in the IceCube collaboration who have been working on similar MC labels for training machine learning algorithms and I would especially like to hear your input.

The attempt was to add some labels which could be used for training machine learning algorithms. Two different main approaches have been taken, one is a calorimetric approach in which the labels try to describe all the energy deposited in and around the detector during the event window. And a pseudo-truth particle approach in which the labels try to describe the particle (or close bunch of particles) which produced a signal in the detector.

I am currently processing a large dataset using the new labels and could at a later time upload some plots if necessary. I have created some plots looking at the energy distributions of these new labels compared to the Homogenized-Qtot and the InIceNeutrino energy in order to, at least to some degree, verify that the labels are working as intended. (Example below)
image
I also looked into the SHAP values of BDT's trained on the truth MC-labels with the regression target being the energy of the in ice neutrino.
image

As there are quite a lot of these plots and they are quite difficult to parse I will not upload all of them here.

The new label extractors

src/graphnet/data/extractors/icecube/i3calorimetry.py

  • e_entrance_track_ : Total energy at the time of entrance of all tracks entering the detector volume.
  • e_deposited_track_: Total deposited energy of all tracks entering the detector volume.
  • e_cascade_: Total energy of cascades contained inside the detector volume
  • e_visible_: e_entrance_track_ + e_cascade_
  • fraction_primary_: e_visible_ as fraction of the primary particle(s) energy
  • fraction_cascade_: e_cascade_/e_visible_

src/graphnet/data/extractors/icecube/i3highesteparticleextractor.py

  • e_fraction_: e_on_entrance_ divided by the energy of the primary
  • distance_: For cascade, starting and contained tracks this is the distance from the center of the detector to the interaction vertex, for stopping and throughgoing tracks this is the distance to where the particle first enters the detector volume.
  • e_on_entrance_: Energy of the Highest Energy Particle (HEP) entering the detector volume, when the particle or production from the particle first becomes visible, the definition of this energy varies depending on whether the HEP starting, a track entering or a bundle.
  • zenith_: zenith angle of the HEP
  • azimuth_azimuth angle of the HEP
  • dir_x_ x direction of the HEP
  • dir_y_ y direction of the HEP
  • dir_z_ z direction of the HEP
  • pos_x_ position x of the HEP (see distance)
  • pos_y_position y of the HEP (see distance)
  • pos_z_position z of the HEP (see distance)
  • time_ time of the HEP at the given position
  • length_: full length of the HEP track (should maybe be removed to not confuse with visible_length_
  • visible_length_: visible length of the track of length of the maximum expansion by the cascade inside the detector.
  • trackness_: fraction of energy produces by track like interactions.
  • interaction_shape_ shape of the interaction
  • particle_type_ particle type of the HEP
  • containment_ containment of the HEP
  • parent_type_ the particle id of the parent of the HEP

examples/01_icetray/05_convert_i3_files_advanced.py

  • An example script which does not work as much as a test as the files do not contain the necessary frames but is intended to give the user an idea of some of the more advanced options available.

src/graphnet/data/extractors/icecube/utilities/gcd_hull.py

  • Is a wrapper around the scipy.spatial ConvexHull class which can be used to effectively determine if a point or an array of points are located inside or outside of a volume spanned by a collection of points.
  • Is combined with the MuonGun.extruded_polygon to also be able to determine vectors intersection points on the surface of the hull.
  • Finally a sphere approximation can also be used to more efficiently determine which of a number of vectors intersect the smallest sphere enveloping the hull.

Additions: Completely new additions to GraphNet

src/graphnet/data/dataconverter.py

  • Allow for specification of an output directory when merging files.

src/graphnet/data/extractors/icecube/i3filtermapextractor.py

  • Class for extraction the boolean values of the I3FilterMap which has a bit of a special construction.

src/graphnet/data/extractors/icecube/utilities/containments.py

  • Adds enums of containment and containment types inspired by what already exists in icetray. (All of these are not necessarily used in the extractors I have added.

src/graphnet/data/extractors/icecube/utilities/i3_filters.py

  • TableFilter - only events containing the table specified are converted .
  • ChargeFilter - only events with the required charge are converted.

src/graphnet/data/extractors/icecube/utilities/i3_in_ice_neutrino.py

  • Utility function for getting the highest energy in ice neutrino

src/graphnet/data/extractors/icecube/utilities/gcd_hull.py

Major change: Files that have seen major changes necessitated by the file conversion

src/graphnet/data/extractors/extractor.py

  • Allow for exclusion of extracted data, can be useful if you want to use on of the standard extractors but want to remove e.g. a str input which would otherwise impact the database performance.

src/graphnet/data/extractors/icecube/i3extractor.py

  • Proper initation of gcd_file and i3_file
  • get_primaries function which can be inherited by classes - This leads to having to define if the file is a CORSIKA file since determining the primary of the muon bundle has to be handled differently.
  • check_primary_energy function, tries to handle cases where the energy of the primary particle is not set. Usually an identical particle exist as a daughter of the particle with the missing energy which can then be inserted as the new primary. This function can be called on both a list of primaries and on a single primary.

src/graphnet/data/extractors/icecube/i3dictvalueextractor.py

  • Extract all values from I3Map key in frame.

src/graphnet/data/utilities/string_selection_resolver.py

  • There are quite a bit of differences between the string selection which can be done using pandas string selection resolver and using the string selection resolver of sqlite. This allows the user to determine which of the two methods should be used. Parquet does not support string selection and therefore ony the pandas selection resolver can be used in that case.

src/graphnet/data/writers/sqlite_writer.py

  • remove_originals : Allows the user to toggle whether or not the input files should be deleted after merging. This is done only after all files have been merged so files are not removed if the script terminates with an error. (This however also limits the usefulness of removing the files since you would still need to have the full space available)
  • reset_integer_primary_key : Allows the user to indicate that the key used for indexing the sqlite db needs to be reset. This is useful in instances where different processes where used to create the db files which are being merged.

src/graphnet/utilities/filesys.py

  • Allows the input for file conversion to either be a folder or a list of I3 files. This means that you would not have to copy the files that you wish to convert into another folder when making a sub-selection.

Minor changes: all the minor stuff I encountered and had to fix while working on this contribution.

.github/workflows/build.yml

  • fix libboost issue with workflows

examples/01_icetray/01_convert_i3_files.py

  • missing f-string marker

src/graphnet/data/dataconverter.py

  • Trying to catch some instances with empty dataframes which could happen if no events in the i3 file fill the selection criteria for conversion
  • Properly updating the number of workers if number of workers is more than number of input files.
  • Make the converter a little more verbose

src/graphnet/data/dataset/dataset.py

  • Parsing of super selection argument
  • Dynamical handling of truth array

src/graphnet/data/extractors/combine_extractors.py

  • Properly set the gcd_file value for combined extractors

src/graphnet/data/pre_configured/dataconverters.py

  • allow passing of the max_table_size to the SQLiteWriter

@sevmag
Copy link
Collaborator

sevmag commented Aug 14, 2025

This PR is closed by #807, isn't it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants