Skip to content

Semantic Dataset Creation for OpenJ9 and OpenJDK #1067

@anirudhsengar

Description

@anirudhsengar

Semantic datasets are a collection of data where each Java file is represented by a set of learned, high-dimensional features that capture the code's underlying structure, context, and relationships, rather than just its surface-level quantitative properties.

Here is a detailed breakdown of all the features required.

Class Identification

  • name: The name of the software project (e.g., log4j).
  • version: The version of the project (e.g., 1.1).
  • name: The fully qualified name of the class being measured (e.g., org.apache.log4j.xml.examples.XCategory).

Complexity Metrics

  • wmc (Weighted Methods per Class): A count of the methods in a class. A higher WMC means more methods and is often associated with greater complexity and more potential for bugs.
  • rfc (Response for a Class): The number of methods in the class plus the number of unique methods called by methods in that class. A high RFC indicates that an object of the class can have a wide-ranging response to a message, making it complex and difficult to test.
  • loc (Lines of Code): The total number of lines of code in the class file. A simple but effective measure of size.
  • max_cc (Maximum Cyclomatic Complexity): The highest cyclomatic complexity value among all methods in the class. Cyclomatic complexity measures the number of linearly independent paths through a method's source code. A high value indicates a very complex method.
  • avg_cc (Average Cyclomatic Complexity): The average cyclomatic complexity across all methods in the class.

Coupling Metrics (How connected a class is to others)

  • cbo (Coupling Between Object classes): The number of other classes to which a class is coupled (i.e., it uses their methods or instance variables, or they use its). High coupling can lead to ripple effects where a change in one class requires changes in many others.
  • ca (Afferent Couplings): "Incoming" coupling. The number of other classes that depend on this class. A high Ca means the class is a core, responsible component.
  • ce (Efferent Couplings): "Outgoing" coupling. The number of other classes that this class depends on. A high Ce means the class has many external dependencies, which can make it fragile.
  • ic (Inheritance Coupling): The number of parent classes a class is coupled to through inheritance.
  • cbm (Coupling Between Methods): Measures the coupling between the methods of a class. Related to cohesion.

Cohesion Metrics (How well a class's members work together)

  • lcom (Lack of Cohesion in Methods): Measures how many pairs of methods in a class share at least one instance variable. A high LCOM value suggests the class is trying to do too many unrelated things and should perhaps be split.
  • lcom3: A specific, normalized variant of the LCOM metric. It often provides a more intuitive scale (e.g., 0 to 2).

Inheritance Metrics

  • dit (Depth of Inheritance Tree): The length of the path from the class to the root of the inheritance tree (i.e., Object). Deep inheritance hierarchies can be complex to understand.
  • noc (Number of Children): The number of immediate subclasses of a class. A high NOC may indicate an improper abstraction or that the class is being heavily reused.
  • mfa (Measure of Functional Abstraction): The ratio of the number of methods inherited by a class to the total number of methods available to it.

Other Metrics

  • npm (Number of Public Methods): The count of public methods in a class. This is the class's public interface.
  • dam (Data Access Metric): A measure of the ratio of private/protected attributes to all attributes in a class. It relates to data encapsulation.
  • moa (Measure of Aggregation): The number of attributes in a class that are of another user-defined class type (i.e., a measure of has-a relationships).
  • cam (Cohesion Among Methods of Class): A cohesion metric that computes the relatedness among methods based on their parameter lists.
  • amc (Average Method LOC): The average number of lines of code per method (LOC / WMC).

The Target Variable

  • bug: The dependent variable you are trying to predict. In this dataset, it represents the number of defects (bugs) that were later found and fixed in that specific class. A value of 0 means no bugs were reported for it.

Sub-issue of adoptium/aqa-tests#6272

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions