-
-
Notifications
You must be signed in to change notification settings - Fork 87
Open
Description
Semantic datasets are a collection of data where each Java file is represented by a set of learned, high-dimensional features that capture the code's underlying structure, context, and relationships, rather than just its surface-level quantitative properties.
Here is a detailed breakdown of all the features required.
Class Identification
- name: The name of the software project (e.g., log4j).
- version: The version of the project (e.g., 1.1).
- name: The fully qualified name of the class being measured (e.g., org.apache.log4j.xml.examples.XCategory).
Complexity Metrics
- wmc (Weighted Methods per Class): A count of the methods in a class. A higher WMC means more methods and is often associated with greater complexity and more potential for bugs.
- rfc (Response for a Class): The number of methods in the class plus the number of unique methods called by methods in that class. A high RFC indicates that an object of the class can have a wide-ranging response to a message, making it complex and difficult to test.
- loc (Lines of Code): The total number of lines of code in the class file. A simple but effective measure of size.
- max_cc (Maximum Cyclomatic Complexity): The highest cyclomatic complexity value among all methods in the class. Cyclomatic complexity measures the number of linearly independent paths through a method's source code. A high value indicates a very complex method.
- avg_cc (Average Cyclomatic Complexity): The average cyclomatic complexity across all methods in the class.
Coupling Metrics (How connected a class is to others)
- cbo (Coupling Between Object classes): The number of other classes to which a class is coupled (i.e., it uses their methods or instance variables, or they use its). High coupling can lead to ripple effects where a change in one class requires changes in many others.
- ca (Afferent Couplings): "Incoming" coupling. The number of other classes that depend on this class. A high Ca means the class is a core, responsible component.
- ce (Efferent Couplings): "Outgoing" coupling. The number of other classes that this class depends on. A high Ce means the class has many external dependencies, which can make it fragile.
- ic (Inheritance Coupling): The number of parent classes a class is coupled to through inheritance.
- cbm (Coupling Between Methods): Measures the coupling between the methods of a class. Related to cohesion.
Cohesion Metrics (How well a class's members work together)
- lcom (Lack of Cohesion in Methods): Measures how many pairs of methods in a class share at least one instance variable. A high LCOM value suggests the class is trying to do too many unrelated things and should perhaps be split.
- lcom3: A specific, normalized variant of the LCOM metric. It often provides a more intuitive scale (e.g., 0 to 2).
Inheritance Metrics
- dit (Depth of Inheritance Tree): The length of the path from the class to the root of the inheritance tree (i.e., Object). Deep inheritance hierarchies can be complex to understand.
- noc (Number of Children): The number of immediate subclasses of a class. A high NOC may indicate an improper abstraction or that the class is being heavily reused.
- mfa (Measure of Functional Abstraction): The ratio of the number of methods inherited by a class to the total number of methods available to it.
Other Metrics
- npm (Number of Public Methods): The count of public methods in a class. This is the class's public interface.
- dam (Data Access Metric): A measure of the ratio of private/protected attributes to all attributes in a class. It relates to data encapsulation.
- moa (Measure of Aggregation): The number of attributes in a class that are of another user-defined class type (i.e., a measure of has-a relationships).
- cam (Cohesion Among Methods of Class): A cohesion metric that computes the relatedness among methods based on their parameter lists.
- amc (Average Method LOC): The average number of lines of code per method (LOC / WMC).
The Target Variable
- bug: The dependent variable you are trying to predict. In this dataset, it represents the number of defects (bugs) that were later found and fixed in that specific class. A value of 0 means no bugs were reported for it.
Sub-issue of adoptium/aqa-tests#6272
smlambert
Metadata
Metadata
Assignees
Labels
No labels