You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: proposal/sec/body.tex
+55-51Lines changed: 55 additions & 51 deletions
Original file line number
Diff line number
Diff line change
@@ -2,22 +2,23 @@ \section{Introduction}
2
2
\label{sec:intro}
3
3
4
4
\textit{Facial emotion recognition} (FER)~\cite{Ko18,JainSS19} is a topic of significant frontier and ongoing debate,
5
-
not only in our daily life, but also in the fields of \textit{artificial intelligence} (AI) and computer vision.
5
+
not only in our daily lives but also in the fields of \textit{artificial intelligence} (AI) and computer vision.
6
6
In this short proposal, we aim to leverage several \textit{deep neural networks} (DNNs),
7
7
which contain convolution layers and residual/attention blocks,
8
8
to detect and interpret six basic universally recognized and expressed human facial emotions
9
9
(i.e., happiness, surprise, sadness, anger, disgust, and fear).
10
10
To make our model more transparent,
11
-
we explain this emotion classification task with \textit{class activation mapping} (CAM) and \textit{gradient-weighted class activation mapping} (Grad-CAM).
11
+
we explain this emotion classification task with \textit{class activation mapping} (CAM)
12
+
and \textit{gradient-weighted class activation mapping} (Grad-CAM).
12
13
13
14
The structure of this report is arranged as follows.
14
15
% \Cref{sec:related} contains the related work of our research.
15
16
In \Cref{sec:approach},
16
17
we address the datasets we collected and the model architecture we implemented.
17
18
The preliminary evaluation results of our models are given in \Cref{sec:result}.
18
-
\Cref{sec:optim} describes the optimization strategies we have plan to investigate in the coming weeks.
19
-
\Cref{fig:result} illustrates the empirical results of our current best model,
20
-
an overview of our time schedule for the entire final project is given in \Cref{fig:schedule}.
19
+
\Cref{sec:optim} describes the optimization strategies we plan to investigate in the coming weeks.
20
+
\Cref{fig:result} illustrates the empirical results of our current best model.
21
+
An overview of our time schedule for the entire final project is given in \Cref{fig:schedule}.
21
22
Our code and supplementary material are available at \url{https://github.com/werywjw/SEP-CVDL}.
22
23
23
24
% add demo: see https://github.com/werywjw/SEP-CVDL/blob/main/paper/Selvaraju_Cogswell_Grad-CAM.pdf
@@ -34,23 +35,13 @@ \subsection{Dataset Acquisition and Processing}
34
35
TFEID~\cite{tfeid,LiGL22},
35
36
as well as the video database DISFA~\cite{MavadatiMBTC13},
36
37
from public institutions and GitHub repositories~\footnote{\url{https://github.com/spenceryee/CS229}}.
37
-
Based on these databases, we created a dataset by augmentation to increase variety,
38
-
full details of augmentation (see \Cref{sec:optim:aug} for details). % is given in~
38
+
Based on these databases, we created a dataset by augmentation to increase the variety,
39
+
and full details of augmentation (see \Cref{sec:optim:aug}). % is given in~
39
40
In terms of illustrating the content of used pictures, we exclusively analyze human faces representing 6 emotions.
40
41
That is,
41
42
we generalized a folder structure annotating the labels 1 (surprise), 2 (fear), 3 (disgust), 4 (happiness), 5 (sadness), and 6 (anger).
42
43
Besides the original format of images and videos, we set standards for extracting frames from the videos,
43
-
resize training pictures to 64x64 pixels, and save them as the JPG format.
44
-
45
-
The images are converted to greyscale with three channels,
46
-
as our original \textit{convolutional neural network} (CNN) is designed to work with three-channel inputs with random rotation and crop.
47
-
Emotions were assigned tags to each individual picture in a CSV file to facilitate further processing in the model.
48
-
We create a custom dataset, which is a collection of data relating to all training images we collected,
49
-
using PyTorch~\footnote{\url{https://pytorch.org}},
50
-
as it includes plenty existing functions to load various custom datasets in domain libraries such as \texttt{TorchVision}, \texttt{TorchText}, \texttt{TorchAudio}, and \texttt{TorchRec}.
51
-
52
-
% a specific problem you're working on.
53
-
% In essence, a custom dataset can be comprised of almost anything.
44
+
resizing training pictures to 64x64 pixels, and saving them in the JPG format.
54
45
55
46
\begin{figure}[ht]
56
47
\centering
@@ -62,6 +53,16 @@ \subsection{Dataset Acquisition and Processing}
62
53
\label{fig:result}
63
54
\end{figure}
64
55
56
+
The images are converted to greyscale with three channels,
57
+
as our original \textit{convolutional neural network} (CNN) is designed to work with three-channel inputs with random rotation and crop.
58
+
Emotions were assigned tags to each individual picture in a CSV file to facilitate further processing in the model.
59
+
We create a custom dataset, which is a collection of data relating to all training images we collected,
60
+
using PyTorch~\footnote{\url{https://pytorch.org}},
61
+
as it includes plenty of existing functions to load various custom datasets in domain libraries such as \texttt{TorchVision}, \texttt{TorchText}, \texttt{TorchAudio}, and \texttt{TorchRec}.
62
+
63
+
% a specific problem you're working on.
64
+
% In essence, a custom dataset can be comprised of almost anything.
65
+
65
66
\subsection{Model Architecture}
66
67
We implement an emotion classification model from scratch with four convolution layers at the very beginning.
67
68
Following each convolutional layer,
@@ -80,7 +81,7 @@ \subsection{Model Architecture}
80
81
we add the residual connections,
81
82
as they allow gradients to flow through the network more easily, improving the training for deep architectures.
82
83
Moreover,
83
-
we add squeeze and excitation (SE) blocks to apply channel-wise attention.
84
+
we add \textit{squeeze and excitation} (SE) blocks to apply channel-wise attention.
We report all the training, testing, and validation accuracy in \% to compare the performance of our models.
113
114
114
115
\Cref{fig:result} shows the test result aggregated from the database RAF-DB~\footnote{\url{https://www.kaggle.com/datasets/shuvoalok/raf-db-dataset}}.
116
+
Different combinations of functions from the \texttt{pytorch.transforms} library are tested for augmentation from those already established filters. % that have been developed.
115
117
As seen in \Cref{tab:model},
116
-
our CNN without random augmentation outperforms the other models in terms of the accuracy,
117
-
indicating that this kind of augmentation is not able to help our model predict the correct the label,
118
+
our CNN without random augmentation outperforms the other models in terms of accuracy,
119
+
indicating that this kind of augmentation is not able to help our model predict the correct label,
118
120
thus we later aim to optimize with other augmentation techniques to capture more representative features of different emotions.
119
121
Further research is orientated on papers engaging similar investigations~\cite{ZeilerF14,li_reliable_2017,VermaMRMV23}.
120
122
@@ -136,7 +138,7 @@ \section{Preliminary Results}
136
138
}
137
139
\caption{Accuracy (\%) for different models in our experiments
138
140
(Note that Aug stands for data augmentation, SE for squeeze and excitation, and Res for residual connections;
Additionally, we guide the training process to enhance the recognition and handling of real-world variations.
165
168
During the project, we pursue various approaches.
166
-
We are implementing different combinations of functions from the \texttt{pytorch.transforms} library and testing already established filters that have been developed.
167
169
% in other research contexts.
168
170
Meanwhile, we create various replications of existing photos by randomly altering different properties such as size, brightness, color channels, or perspectives.
To further analyze the separate scores of each class of the model,
220
+
we write a script that takes a folder path as input and iterates through the images inside a subfolder to record the performance of the model with respect to each emotion class.
221
+
This CSV file is represented with the corresponding classification scores.
222
+
223
+
\subsection{CAM and Grad-CAM} % aggregate Class Activation Mapping (CAM)
216
224
\label{sec:optim:cam}
217
225
218
-
Generally speaking, Class Activation Mapping is a visualization technique designed to highlight the regions of an image or video that contribute the most to the prediction of a specific class by a neural network,
219
-
typically the final convolutional layer of a CNN before the fully connected layers.
220
-
Technically, CAM generates a heatmap that highlights the important regions of the image in terms of the decision of the model.
221
-
Besides proposing a method to visualize the discriminative regions of a classification-trained CNN,
222
-
we adapte this approach from \citet{ZhouKLOT16} to localize objects without providing the model with any bounding box annotations.
223
-
The model can thus learn the classification task with class labels and is then able to localize the object of a specific class in an image.
226
+
In generally,
227
+
CAM helps interpret CNN decisions by providing visual cues about the regions that influenced the classification,
228
+
as it highlights the important regions of an image or a video,
229
+
aiding in the understanding of the behavior of the model,
230
+
which is especially useful for model debugging and improvement.
231
+
Besides proposing a method to visualize the discriminative regions of a CNN trained for the classification task, % classification-trained
232
+
we adopt this approach from \citet{ZhouKLOT16} to localize objects without providing the model with any bounding box annotations.
233
+
The model can therefore learn the classification task with class labels and is then able to localize the object of a specific class in an image or video.
224
234
235
+
% Technically, CAM generates a heatmap that highlights the important regions of the image in terms of the decision of the model.
% CAM is a technique popularly used in CNNs to visualize and understand the regions of an input image that contribute most to a particular class prediction.
227
238
% Model Architecture:
228
239
% CAM is typically applied to the final convolutional layer of a CNN, just before the fully connected layers.
229
240
% CAM Process:
230
-
The final convolutional layer produces feature maps, and the GAP layer computes the average value of each feature map.
231
-
The weights connecting the feature maps to the output class are obtained.
232
-
The weighted combination of feature maps, representing the importance of each spatial location, is used to generate the CAM heatmap.
241
+
% The final convolutional layer produces feature maps, and
233
242
% Application:
234
-
CAM helps interpret CNN decisions by providing visual cues about the regions that influenced the classification.
235
-
It aids in understanding the model's behavior and can be useful for model debugging and improvement.
236
-
The global average pooling (GAP) layer is used to obtain a spatial average of the feature maps.
237
-
238
-
\subsection{Grad-CAM} % aggregate
239
-
\label{sec:optim:gcam}
240
-
243
+
% The GAP layer is used to obtain a spatial average of the feature maps.
244
+
% The \textit{global average pooling} (GAP) layer computes the average value of each feature map to obtain a spatial average of feature maps.
245
+
% The weights connecting the feature maps to the output class are obtained.
246
+
% The weighted combination of feature maps, representing the importance of each spatial location, is used to generate the CAM heatmap.
247
+
% CAM is a visualization technique designed to highlight the important regions of an image or video that contribute the most to the prediction of a specific class by a neural network,
248
+
% typically the final convolutional layer of a CNN before the fully connected layers.
241
249
Despite CAM can provide valuable insights into the decision-making process of deep learning models, especially CNNs,
242
-
CAM must be implemented in the last layer of a CNN,
250
+
CAM must be implemented in the last layer of a CNN or before the fully connected layer,
243
251
% Grad-CAM can be implemented with every architecture without big effort.
244
-
We thus follow up Gradient-weighted CAM~\cite{SelvarajuCDVPB17},
252
+
We will meanwhile compare to Gradient-weighted CAM~\cite{SelvarajuCDVPB17},
245
253
introduced as a technique that is easier to implement with different architectures.
246
-
This task will be implemented by using the libraries of Pytorch and OpenCV~\footnote{~\url{https://opencv.org}}.
247
-
248
-
\subsection{Table of Classification scores}
249
-
\label{sec:optim:csv}
250
-
To further analyze the separate scores of the each class of the model,
251
-
we wrote a script that takes a folder path as input and iterates through the images inside a subfolder.
252
-
The output is a CSV file representing the corresponding classification scores.
254
+
This task will be implemented by using the libraries from PyTorch and OpenCV~\footnote{~\url{https://opencv.org}}.
She also takes part in the explainable AI and Grad-CAM.
264
268
\item\textbf{Mahdi Mohammadi} implemented the augmentation, did the research searching, conclusion reasearching, data preprocessing, and CAM-Images inquiry.
265
269
\item\textbf{Jiawen Wang} implemented the model architecture, training and testing infrastructure, and optimization strategies.
266
-
In the specific writing part, she also draw the figures and tables and improved this report from other team members.
270
+
In the specific writing part, she draws the figures and tables and improved this report from team members.
0 commit comments