Skip to content

Commit fc4d741

Browse files
committed
updated report 2 pages
1 parent f4bbace commit fc4d741

File tree

3 files changed

+58
-54
lines changed

3 files changed

+58
-54
lines changed

proposal/main.brf

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,6 @@
1010
\backcite {li_reliable_2017}{{2}{3}{section.3}}
1111
\backcite {VermaMRMV23}{{2}{3}{section.3}}
1212
\backcite {ZeilerF14}{{2}{3}{section.3}}
13-
\backcite {HeZRS16}{{2}{\caption@xref {??}{ on input line 136}}{table.caption.3}}
14-
\backcite {ZhouKLOT16}{{2}{4.2}{subsection.4.2}}
15-
\backcite {SelvarajuCDVPB17}{{3}{4.3}{subsection.4.3}}
13+
\backcite {HeZRS16}{{2}{\caption@xref {??}{ on input line 138}}{table.caption.3}}
14+
\backcite {ZhouKLOT16}{{2}{4.3}{subsection.4.3}}
15+
\backcite {SelvarajuCDVPB17}{{2}{4.3}{subsection.4.3}}

proposal/sec/body.tex

Lines changed: 55 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -2,22 +2,23 @@ \section{Introduction}
22
\label{sec:intro}
33

44
\textit{Facial emotion recognition} (FER)~\cite{Ko18,JainSS19} is a topic of significant frontier and ongoing debate,
5-
not only in our daily life, but also in the fields of \textit{artificial intelligence} (AI) and computer vision.
5+
not only in our daily lives but also in the fields of \textit{artificial intelligence} (AI) and computer vision.
66
In this short proposal, we aim to leverage several \textit{deep neural networks} (DNNs),
77
which contain convolution layers and residual/attention blocks,
88
to detect and interpret six basic universally recognized and expressed human facial emotions
99
(i.e., happiness, surprise, sadness, anger, disgust, and fear).
1010
To make our model more transparent,
11-
we explain this emotion classification task with \textit{class activation mapping} (CAM) and \textit{gradient-weighted class activation mapping} (Grad-CAM).
11+
we explain this emotion classification task with \textit{class activation mapping} (CAM)
12+
and \textit{gradient-weighted class activation mapping} (Grad-CAM).
1213

1314
The structure of this report is arranged as follows.
1415
% \Cref{sec:related} contains the related work of our research.
1516
In \Cref{sec:approach},
1617
we address the datasets we collected and the model architecture we implemented.
1718
The preliminary evaluation results of our models are given in \Cref{sec:result}.
18-
\Cref{sec:optim} describes the optimization strategies we have plan to investigate in the coming weeks.
19-
\Cref{fig:result} illustrates the empirical results of our current best model,
20-
an overview of our time schedule for the entire final project is given in \Cref{fig:schedule}.
19+
\Cref{sec:optim} describes the optimization strategies we plan to investigate in the coming weeks.
20+
\Cref{fig:result} illustrates the empirical results of our current best model.
21+
An overview of our time schedule for the entire final project is given in \Cref{fig:schedule}.
2122
Our code and supplementary material are available at \url{https://github.com/werywjw/SEP-CVDL}.
2223

2324
% add demo: see https://github.com/werywjw/SEP-CVDL/blob/main/paper/Selvaraju_Cogswell_Grad-CAM.pdf
@@ -34,23 +35,13 @@ \subsection{Dataset Acquisition and Processing}
3435
TFEID~\cite{tfeid,LiGL22},
3536
as well as the video database DISFA~\cite{MavadatiMBTC13},
3637
from public institutions and GitHub repositories~\footnote{\url{https://github.com/spenceryee/CS229}}.
37-
Based on these databases, we created a dataset by augmentation to increase variety,
38-
full details of augmentation (see \Cref{sec:optim:aug} for details). % is given in~
38+
Based on these databases, we created a dataset by augmentation to increase the variety,
39+
and full details of augmentation (see \Cref{sec:optim:aug}). % is given in~
3940
In terms of illustrating the content of used pictures, we exclusively analyze human faces representing 6 emotions.
4041
That is,
4142
we generalized a folder structure annotating the labels 1 (surprise), 2 (fear), 3 (disgust), 4 (happiness), 5 (sadness), and 6 (anger).
4243
Besides the original format of images and videos, we set standards for extracting frames from the videos,
43-
resize training pictures to 64x64 pixels, and save them as the JPG format.
44-
45-
The images are converted to greyscale with three channels,
46-
as our original \textit{convolutional neural network} (CNN) is designed to work with three-channel inputs with random rotation and crop.
47-
Emotions were assigned tags to each individual picture in a CSV file to facilitate further processing in the model.
48-
We create a custom dataset, which is a collection of data relating to all training images we collected,
49-
using PyTorch~\footnote{\url{https://pytorch.org}},
50-
as it includes plenty existing functions to load various custom datasets in domain libraries such as \texttt{TorchVision}, \texttt{TorchText}, \texttt{TorchAudio}, and \texttt{TorchRec}.
51-
52-
% a specific problem you're working on.
53-
% In essence, a custom dataset can be comprised of almost anything.
44+
resizing training pictures to 64x64 pixels, and saving them in the JPG format.
5445

5546
\begin{figure}[ht]
5647
\centering
@@ -62,6 +53,16 @@ \subsection{Dataset Acquisition and Processing}
6253
\label{fig:result}
6354
\end{figure}
6455

56+
The images are converted to greyscale with three channels,
57+
as our original \textit{convolutional neural network} (CNN) is designed to work with three-channel inputs with random rotation and crop.
58+
Emotions were assigned tags to each individual picture in a CSV file to facilitate further processing in the model.
59+
We create a custom dataset, which is a collection of data relating to all training images we collected,
60+
using PyTorch~\footnote{\url{https://pytorch.org}},
61+
as it includes plenty of existing functions to load various custom datasets in domain libraries such as \texttt{TorchVision}, \texttt{TorchText}, \texttt{TorchAudio}, and \texttt{TorchRec}.
62+
63+
% a specific problem you're working on.
64+
% In essence, a custom dataset can be comprised of almost anything.
65+
6566
\subsection{Model Architecture}
6667
We implement an emotion classification model from scratch with four convolution layers at the very beginning.
6768
Following each convolutional layer,
@@ -80,7 +81,7 @@ \subsection{Model Architecture}
8081
we add the residual connections,
8182
as they allow gradients to flow through the network more easily, improving the training for deep architectures.
8283
Moreover,
83-
we add squeeze and excitation (SE) blocks to apply channel-wise attention.
84+
we add \textit{squeeze and excitation} (SE) blocks to apply channel-wise attention.
8485

8586
\begin{table}%[ht]
8687
\centering
@@ -112,9 +113,10 @@ \section{Preliminary Results}
112113
We report all the training, testing, and validation accuracy in \% to compare the performance of our models.
113114

114115
\Cref{fig:result} shows the test result aggregated from the database RAF-DB~\footnote{\url{https://www.kaggle.com/datasets/shuvoalok/raf-db-dataset}}.
116+
Different combinations of functions from the \texttt{pytorch.transforms} library are tested for augmentation from those already established filters. % that have been developed.
115117
As seen in \Cref{tab:model},
116-
our CNN without random augmentation outperforms the other models in terms of the accuracy,
117-
indicating that this kind of augmentation is not able to help our model predict the correct the label,
118+
our CNN without random augmentation outperforms the other models in terms of accuracy,
119+
indicating that this kind of augmentation is not able to help our model predict the correct label,
118120
thus we later aim to optimize with other augmentation techniques to capture more representative features of different emotions.
119121
Further research is orientated on papers engaging similar investigations~\cite{ZeilerF14,li_reliable_2017,VermaMRMV23}.
120122

@@ -136,7 +138,7 @@ \section{Preliminary Results}
136138
}
137139
\caption{Accuracy (\%) for different models in our experiments
138140
(Note that Aug stands for data augmentation, SE for squeeze and excitation, and Res for residual connections;
139-
+/- represent with/without respectively.)}
141+
+/- represent with/without respectively)}
140142
\label{tab:model}
141143
\end{table}
142144

@@ -151,7 +153,8 @@ \section{Optimization Strategies}
151153

152154
\subsection{Data Augmentation}
153155
\label{sec:optim:aug}
154-
In machine learning and AI,
156+
157+
In deep learning and AI, %machine
155158
augmentation stands as a transformative technique,
156159
empowering algorithms to learn from and adapt to a wider range of data.
157160
By introducing subtle modifications to existing data points,
@@ -163,7 +166,6 @@ \subsection{Data Augmentation}
163166
% which is a common pitfall in machine learning.
164167
Additionally, we guide the training process to enhance the recognition and handling of real-world variations.
165168
During the project, we pursue various approaches.
166-
We are implementing different combinations of functions from the \texttt{pytorch.transforms} library and testing already established filters that have been developed.
167169
% in other research contexts.
168170
Meanwhile, we create various replications of existing photos by randomly altering different properties such as size, brightness, color channels, or perspectives.
169171

@@ -212,44 +214,46 @@ \subsection{Data Augmentation}
212214
\label{fig:schedule}
213215
\end{figure*}
214216

215-
\subsection{CAM} % aggregate
217+
\subsection{Classification Scores}
218+
\label{sec:optim:csv}
219+
To further analyze the separate scores of each class of the model,
220+
we write a script that takes a folder path as input and iterates through the images inside a subfolder to record the performance of the model with respect to each emotion class.
221+
This CSV file is represented with the corresponding classification scores.
222+
223+
\subsection{CAM and Grad-CAM} % aggregate Class Activation Mapping (CAM)
216224
\label{sec:optim:cam}
217225

218-
Generally speaking, Class Activation Mapping is a visualization technique designed to highlight the regions of an image or video that contribute the most to the prediction of a specific class by a neural network,
219-
typically the final convolutional layer of a CNN before the fully connected layers.
220-
Technically, CAM generates a heatmap that highlights the important regions of the image in terms of the decision of the model.
221-
Besides proposing a method to visualize the discriminative regions of a classification-trained CNN,
222-
we adapte this approach from \citet{ZhouKLOT16} to localize objects without providing the model with any bounding box annotations.
223-
The model can thus learn the classification task with class labels and is then able to localize the object of a specific class in an image.
226+
In generally,
227+
CAM helps interpret CNN decisions by providing visual cues about the regions that influenced the classification,
228+
as it highlights the important regions of an image or a video,
229+
aiding in the understanding of the behavior of the model,
230+
which is especially useful for model debugging and improvement.
231+
Besides proposing a method to visualize the discriminative regions of a CNN trained for the classification task, % classification-trained
232+
we adopt this approach from \citet{ZhouKLOT16} to localize objects without providing the model with any bounding box annotations.
233+
The model can therefore learn the classification task with class labels and is then able to localize the object of a specific class in an image or video.
224234

235+
% Technically, CAM generates a heatmap that highlights the important regions of the image in terms of the decision of the model.
225236
%~\footnote{~\url{https://medium.com/@stepanulyanin/implementing-grad-cam-in-pytorch-ea0937c31e82}}
226237
% CAM is a technique popularly used in CNNs to visualize and understand the regions of an input image that contribute most to a particular class prediction.
227238
% Model Architecture:
228239
% CAM is typically applied to the final convolutional layer of a CNN, just before the fully connected layers.
229240
% CAM Process:
230-
The final convolutional layer produces feature maps, and the GAP layer computes the average value of each feature map.
231-
The weights connecting the feature maps to the output class are obtained.
232-
The weighted combination of feature maps, representing the importance of each spatial location, is used to generate the CAM heatmap.
241+
% The final convolutional layer produces feature maps, and
233242
% Application:
234-
CAM helps interpret CNN decisions by providing visual cues about the regions that influenced the classification.
235-
It aids in understanding the model's behavior and can be useful for model debugging and improvement.
236-
The global average pooling (GAP) layer is used to obtain a spatial average of the feature maps.
237-
238-
\subsection{Grad-CAM} % aggregate
239-
\label{sec:optim:gcam}
240-
243+
% The GAP layer is used to obtain a spatial average of the feature maps.
244+
% The \textit{global average pooling} (GAP) layer computes the average value of each feature map to obtain a spatial average of feature maps.
245+
% The weights connecting the feature maps to the output class are obtained.
246+
% The weighted combination of feature maps, representing the importance of each spatial location, is used to generate the CAM heatmap.
247+
% CAM is a visualization technique designed to highlight the important regions of an image or video that contribute the most to the prediction of a specific class by a neural network,
248+
% typically the final convolutional layer of a CNN before the fully connected layers.
241249
Despite CAM can provide valuable insights into the decision-making process of deep learning models, especially CNNs,
242-
CAM must be implemented in the last layer of a CNN,
250+
CAM must be implemented in the last layer of a CNN or before the fully connected layer,
243251
% Grad-CAM can be implemented with every architecture without big effort.
244-
We thus follow up Gradient-weighted CAM~\cite{SelvarajuCDVPB17},
252+
We will meanwhile compare to Gradient-weighted CAM~\cite{SelvarajuCDVPB17},
245253
introduced as a technique that is easier to implement with different architectures.
246-
This task will be implemented by using the libraries of Pytorch and OpenCV~\footnote{~\url{https://opencv.org}}.
247-
248-
\subsection{Table of Classification scores}
249-
\label{sec:optim:csv}
250-
To further analyze the separate scores of the each class of the model,
251-
we wrote a script that takes a folder path as input and iterates through the images inside a subfolder.
252-
The output is a CSV file representing the corresponding classification scores.
254+
This task will be implemented by using the libraries from PyTorch and OpenCV~\footnote{~\url{https://opencv.org}}.
255+
% \subsection{Grad-CAM} % aggregate
256+
% \label{sec:optim:gcam}
253257

254258
\subsection*{Author Contributions}
255259
\label{sec:author}
@@ -263,7 +267,7 @@ \subsection*{Author Contributions}
263267
She also takes part in the explainable AI and Grad-CAM.
264268
\item \textbf{Mahdi Mohammadi} implemented the augmentation, did the research searching, conclusion reasearching, data preprocessing, and CAM-Images inquiry.
265269
\item \textbf{Jiawen Wang} implemented the model architecture, training and testing infrastructure, and optimization strategies.
266-
In the specific writing part, she also draw the figures and tables and improved this report from other team members.
270+
In the specific writing part, she draws the figures and tables and improved this report from team members.
267271
\end{itemize}
268272

269273
\section*{Acknowledgements}

0 commit comments

Comments
 (0)