Skip to content

Commit a84b855

Browse files
authored
Merge pull request #181 from lanl/develop
Develop
2 parents ff683c8 + 387bb92 commit a84b855

File tree

74 files changed

+894
-190
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

74 files changed

+894
-190
lines changed

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
cff-version: 1.2.0
1+
version: 0.0.35
22
message: "If you use this software, please cite it as below."
33
authors:
44
- family-names: Eren
@@ -20,7 +20,7 @@ authors:
2020
- family-names: Alexandrov
2121
given-names: Boian
2222
title: "Tensor Extraction of Latent Features (T-ELF)"
23-
version: 0.0.34
23+
version: 0.0.35
2424
url: https://github.com/lanl/T-ELF
2525
doi: 10.5281/zenodo.10257897
2626
date-released: 2023-12-04

README.md

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -97,20 +97,35 @@ python post_install.py # use the following, for example, for GPU system: <python
9797
| WNMFk | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | NMFk with weighting - used for recommendation system | [Link](examples/WNMFk/WNMFk.ipynb) | :white_check_mark: |
9898
| HNMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Hierarchical NMFk | [Link](examples/HNMFk/HNMFk.ipynb) | :white_check_mark: |
9999
| BNMFk | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Boolean NMFk | [Link](examples/BNMFk/BNMFk.ipynb) | :white_check_mark: |
100-
100+
| LMF | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | | | Logistic Matrix Factorization | [Link](examples/LMF/LMF.ipynb) | :white_check_mark: |
101+
| SPLIT NMFk | | | | | | | Joint NMFk factorization of multiple data via SPLIT | | :soon: |
102+
| SPLIT Transfer Classifier | | | | | | | Supervised transfer learning method via SPLIT and NMFk | | :soon: |
103+
101104
### TELF.pre_processing
102105

103106
| **Method** | **Multiprocessing** | **HPC** | **Description** | **Example** | **Release Status** |
104107
|:----------:|:-------------------:|:-------------------:|:------------------------------------------------------------------:|:-----------:|:------------------:|
105108
| Vulture | :heavy_check_mark: | :heavy_check_mark: | Advanced text processing tool for cleaning and NLP | [Link](examples/Vulture) | :white_check_mark: |
106109
| Beaver | :heavy_check_mark: | :heavy_check_mark: | Fast matrix and tensor building tool for text mining | [Link](examples/Beaver) | :white_check_mark: |
110+
| iPenguin | | | Online Semantic Scholar information retrieval tool | | :soon: |
111+
| Orca | | | Duplicate author detector for text mining and information retrival | | :soon: |
112+
113+
### TELF.post_processing
107114

115+
| **Method** | **Description** | **Example** | **Release Status** |
116+
|:----------:|:----------------------------------------------------------:|:-----------:|:------------------:|
117+
| Peacock | Data visualization and generation of actionable statistics | | :soon: |
118+
| Wolf | Graph centrality and ranking tool | | :soon: |
119+
| Fox | Report generation tool for text data | | :soon: |
120+
| SeaLion | Generic report generation tool | | :soon: |
108121

109122
### TELF.applications
110123

111124
| **Method** | **Description** | **Example** | **Release Status** |
112125
|:----------:|:--------------------------------------------------------------------:|:-----------:|:------------------:|
113126
| Cheetah | Fast search by keywords and phrases | [Link](examples/Cheetah) | :white_check_mark: |
127+
| Bunny | Dataset generation tool for documents and their citations/references | | :soon: |
128+
| Termite | Knowladge graph building tool | | :soon: |
114129

115130

116131
## How to Cite T-ELF?

TELF/applications/Cheetah/cheetah.py

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -833,7 +833,15 @@ def _index_affiliation_country(self, data:dict) -> tuple:
833833

834834
for affil_id, affil_info_dict in curr_info_dict.items():
835835
affil_id = str(affil_id).strip().lower()
836-
country = affil_info_dict["country"].strip().lower()
836+
837+
# Check type and country of affiliation
838+
if not isinstance(affil_info_dict, dict):
839+
continue
840+
country = affil_info_dict.get("country")
841+
if country:
842+
country = country.strip().lower()
843+
else:
844+
country = ''
837845

838846
# affiliation
839847
if str(affil_id) in affiliation_index_tmp:
Lines changed: 242 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,242 @@
1+
from tqdm import tqdm
2+
import matplotlib.pyplot as plt
3+
import numpy as np
4+
5+
try:
6+
import cupy as cp
7+
except Exception:
8+
cp = None
9+
10+
class LogisticMatrixFactorization:
11+
def __init__(self, k=30, l2_p=1e-6, epochs=1000, learning_rate=0.001, tolerance=1e-4, device="cpu", random_state=None):
12+
"""
13+
Logistic Matrix Factorization with a mask.
14+
15+
Parameters:
16+
- k: Number of latent factors.
17+
- l2_p: Regularization parameter (L2 penalty).
18+
- epochs: Number of training epochs.
19+
- learning_rate: Learning rate for gradient descent.
20+
- tolerance: Early stopping criterion based on loss change.
21+
"""
22+
self.k = k
23+
self.l2_p = l2_p
24+
self.epochs = epochs
25+
self.learning_rate = learning_rate
26+
self.tolerance = tolerance
27+
self.np = np
28+
self.random_state = random_state
29+
30+
if device == "cpu":
31+
self.device = device
32+
elif device == "gpu":
33+
self.device = 0
34+
elif isinstance(device, int) and device >= 0:
35+
self.device = device
36+
else:
37+
raise Exception("Device should be 'cpu', 'gpu' (CUDA:0), or a GPU number between 0 and N-1 where N is the number of GPUs.")
38+
39+
if self.device != "cpu" and cp is None:
40+
print("No CUDA found! Using CPU!")
41+
self.device = "cpu"
42+
43+
def fit(self, Xtrain, MASK, plot_loss=True):
44+
"""
45+
Train the logistic matrix factorization model.
46+
47+
Parameters:
48+
- Xtrain: Training interaction matrix (m x n).
49+
- MASK: Binary mask matrix with 1s for observed entries in Xtrain.
50+
51+
Returns:
52+
- W: Learned row (user) latent feature matrix (m x k).
53+
- H: Learned column (item) latent feature matrix (k x n).
54+
- row_bias: Learned row bias vector (m x 1).
55+
- col_bias: Learned column bias vector (1 x n).
56+
"""
57+
if self.device != "cpu":
58+
self.np = cp
59+
60+
m, n = Xtrain.shape
61+
W, H, row_bias, col_bias = self._initialize_embeddings(m, n)
62+
63+
if self.device != "cpu":
64+
with cp.cuda.Device(self.device):
65+
losses = cp.zeros(self.epochs)
66+
MASK = cp.array(MASK)
67+
Xtrain = cp.array(Xtrain)
68+
W, H, row_bias, col_bias, losses = self._factorization_routine(W, H, row_bias, col_bias, MASK, Xtrain, losses)
69+
70+
# to CPU
71+
W = cp.asnumpy(W)
72+
H = cp.asnumpy(H)
73+
row_bias = cp.asnumpy(row_bias)
74+
col_bias = cp.asnumpy(col_bias)
75+
MASK = cp.asnumpy(MASK)
76+
Xtrain = cp.asnumpy(Xtrain)
77+
losses = cp.asnumpy(losses)
78+
self.np = np
79+
else:
80+
losses = np.zeros(self.epochs)
81+
W, H, row_bias, col_bias, losses = self._factorization_routine(W, H, row_bias, col_bias, MASK, Xtrain, losses)
82+
83+
# Plot loss
84+
if plot_loss:
85+
plt.plot(losses)
86+
plt.xlabel('Epoch')
87+
plt.ylabel('Loss')
88+
plt.title('Training Loss')
89+
plt.show()
90+
91+
return W, H, row_bias, col_bias, losses
92+
93+
def predict(self, W, H, row_bias, col_bias):
94+
"""
95+
Predict all entries in the matrix.
96+
97+
Parameters:
98+
- W: Learned row latent feature matrix (m x k).
99+
- H: Learned column latent feature matrix (k x n).
100+
- row_bias: Learned row bias vector (m x 1).
101+
- col_bias: Learned column bias vector (1 x n).
102+
103+
Returns:
104+
- Xtilda: Predicted matrix of interaction probabilities.
105+
"""
106+
return self._sigmoid(self.np.dot(W, H) + row_bias + col_bias)
107+
108+
def map_probabilities_to_binary(self, Xtilda, threshold=0.5):
109+
"""
110+
Map probabilities to binary values (0 or 1) using a threshold.
111+
112+
Parameters:
113+
- Xtilda: numpy array, predicted probabilities (values in [0, 1]).
114+
- threshold: float, the cutoff for mapping probabilities to 0 or 1.
115+
116+
Returns:
117+
- Xtilda_binary: numpy array, binary Xtilda (0s and 1s).
118+
"""
119+
return (Xtilda >= threshold).astype(int)
120+
121+
def _initialize_embeddings(self, m, n):
122+
"""
123+
Initialize embeddings (W and H) and biases for rows (users) and columns (items).
124+
"""
125+
np.random.seed(self.random_state)
126+
127+
W = np.random.normal(scale=0.1, size=(m, self.k))
128+
H = np.random.normal(scale=0.1, size=(self.k, n))
129+
row_bias = np.random.normal(scale=0.1, size=(m, 1))
130+
col_bias = np.random.normal(scale=0.1, size=(1, n))
131+
132+
if self.device != "cpu":
133+
with cp.cuda.Device(self.device):
134+
W, H, row_bias, col_bias = cp.array(W), cp.array(H), cp.array(row_bias), cp.array(col_bias)
135+
136+
return W, H, row_bias, col_bias
137+
138+
def _sigmoid(self, x):
139+
return 1 / (1 + self.np.exp(-x))
140+
141+
def _compute_loss(self, X_train, Xtilda, MASK, W, H):
142+
"""
143+
Compute binary cross-entropy loss.
144+
145+
Parameters:
146+
- X_train: Training interaction matrix.
147+
- Xtilda: Predicted matrix.
148+
- MASK: Binary mask matrix.
149+
150+
Returns:
151+
- loss: Binary cross-entropy loss.
152+
"""
153+
loss = -self.np.sum(
154+
MASK * (X_train * self.np.log(Xtilda + 1e-8) + (1 - X_train) * self.np.log(1 - Xtilda + 1e-8))
155+
)
156+
loss += self.l2_p * (self.np.sum(W ** 2) + self.np.sum(H ** 2))
157+
return loss
158+
159+
160+
def _factorization_routine(self, W, H, row_bias, col_bias, MASK, Xtrain, losses):
161+
"""
162+
Performs matrix factorization using stochastic gradient descent (SGD) with regularization and optional early stopping.
163+
164+
This function iteratively optimizes the latent factor matrices (`W` and `H`), row biases, and column biases
165+
to minimize the reconstruction error between the observed entries in the input matrix (`Xtrain`) and the predicted
166+
matrix (`Xtilda`). It incorporates L2 regularization and supports early stopping if the loss improvement falls
167+
below a specified tolerance.
168+
169+
Parameters:
170+
W (numpy.ndarray):
171+
A matrix of shape `(num_rows, latent_factors)` representing the initial latent factors for rows.
172+
H (numpy.ndarray):
173+
A matrix of shape `(latent_factors, num_columns)` representing the initial latent factors for columns.
174+
row_bias (numpy.ndarray):
175+
A vector of shape `(num_rows, 1)` representing the row-wise biases.
176+
col_bias (numpy.ndarray):
177+
A vector of shape `(1, num_columns)` representing the column-wise biases.
178+
MASK (numpy.ndarray):
179+
A binary mask matrix of the same shape as `Xtrain`, where 1 indicates an observed entry and 0 indicates missing.
180+
Xtrain (numpy.ndarray):
181+
The observed training data matrix of shape `(num_rows, num_columns)`.
182+
losses (list or numpy.ndarray):
183+
A pre-allocated container to store the loss values at each epoch.
184+
185+
Returns:
186+
W (numpy.ndarray):
187+
The updated latent factor matrix for rows after optimization.
188+
H (numpy.ndarray):
189+
The updated latent factor matrix for columns after optimization.
190+
row_bias (numpy.ndarray):
191+
The updated row-wise biases.
192+
col_bias (numpy.ndarray):
193+
The updated column-wise biases.
194+
losses (list or numpy.ndarray):
195+
The updated list or array containing the training loss at each epoch.
196+
197+
Steps:
198+
1. **Prediction**: The predicted matrix (`Xtilda`) is computed using the current `W`, `H`, `row_bias`, and `col_bias`.
199+
2. **Error Calculation**: The reconstruction error is calculated only for observed entries using the binary mask (`MASK`).
200+
3. **Gradient Calculation**: Gradients for `W`, `H`, `row_bias`, and `col_bias` are computed using the observed errors
201+
and L2 regularization.
202+
4. **Parameter Updates**: The latent factor matrices (`W`, `H`) and biases (`row_bias`, `col_bias`) are updated using
203+
the gradients and a specified learning rate.
204+
5. **Loss Calculation**: The reconstruction loss is computed for the current epoch and stored in the `losses` array.
205+
6. **Early Stopping**: If the loss improvement between consecutive epochs falls below a predefined tolerance, the
206+
optimization process terminates early.
207+
208+
Notes:
209+
- The `_compute_loss` function is assumed to compute the loss using both observed reconstruction errors and regularization terms.
210+
- Early stopping can significantly reduce computation time when the optimization converges quickly.
211+
- The function updates the input parameters in-place, and the returned values reflect the final state after optimization.
212+
"""
213+
for epoch in tqdm(range(self.epochs)):
214+
# Compute Xtilda (predictions)
215+
Xtilda = self.predict(W, H, row_bias=row_bias, col_bias=col_bias)
216+
217+
# Compute errors for observed entries
218+
errors = MASK * (Xtilda - Xtrain)
219+
220+
# Gradients
221+
grad_W = self.np.dot(errors, H.T) + self.l2_p * W
222+
grad_H = self.np.dot(W.T, errors) + self.l2_p * H
223+
grad_row_bias = self.np.sum(errors, axis=1, keepdims=True) + self.l2_p * row_bias
224+
grad_col_bias = self.np.sum(errors, axis=0, keepdims=True) + self.l2_p * col_bias
225+
226+
# Update embeddings and biases
227+
W -= self.learning_rate * grad_W
228+
H -= self.learning_rate * grad_H
229+
row_bias -= self.learning_rate * grad_row_bias
230+
col_bias -= self.learning_rate * grad_col_bias
231+
232+
# Compute training loss
233+
loss = self._compute_loss(Xtrain, Xtilda, MASK, W, H)
234+
losses[epoch] = loss
235+
236+
# Early stopping based on tolerance
237+
if self.tolerance is not None and (epoch > 0 and abs(losses[epoch] - losses[epoch-1]) < self.tolerance):
238+
print(f"Early stopping at epoch {epoch + 1}. Loss change below tolerance.")
239+
break
240+
241+
return W, H, row_bias, col_bias, losses
242+

TELF/version.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = '0.0.34'
1+
__version__ = "0.0.35"

docs/Beaver.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
<meta charset="utf-8" />
99
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
1010

11-
<title>TELF.pre_processing.Beaver: Fast matrix and tensor building tool &#8212; TELF 0.0.34 documentation</title>
11+
<title>TELF.pre_processing.Beaver: Fast matrix and tensor building tool &#8212; TELF 0.0.35 documentation</title>
1212

1313

1414

@@ -40,7 +40,7 @@
4040
<link rel="preload" as="script" href="_static/scripts/bootstrap.js?digest=26a4bc78f4c0ddb94549" />
4141
<link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=26a4bc78f4c0ddb94549" />
4242

43-
<script src="_static/documentation_options.js?v=d15f0e27"></script>
43+
<script src="_static/documentation_options.js?v=6aa38c3a"></script>
4444
<script src="_static/doctools.js?v=9bcbadda"></script>
4545
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
4646
<script src="_static/scripts/sphinx-book-theme.js?v=887ef09a"></script>
@@ -126,7 +126,7 @@
126126

127127

128128

129-
<p class="title logo__title">TELF 0.0.34 documentation</p>
129+
<p class="title logo__title">TELF 0.0.35 documentation</p>
130130

131131
</a></div>
132132
<div class="sidebar-primary-item">

docs/Cheetah.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
<meta charset="utf-8" />
99
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
1010

11-
<title>TELF.applications.Cheetah: Advanced search by keywords and phrases &#8212; TELF 0.0.34 documentation</title>
11+
<title>TELF.applications.Cheetah: Advanced search by keywords and phrases &#8212; TELF 0.0.35 documentation</title>
1212

1313

1414

@@ -40,7 +40,7 @@
4040
<link rel="preload" as="script" href="_static/scripts/bootstrap.js?digest=26a4bc78f4c0ddb94549" />
4141
<link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=26a4bc78f4c0ddb94549" />
4242

43-
<script src="_static/documentation_options.js?v=d15f0e27"></script>
43+
<script src="_static/documentation_options.js?v=6aa38c3a"></script>
4444
<script src="_static/doctools.js?v=9bcbadda"></script>
4545
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
4646
<script src="_static/scripts/sphinx-book-theme.js?v=887ef09a"></script>
@@ -126,7 +126,7 @@
126126

127127

128128

129-
<p class="title logo__title">TELF 0.0.34 documentation</p>
129+
<p class="title logo__title">TELF 0.0.35 documentation</p>
130130

131131
</a></div>
132132
<div class="sidebar-primary-item">

docs/HNMFk.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
<meta charset="utf-8" />
99
<meta name="viewport" content="width=device-width, initial-scale=1.0" /><meta name="viewport" content="width=device-width, initial-scale=1" />
1010

11-
<title>TELF.factorization.HNMFk: Hierarchical Non-negative Matrix Factorization with Automatic Model Determination &#8212; TELF 0.0.34 documentation</title>
11+
<title>TELF.factorization.HNMFk: Hierarchical Non-negative Matrix Factorization with Automatic Model Determination &#8212; TELF 0.0.35 documentation</title>
1212

1313

1414

@@ -40,7 +40,7 @@
4040
<link rel="preload" as="script" href="_static/scripts/bootstrap.js?digest=26a4bc78f4c0ddb94549" />
4141
<link rel="preload" as="script" href="_static/scripts/pydata-sphinx-theme.js?digest=26a4bc78f4c0ddb94549" />
4242

43-
<script src="_static/documentation_options.js?v=d15f0e27"></script>
43+
<script src="_static/documentation_options.js?v=6aa38c3a"></script>
4444
<script src="_static/doctools.js?v=9bcbadda"></script>
4545
<script src="_static/sphinx_highlight.js?v=dc90522c"></script>
4646
<script src="_static/scripts/sphinx-book-theme.js?v=887ef09a"></script>
@@ -126,7 +126,7 @@
126126

127127

128128

129-
<p class="title logo__title">TELF 0.0.34 documentation</p>
129+
<p class="title logo__title">TELF 0.0.35 documentation</p>
130130

131131
</a></div>
132132
<div class="sidebar-primary-item">

0 commit comments

Comments
 (0)