maligrad is a small NumPy-array-based autograd engine with an included NN library
It can do the majority of things numpy can:
a = Variable(2 * np.ones((3, 3)), requires_grad=True)
b = Variable(3 * np.ones((3, 3)))
# addition
print(a + b)
>>> Variable(
data=
[[5. 5.]
[5. 5.]]
)
# multiplication
print(a * b)
>>> Variable(
data=
[[6. 6.]
[6. 6.]]
)
# matrix multiplication
print(a @ b[0])
>>> Variable(
data=
[12. 12.]
)
# and much more
while also enabling reverse mode automatic differentiation:
(a[0] * b).sum().backward()
print(a.grad)
>>> [[6., 6.],
[0., 0.]]
The project was inspired by the great micrograd. If you watched the video tutorial or implemented micrograd yourself, you know that the engine is based on scalars. While being really simple, it is still powerful enough to define pretty much any neural net. The downside to that approach is, however, that it's not as efficient.
So, I wondered how much more work would be needed to make a similar automatic differentiation engine, but based on tensors /
It turns out that not that much is different. For comparison, micrograd's engine is around 70 lines of code, while maligrad has roughly 350 lines under the hood. And that includes all the extra operations that are not included in micrograd. At the end, the broadcasting and array operation shenanigans sorted themselves out almost on their own (I attribute this to my great planning and look-ahead. No blind luck was involved.).
At first, my computation graph was being constructed in much the same way as in Andrej Karpathy's video lecture. I decided to split the nodes into function and data nodes, however, since I knew that arrays would require more operations for basic functionality, and I wanted to split code to make it more manageable (this did not happen at the end, due to typing circular dependencies).
Later, I noticed that this approach, while being very simple, is not the most memory-efficient. Not to go too deep into the details, but in essence, the compute graph I was constructing included all the tensors from all the intermediate compute results. Think about it for a second. While training, all the intermediate tensors that would otherwise get GC'd needed to stay in memory because they were nodes in the computation graph. Better yet, the gradients were allocated at construction, which inflated the memory footprint twice as much.
This had to change, so I completely rewrote the engine from scratch after nearly calling the project a wrap. Fortunately, I used unit tests, so the whole rewrite happened pretty much in a day.
Anyhow, here's how it works:. Suppose, we want to know the gradients through the following computation:
a = Variable(2, requires_grad=True)
b = Variable(3, requires_grad=True)
x = a * b # step 1
y = a + x # step 2
y.backward()
The first node that gets constructed is the multiplication node:
Output
The second step is much the same:
Instead of the multiplication, the addition node now generates the output
That is, essentially, it.
After you call .backward()
, the gradients flow through the blue graph and get deposited into the inputs.
There are certain other details and optimizations that are still needed for backprop to be successful, but they cannot be explained as quickly.
assets
includes images for this markdown.
examples
includes notebooks with neural net training examples.
maligrad
includes the main juice.
autograd
includes the automatic differentiation engine. If you have to check out one thing, look at this engine.
data
includes a data engine similar to that of pytorch.
nn
includes a torch-style nn library. It can do convolutions. Check this one out if you are interested after checking out autograd.
tests
test cases for maligrad. Same folder structure as in maligrad.
Install numpy if you still haven't:
pip install numpy
If you wish to run the example notebooks, also install jupyter and matplotlib:
pip install ipykernel matplotlib
Besides micrograd, the project also draws inspiration from pytorch (nn
interface).
Tinygrad is also cool, though not many influences were drawn from it.
The project may look like it's useful in a way. The very professionally executed logo may add to the effect. It is not. It was written purely for my getting familiar with how automatic differentiation works, and probably has a multitude of bugs (even when I was thinking it was completed and set on to write out the examples, I encountered a couple of them). You better stick with pytorch or JAX.
I will probably never come around to doing any of these:
-
Inplace modifications are not supported and never will be. But some operations, like padding in convolution, would require a similiar operation. It is probably the easiest to do something like an
.at
property in JAX. -
Currently, maligrad is pretty memory inefficient, because the compute graphs are longer than they need to be, since only a few basic ops have
Function
implementations. For example: division is implemented asa / b = a * b ** -1
, which produces two nodes in DAG. This is not a lone example, so operator overrides at least should have an associatedFunction
. -
RNNs and other nn stuff.