home

author: niplav, created: 2022-10-19, modified: 2022-12-20, language: english, status: in progress, importance: 2, confidence: likely

Solutions to the textbook “Maths for Intelligent Systems”.

Solutions to “Maths for Intelligent Systems”

Chapter 2

Stray Non-Exercise 1

Let me start with an example: We have three real-valued quantities $x, g$ and $f$ which depend on each other. Specifically, $f(x,g)=3x+2g$ and $g(x)=2x$.
Question: What is the “derivative of $f$ w.r.t. $x$”?

Intuitively, I'd say that $\frac{\partial}{\partial x}f(x,g)=3$. But then I notice that $g$ is allegedly a "real-valued quantity", what is that supposed to mean? Is it not a function?

Alas, plugging in $g$ into $f$ gives $f(x)=3x+2(2x)$ and $\frac{\partial}{\partial x}f(x)=3+4=7$.

2.4

(i)

$$XA+A^{\top}=\mathbf{I} \Leftrightarrow \\ XA=\mathbf{I}-A^{\top} \Leftrightarrow \\ X=(\mathbf{I}-A^{\top})A^{-1}$$

(ii)

$$ X^{\top}C=(2A(X+B))^{\top} \Leftrightarrow \\ X^{\top}C=(2AX)^{\top}+(2AB)^{\top} \Leftrightarrow \\ X^{\top}C-X^{\top}(2A)^{\top}=(2AB)^{\top} \Leftrightarrow \\ X^{\top}(C-(2A)^{\top})=(2AB)^{\top} \\ X^{\top}=(2AB)^{\top} (C^{-1}-((2A)^{\top})^{-1}) \Leftrightarrow \\ X=((C^{-1})^{\top}-(2A)^{-1}) 2AB \Leftrightarrow \\ X=(C^{-1})^{\top}2AB-B $$

(iii)

$$(Ax-y)^{\top}A=\mathbf{0}_n^{\top} \Leftrightarrow \\ A^{\top}(Ax-y)=\mathbf{0}_n^{\top} \Leftrightarrow \\ A^{\top}Ax -A^{\top}y=\mathbf{0}_n^{\top} \Leftrightarrow \\ x=(A^{\top}A)^{-1}(\mathbf{0}_n^{\top}+A^{\top}y)$$

(iv)

$$(Ax-y)^{\top}A+x^{\top}B=\mathbf{0}_n^{\top} \Leftrightarrow \\ A^{\top}(Ax-y)+x^{\top}B=\mathbf{0}_n^{\top} \Leftrightarrow \\ A^{\top}Ax-A^{\top}y+x^{\top}B=\mathbf{0}_n^{\top} \Leftrightarrow \\ A^{\top}Ax+x^{\top}B=\mathbf{0}_n^{\top}+A^{\top}y \Leftrightarrow \\ A^{\top}Ax+B^{\top}x=\mathbf{0}_n^{\top}+A^{\top}y \Leftrightarrow \\ (A^{\top}A+B^{\top})x=\mathbf{0}_n^{\top}+A^{\top}y \Leftrightarrow \\ x=(A^{\top}A+B^{\top})^{-1}(\mathbf{0}_n^{\top}+A^{\top}y) $$

(v)

(vi)

2.6.1

I… I don't know what the skew matrix is :-/, and Wikipedia isn't very helpful (I don't think it's the skew-Hermitian matrix or the skew-symmetric matrix or the skew-Hamiltonian matrix).

2.6.2

2.6.3

Writing code: This I can do.

i)

using Random, LinearAlgebra

function gradient_check(x, f, df):
    n=length(x)
    d=length(f(x))
    ε=10^-6
    J=zero(Matrix{Float64}(undef, d, n))
    for i in 1:n
        unit=zero(rand(n))
        unit[i]=1
        J[:,i]=(f(x+ε*unit)-f(x-ε*unit))/(2*ε)
    end
    if norm(J-df(x), Inf)<10^-4
        return true
    else
        return false
    end
end

ii)

julia> A=rand(Float64, (10, 15))
julia> f(x)=A*x
julia> df(x)=A
julia> x=randn(15)
15-element Vector{Float64}:
  1.536516645971545
  1.0136394994998532
 -0.09863977762813898
  1.3510191388362935
  0.84503226122143
  0.09296670831415606
 -1.5390337565597376
  1.4679194319980104
 -0.7085023577127753
 -0.10676335224166593
 -0.8686753109089055
  1.2912744597257453
  0.7364123079861109
  0.5736005534388826
  0.5332386427039576
julia> gradient_check(x, f, df)
true

And now the cooler $f$:

julia> f(x)=transpose(x)*x
f (generic function with 1 method)
julia> df(x)=2*transpose(x)
df (generic function with 1 method)

2.6.4

The derivative of $σ(W_0 \times x_0)$, using the chain rule and the derivative of $\frac{dσ}{dx}=σ'$, is $σ'(W_0 \times x_0) \times W_0$.

Applying this again for $W_1 \times σ(W_0 \times x_0)$, we get $W_1 \times σ'(W_0 \times x_0) \times W_0$.

Again: $\frac{d}{d x_0} σ(W_1 \times σ(W_0 \times x_0))=σ'(W_1 \times σ(W_0 \times x_0)) \times W_1 \times σ'(W_0 \times x_0) \times W_0$.

And finally: $\frac{d}{d x_0} W_2 \times σ(W_1 \times σ(W_0 \times x_0))=W_2 \times σ'(W_1 \times σ(W_0 \times x_0)) \times W_1 \times σ'(W_0 \times x_0) \times W_0$.

Then the formula for computing $\frac{d f}{d x_0}$ is $W_2 \times \prod_{l=0}^{m-1} σ'(z_{l+1}) \times W_l$, where $m$ is the number of matrices, and $\prod$ is left matrix multiplication.