Solving for X, Part 2: Tackling Tougher Linear Equations in Data Science

Welcome back to our linear algebra for data science post! In our last post, we kicked off our dive into linear algebra for data science by looking at how we represent data using matrices and how we can spot linear relationships hiding in that data using the idea of the null space. Matrices are fundamental, and understanding those relationships is super powerful.

Today, we’re moving on to another absolute core concept: solving linear equations. If you think about a ton of problems in data science, machine learning, and even everyday modeling, they often boil down to having a bunch of equations and needing to find the values that make them true. This is where solving matrix equations becomes key.

Turning Equations into Matrix Problems: $AX=B$

Generally, when you have a set of linear equations, you can write them in a neat matrix form that looks like this:

$AX = B$

Let’s break down what’s in this equation based on what we learned before:

$A$ is a matrix. If you have $M$ equations and $N$ variables, $A$ is typically an $M \times N$ matrix. As we saw, $M$ is the number of rows (equations in this case) and $N$ is the number of columns (variables).
$X$ is a column vector containing the variables you’re trying to solve for. Since $A$ has $N$ columns (variables), $X$ needs to have $N$ rows, so it’s an $N \times 1$ matrix (a column vector).
$B$ is a column vector containing the constants from the right-hand side of your equations. For the matrix multiplication to work out, $B$ needs to have the same number of rows as $A$ , so it’s an $M \times 1$ matrix.

So, when you write $AX=B$ , you’re really representing a system of $M$ linear equations involving $N$ variables.

Now, depending on how $M$ (the number of equations) stacks up against $N$ (the number of variables), things can play out in three main ways:

$M = N$ : The number of equations is exactly the same as the number of variables. This is often the “nicest” case to solve.
$M > N$ : You have more equations than variables. Think of it as having too many rules for your variables to follow. Usually, this means there’s no single perfect solution that satisfies all the equations.
$M < N$ : You have fewer equations than variables. This means you have more variables than you strictly need for the given equations. In this scenario, you’ll usually find that there are many, many possible solutions.

We’re going to look at these cases, and then see how a cool idea called the pseudo-inverse can bring them all together.

A Quick Refresher on Rank

Remember rank from our last post? It’s going to be super important here. The rank of a matrix is the number of its linearly independent rows or columns. Remember, row rank always equals column rank – you can’t have a different number of independent rows versus columns!

For an $M \times N$ matrix, the maximum possible rank it can have is the smaller of $M$ and $N$ . If $M < N$ , the max rank is $M$ . If $N < M$ , the max rank is $N$ .

Case 1: When Equations Equal Variables ( $M=N$ )

This is the case where your matrix $A$ is square ( $M \times N$ and $M=N$ ).

If $A$ is Full Rank: “Full rank” here means the rank of the matrix is equal to $M$ (or $N$ , since they’re the same). What does this mean? It means all your equations on the left-hand side are independent. You can’t create any one equation by combining the others. In this happy case, there is a unique solution to $AX=B$ . You might remember from algebra that you can solve this by finding the inverse of $A$ , written as $A^{-1}$ . The solution is simply: $X = A^{-1}B$ You can find $A^{-1}$ if the determinant of $A$ is not zero. This is the standard, straightforward scenario.
If $A$ is Not Full Rank: This means the rank of $A$ is less than $M$ . In this situation, some of your equations on the left-hand side are actually linear combinations of others – they are linearly dependent. When this happens, depending on what the values are on the right-hand side (in the $B$ vector), you get two possibilities:
1. Consistent System: If the dependencies on the left-hand side match up exactly with the dependencies on the right-hand side (the $B$ vector values), the equations are consistent. But because they are dependent, you don’t have enough independent equations to pin down a single solution. This leads to infinite solutions.
2. Inconsistent System: If the dependencies on the left-hand side don’t match up with the $B$ vector values, the equations are inconsistent. They contradict each other, and there is no solution that can satisfy all of them.

Let’s look at a couple of examples for this $M=N$ case:

Example 1: Full Rank (Unique Solution)

Consider the system of equations: $x_1 + 3x_2 = 7$ $2x_1 + 4x_2 = 10$

In matrix form $AX=B$ , this is:

\begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 7 \\ 10 \end{bmatrix}

Here, $A = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}$ . $M=2$ , $N=2$ . We can check if $A$ is full rank by calculating its determinant: $(1 \times 4) - (3 \times 2) = 4 - 6 = -2$ . Since the determinant is not 0, the matrix is full rank (rank = 2). There is a unique solution, $X = A^{-1}B$ . The inverse of $A$ is $\begin{bmatrix} -2 & 1.5 \\ 1 & -0.5 \end{bmatrix}$ . So, $X = \begin{bmatrix} -2 & 1.5 \\ 1 & -0.5 \end{bmatrix} \begin{bmatrix} 7 \\ 10 \end{bmatrix} = \begin{bmatrix} (-2 \times 7) + (1.5 \times 10) \\ (1 \times 7) + (-0.5 \times 10) \end{bmatrix} = \begin{bmatrix} -14 + 15 \\ 7 - 5 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}$ . The unique solution is $x_1 = 1$ and $x_2 = 2$ . You can plug these back into the original equations to check ( $1 + 3(2) = 7$ , $2(1) + 4(2) = 10$ ). It works!

You can solve this easily in software too. In Python, you could define A and B using NumPy and use the numpy.linalg.solve() command or compute the inverse and multiply.

Example 2: Not Full Rank, Consistent (Infinite Solutions)

Consider the system: $x_1 + 2x_2 = 5$ $2x_1 + 4x_2 = 10$

In matrix form $AX=B$ :

\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 10 \end{bmatrix}

Here, $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ . $M=2$ , $N=2$ . Check the determinant: $(1 \times 4) - (2 \times 2) = 4 - 4 = 0$ . The determinant is 0, so the matrix is not full rank (rank = 1). The columns are dependent (column 2 is 2 * column 1), and the rows are dependent (row 2 is 2 * row 1).

Now, look at the equations themselves. The second equation $2x_1 + 4x_2 = 10$ is exactly 2 times the first equation $x_1 + 2x_2 = 5$ . This dependency on the left-hand side ( $2x_1 + 4x_2$ being $2 \times (x_1 + 2x_2)$ ) is matched on the right-hand side ( $10$ being $2 \times 5$ ). Since the left and right sides have the same linear dependency, the system is consistent. However, because the second equation is just a scaled version of the first, you effectively only have one independent equation ( $x_1 + 2x_2 = 5$ ) with two variables ( $x_1, x_2$ ). This means you have a “free” variable. You can pick any value for $x_2$ , and then calculate the $x_1$ that works ( $x_1 = 5 - 2x_2$ ). For example:

If $x_2 = 0$ , $x_1 = 5$ . Solution: $(5, 0)$ .
If $x_2 = 1$ , $x_1 = 3$ . Solution: $(3, 1)$ .
If $x_2 = 2.5$ , $x_1 = 0$ . Solution: $(0, 2.5)$ . Since you can choose any value for $x_2$ , there are infinite solutions to this system.

Example 3: Not Full Rank, Inconsistent (No Solution)

Consider the system: $x_1 + 2x_2 = 5$ $2x_1 + 4x_2 = 9$

In matrix form $AX=B$ :

\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 9 \end{bmatrix}

Here, $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ is the same as before, so it’s not full rank (rank = 1). The left-hand side has the same dependency (row 2 is 2 * row 1). But look at the right-hand side ( $B$ vector). The first equation’s right side is 5. If you multiply it by 2 (the scaling factor between the left-hand sides), you get $2 \times 5 = 10$ . However, the second equation’s right side is 9, not 10. The dependency on the left-hand side (where the second equation’s left side is twice the first) does not match the right-hand side ( $9 \neq 2 \times 5$ ). The equations contradict each other ( $2x_1 + 4x_2$ cannot equal both 10 and 9 simultaneously based on the first equation). The system is inconsistent, and there is no solution that can satisfy both equations.

So, for the $M=N$ case (square matrix A): if A is full rank, unique solution; if A is not full rank, check consistency – if consistent, infinite solutions; if inconsistent, no solution.

Case 2: More Equations Than Variables ( $M > N$ )

Now let’s look at the case where you have more equations than variables. Like trying to find two numbers ( $x_1, x_2$ ) that satisfy five different conditions.

$AX = B \quad (\text{where M > N})$

Since you have more equations than variables, it’s generally impossible to find a perfect solution $X$ that makes every single equation true (i.e., makes $AX$ exactly equal to $B$ ). If you could find such a perfect $X$ , then $AX - B$ would be a vector of all zeros.

But since a perfect solution is usually out of reach, what’s the next best thing? We want to find a solution $X$ that gets us as close as possible to satisfying all the equations. In other words, we want to find an $X$ that makes the difference between $AX$ and $B$ as small as possible.

Think of $AX - B$ as a vector of “errors,” where each element is the error in one equation. Equation 1 error: $e_1 = (A_{11}x_1 + A_{12}x_2 + ... + A_{1N}x_N) - B_1$ Equation 2 error: $e_2 = (A_{21}x_1 + A_{22}x_2 + ... + A_{2N}x_N) - B_2$ … Equation M error: $e_M = (A_{M1}x_1 + A_{M2}x_2 + ... + A_{MN}x_N) - B_M$

The vector of errors is $E = AX - B$ . We want to find the $X$ that makes this error vector $E$ as “small” as possible. How do we measure the size of a vector of errors? We don’t just add the errors ( $e_1 + e_2 + ...$ ), because positive and negative errors could cancel out, making a big overall error look like zero.

A common way to measure the overall error is to minimize the sum of the squared errors: $e_1^2 + e_2^2 + ... + e_M^2$ . This is because squaring makes all the errors positive, and larger errors contribute more significantly. This approach is called finding the least squares solution.

Minimizing the sum of squared errors ( $e_1^2 + ... + e_M^2$ ) is the same as minimizing the squared length (or squared norm) of the error vector $E = AX - B$ . We write the squared length of a vector $v$ as $||v||^2$ , which is $v^T v$ (the vector transposed multiplied by the original vector). So, we want to minimize $||AX - B||^2$ , which is $(AX - B)^T (AX - B)$ .

$\text{Minimize } (AX - B)^T (AX - B)$

To find the $X$ that minimizes this, you can use calculus. You take the derivative of this expression with respect to the vector $X$ and set it equal to zero. After doing the calculus and some matrix algebra (which we won’t go into detail here, just like the lecture), the equation you get to solve for $X$ is:

$A^T A X = A^T B$

Where $A^T$ is the transpose of matrix $A$ (you flip its rows and columns).

Now, if the matrix $A^T A$ is invertible (which happens if the columns of the original matrix $A$ are linearly independent – related to the rank!), you can solve for $X$ :

$X = (A^T A)^{-1} A^T B$

This solution $X$ is the least squares solution. It’s the $X$ that minimizes the sum of squared differences between $AX$ and $B$ . This $X$ might not make $AX$ exactly equal to $B$ (since a perfect solution might not exist), but it gives you the best possible fit in a least squares sense.

This optimization perspective gives us a way to find a meaningful “solution” even when we have more equations than variables, a common situation in data fitting.

Let’s look back at our first $M>N$ example and see how this least squares solution plays out.

Example system of 3 equations with 2 variables: $x_1 = 1$ $2x_1 = -0.5$ $3x_1 + x_2 = 5$

In matrix form $AX=B$ :

A = \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix}, \quad X = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \quad B = \begin{bmatrix} 1 \\ -0.5 \\ 5 \end{bmatrix}

$M=3, N=2$ .

Applying the formula $X = (A^T A)^{-1} A^T B$ , as we calculated earlier, gave us:

X = \begin{bmatrix} 0 \\ 5 \end{bmatrix}

So the least squares solution is $x_1=0, x_2=5$ . As we saw, this didn’t satisfy the first two equations perfectly ( $0 \neq 1$ and $2 \times 0 = 0 \neq -0.5$ ), but it made the third equation exact ( $3 \times 0 + 5 = 5$ ) and minimized the total squared error across all three.

If we took the second $M>N$ example (where a perfect solution existed): $x_1 = 1$ $2x_1 = 2$ $3x_1 + x_2 = 5$

A = \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix}, \quad X = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \quad B = \begin{bmatrix} 1 \\ 2 \\ 5 \end{bmatrix}

Applying the same formula $X = (A^T A)^{-1} A^T B$ , we got:

X = \begin{bmatrix} 1 \\ 2 \end{bmatrix}

This matches the perfect solution $(x_1=1, x_2=2)$ that satisfies all three equations, because in this case, the minimum sum of squared errors is zero.

So, the least squares solution $X = (A^T A)^{-1} A^T B$ is a powerful way to get the “best fit” solution when you have more equations than variables, and it finds the exact solution if one exists. This formula works as long as the columns of $A$ are linearly independent (which makes $A^T A$ invertible).

Case 3: More Variables Than Equations ( $M < N$ )

Now for the last case: fewer equations than variables. Imagine having one equation, like $x_1 + x_2 + x_3 = 10$ , and trying to find values for $x_1, x_2, x_3$ . You can probably already see you have lots of options!

$AX = B \quad (\text{where M < N})$

When you have more variables than independent equations, you’ll have “free” variables. For example, with 2 equations and 3 variables, you can often pick any value for one variable (say, $x_3$ ), plug it into the equations, and then you’re left with 2 equations and 2 variables to solve for $x_1$ and $x_2$ . Since you could pick any value for $x_3$ , you end up with infinite solutions.

Since there are endless possibilities for $X$ that satisfy $AX=B$ , how do you pick just one solution that makes the most sense for your problem? The equations themselves don’t give you enough information to choose. You need some other rule or criteria.

Similar to the $M>N$ case, we can use an optimization idea! This time, instead of minimizing the error in the equations (since many solutions make the error zero!), we need a different objective. A common and often useful criterion is to find the solution $X$ that has the minimum “size” or “length”. This is called finding the minimum norm solution.

We measure the “size” or “norm” of the solution vector $X$ using its length, specifically, minimizing the squared length $||X||^2 = X^T X$ . Why minimize $X^T X$ ? Think of it as finding the solution vector $X$ that is closest to the origin $(0, 0, ... 0)$ in your variable space. From an engineering or modeling perspective, finding a solution where the variables have smaller values might be desirable in some contexts (like keeping design parameters small).

So, for the $M<N$ case, the optimization problem is:

$\text{Minimize } \frac{1}{2} X^T X \quad \text{Subject to } AX = B$

The $\frac{1}{2}$ is just there to make the math cleaner when you solve it using calculus. The important part is “Subject to $AX=B$ .” This is a constrained optimization problem – you’re minimizing an objective function ( $X^T X$ ) while making sure your solution still satisfies the original equations $AX=B$ . This is different from the $M>N$ case where we had unconstrained optimization (just minimize the error without worrying about satisfying the equations perfectly).

Solving constrained optimization problems involves techniques like using Lagrangian functions. While the detailed steps for solving this are part of optimization theory (which often comes after linear algebra, but uses a lot of it!), the resulting formula for the minimum norm solution, assuming the rows of $A$ are linearly independent (making $A A^T$ invertible), is:

$X = A^T (A A^T)^{-1} B$

This formula gives you the unique solution that satisfies $AX=B$ and has the minimum possible squared length $||X||^2$ .

Let’s try an example for this $M<N$ case. Consider the system of 2 equations with 3 variables: $x_1 + 2x_2 + 3x_3 = 2$ $x_3 = 1$

In matrix form $AX=B$ :

A = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 0 & 1 \end{bmatrix}, \quad X = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad B = \begin{bmatrix} 2 \\ 1 \end{bmatrix}

$M=2, N=3$ . $M < N$ .

We know there are infinite solutions here (like $(-1, 0, 1)$ , $(-3, 1, 1)$ , etc.). Let’s use the minimum norm formula $X = A^T (A A^T)^{-1} B$ .

First, find $A^T$ : $A^T = \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix}$

Next, calculate $A A^T$ : $A A^T = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 14 & 3 \\ 3 & 1 \end{bmatrix}$

This matrix has a determinant of 5, so it’s invertible. $(A A^T)^{-1} = \begin{bmatrix} 0.2 & -0.6 \\ -0.6 & 2.8 \end{bmatrix}$ .

Finally, calculate $X = A^T (A A^T)^{-1} B$ : $X = \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix} \begin{bmatrix} 0.2 & -0.6 \\ -0.6 & 2.8 \end{bmatrix} \begin{bmatrix} 2 \\ 1 \end{bmatrix}$

As we calculated before, this gives us:

X = \begin{bmatrix} -0.2 \\ -0.4 \\ 1 \end{bmatrix}

The minimum norm solution is $x_1 = -0.2$ , $x_2 = -0.4$ , and $x_3 = 1$ . This solution satisfies the original equations, and among all the infinite solutions that work, this vector $[-0.2, -0.4, 1]^T$ is the one closest to the origin $(0,0,0)$ .

So, the formula $X = A^T (A A^T)^{-1} B$ gives us the minimum norm solution when you have more variables than equations, provided the rows of $A$ are linearly independent (making $A A^T$ invertible).

One Solution to Rule Them All: The Pseudo-inverse!

We’ve looked at the three cases for $AX=B$ :

$M=N$ : Unique solution ( $A^{-1}B$ ) if full rank; infinite or no solutions if not full rank.
$M>N$ : Generally no exact solution, find the least squares solution ( $X = (A^T A)^{-1} A^T B$ ) that minimizes errors.
$M<N$ : Infinite exact solutions, find the minimum norm solution ( $X = A^T (A A^T)^{-1} B$ ) that is closest to the origin.

This is a lot to keep track of! Wouldn’t it be awesome if there was one single formula that just gave you the right kind of solution (unique exact, least squares, or minimum norm) no matter the size or rank of $A$ ?

It turns out there is! This is where the concept of the Moore-Penrose pseudo-inverse (often written as $A^+$ ) comes in. It generalizes the idea of the matrix inverse $A^{-1}$ .

The magical formula for solving $AX=B$ in all cases is simply:

$X = A^+ B$

Where $A^+$ is the pseudo-inverse of $A$ .

How is this magical?

If $M=N$ and $A$ is full rank, $A^+$ is exactly the same as $A^{-1}$ . So you get $X = A^{-1}B$ , the unique exact solution.
If $M>N$ and the columns of $A$ are independent (making $A^T A$ invertible), $A^+$ is equal to $(A^T A)^{-1} A^T$ . So you get $X = (A^T A)^{-1} A^T B$ , the least squares solution.
If $M<N$ and the rows of $A$ are independent (making $A A^T$ invertible), $A^+$ is equal to $A^T (A A^T)^{-1}$ . So you get $X = A^T (A A^T)^{-1} B$ , the minimum norm solution.

And what about the cases where $A$ is not full rank (determinant is zero for M=N, columns/rows are dependent)? The pseudo-inverse still works! In those rank-deficient cases:

If the system is consistent ( $M=N$ not full rank, or $M<N$ rank deficient but consistent), $A^+ B$ gives you the minimum norm solution among the infinite exact solutions.
If the system is inconsistent ( $M=N$ not full rank, or $M>N$ rank deficient and inconsistent), $A^+ B$ gives you the least squares solution that minimizes the error $||AX-B||^2$ .

So, the pseudo-inverse $A^+$ gives you:

The unique exact solution if it exists.
The minimum norm exact solution if there are infinite exact solutions.
The minimum norm least squares solution if there are no exact solutions (i.e., the least squares solution with the smallest norm, although in the M>N case this is usually the only least squares solution).

How do you calculate this pseudo-inverse $A^+$ ? One common way is using something called Singular Value Decomposition (SVD), which is a really powerful technique in linear algebra. But for using it to solve $AX=B$ , you often don’t need to know the details of SVD right away.

Many software packages can compute the pseudo-inverse directly. In Python, you can use NumPy’s linalg.pinv() function.

Here are examples in Python using numpy.linalg.pinv() on the matrices we just looked at:

import numpy as np

# Example 1 (M > N)
A1 = np.array([[1, 0], [2, 0], [3, 1]])
B1 = np.array([[1], [-0.5], [5]])

A1_plus = np.linalg.pinv(A1)
X1 = A1_plus @ B1 # Using the matrix multiplication operator @

print("Solution for Example 1 (M>N):")
print(X1)
# Output should be close to:
# [[0.]
#  [5.]]

print("\n---")

# Example 2 (M < N)
A2 = np.array([[1, 2, 3], [0, 0, 1]])
B2 = np.array([[2], [1]])

A2_plus = np.linalg.pinv(A2)
X2 = A2_plus @ B2 # Using the matrix multiplication operator @

print("Solution for Example 2 (M<N):")
print(X2)
# Output should be close to:
# [[-0.2]
#  [-0.4]
#  [ 1. ]]

Turning Equations into Matrix Problems: AX=BAX=BAX=B

A Quick Refresher on Rank

Case 1: When Equations Equal Variables (M=NM=NM=N)

Case 2: More Equations Than Variables (M>NM > NM>N)

Case 3: More Variables Than Equations (M<NM < NM<N)

One Solution to Rule Them All: The Pseudo-inverse!

Turning Equations into Matrix Problems: $AX=B$

Case 1: When Equations Equal Variables ( $M=N$ )

Case 2: More Equations Than Variables ( $M > N$ )

Case 3: More Variables Than Equations ( $M < N$ )