Solving for X, Part 2: Tackling Tougher Linear Equations in Data Science

Welcome back to our linear algebra for data science post! In our last post, we kicked off our dive into linear algebra for data science by looking at how we represent data using matrices and how we can spot linear relationships hiding in that data using the idea of the null space. Matrices are fundamental, and understanding those relationships is super powerful.

Today, we’re moving on to another absolute core concept: solving linear equations. If you think about a ton of problems in data science, machine learning, and even everyday modeling, they often boil down to having a bunch of equations and needing to find the values that make them true. This is where solving matrix equations becomes key.

Turning Equations into Matrix Problems: AX=BAX=B

Generally, when you have a set of linear equations, you can write them in a neat matrix form that looks like this:

AX=BAX = B

Let’s break down what’s in this equation based on what we learned before:

  • AA is a matrix. If you have MM equations and NN variables, AA is typically an M×NM \times N matrix. As we saw, MM is the number of rows (equations in this case) and NN is the number of columns (variables).
  • XX is a column vector containing the variables you’re trying to solve for. Since AA has NN columns (variables), XX needs to have NN rows, so it’s an N×1N \times 1 matrix (a column vector).
  • BB is a column vector containing the constants from the right-hand side of your equations. For the matrix multiplication to work out, BB needs to have the same number of rows as AA, so it’s an M×1M \times 1 matrix.

So, when you write AX=BAX=B, you’re really representing a system of MM linear equations involving NN variables.

Now, depending on how MM (the number of equations) stacks up against NN (the number of variables), things can play out in three main ways:

  1. M=NM = N: The number of equations is exactly the same as the number of variables. This is often the “nicest” case to solve.
  2. M>NM > N: You have more equations than variables. Think of it as having too many rules for your variables to follow. Usually, this means there’s no single perfect solution that satisfies all the equations.
  3. M<NM < N: You have fewer equations than variables. This means you have more variables than you strictly need for the given equations. In this scenario, you’ll usually find that there are many, many possible solutions.

We’re going to look at these cases, and then see how a cool idea called the pseudo-inverse can bring them all together.

A Quick Refresher on Rank

Remember rank from our last post? It’s going to be super important here. The rank of a matrix is the number of its linearly independent rows or columns. Remember, row rank always equals column rank – you can’t have a different number of independent rows versus columns!

For an M×NM \times N matrix, the maximum possible rank it can have is the smaller of MM and NN. If M<NM < N, the max rank is MM. If N<MN < M, the max rank is NN.

Case 1: When Equations Equal Variables (M=NM=N)

This is the case where your matrix AA is square (M×NM \times N and M=NM=N).

  • If AA is Full Rank: “Full rank” here means the rank of the matrix is equal to MM (or NN, since they’re the same). What does this mean? It means all your equations on the left-hand side are independent. You can’t create any one equation by combining the others. In this happy case, there is a unique solution to AX=BAX=B. You might remember from algebra that you can solve this by finding the inverse of AA, written as A1A^{-1}. The solution is simply: X=A1BX = A^{-1}B You can find A1A^{-1} if the determinant of AA is not zero. This is the standard, straightforward scenario.

  • If AA is Not Full Rank: This means the rank of AA is less than MM. In this situation, some of your equations on the left-hand side are actually linear combinations of others – they are linearly dependent. When this happens, depending on what the values are on the right-hand side (in the BB vector), you get two possibilities:

    1. Consistent System: If the dependencies on the left-hand side match up exactly with the dependencies on the right-hand side (the BB vector values), the equations are consistent. But because they are dependent, you don’t have enough independent equations to pin down a single solution. This leads to infinite solutions.
    2. Inconsistent System: If the dependencies on the left-hand side don’t match up with the BB vector values, the equations are inconsistent. They contradict each other, and there is no solution that can satisfy all of them.

Let’s look at a couple of examples for this M=NM=N case:

Example 1: Full Rank (Unique Solution)

Consider the system of equations: x1+3x2=7x_1 + 3x_2 = 7 2x1+4x2=102x_1 + 4x_2 = 10

In matrix form AX=BAX=B, this is:

[1324][x1x2]=[710]\begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 7 \\ 10 \end{bmatrix}

Here, A=[1324]A = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}. M=2M=2, N=2N=2. We can check if AA is full rank by calculating its determinant: (1×4)(3×2)=46=2(1 \times 4) - (3 \times 2) = 4 - 6 = -2. Since the determinant is not 0, the matrix is full rank (rank = 2). There is a unique solution, X=A1BX = A^{-1}B. The inverse of AA is [21.510.5]\begin{bmatrix} -2 & 1.5 \\ 1 & -0.5 \end{bmatrix}. So, X=[21.510.5][710]=[(2×7)+(1.5×10)(1×7)+(0.5×10)]=[14+1575]=[12]X = \begin{bmatrix} -2 & 1.5 \\ 1 & -0.5 \end{bmatrix} \begin{bmatrix} 7 \\ 10 \end{bmatrix} = \begin{bmatrix} (-2 \times 7) + (1.5 \times 10) \\ (1 \times 7) + (-0.5 \times 10) \end{bmatrix} = \begin{bmatrix} -14 + 15 \\ 7 - 5 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}. The unique solution is x1=1x_1 = 1 and x2=2x_2 = 2. You can plug these back into the original equations to check (1+3(2)=71 + 3(2) = 7, 2(1)+4(2)=102(1) + 4(2) = 10). It works!

You can solve this easily in software too. In Python, you could define A and B using NumPy and use the numpy.linalg.solve() command or compute the inverse and multiply.

Example 2: Not Full Rank, Consistent (Infinite Solutions)

Consider the system: x1+2x2=5x_1 + 2x_2 = 5 2x1+4x2=102x_1 + 4x_2 = 10

In matrix form AX=BAX=B:

[1224][x1x2]=[510]\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 10 \end{bmatrix}

Here, A=[1224]A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}. M=2M=2, N=2N=2. Check the determinant: (1×4)(2×2)=44=0(1 \times 4) - (2 \times 2) = 4 - 4 = 0. The determinant is 0, so the matrix is not full rank (rank = 1). The columns are dependent (column 2 is 2 * column 1), and the rows are dependent (row 2 is 2 * row 1).

Now, look at the equations themselves. The second equation 2x1+4x2=102x_1 + 4x_2 = 10 is exactly 2 times the first equation x1+2x2=5x_1 + 2x_2 = 5. This dependency on the left-hand side (2x1+4x22x_1 + 4x_2 being 2×(x1+2x2)2 \times (x_1 + 2x_2)) is matched on the right-hand side (1010 being 2×52 \times 5). Since the left and right sides have the same linear dependency, the system is consistent. However, because the second equation is just a scaled version of the first, you effectively only have one independent equation (x1+2x2=5x_1 + 2x_2 = 5) with two variables (x1,x2x_1, x_2). This means you have a “free” variable. You can pick any value for x2x_2, and then calculate the x1x_1 that works (x1=52x2x_1 = 5 - 2x_2). For example:

  • If x2=0x_2 = 0, x1=5x_1 = 5. Solution: (5,0)(5, 0).
  • If x2=1x_2 = 1, x1=3x_1 = 3. Solution: (3,1)(3, 1).
  • If x2=2.5x_2 = 2.5, x1=0x_1 = 0. Solution: (0,2.5)(0, 2.5). Since you can choose any value for x2x_2, there are infinite solutions to this system.

Example 3: Not Full Rank, Inconsistent (No Solution)

Consider the system: x1+2x2=5x_1 + 2x_2 = 5 2x1+4x2=92x_1 + 4x_2 = 9

In matrix form AX=BAX=B:

[1224][x1x2]=[59]\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 9 \end{bmatrix}

Here, A=[1224]A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} is the same as before, so it’s not full rank (rank = 1). The left-hand side has the same dependency (row 2 is 2 * row 1). But look at the right-hand side (BB vector). The first equation’s right side is 5. If you multiply it by 2 (the scaling factor between the left-hand sides), you get 2×5=102 \times 5 = 10. However, the second equation’s right side is 9, not 10. The dependency on the left-hand side (where the second equation’s left side is twice the first) does not match the right-hand side (92×59 \neq 2 \times 5). The equations contradict each other (2x1+4x22x_1 + 4x_2 cannot equal both 10 and 9 simultaneously based on the first equation). The system is inconsistent, and there is no solution that can satisfy both equations.

So, for the M=NM=N case (square matrix A): if A is full rank, unique solution; if A is not full rank, check consistency – if consistent, infinite solutions; if inconsistent, no solution.

Case 2: More Equations Than Variables (M>NM > N)

Now let’s look at the case where you have more equations than variables. Like trying to find two numbers (x1,x2x_1, x_2) that satisfy five different conditions.

AX=B(where M > N)AX = B \quad (\text{where M > N})

Since you have more equations than variables, it’s generally impossible to find a perfect solution XX that makes every single equation true (i.e., makes AXAX exactly equal to BB). If you could find such a perfect XX, then AXBAX - B would be a vector of all zeros.

But since a perfect solution is usually out of reach, what’s the next best thing? We want to find a solution XX that gets us as close as possible to satisfying all the equations. In other words, we want to find an XX that makes the difference between AXAX and BB as small as possible.

Think of AXBAX - B as a vector of “errors,” where each element is the error in one equation. Equation 1 error: e1=(A11x1+A12x2+...+A1NxN)B1e_1 = (A_{11}x_1 + A_{12}x_2 + ... + A_{1N}x_N) - B_1 Equation 2 error: e2=(A21x1+A22x2+...+A2NxN)B2e_2 = (A_{21}x_1 + A_{22}x_2 + ... + A_{2N}x_N) - B_2 … Equation M error: eM=(AM1x1+AM2x2+...+AMNxN)BMe_M = (A_{M1}x_1 + A_{M2}x_2 + ... + A_{MN}x_N) - B_M

The vector of errors is E=AXBE = AX - B. We want to find the XX that makes this error vector EE as “small” as possible. How do we measure the size of a vector of errors? We don’t just add the errors (e1+e2+...e_1 + e_2 + ...), because positive and negative errors could cancel out, making a big overall error look like zero.

A common way to measure the overall error is to minimize the sum of the squared errors: e12+e22+...+eM2e_1^2 + e_2^2 + ... + e_M^2. This is because squaring makes all the errors positive, and larger errors contribute more significantly. This approach is called finding the least squares solution.

Minimizing the sum of squared errors (e12+...+eM2e_1^2 + ... + e_M^2) is the same as minimizing the squared length (or squared norm) of the error vector E=AXBE = AX - B. We write the squared length of a vector vv as v2||v||^2, which is vTvv^T v (the vector transposed multiplied by the original vector). So, we want to minimize AXB2||AX - B||^2, which is (AXB)T(AXB)(AX - B)^T (AX - B).

Minimize (AXB)T(AXB)\text{Minimize } (AX - B)^T (AX - B)

To find the XX that minimizes this, you can use calculus. You take the derivative of this expression with respect to the vector XX and set it equal to zero. After doing the calculus and some matrix algebra (which we won’t go into detail here, just like the lecture), the equation you get to solve for XX is:

ATAX=ATBA^T A X = A^T B

Where ATA^T is the transpose of matrix AA (you flip its rows and columns).

Now, if the matrix ATAA^T A is invertible (which happens if the columns of the original matrix AA are linearly independent – related to the rank!), you can solve for XX:

X=(ATA)1ATBX = (A^T A)^{-1} A^T B

This solution XX is the least squares solution. It’s the XX that minimizes the sum of squared differences between AXAX and BB. This XX might not make AXAX exactly equal to BB (since a perfect solution might not exist), but it gives you the best possible fit in a least squares sense.

This optimization perspective gives us a way to find a meaningful “solution” even when we have more equations than variables, a common situation in data fitting.

Let’s look back at our first M>NM>N example and see how this least squares solution plays out.

Example system of 3 equations with 2 variables: x1=1x_1 = 1 2x1=0.52x_1 = -0.5 3x1+x2=53x_1 + x_2 = 5

In matrix form AX=BAX=B:

A=[102031],X=[x1x2],B=[10.55]A = \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix}, \quad X = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \quad B = \begin{bmatrix} 1 \\ -0.5 \\ 5 \end{bmatrix}

M=3,N=2M=3, N=2.

Applying the formula X=(ATA)1ATBX = (A^T A)^{-1} A^T B, as we calculated earlier, gave us:

X=[05]X = \begin{bmatrix} 0 \\ 5 \end{bmatrix}

So the least squares solution is x1=0,x2=5x_1=0, x_2=5. As we saw, this didn’t satisfy the first two equations perfectly (010 \neq 1 and 2×0=00.52 \times 0 = 0 \neq -0.5), but it made the third equation exact (3×0+5=53 \times 0 + 5 = 5) and minimized the total squared error across all three.

If we took the second M>NM>N example (where a perfect solution existed): x1=1x_1 = 1 2x1=22x_1 = 2 3x1+x2=53x_1 + x_2 = 5

A=[102031],X=[x1x2],B=[125]A = \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix}, \quad X = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix}, \quad B = \begin{bmatrix} 1 \\ 2 \\ 5 \end{bmatrix}

Applying the same formula X=(ATA)1ATBX = (A^T A)^{-1} A^T B, we got:

X=[12]X = \begin{bmatrix} 1 \\ 2 \end{bmatrix}

This matches the perfect solution (x1=1,x2=2)(x_1=1, x_2=2) that satisfies all three equations, because in this case, the minimum sum of squared errors is zero.

So, the least squares solution X=(ATA)1ATBX = (A^T A)^{-1} A^T B is a powerful way to get the “best fit” solution when you have more equations than variables, and it finds the exact solution if one exists. This formula works as long as the columns of AA are linearly independent (which makes ATAA^T A invertible).

Case 3: More Variables Than Equations (M<NM < N)

Now for the last case: fewer equations than variables. Imagine having one equation, like x1+x2+x3=10x_1 + x_2 + x_3 = 10, and trying to find values for x1,x2,x3x_1, x_2, x_3. You can probably already see you have lots of options!

AX=B(where M < N)AX = B \quad (\text{where M < N})

When you have more variables than independent equations, you’ll have “free” variables. For example, with 2 equations and 3 variables, you can often pick any value for one variable (say, x3x_3), plug it into the equations, and then you’re left with 2 equations and 2 variables to solve for x1x_1 and x2x_2. Since you could pick any value for x3x_3, you end up with infinite solutions.

Since there are endless possibilities for XX that satisfy AX=BAX=B, how do you pick just one solution that makes the most sense for your problem? The equations themselves don’t give you enough information to choose. You need some other rule or criteria.

Similar to the M>NM>N case, we can use an optimization idea! This time, instead of minimizing the error in the equations (since many solutions make the error zero!), we need a different objective. A common and often useful criterion is to find the solution XX that has the minimum “size” or “length”. This is called finding the minimum norm solution.

We measure the “size” or “norm” of the solution vector XX using its length, specifically, minimizing the squared length X2=XTX||X||^2 = X^T X. Why minimize XTXX^T X? Think of it as finding the solution vector XX that is closest to the origin (0,0,...0)(0, 0, ... 0) in your variable space. From an engineering or modeling perspective, finding a solution where the variables have smaller values might be desirable in some contexts (like keeping design parameters small).

So, for the M<NM<N case, the optimization problem is:

Minimize 12XTXSubject to AX=B\text{Minimize } \frac{1}{2} X^T X \quad \text{Subject to } AX = B

The 12\frac{1}{2} is just there to make the math cleaner when you solve it using calculus. The important part is “Subject to AX=BAX=B.” This is a constrained optimization problem – you’re minimizing an objective function (XTXX^T X) while making sure your solution still satisfies the original equations AX=BAX=B. This is different from the M>NM>N case where we had unconstrained optimization (just minimize the error without worrying about satisfying the equations perfectly).

Solving constrained optimization problems involves techniques like using Lagrangian functions. While the detailed steps for solving this are part of optimization theory (which often comes after linear algebra, but uses a lot of it!), the resulting formula for the minimum norm solution, assuming the rows of AA are linearly independent (making AATA A^T invertible), is:

X=AT(AAT)1BX = A^T (A A^T)^{-1} B

This formula gives you the unique solution that satisfies AX=BAX=B and has the minimum possible squared length X2||X||^2.

Let’s try an example for this M<NM<N case. Consider the system of 2 equations with 3 variables: x1+2x2+3x3=2x_1 + 2x_2 + 3x_3 = 2 x3=1x_3 = 1

In matrix form AX=BAX=B:

A=[123001],X=[x1x2x3],B=[21]A = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 0 & 1 \end{bmatrix}, \quad X = \begin{bmatrix} x_1 \\ x_2 \\ x_3 \end{bmatrix}, \quad B = \begin{bmatrix} 2 \\ 1 \end{bmatrix}

M=2,N=3M=2, N=3. M<NM < N.

We know there are infinite solutions here (like (1,0,1)(-1, 0, 1), (3,1,1)(-3, 1, 1), etc.). Let’s use the minimum norm formula X=AT(AAT)1BX = A^T (A A^T)^{-1} B.

First, find ATA^T: AT=[102031]A^T = \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix}

Next, calculate AATA A^T: AAT=[123001][102031]=[14331]A A^T = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix} = \begin{bmatrix} 14 & 3 \\ 3 & 1 \end{bmatrix}

This matrix has a determinant of 5, so it’s invertible. (AAT)1=[0.20.60.62.8](A A^T)^{-1} = \begin{bmatrix} 0.2 & -0.6 \\ -0.6 & 2.8 \end{bmatrix}.

Finally, calculate X=AT(AAT)1BX = A^T (A A^T)^{-1} B: X=[102031][0.20.60.62.8][21]X = \begin{bmatrix} 1 & 0 \\ 2 & 0 \\ 3 & 1 \end{bmatrix} \begin{bmatrix} 0.2 & -0.6 \\ -0.6 & 2.8 \end{bmatrix} \begin{bmatrix} 2 \\ 1 \end{bmatrix}

As we calculated before, this gives us:

X=[0.20.41]X = \begin{bmatrix} -0.2 \\ -0.4 \\ 1 \end{bmatrix}

The minimum norm solution is x1=0.2x_1 = -0.2, x2=0.4x_2 = -0.4, and x3=1x_3 = 1. This solution satisfies the original equations, and among all the infinite solutions that work, this vector [0.2,0.4,1]T[-0.2, -0.4, 1]^T is the one closest to the origin (0,0,0)(0,0,0).

So, the formula X=AT(AAT)1BX = A^T (A A^T)^{-1} B gives us the minimum norm solution when you have more variables than equations, provided the rows of AA are linearly independent (making AATA A^T invertible).

One Solution to Rule Them All: The Pseudo-inverse!

We’ve looked at the three cases for AX=BAX=B:

  • M=NM=N: Unique solution (A1BA^{-1}B) if full rank; infinite or no solutions if not full rank.
  • M>NM>N: Generally no exact solution, find the least squares solution (X=(ATA)1ATBX = (A^T A)^{-1} A^T B) that minimizes errors.
  • M<NM<N: Infinite exact solutions, find the minimum norm solution (X=AT(AAT)1BX = A^T (A A^T)^{-1} B) that is closest to the origin.

This is a lot to keep track of! Wouldn’t it be awesome if there was one single formula that just gave you the right kind of solution (unique exact, least squares, or minimum norm) no matter the size or rank of AA?

It turns out there is! This is where the concept of the Moore-Penrose pseudo-inverse (often written as A+A^+) comes in. It generalizes the idea of the matrix inverse A1A^{-1}.

The magical formula for solving AX=BAX=B in all cases is simply:

X=A+BX = A^+ B

Where A+A^+ is the pseudo-inverse of AA.

How is this magical?

  • If M=NM=N and AA is full rank, A+A^+ is exactly the same as A1A^{-1}. So you get X=A1BX = A^{-1}B, the unique exact solution.
  • If M>NM>N and the columns of AA are independent (making ATAA^T A invertible), A+A^+ is equal to (ATA)1AT(A^T A)^{-1} A^T. So you get X=(ATA)1ATBX = (A^T A)^{-1} A^T B, the least squares solution.
  • If M<NM<N and the rows of AA are independent (making AATA A^T invertible), A+A^+ is equal to AT(AAT)1A^T (A A^T)^{-1}. So you get X=AT(AAT)1BX = A^T (A A^T)^{-1} B, the minimum norm solution.

And what about the cases where AA is not full rank (determinant is zero for M=N, columns/rows are dependent)? The pseudo-inverse still works! In those rank-deficient cases:

  • If the system is consistent (M=NM=N not full rank, or M<NM<N rank deficient but consistent), A+BA^+ B gives you the minimum norm solution among the infinite exact solutions.
  • If the system is inconsistent (M=NM=N not full rank, or M>NM>N rank deficient and inconsistent), A+BA^+ B gives you the least squares solution that minimizes the error AXB2||AX-B||^2.

So, the pseudo-inverse A+A^+ gives you:

  • The unique exact solution if it exists.
  • The minimum norm exact solution if there are infinite exact solutions.
  • The minimum norm least squares solution if there are no exact solutions (i.e., the least squares solution with the smallest norm, although in the M>N case this is usually the only least squares solution).

How do you calculate this pseudo-inverse A+A^+? One common way is using something called Singular Value Decomposition (SVD), which is a really powerful technique in linear algebra. But for using it to solve AX=BAX=B, you often don’t need to know the details of SVD right away.

Many software packages can compute the pseudo-inverse directly. In Python, you can use NumPy’s linalg.pinv() function.

Here are examples in Python using numpy.linalg.pinv() on the matrices we just looked at:

import numpy as np

# Example 1 (M > N)
A1 = np.array([[1, 0], [2, 0], [3, 1]])
B1 = np.array([[1], [-0.5], [5]])

A1_plus = np.linalg.pinv(A1)
X1 = A1_plus @ B1 # Using the matrix multiplication operator @

print("Solution for Example 1 (M>N):")
print(X1)
# Output should be close to:
# [[0.]
#  [5.]]

print("\n---")

# Example 2 (M < N)
A2 = np.array([[1, 2, 3], [0, 0, 1]])
B2 = np.array([[2], [1]])

A2_plus = np.linalg.pinv(A2)
X2 = A2_plus @ B2 # Using the matrix multiplication operator @

print("Solution for Example 2 (M<N):")
print(X2)
# Output should be close to:
# [[-0.2]
#  [-0.4]
#  [ 1. ]]