Solving for X: Cracking the Code of Linear Equations in Data Science

Alright, team! In our last post, we kicked off our dive into linear algebra for data science by looking at how we represent data using matrices and how we can spot linear relationships hiding in that data using the idea of the null space. Matrices are fundamental, and understanding those relationships is super powerful.

Today, we’re moving on to another absolute core concept: solving linear equations. If you think about a ton of problems in data science, machine learning, and even everyday modeling, they often boil down to having a bunch of equations and needing to find the values that make them true. This is where solving matrix equations becomes key.

Turning Equations into Matrix Problems: AX=BAX=B

Generally, when you have a set of linear equations, you can write them in a neat matrix form that looks like this:

AX=BAX = B

Let’s break down what’s in this equation based on what we learned before:

  • AA is a matrix. If you have MM equations and NN variables, AA is typically an M×NM \times N matrix. As we saw, MM is the number of rows (equations in this case) and NN is the number of columns (variables).
  • XX is a column vector containing the variables you’re trying to solve for. Since AA has NN columns (variables), XX needs to have NN rows, so it’s an N×1N \times 1 matrix (a column vector).
  • BB is a column vector containing the constants from the right-hand side of your equations. For the matrix multiplication to work out, BB needs to have the same number of rows as AA, so it’s an M×1M \times 1 matrix.

So, when you write AX=BAX=B, you’re really representing a system of MM linear equations involving NN variables.

Now, depending on how MM (the number of equations) stacks up against NN (the number of variables), things can play out in three main ways:

  1. M=NM = N: The number of equations is exactly the same as the number of variables. This is often the “nicest” case to solve.
  2. M>NM > N: You have more equations than variables. Think of it as having too many rules for your variables to follow. Usually, this means there’s no single perfect solution that satisfies all the equations.
  3. M<NM < N: You have fewer equations than variables. This means you have more variables than you strictly need for the given equations. In this scenario, you’ll usually find that there are many, many possible solutions.

We’re going to look at these cases, and then see how a cool idea called the pseudo-inverse can bring them all together.

A Quick Refresher on Rank

Remember rank from our last post? It’s going to be super important here. The rank of a matrix is the number of its linearly independent rows or columns. Remember, row rank always equals column rank – you can’t have a different number of independent rows versus columns!

For an M×NM \times N matrix, the maximum possible rank it can have is the smaller of MM and NN. If M<NM < N, the max rank is MM. If N<MN < M, the max rank is NN.

Case 1: When Equations Equal Variables (M=NM=N)

This is the case where your matrix AA is square (M×NM \times N and M=NM=N).

  • If AA is Full Rank: “Full rank” here means the rank of the matrix is equal to MM (or NN, since they’re the same). What does this mean? It means all your equations on the left-hand side are independent. You can’t create any one equation by combining the others. In this happy case, there is a unique solution to AX=BAX=B. You might remember from algebra that you can solve this by finding the inverse of AA, written as A1A^{-1}. The solution is simply: X=A1BX = A^{-1}B You can find A1A^{-1} if the determinant of AA is not zero. This is the standard, straightforward scenario.

  • If AA is Not Full Rank: This means the rank of AA is less than MM. In this situation, some of your equations on the left-hand side are actually linear combinations of others – they are linearly dependent. When this happens, depending on what the values are on the right-hand side (in the BB vector), you get two possibilities:

    1. Consistent System: If the dependencies on the left-hand side match up exactly with the dependencies on the right-hand side (the BB vector values), the equations are consistent. But because they are dependent, you don’t have enough independent equations to pin down a single solution. This leads to infinite solutions.
    2. Inconsistent System: If the dependencies on the left-hand side don’t match up with the BB vector values, the equations are inconsistent. They contradict each other, and there is no solution that can satisfy all of them.

Let’s look at a couple of examples for this M=NM=N case:

Example 1: Full Rank (Unique Solution)

Consider the system of equations: x1+3x2=7x_1 + 3x_2 = 7 2x1+4x2=102x_1 + 4x_2 = 10

In matrix form AX=BAX=B, this is:

[1324][x1x2]=[710]\begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 7 \\ 10 \end{bmatrix}

Here, A=[1324]A = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}. M=2M=2, N=2N=2. We can check if AA is full rank by calculating its determinant: (1×4)(3×2)=46=2(1 \times 4) - (3 \times 2) = 4 - 6 = -2. Since the determinant is not 0, the matrix is full rank (rank = 2). There is a unique solution, X=A1BX = A^{-1}B. The inverse of AA is [21.510.5]\begin{bmatrix} -2 & 1.5 \\ 1 & -0.5 \end{bmatrix}. So, X=[21.510.5][710]=[(2×7)+(1.5×10)(1×7)+(0.5×10)]=[14+1575]=[12]X = \begin{bmatrix} -2 & 1.5 \\ 1 & -0.5 \end{bmatrix} \begin{bmatrix} 7 \\ 10 \end{bmatrix} = \begin{bmatrix} (-2 \times 7) + (1.5 \times 10) \\ (1 \times 7) + (-0.5 \times 10) \end{bmatrix} = \begin{bmatrix} -14 + 15 \\ 7 - 5 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}. The unique solution is x1=1x_1 = 1 and x2=2x_2 = 2. You can plug these back into the original equations to check (1+3(2)=71 + 3(2) = 7, 2(1)+4(2)=102(1) + 4(2) = 10). It works!

You can solve this easily in software too. In R, you could define A and B and use the solve() command.

Example 2: Not Full Rank, Consistent (Infinite Solutions)

Consider the system: x1+2x2=5x_1 + 2x_2 = 5 2x1+4x2=102x_1 + 4x_2 = 10

In matrix form AX=BAX=B:

[1224][x1x2]=[510]\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 10 \end{bmatrix}

Here, A=[1224]A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}. M=2M=2, N=2N=2. Check the determinant: (1×4)(2×2)=44=0(1 \times 4) - (2 \times 2) = 4 - 4 = 0. The determinant is 0, so the matrix is not full rank (rank = 1). The columns are dependent (column 2 is 2 * column 1), and the rows are dependent (row 2 is 2 * row 1).

Now, look at the equations themselves. The second equation 2x1+4x2=102x_1 + 4x_2 = 10 is exactly 2 times the first equation x1+2x2=5x_1 + 2x_2 = 5. This dependency on the left-hand side (2x1+4x22x_1 + 4x_2 being 2×(x1+2x2)2 \times (x_1 + 2x_2)) is matched on the right-hand side (1010 being 2×52 \times 5). Since the left and right sides have the same linear dependency, the system is consistent. However, because the second equation is just a scaled version of the first, you effectively only have one independent equation (x1+2x2=5x_1 + 2x_2 = 5) with two variables (x1,x2x_1, x_2). This means you have a “free” variable. You can pick any value for x2x_2, and then calculate the x1x_1 that works (x1=52x2x_1 = 5 - 2x_2). For example:

  • If x2=0x_2 = 0, x1=5x_1 = 5. Solution: (5,0)(5, 0).
  • If x2=1x_2 = 1, x1=3x_1 = 3. Solution: (3,1)(3, 1).
  • If x2=2.5x_2 = 2.5, x1=0x_1 = 0. Solution: (0,2.5)(0, 2.5). Since you can choose any value for x2x_2, there are infinite solutions to this system.

Example 3: Not Full Rank, Inconsistent (No Solution)

Consider the system: x1+2x2=5x_1 + 2x_2 = 5 2x1+4x2=92x_1 + 4x_2 = 9

In matrix form AX=BAX=B:

[1224][x1x2]=[59]\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 9 \end{bmatrix}

Here, A=[1224]A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} is the same as before, so it’s not full rank (rank = 1). The left-hand side has the same dependency (row 2 is 2 * row 1). But look at the right-hand side (BB vector). The first equation’s right side is 5. If you multiply it by 2 (the scaling factor between the left-hand sides), you get 2×5=102 \times 5 = 10. However, the second equation’s right side is 9, not 10. The dependency on the left-hand side (where the second equation’s left side is twice the first) does not match the right-hand side (92×59 \neq 2 \times 5). The equations contradict each other (2x1+4x22x_1 + 4x_2 cannot equal both 10 and 9 simultaneously based on the first equation). The system is inconsistent, and there is no solution that can satisfy both equations.

So, for the M=NM=N case (square matrix A): if A is full rank, unique solution; if A is not full rank, check consistency – if consistent, infinite solutions; if inconsistent, no solution.

Case 2: More Equations than Variables (M>NM > N)

Now let’s look at the case where you have more equations than variables. Like trying to find two numbers (x1,x2x_1, x_2) that satisfy five different conditions.

AX=B(where M > N)AX = B \quad (\text{where M > N})

Since you have more equations than variables, it’s generally impossible to find a perfect solution XX that makes every single equation true (i.e., makes AXAX exactly equal to BB). If you could find such a perfect XX, then AXBAX - B would be a vector of all zeros.

But since a perfect solution is usually out of reach, what’s the next best thing? We want to find a solution XX that gets us as close as possible to satisfying all the equations. In other words, we want to find an XX that makes the difference between AXAX and BB as small as possible.

Think of AXBAX - B as a vector of “errors,” where each element is the error in one equation. Equation 1 error: e1=(A11x1+A12x2+...+A1NxN)B1e_1 = (A_{11}x_1 + A_{12}x_2 + ... + A_{1N}x_N) - B_1 Equation 2 error: e2=(A21x1+A22x2+...+A2NxN)B2e_2 = (A_{21}x_1 + A_{22}x_2 + ... + A_{2N}x_N) - B_2 … Equation M error: eM=(AM1x1+AM2x2+...+AMNxN)BMe_M = (A_{M1}x_1 + A_{M2}x_2 + ... + A_{MN}x_N) - B_M

The vector of errors is E=AXBE = AX - B. We want to find the XX that makes this error vector EE as “small” as possible. How do we measure the size of a vector of errors? We don’t just add the errors (e1+e2+...e_1 + e_2 + ...), because positive and negative errors could cancel out, making a big overall error look like zero.

A common way to measure the overall error is to minimize the sum of the squared errors: e12+e22+...+eM2e_1^2 + e_2^2 + ... + e_M^2. This is because squaring makes all the errors positive, and larger errors contribute more significantly. This approach is called finding the least squares solution.

Minimizing the sum of squared errors (e12+...+eM2e_1^2 + ... + e_M^2) is the same as minimizing the squared length (or squared norm) of the error vector E=AXBE = AX - B. We write the squared length of a vector vv as v2||v||^2, which is vTvv^T v (the vector transposed multiplied by the original vector). So, we want to minimize AXB2||AX - B||^2, which is (AXB)T(AXB)(AX - B)^T (AX - B).

Minimize (AXB)T(AXB)\text{Minimize } (AX - B)^T (AX - B)

To find the XX that minimizes this, you can use calculus. You take the derivative of this expression with respect to the vector XX and set it equal to zero. After doing the calculus and some matrix algebra (which we won’t go into detail here, just like the lecture), the equation you get to solve for XX is:

ATAX=ATBA^T A X = A^T B

Where ATA^T is the transpose of matrix AA (you flip its rows and columns).

Now, if the matrix ATAA^T A is invertible (which happens if the columns of the original matrix AA are linearly independent – related to the rank!), you can solve for XX:

X=(ATA)1ATBX = (A^T A)^{-1} A^T B

This solution XX is the least squares solution. It’s the XX that minimizes the sum of squared differences between AXAX and BB. This XX might not make AXAX exactly equal to BB (since a perfect solution might not exist), but it gives you the best possible fit in a least squares sense.

This optimization perspective gives us a way to find a meaningful “solution” even when we have more equations than variables, a common situation in data fitting.

What’s Next?

So far, we’ve covered the case where M=NM=N (same number of equations and variables) and the case where M>NM > N (more equations than variables), introducing the idea of the least squares solution for the latter.

In the next part, we’ll look at the third case where M<NM < N (fewer equations than variables), which usually has infinite solutions. We’ll also see how an optimization idea can help us find a specific, useful solution in that case too.

Finally, we’ll see how this least squares idea and another concept called the pseudo-inverse can actually provide a single, elegant way to think about solving AX=BAX=B that covers all three cases!

Stay tuned!