Solving for X: Cracking the Code of Linear Equations in Data Science

Alright, team! In our last post, we kicked off our dive into linear algebra for data science by looking at how we represent data using matrices and how we can spot linear relationships hiding in that data using the idea of the null space. Matrices are fundamental, and understanding those relationships is super powerful.

Today, we’re moving on to another absolute core concept: solving linear equations. If you think about a ton of problems in data science, machine learning, and even everyday modeling, they often boil down to having a bunch of equations and needing to find the values that make them true. This is where solving matrix equations becomes key.

Turning Equations into Matrix Problems: $AX=B$

Generally, when you have a set of linear equations, you can write them in a neat matrix form that looks like this:

$AX = B$

Let’s break down what’s in this equation based on what we learned before:

$A$ is a matrix. If you have $M$ equations and $N$ variables, $A$ is typically an $M \times N$ matrix. As we saw, $M$ is the number of rows (equations in this case) and $N$ is the number of columns (variables).
$X$ is a column vector containing the variables you’re trying to solve for. Since $A$ has $N$ columns (variables), $X$ needs to have $N$ rows, so it’s an $N \times 1$ matrix (a column vector).
$B$ is a column vector containing the constants from the right-hand side of your equations. For the matrix multiplication to work out, $B$ needs to have the same number of rows as $A$ , so it’s an $M \times 1$ matrix.

So, when you write $AX=B$ , you’re really representing a system of $M$ linear equations involving $N$ variables.

Now, depending on how $M$ (the number of equations) stacks up against $N$ (the number of variables), things can play out in three main ways:

$M = N$ : The number of equations is exactly the same as the number of variables. This is often the “nicest” case to solve.
$M > N$ : You have more equations than variables. Think of it as having too many rules for your variables to follow. Usually, this means there’s no single perfect solution that satisfies all the equations.
$M < N$ : You have fewer equations than variables. This means you have more variables than you strictly need for the given equations. In this scenario, you’ll usually find that there are many, many possible solutions.

We’re going to look at these cases, and then see how a cool idea called the pseudo-inverse can bring them all together.

A Quick Refresher on Rank

Remember rank from our last post? It’s going to be super important here. The rank of a matrix is the number of its linearly independent rows or columns. Remember, row rank always equals column rank – you can’t have a different number of independent rows versus columns!

For an $M \times N$ matrix, the maximum possible rank it can have is the smaller of $M$ and $N$ . If $M < N$ , the max rank is $M$ . If $N < M$ , the max rank is $N$ .

Case 1: When Equations Equal Variables ( $M=N$ )

This is the case where your matrix $A$ is square ( $M \times N$ and $M=N$ ).

If $A$ is Full Rank: “Full rank” here means the rank of the matrix is equal to $M$ (or $N$ , since they’re the same). What does this mean? It means all your equations on the left-hand side are independent. You can’t create any one equation by combining the others. In this happy case, there is a unique solution to $AX=B$ . You might remember from algebra that you can solve this by finding the inverse of $A$ , written as $A^{-1}$ . The solution is simply: $X = A^{-1}B$ You can find $A^{-1}$ if the determinant of $A$ is not zero. This is the standard, straightforward scenario.
If $A$ is Not Full Rank: This means the rank of $A$ is less than $M$ . In this situation, some of your equations on the left-hand side are actually linear combinations of others – they are linearly dependent. When this happens, depending on what the values are on the right-hand side (in the $B$ vector), you get two possibilities:
1. Consistent System: If the dependencies on the left-hand side match up exactly with the dependencies on the right-hand side (the $B$ vector values), the equations are consistent. But because they are dependent, you don’t have enough independent equations to pin down a single solution. This leads to infinite solutions.
2. Inconsistent System: If the dependencies on the left-hand side don’t match up with the $B$ vector values, the equations are inconsistent. They contradict each other, and there is no solution that can satisfy all of them.

Let’s look at a couple of examples for this $M=N$ case:

Example 1: Full Rank (Unique Solution)

Consider the system of equations: $x_1 + 3x_2 = 7$ $2x_1 + 4x_2 = 10$

In matrix form $AX=B$ , this is:

\begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 7 \\ 10 \end{bmatrix}

Here, $A = \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}$ . $M=2$ , $N=2$ . We can check if $A$ is full rank by calculating its determinant: $(1 \times 4) - (3 \times 2) = 4 - 6 = -2$ . Since the determinant is not 0, the matrix is full rank (rank = 2). There is a unique solution, $X = A^{-1}B$ . The inverse of $A$ is $\begin{bmatrix} -2 & 1.5 \\ 1 & -0.5 \end{bmatrix}$ . So, $X = \begin{bmatrix} -2 & 1.5 \\ 1 & -0.5 \end{bmatrix} \begin{bmatrix} 7 \\ 10 \end{bmatrix} = \begin{bmatrix} (-2 \times 7) + (1.5 \times 10) \\ (1 \times 7) + (-0.5 \times 10) \end{bmatrix} = \begin{bmatrix} -14 + 15 \\ 7 - 5 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix}$ . The unique solution is $x_1 = 1$ and $x_2 = 2$ . You can plug these back into the original equations to check ( $1 + 3(2) = 7$ , $2(1) + 4(2) = 10$ ). It works!

You can solve this easily in software too. In R, you could define A and B and use the solve() command.

Example 2: Not Full Rank, Consistent (Infinite Solutions)

Consider the system: $x_1 + 2x_2 = 5$ $2x_1 + 4x_2 = 10$

In matrix form $AX=B$ :

\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 10 \end{bmatrix}

Here, $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ . $M=2$ , $N=2$ . Check the determinant: $(1 \times 4) - (2 \times 2) = 4 - 4 = 0$ . The determinant is 0, so the matrix is not full rank (rank = 1). The columns are dependent (column 2 is 2 * column 1), and the rows are dependent (row 2 is 2 * row 1).

Now, look at the equations themselves. The second equation $2x_1 + 4x_2 = 10$ is exactly 2 times the first equation $x_1 + 2x_2 = 5$ . This dependency on the left-hand side ( $2x_1 + 4x_2$ being $2 \times (x_1 + 2x_2)$ ) is matched on the right-hand side ( $10$ being $2 \times 5$ ). Since the left and right sides have the same linear dependency, the system is consistent. However, because the second equation is just a scaled version of the first, you effectively only have one independent equation ( $x_1 + 2x_2 = 5$ ) with two variables ( $x_1, x_2$ ). This means you have a “free” variable. You can pick any value for $x_2$ , and then calculate the $x_1$ that works ( $x_1 = 5 - 2x_2$ ). For example:

If $x_2 = 0$ , $x_1 = 5$ . Solution: $(5, 0)$ .
If $x_2 = 1$ , $x_1 = 3$ . Solution: $(3, 1)$ .
If $x_2 = 2.5$ , $x_1 = 0$ . Solution: $(0, 2.5)$ . Since you can choose any value for $x_2$ , there are infinite solutions to this system.

Example 3: Not Full Rank, Inconsistent (No Solution)

Consider the system: $x_1 + 2x_2 = 5$ $2x_1 + 4x_2 = 9$

In matrix form $AX=B$ :

\begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} = \begin{bmatrix} 5 \\ 9 \end{bmatrix}

Here, $A = \begin{bmatrix} 1 & 2 \\ 2 & 4 \end{bmatrix}$ is the same as before, so it’s not full rank (rank = 1). The left-hand side has the same dependency (row 2 is 2 * row 1). But look at the right-hand side ( $B$ vector). The first equation’s right side is 5. If you multiply it by 2 (the scaling factor between the left-hand sides), you get $2 \times 5 = 10$ . However, the second equation’s right side is 9, not 10. The dependency on the left-hand side (where the second equation’s left side is twice the first) does not match the right-hand side ( $9 \neq 2 \times 5$ ). The equations contradict each other ( $2x_1 + 4x_2$ cannot equal both 10 and 9 simultaneously based on the first equation). The system is inconsistent, and there is no solution that can satisfy both equations.

So, for the $M=N$ case (square matrix A): if A is full rank, unique solution; if A is not full rank, check consistency – if consistent, infinite solutions; if inconsistent, no solution.

Case 2: More Equations than Variables ( $M > N$ )

Now let’s look at the case where you have more equations than variables. Like trying to find two numbers ( $x_1, x_2$ ) that satisfy five different conditions.

$AX = B \quad (\text{where M > N})$

Since you have more equations than variables, it’s generally impossible to find a perfect solution $X$ that makes every single equation true (i.e., makes $AX$ exactly equal to $B$ ). If you could find such a perfect $X$ , then $AX - B$ would be a vector of all zeros.

But since a perfect solution is usually out of reach, what’s the next best thing? We want to find a solution $X$ that gets us as close as possible to satisfying all the equations. In other words, we want to find an $X$ that makes the difference between $AX$ and $B$ as small as possible.

Think of $AX - B$ as a vector of “errors,” where each element is the error in one equation. Equation 1 error: $e_1 = (A_{11}x_1 + A_{12}x_2 + ... + A_{1N}x_N) - B_1$ Equation 2 error: $e_2 = (A_{21}x_1 + A_{22}x_2 + ... + A_{2N}x_N) - B_2$ … Equation M error: $e_M = (A_{M1}x_1 + A_{M2}x_2 + ... + A_{MN}x_N) - B_M$

The vector of errors is $E = AX - B$ . We want to find the $X$ that makes this error vector $E$ as “small” as possible. How do we measure the size of a vector of errors? We don’t just add the errors ( $e_1 + e_2 + ...$ ), because positive and negative errors could cancel out, making a big overall error look like zero.

A common way to measure the overall error is to minimize the sum of the squared errors: $e_1^2 + e_2^2 + ... + e_M^2$ . This is because squaring makes all the errors positive, and larger errors contribute more significantly. This approach is called finding the least squares solution.

Minimizing the sum of squared errors ( $e_1^2 + ... + e_M^2$ ) is the same as minimizing the squared length (or squared norm) of the error vector $E = AX - B$ . We write the squared length of a vector $v$ as $||v||^2$ , which is $v^T v$ (the vector transposed multiplied by the original vector). So, we want to minimize $||AX - B||^2$ , which is $(AX - B)^T (AX - B)$ .

$\text{Minimize } (AX - B)^T (AX - B)$

To find the $X$ that minimizes this, you can use calculus. You take the derivative of this expression with respect to the vector $X$ and set it equal to zero. After doing the calculus and some matrix algebra (which we won’t go into detail here, just like the lecture), the equation you get to solve for $X$ is:

$A^T A X = A^T B$

Where $A^T$ is the transpose of matrix $A$ (you flip its rows and columns).

Now, if the matrix $A^T A$ is invertible (which happens if the columns of the original matrix $A$ are linearly independent – related to the rank!), you can solve for $X$ :

$X = (A^T A)^{-1} A^T B$

This solution $X$ is the least squares solution. It’s the $X$ that minimizes the sum of squared differences between $AX$ and $B$ . This $X$ might not make $AX$ exactly equal to $B$ (since a perfect solution might not exist), but it gives you the best possible fit in a least squares sense.

This optimization perspective gives us a way to find a meaningful “solution” even when we have more equations than variables, a common situation in data fitting.

What’s Next?

So far, we’ve covered the case where $M=N$ (same number of equations and variables) and the case where $M > N$ (more equations than variables), introducing the idea of the least squares solution for the latter.

In the next part, we’ll look at the third case where $M < N$ (fewer equations than variables), which usually has infinite solutions. We’ll also see how an optimization idea can help us find a specific, useful solution in that case too.

Finally, we’ll see how this least squares idea and another concept called the pseudo-inverse can actually provide a single, elegant way to think about solving $AX=B$ that covers all three cases!

Stay tuned!

Turning Equations into Matrix Problems: AX=BAX=BAX=B

A Quick Refresher on Rank

Case 1: When Equations Equal Variables (M=NM=NM=N)

Case 2: More Equations than Variables (M>NM > NM>N)

What’s Next?

Turning Equations into Matrix Problems: $AX=B$

Case 1: When Equations Equal Variables ( $M=N$ )

Case 2: More Equations than Variables ( $M > N$ )