Lagrange multipliers
When a (continuous) function of one variable is strictly crescent or decrescent we don't have maximum or minimum points unless we set a closed interval, in which case the boundaries themselves are going to be the maximum and minimum points. For functions of two variables we can do the same and set a subdomain to limit our search for maximum and minimum points. The difference is that the domain of a function of two variables lies in [math]\displaystyle{ \mathbb{R}^2 }[/math], which means that the subdomain is going to be all points from a certain subset, a circumference for example. For functions of three variables we can't see the graph, but we can plot level surfaces and visualise constrains in [math]\displaystyle{ \mathbb{R}^3 }[/math].
Many textbooks begin the explanation of Lagrange multipliers with the partial derivatives and a system of equations. I'm going to use the graphical interpretation first to make it easier to understand the concept:
The function is [math]\displaystyle{ f(x,y) = x + y }[/math] and the constrain is [math]\displaystyle{ x^2 + y^2 = 1 }[/math]. As you can see, the function's domain has been restrained to all points that belong to the equation of a circumference with radius equal to 1. If we displace the circumference along the vertical axis, [math]\displaystyle{ f(x,y) = z }[/math] in this case, we are going to have it intersect the function's surface at some z height. Now the interesting property of the previous point is that, if we consider the constrain to be a level curve of a function of two variables, we have two parallel gradients there.
How do we know the fact that both gradients are parallel at the previous point? If you think on level curves, every function of two variables can be seen as a set of infinitely many level curves. At every point of a level curve we have a gradient that is perpendicular to that curve. At some z height the circumference is going to touch the graph of [math]\displaystyle{ f }[/math] and that point is going to be on some level curve of [math]\displaystyle{ f }[/math]. In turn, the equation that gives the constrain can be a level curve of some function [math]\displaystyle{ g }[/math] and if it's tangent to a level curve of [math]\displaystyle{ f }[/math], then we have two gradients parallel to each other.
Most textbooks are going to have a graphical explanation of the Lagrange multipliers by means of showing multiple level curves of [math]\displaystyle{ f }[/math] and the constrain, a circumference for example. Some people may be confused by the fact that the constrain is intersecting multiple level curves. Remember that when we plot level curves they are all on the XY plane, but each one should have its own "height". We don't plot level curves in 3D with the Z axis. At least not by hand on paper.
Every exercise is going to require us to solve this system of equations (non linear because we often have squares and products of variables). Remember, an equality between vectors means an equality between each coordinate of each vector:
[math]\displaystyle{ \begin{cases} \nabla f(x,y) & = \ \ \lambda \nabla g(x,y) \\ \ \ \ \ g(x,y) & = \ \ 0 \end{cases} }[/math]
(for three or more variables it's the same concept)
To explain [math]\displaystyle{ g(x,y) = 0 }[/math] just think about the constrain. Every function of two variables has its domain on the XY plane, where all points have the form [math]\displaystyle{ (x,y,0) }[/math]. The constrain is a subset of the XY plane itself because what the constrain is doing is restricting what points from the function's domain are allowed. Some textbooks may present [math]\displaystyle{ g(x,y) = k }[/math]. It's the same thing, except that the constrain is seen as the level curve of a secondary function.
The [math]\displaystyle{ \lambda }[/math] is called the Lagrange multiplier. Assuming that [math]\displaystyle{ f }[/math] and [math]\displaystyle{ g }[/math] are different functions, their respective gradients must differ on magnitude by some unknown constant. In some cases the constant may be [math]\displaystyle{ \lambda = 1 }[/math] and the gradients at that point are equal to each other.
Alternate method: In the example above we could have isolated [math]\displaystyle{ y = \pm \sqrt{x^2 - 1} }[/math] and substituted in the function itself. This way we reduce the problem from two variables to a single one. By analogy, a function of three variables with the constrain being some surface in 3D can also be reduced to a problem with two variables.
Multiple constrains: we can have more than one constrain. The domain of a function of three variables lies in [math]\displaystyle{ \mathbb{R}^3 }[/math]. If we have one constrain, it can be the surface of a sphere for example. If we intersect that sphere with the equation of a plane, we have a circumference. It's a bit more complicated to visualise. What we are going to have is:
[math]\displaystyle{ \nabla f(x,y,z) = \lambda_1 \nabla g(x,y,z) + \lambda_2 \nabla h(x,y,z) }[/math]
Separately, [math]\displaystyle{ \nabla g }[/math] and [math]\displaystyle{ \nabla h }[/math] won't be parallel to [math]\displaystyle{ \nabla f }[/math], but the sum of them is going to be. Much like the previous case, the domain is all points that belong to some curve. But now the curve lies in [math]\displaystyle{ \mathbb{R}^3 }[/math]. In the previous case the domain was a circumference with zero depth, while in [math]\displaystyle{ \mathbb{R}^3 }[/math] the domain can be a tilted circumference for example.
Beyond 3D the algebra exists but with calculus alone we are unable to solve more complex problems in higher dimensions. We can also have inequalities and multiple conditions to be meet, but to solve these types of problems we need the linear programming or the non-linear programming.
Note: not every exercise is going to have both a maximum and a minimum. Sometimes the constrain is not a closed boundary such as a circumference, it can be a parabola for example. If the constrain is a circumference, then [math]\displaystyle{ g }[/math] is differentiable everywhere. Bear in mind that the Lagrange multiplier requires the functions to be differentiable and the gradient cannot be null. If at some point we don't have a gradient, the gradient is null or we have [math]\displaystyle{ \lambda = 0 }[/math] we have to rely on other pieces of information to know what it's going on that particular point.