(dominance condition),h(θ(t)∣θ(t))=f(θ(t)∣θ(t))(tangent condition). \begin{aligned} h(\theta | \theta^{(t)}) &\ge f(\theta | \theta^{(t)}) \quad\text{(dominance condition)}, \\ h(\theta^{(t)} | \theta^{(t)}) &= f(\theta^{(t)} | \theta^{(t)}) \quad\text{(tangent condition)}. \end{aligned} h(θ∣θ(t))h(θ(t)∣θ(t))≥f(θ∣θ(t))(dominance condition),=f(θ(t)∣θ(t))(tangent condition). Simply put, the surface of h(θ) lies above f(θ), apart from at θ(t) where they are equal. An MM update consists of minimizing the majorization:
θ(t+1)=argminθh(θ∣θ(t)). Things to note:
MM updates are guaranteed to decrease the value of the objective function (descent property).
Majorizations are often used to split parameters, allowing updates to be done element-wise.
Blue: Objective
Green: Majorization
Red: Parameter Estimate
One technique for majorizing a convex function is a quadratic upper bound. If we can find a matrix M such that M−d2f(θ) is nonnegative definite for all θ, then
f(θ)≤f(θ(t))+∇f(θ(t))T(θ−θ(t))+21(θ−θ(t))TM(θ−θ(t)). Using the RHS as our majorization, updates take the form
θ(t+1)=θ(t)−M−1∇f(θ(t)). This method produces an update rule that looks like Newton's method, but uses an approximation of the Hessian which guarantees descent.
The negative log-likelihood for the logistic regression model provides the components
f(β)∇f(β)d2f(β)=i∑{−yixiTβ+ln[1+exp(xiTβ)]}=i∑−[yi−yi^(β)]xi=i∑yi^(β)[1−yi^(β)]xixiT where each xi is a vector of predictors, yi∈{0,1} is the response, and yi^(β)=(1+exp(−xiTβ))−1. Continuing to create the majorization:
In matrix form, the Hessian is d2f(β)=XTWX, where W is a diagonal matrix with entries yi^(1−yi^).
yi^ can only take values in (0, 1), so 41≥yi^(1−yi^).
Therefore, we can use M=XTX/4 to create a quadratic upper bound. Compare this MM update to Newton's method.
β(t+1)=β(t)−4(XTX)−1XT(y−y^(β(t))) β(t+1)=β(t)−(XTWX)−1XT(y−y^(β(t))) In this case, MM has the advantage of being able to reuse (XTX)−1 after it is computed once, whereas Newton's method must solve a linear system at each iteration.