This page may be out of date. Submit any pending changes before refreshing this page.
Hide this message.

Why isn't the Jensen-Shannon divergence used more often than the Kullback-Leibler (since JS is symmetric, thus possibly a better indicator of distance)?

3 Answers

The Kullback-Leibler divergence has a few nice properties, one of them being that [math]KL[q;p][/math] kind of abhors regions where [math]q(x)[/math] have non-null mass and [math]p(x)[/math] has null mass. This might look like a bug, but it’s actually a feature in certain situations.

If you’re trying to find approximations for a complex (intractable) distribution [math]p(x)[/math] by a (tractable) approximate distribution [math]q(x)[/math] you want to be absolutely sure that any [math]x[/math] that would be very improbable to be drawn from [math]p(x)[/math] would also be very improbable to be drawn from [math]q(x)[/math]. That KL have this property is easily shown: there’s a [math]q(x) log[q(x)/p(x)][/math] in the integrand. When [math]q(x)[/math] is small but [math]p(x)[/math] is not, that’s ok. But when [math]p(x) [/math]is small, this grows very rapidly if [math]q(x)[/math] isn’t also small. So, if you’re choosing [math]q(x)[/math] to minimize [math]KL[q; p][/math], it’s very improbable that [math]q(x)[/math] will assign a lot of mass on regions where [math]p(x)[/math] is near zero.

The Jensen-Shannon divergence don’t have this property. It is well behaved both when [math]p(x) [/math]and [math]q(x) [/math]are small. This means that it won’t penalize as much a distribution [math]q(x)[/math] from which you can sample values that are impossible in [math]p(x)[/math].

JQ Veenstra
It depends on what question you want to answer.  As a pure distance measure, JS is superior - it is symmetric, and the square root is even a metric!  What more could one ask for?

The usefulness of KL is that, simply, it is the fundamental building block of the JS divergences (the JS divergence that is symmetric is on two distributions, and the weights are each 1/2).  If you were only able to calculate one divergence measure (for whatever reason), you could calculate JS from KL but not the reverse.  (OK, I admit, this is not true, technically, if you can preset the weights, etc., every time you calculate.  But that's missing the point.)  That "whatever reason" is often, for example, the desire to not introduce too many definitions (when we're in the realm of paper-writing - there are others.)  But it's also so much more than that.

For example, the KL divergence measures the expected number of extra bits (with log base 2) required to code samples from one distribution with a code optimized for another.  That in itself is huge, and can be more useful than anything JS can tell you, simply because you often want to have a reference distribution for many reasons.  And it can be used in other situations as well.
Jay Verkuilen
KL has a lot of nice theorems attached and is just better known.

Many problems to which it is applied are not symmetric so symmetry isn't necessarily a big sell. However the fact that KLD is infinite whenever the target measure isn't absolutely continuous with respect to the reference is an issue.
View More Answers