Why isn't the Jensen-Shannon divergence used more often than the Kullback-Leibler (since JS is symmetric, thus possibly a better indicator of distance)?

Rafael Sola De Paula De Angelo Calsaverini, Physicist and Data Scientist

Answered May 18, 2016

The Kullback-Leibler divergence has a few nice properties, one of them being that [math]KL[q;p][/math] kind of abhors regions where [math]q(x)[/math] have non-null mass and [math]p(x)[/math] has null mass. This might look like a bug, but it’s actually a feature in certain situations.

If you’re trying to find approximations for a complex (intractable) distribution [math]p(x)[/math] by a (tractable) approximate distribution [math]q(x)[/math] you want to be absolutely sure that any [math]x[/math] that would be very improbable to be drawn from [math]p(x)[/math] would also be very improbable to be drawn from [math]q(x)[/math]. That KL have this property is easily shown: there’s a [math]q(x) log[q(x)/p(x)][/math] in the integrand. When [math]q(x)[/math] is small but [math]p(x)[/math] is not, that’s ok. But when [math]p(x) [/math]is small, this grows very rapidly if [math]q(x)[/math] isn’t also small. So, if you’re choosing [math]q(x)[/math] to minimize [math]KL[q; p][/math], it’s very improbable that [math]q(x)[/math] will assign a lot of mass on regions where [math]p(x)[/math] is near zero.

The Jensen-Shannon divergence don’t have this property. It is well behaved both when [math]p(x) [/math]and [math]q(x) [/math]are small. This means that it won’t penalize as much a distribution [math]q(x)[/math] from which you can sample values that are impossible in [math]p(x)[/math].

3.5k Views · 13 Upvotes

Promoted by Zeqr

Machine Learning expert to answer all your questions.

Request a class on Zeqr, a new live video one-on-one coaching platform.

Learn More at zeqr.com

Related QuestionsMore Answers Below

JQ Veenstra, Does research in theory and practice of ML

Answered Sep 17, 2015

It depends on what question you want to answer. As a pure distance measure, JS is superior - it is symmetric, and the square root is even a metric! What more could one ask for?

The usefulness of KL is that, simply, it is the fundamental building block of the JS divergences (the JS divergence that is symmetric is on two distributions, and the weights are each 1/2). If you were only able to calculate one divergence measure (for whatever reason), you could calculate JS from KL but not the reverse. (OK, I admit, this is not true, technically, if you can preset the weights, etc., every time you calculate. But that's missing the point.) That "whatever reason" is often, for example, the desire to not introduce too many definitions (when we're in the realm of paper-writing - there are others.) But it's also so much more than that.

For example, the KL divergence measures the expected number of extra bits (with log base 2) required to code samples from one distribution with a code optimized for another. That in itself is huge, and can be more useful than anything JS can tell you, simply because you often want to have a reference distribution for many reasons. And it can be used in other situations as well.

2.5k Views · 1 Upvote

Jay Verkuilen, PhD Psychometrics, MS Mathematical Statistics, UIUC

Answered Sep 9, 2015

KL has a lot of nice theorems attached and is just better known.

Many problems to which it is applied are not symmetric so symmetry isn't necessarily a big sell. However the fact that KLD is infinite whenever the target measure isn't absolutely continuous with respect to the reference is an issue.

2.5k Views · 1 Upvote

Why isn't the Jensen-Shannon divergence used more often than the Kullback-Leibler (since JS is symmetric, thus possibly a better indicator of distance)?

Answer Wiki

Related QuestionsMore Answers Below

Related Questions

Related Questions