- Notebooks
- What Is Thought? (2004)
- A model of technology
- Agent based models
- Altruism
- Attention economy
- Behavioural economics
- Broken ergodicity
- Category Theory
- Charles Sturt University
- Citation management software
- Climate Change
- Collective Cognition
- Complexity (What it is)
- Configuration space of the economy
- C++
- Design grammars
- Disruptive Technology
- Dunbar numbers
- Economics
- Econophysics
- Flocking algorithms
- Framing
- Free energy
- Genetic algorithms
- Geometry of fitness landscapes
- Grammatical inference
- Human geography
- Iconoclasm as method
- Information Geometry
- The Bumper Book of Informations
- ISEA 2013 proposal - gardening sound
- Isotropic vectors
- Knowledge topology
- Linear algebra
- Material basis of the economy
- Mathematica
- Niche construction
- Nonparametric mutual information
- picloud
- Post-normal science
- Pure, the language
- Python
- R, the language
- 1 Research Proposal for Dan MacKinlay
- Research Proposal (for a dead project)
- Risk
- Science for Policy
- Simulation for the social sciences
- Soft methodology
- Spatial statistics
- Statistics
- String dynamics
- Supercollider
- Sustainability
- SXSW from the gutter
- Things to avoid so your PhD is not a trainwreck
- TODOs
- Transfer Entropy
- Berlin 2010- Transmediale
- What am I up to here?

- Articles
- Projects
- Colophon

*In which a headache is induced in prescriptive grammarians.*

Different types of entropy/information and their disambiguations; a work in progress. The seductive power of the logarithm.

To cover - information versus entropy (both in the sense that information entropy of a random event is the expected value of its self-information, and the border wars between statistical mechanics and information theory, in the sense of having a war over whether there is even a border).

A proven path to influence is to find a new derivative measure based on Shannon information, and apply it to something provocative.

Vanilla information, thanks be to Claude Shannon. Given a random variable \(X\) taking values \(x \in \mathcal{X}\) from some discrete alphabet \(\mathcal{X}\), with probability mass function \(p(x)\).

\[\begin{split}\begin{array}{ccc}
H(x) & \overset{\underset{\mathrm{def}}{}}{=} & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \\
& \equiv & E( \log 1/p(x) )
\end{array}\end{split}\]

Because “Kullback-Leibler divergence” is a lot of syllables for something you use so very often, even if usually in sentences like “unlike the K-L divergences”. Or you could call it the “relative entropy”, but that sounds like something to do with my uncle after the seventh round of christmas drinks.

It is defined between the probability mass functions of two discrete random variables, \(P,Q\), where those probability mass functions are given \(p(x)\) and \(q(x)\) respectively.

\[\begin{split}\begin{array}{cccc}
D(P \parallel Q) & \overset{\underset{\mathrm{def}}{}}{=} & -\sum_{x \in \mathcal{X}} p(x) \log p(x) \frac{p(x)}{q(x)} \\
& \equiv & E \log p(x) \frac{p(x)}{q(x)}
\end{array}\end{split}\]

That is, of conditional probability distributions. This is defined in precisely the manner you’d expect, but I should probably go into detail, esp. on the chain rule for it, because it’s useful later.

The “informativeness” of one variable given another... Most simply, the K-L divergence between the product distribution and the joint distribution of two random variables. (That is, it vanishes if the two variables are independent).

Now, take \(X\) and \(Y\) with joint probability mass distribution \(p_{XY}(x,y)\) and, for clarity, marginal distributions \(p_X\) and \(p_Y\).

Then the mutual information :math:I is given

\[I(X; Y) = H(X) - X(X|Y)\]

Estimating this one has been giving me grief lately, so I’ll be happy when I
get to this section and solve it forever. See *Nonparametric mutual information*.

Getting an intuition of what this measure does is handy, so I’ll expound some equivalent definitions that emphasis different characteristics:

\[\begin{split}\begin{array}{cccc}
I(X; Y) & \overset{\underset{\mathrm{def}}{}}{=} &
\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}}
p_{XY}(x, y) \log p(x, y) \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)} \\
& = & D( p_{XY} \parallel p_X p_Y) \\
& = & E \log \frac{p_{XY}(x,y)}{p_X(x)p_Y(y)}
\end{array}\end{split}\]

Some folks hope information measures can be massaged to give some general
estimate of how “complex” a system is. It can’t in a general sense - see my unsupported rant over at *Complexity (What it is)*. However, in highly swpecific circumstances it does some neat things. Here are examples.

Crutchfield, Packard and Feldman

(see http://tuvalu.santafe.edu/~cmg/compmech/tutorials/ComplexityMeasures.pdf or http://tuvalu.santafe.edu/events/workshops/images/e/e9/ComplexityLecture2C.pdf http://hornacek.coa.edu/dave/Publications/RURO.html or http://hornacek.coa.edu/dave/Publications/DNCO.html )

I don’t much anything about this yet, since I only encountered it in passing.

Schreiber says:

> If \(I\) is obtained by coarse graining a continuous system \(X\) at
resolution \(\epsilon\), the entropy \(HX(\epsilon)\) and entropy rate
\(hX(\epsilon)\) will depend on the partitioning and in general diverge
like \(\log(\epsilon)\) when \(\epsilon \to 0\). However, for the
special case of a deterministic dynamical system, \(lim_{\epsilon\to 0}
hX (\epsilon) = hKS\) may exist and is then called the *Kolmogorov-Sinai
entropy*. (For non-Markov systems, also the limit \(k \to \infty\) needs
to be taken.)

That is, it is a special case of the entropy rate. Is that all? Does the entropy rate really diverge for any non-deterministic system, even one of bounded variation? Must work that out.

Related to entropy convergence rates.

Hm. See that Crutchfield paper, above.

Bialek and Tishby.

Schreiber. Approximating states as Markovian processes, we measure their influence on each other’s transition probabilities.

(How can this be generalised to non-Markovian processes, or at least ones with hidden state?)

See also Granger causality. See also more general causality, Judea Pearl-style.

Lizier.

Shannon’s less-successful formula.

E.T. Jaynes’ entrant into the field. See also K-L divergence for continuous distributions.

http://en.wikipedia.org/wiki/Limiting_density_of_discrete_points

This is more of a raw statistical thing, a matrix that tells you how much a new datum affects your parameter estimates. It is related, I am told, to garden variety Shannon information, and when that non-obvious fact is made more clear to me I shall expand how precisely this is so.

Also, the Hartley measure.

You don’t need to use a logarithm in your information summation. Free energy, something something. (?)

The observation that many of the attractive features of information measures are simply due to the concavity of the logarithm term in the function. So, why not whack another concave function with even more handy features in there? Bam, you are now working on Rényi information. How do you feel?

Attempting to make information measures non-extensive. *q*-entropy. Seems to
have made a big splash in Brazil, but less in other countries. There are good
books about highly localised mathematical oddities in that nation,
but not ones I’d cite in academic articles. Non-extensive measures are an
intriguing idea, though. I wonder if it’s parochialism that keeps everyone off
Tsallis statistics, or a lack of demonstrated use?

Hmm.

Due to Murali Rao.

- John Baez’s A Characterisation of Entropy
- Daniel Ellerman’s History of the Logical Entropy Formula and From Partition Logic to Information Theory