Strictly speaking, Shannon's formula is a measure for uncertainty, which increases with the number of bits needed to optimally encode a sequence of realizations of $J$.
In order to measure the information flow between two processes, Shannon entropy is combined with the concept of the Kullback-Leibler distance [@KL51] and by assuming that the underlying processes evolve over time according to a Markov process [@schreiber2000].
Let $I$ and $J$ denote two discrete random variables with marginal probability distributions $p(i)$ and $p(j)$ and joint probability distribution $p(i,j)$, whose dynamical structures correspond to stationary Markov processes of order $k$ (process $I$) and $l$ (process $J$).
The Markov property implies that the probability to observe $I$ at time $t+1$ in state $i$ conditional on the $k$ previous observations is $p(i_{t+1}|i_t,...,i_{t-k+1})=p(i_{t+1}|i_t,...,i_{t-k})$.
The average number of bits needed to encode the observation in $t+1$ if the previous $k$ values are known is given by
$$
h_I(k)=- \sum_i p\left(i_{t+1}, i_t^{(k)}\right) \cdot log \left(p\left(i_{t+1}|i_t^{(k)}\right)\right),
$$
|