Zero truncated count distributions and their negative log likelihoods.
Recently, I performed research on single-cell transcriptomic transformers. When porting language model paradigms to this data modality, a pre-training task can be formulated that predicts gene expression levels from their partly masked versions. The gene expression levels typically consist of discrete counts. These are commonly assumed to follow either a (1) Poisson, (2) negative binomial, or (3) zero-inflated negative binomial distribution.
Using single-cell transformers on gene expression profiles poses challenges w.r.t. data dimensionality. In essence, every expression profile contains ± 20 000 genes, each making up one input token. This number of tokens will pose issues for research-institute-scale hardware, even when using memory efficient versions of self-attention. Luckily, it is not unreasonable to assume that all the relevant expression information is contained within the non-zero counts in the profile[1]. By removing all zero counts from the expression profiles, the usual distributional assumptions do not hold anymore. Instead, all prior distributions become zero-truncated. This blogpost describes the Zero-truncated versions of the Poisson and Negative binomial distributions.
The zero-truncated Poisson distribution #
The Poisson distribution is given by:
characterizing the probability of observing a (true) count
The negative log-likelihood of the zero-truncated Poisson distribution is, hence:
When estimating
One can see that substituting the second loss term
The zero-truncated negative binomial distribution #
The negative binomial distribution is given by:
characterizing the probability of observing a (true) count
The negative log-likelihood of the zero-truncated Negative Binomial distribution is, hence:
Similarly, when estimating
One can see that the “default” NB NLL becomes a good approximation of the zero-truncated NB NLL at high values of
The code for using these as loss functions within PyTorch can be found here.
References and footnotes
Just like we do not need to explicitly mention to a transformer that some words were not present in a sentence, it might not be necessary to communicate to a transformer that some genes were inactive. The communication that this is the case can happen implicitly simply by this token not being present. ↩︎