One recurrent problem that we are experiencing in the office is, as the title suggests, about fitting statistical distributions. Given data from physical experiments, or somehow related to them (such as be gene expression profiles or parameters of a gene selection method), we are spending some time to figure out the distribution that generated those data.
Well, not just a distribution. Just the best one. One naive way of doing this is, of course, to plot the data (e.g. plot the histogram) and inspect the distribution in a graphical way. For instance the R code below
dataN <- rnorm(1000, mean=3, sd= 7.5) hist(dataN, breaks=15)
would plot a gaussian-like histogram, suggesting that the data have been indeed generated by a normal distribution.
dataP <- rpois(n=1000, lambda=4.9) hist(dataP, breaks=15)
would plot the histogram of a poisson-like distribution, indeed. Yet another example (trying not to be annoying here) would assume that the shape of the distribution is a Gamma
dataG <- rgamma(1000, shape=1.2, scale=1.3) hist(dataG, breaks=15)
So far so good. Once we are “sure” about the shape of the distribution, the next big step is to guess the right parameters that govern the real trends within that distribution. Therefore, in the case of a Normal distribution we should estimate the right and , or for a Poisson distribution we might need to estimate the parameter.
If our assumptions are all about a Gamma distribution, then parameters $latex k$ and $latex theta $ should be estimated.
How about the shape?
Even though it’s quite nice to see curves on the screen, the graphical way of guessing the best distribution of unknown data, is not really the most suitable one. Quantile-quantile plots are much better in that sense. It’s still a graphical technique. But much more explicit since it plots empirical quantiles and theoretical quantiles on the two axes, giving a straight line in case the two match consistently. Therefore:
qqplot(dataN, rnorm(1000, 3, 7.5), main="QQ-plot Normal") qqplot(dataP, rpois(1000, 4.9), main="QQ-plot Poisson") qqplot(dataG, rgamma(1000, 1.2, 1.3), main="QQ-plot Gamma")
How about the parameters?
This is indeed the biggest deal. Statisticians know about several methods to estimate parameters of a distribution, assuming the shape has been correctly identified. One is the method of moments in which all parameters sampled directly from data will be assumed as parameters of the candidate distribution. For instance, if we assume that data have been generated from a Normal distribution, we should estimate the mean and the variance. Who better than the sample mean and sample variance could do that? An approximation would be: and that will be thrown into the normal distribution analytical form we are familiar with
Another method I personally prefer is the Maximum Likelihood Estimation. Of course it relies on the assumed distribution, as the method of moments. Here is an analytical example for the Gamma distribution. The maximum likelihood estimation method consists in maximising the likelihood function which is (in general for any distribution) The MLE consists in finding the that miximises the likelihood (or equivalently its logarithmic function).
For the Gamma distribution it would be something like and the logarithmic function is As a Pig in love with Math I would keep it like that.
But my room mate, the engineer, usually dislikes that form and prefers to see it in a more “useful” way, as he claims, whenever he applies this masterpiece to real data. In R there is a function that does all that, in just one line of code
library(MASS) fd_g <- fitdistr(dataG, "gamma") est_shape = fd_g$estimate[] est_rate = fd_g$estimate[]
The estimated parameters of the assumed Gamma distribution are stored in est_shape and est_rate. I have the shape, I have the parameters. What now? The final step of any distribution fitting procedure consists in measuring how good the estimated distribution (shape and parameters) fit the data. Several possibilities are offered by mathematical statistics that usually go under the name of goodness of fit.
These measures basically match the empirical frequencies of observed data with the ones fitted by a theoretical distribution (the assumed one). The literature is offering absolute and relative measures. Two absolute measures are where and are the empirical frequencies and the theoretical ones respectively. Some relative measures are:
Again, I found the engineer somewhere between disgusted and scared.
He wanted me to pull some numbers out and he translated this in R code
#assume dataP come from a Poisson distribution emp_freq <- table(dataP) # table of empirical frequencies freq <- vector() # vector of empirical frequencies for(i in 1:length(emp_freq)) freq[i] <- emp_freq[[i]] # vector of expected frequencies exp_freq <- dpois(0:max(dataP), lambda=est_lambda) gf_a <- mean(abs(freq - trunc(exp_freq))) #absolute goodness of fit gf_r <- gf_a/mean(freq) * 100 # relative goodness of fit
Even though hypothesis testing might reveal also the significance of the difference between the two distributions under investigation, the two measures explained above are usually good indicators of how good (or bad) the assumed distribution is.
Next time we’ll try to put everything together and write some code that automatically selects the best in a list of distributions given as candidates.
Before you go
If you enjoyed this post, you will love the newsletter of Data Science at Home. It’s my FREE digest of the best content in Artificial Intelligence, data science, predictive analytics and computer science. Subscribe!