### Fitting distributions in R

One recurrent problem that we are experiencing in the office is, as the title suggests, about fitting statistical distributions. Given data from physical experiments, or somehow related to them (such as be gene expression profiles or parameters of a gene selection method), we are spending some time to figure out the distribution that generated those data.

Well, not just a distribution. Just the best one. One naive way of doing this is, of course, to plot the data (e.g. plot the histogram) and inspect the distribution in a graphical way. For instance the R code below

```
dataN <- rnorm(1000, mean=3, sd= 7.5)
hist(dataN, breaks=15)
```

would plot a gaussian-like histogram, suggesting that the data have been indeed generated by a normal distribution.

```
dataP <- rpois(n=1000, lambda=4.9)
hist(dataP, breaks=15)
```

would plot the histogram of a poisson-like distribution, indeed. Yet another example (trying not to be annoying here) would assume that the shape of the distribution is a Gamma

```
dataG <- rgamma(1000, shape=1.2, scale=1.3)
hist(dataG, breaks=15)
```

So far so good. Once we are *“sure”* about the shape of the distribution, the next
big step is to guess the right parameters that govern the real trends within
that distribution. Therefore, in the case of a Normal distribution we should
estimate the right and , or for a Poisson
distribution we might need to estimate the parameter.

If our assumptions are all about a Gamma distribution, then parameters $latex k$ and $latex theta $ should be estimated.

## How about the shape?

Even though it’s quite nice to see curves on the screen, the graphical way of guessing the best distribution of unknown data, is not really the most suitable one. Quantile-quantile plots are much better in that sense. It’s still a graphical technique. But much more explicit since it plots empirical quantiles and theoretical quantiles on the two axes, giving a straight line in case the two match consistently. Therefore:

```
qqplot(dataN, rnorm(1000, 3, 7.5), main="QQ-plot Normal")
qqplot(dataP, rpois(1000, 4.9), main="QQ-plot Poisson")
qqplot(dataG, rgamma(1000, 1.2, 1.3), main="QQ-plot Gamma")
```

would plot (almost) a straight line, or points aligned onto it. [caption id=”attachment_1704” align=”alignnone” width=”640”]

## How about the parameters?

This is indeed the biggest deal. Statisticians know about several methods to estimate parameters of a distribution, assuming the shape has been correctly identified. One is the method of moments in which all parameters sampled directly from data will be assumed as parameters of the candidate distribution. For instance, if we assume that data have been generated from a Normal distribution, we should estimate the mean and the variance. Who better than the sample mean and sample variance could do that? An approximation would be: and that will be thrown into the normal distribution analytical form we are familiar with

Another method I personally prefer is the Maximum Likelihood Estimation. Of course it relies on the assumed distribution, as the method of moments. Here is an analytical example for the Gamma distribution. The maximum likelihood estimation method consists in maximising the likelihood function which is (in general for any distribution) The MLE consists in finding the that miximises the likelihood (or equivalently its logarithmic function).

For the Gamma distribution it would be something like and the logarithmic function is As a Pig in love with Math I would keep it like that.

But my room mate, the engineer, usually dislikes that form and prefers to see it in a
more *“useful”* way, as he claims, whenever he applies this masterpiece to
real data. In R there is a function that does all that, in just one line of
code

```
library(MASS)
fd_g <- fitdistr(dataG, "gamma")
est_shape = fd_g$estimate[[1]]
est_rate = fd_g$estimate[[2]]
```

The estimated parameters of the assumed Gamma distribution are stored in
*est_shape* and *est_rate*. I have the shape, I have the parameters. What now?
The final step of any distribution fitting procedure consists in measuring how
good the estimated distribution (shape and parameters) fit the data. Several
possibilities are offered by mathematical statistics that usually go under the
name of *goodness of fit*.

These measures basically match the empirical frequencies of observed data with the ones fitted by a theoretical distribution (the assumed one). The literature is offering absolute and relative measures. Two absolute measures are where and are the empirical frequencies and the theoretical ones respectively. Some relative measures are:

Again, I found the engineer somewhere between disgusted and scared.

He wanted me to pull some numbers out and he translated this in R code

```
#assume dataP come from a Poisson distribution
emp_freq <- table(dataP) # table of empirical frequencies
freq <- vector() # vector of empirical frequencies
for(i in 1:length(emp_freq))
freq[i] <- emp_freq[[i]]
# vector of expected frequencies
exp_freq <- dpois(0:max(dataP), lambda=est_lambda)
gf_a <- mean(abs(freq - trunc(exp_freq))) #absolute goodness of fit
gf_r <- gf_a/mean(freq) * 100 # relative goodness of fit
```

Even though hypothesis testing might reveal also the significance of the difference between the two distributions under investigation, the two measures explained above are usually good indicators of how good (or bad) the assumed distribution is.

Next time we’ll try to put everything together and write some code that automatically selects the best in a list of distributions given as candidates.

(oo)

## Before you go

If you enjoyed this post, you will love the newsletter of Data Science at Home. It’s my FREE digest of the best content in Artificial Intelligence, data science, predictive analytics and computer science. Subscribe!