Skip to contents

Compare some empirical data set against different distributions to help find the distribution that could be the best fit.

Usage

tidy_distribution_comparison(
  .x,
  .distribution_type = "continuous",
  .round_to_place = 3
)

Arguments

.x

The data set being passed to the function

.distribution_type

What kind of data is it, can be one of continuous or discrete

.round_to_place

How many decimal places should the parameter estimates be rounded off to for distibution construction. The default is 3

Value

An invisible list object. A tibble is printed.

Details

The purpose of this function is to take some data set provided and to try to find a distribution that may fit the best. A parameter of .distribution_type must be set to either continuous or discrete in order for this the function to try the appropriate types of distributions.

The following distributions are used:

Continuous:

  • tidy_beta

  • tidy_cauchy

  • tidy_exponential

  • tidy_gamma

  • tidy_logistic

  • tidy_lognormal

  • tidy_normal

  • tidy_pareto

  • tidy_uniform

  • tidy_weibull

Discrete:

  • tidy_binomial

  • tidy_geometric

  • tidy_hypergeometric

  • tidy_poisson

The function itself returns a list output of tibbles. Here are the tibbles that are returned:

  • comparison_tbl

  • deviance_tbl

  • total_deviance_tbl

  • aic_tbl

  • kolmogorov_smirnov_tbl

  • multi_metric_tbl

The comparison_tbl is a long tibble that lists the values of the density function against the given data.

The deviance_tbl and the total_deviance_tbl just give the simple difference from the actual density to the estimated density for the given estimated distribution.

The aic_tbl will provide the AIC for a lm model of the estimated density against the emprical density.

The kolmogorov_smirnov_tbl for now provides a two.sided estimate of the ks.test of the estimated density against the empirical.

The multi_metric_tbl will summarise all of these metrics into a single tibble.

Author

Steven P. Sanderson II, MPH

Examples

xc <- mtcars$mpg
output_c <- tidy_distribution_comparison(xc, "continuous")
#> For the beta distribution, its mean 'mu' should be 0 < mu < 1. The data will
#> therefore be scaled to enforce this.

xd <- trunc(xc)
output_d <- tidy_distribution_comparison(xd, "discrete")

output_c
#> $comparison_tbl
#> # A tibble: 352 × 8
#>    sim_number     x     y    dx       dy     p     q dist_type
#>    <fct>      <int> <dbl> <dbl>    <dbl> <dbl> <dbl> <fct>    
#>  1 1              1  21    2.97 0.000114 0.625  10.4 Empirical
#>  2 1              2  21    4.21 0.000455 0.625  10.4 Empirical
#>  3 1              3  22.8  5.44 0.00142  0.781  13.3 Empirical
#>  4 1              4  21.4  6.68 0.00355  0.688  14.3 Empirical
#>  5 1              5  18.7  7.92 0.00721  0.469  14.7 Empirical
#>  6 1              6  18.1  9.16 0.0124   0.438  15   Empirical
#>  7 1              7  14.3 10.4  0.0192   0.125  15.2 Empirical
#>  8 1              8  24.4 11.6  0.0281   0.812  15.2 Empirical
#>  9 1              9  22.8 12.9  0.0395   0.781  15.5 Empirical
#> 10 1             10  19.2 14.1  0.0516   0.531  15.8 Empirical
#> # ℹ 342 more rows
#> 
#> $deviance_tbl
#> # A tibble: 352 × 2
#>    name                         value
#>    <chr>                        <dbl>
#>  1 Empirical                  0.451  
#>  2 Beta c(1.107, 1.577, 0)    0.189  
#>  3 Cauchy c(19.2, 7.375)     -0.451  
#>  4 Exponential c(0.05)        0.0990 
#>  5 Gamma c(11.47, 1.752)     -0.220  
#>  6 Logistic c(20.091, 3.27)   0.154  
#>  7 Lognormal c(2.958, 0.293)  0.00502
#>  8 Pareto c(10.4, 1.624)      0.0953 
#>  9 Uniform c(8.341, 31.841)  -0.515  
#> 10 Weibull c(3.579, 22.288)  -0.337  
#> # ℹ 342 more rows
#> 
#> $total_deviance_tbl
#> # A tibble: 10 × 2
#>    dist_with_params          abs_tot_deviance
#>    <chr>                                <dbl>
#>  1 Logistic c(20.091, 3.27)             0.183
#>  2 Lognormal c(2.958, 0.293)            1.37 
#>  3 Gaussian c(20.091, 5.932)            2.24 
#>  4 Uniform c(8.341, 31.841)             2.39 
#>  5 Weibull c(3.579, 22.288)             3.33 
#>  6 Beta c(1.107, 1.577, 0)              3.99 
#>  7 Gamma c(11.47, 1.752)                4.04 
#>  8 Pareto c(10.4, 1.624)                7.21 
#>  9 Exponential c(0.05)                  7.46 
#> 10 Cauchy c(19.2, 7.375)               12.5  
#> 
#> $aic_tbl
#> # A tibble: 10 × 3
#>    dist_type                 aic_value abs_aic
#>    <fct>                         <dbl>   <dbl>
#>  1 Beta c(1.107, 1.577, 0)        14.9    14.9
#>  2 Pareto c(10.4, 1.624)          86.9    86.9
#>  3 Gamma c(11.47, 1.752)        -157.    157. 
#>  4 Weibull c(3.579, 22.288)     -168.    168. 
#>  5 Gaussian c(20.091, 5.932)    -192.    192. 
#>  6 Logistic c(20.091, 3.27)     -195.    195. 
#>  7 Uniform c(8.341, 31.841)     -195.    195. 
#>  8 Exponential c(0.05)          -202.    202. 
#>  9 Cauchy c(19.2, 7.375)        -218.    218. 
#> 10 Lognormal c(2.958, 0.293)    -227.    227. 
#> 
#> $kolmogorov_smirnov_tbl
#> # A tibble: 10 × 6
#>    dist_type              ks_statistic ks_pvalue ks_method alternative dist_char
#>    <fct>                         <dbl>     <dbl> <chr>     <chr>       <chr>    
#>  1 Beta c(1.107, 1.577, …       0.781   0.000500 Monte-Ca… two-sided   Beta c(1…
#>  2 Cauchy c(19.2, 7.375)        0.562   0.000500 Monte-Ca… two-sided   Cauchy c…
#>  3 Exponential c(0.05)          0.438   0.00700  Monte-Ca… two-sided   Exponent…
#>  4 Gamma c(11.47, 1.752)        0.25    0.276    Monte-Ca… two-sided   Gamma c(…
#>  5 Logistic c(20.091, 3.…       0.125   0.970    Monte-Ca… two-sided   Logistic…
#>  6 Lognormal c(2.958, 0.…       0.0938  1        Monte-Ca… two-sided   Lognorma…
#>  7 Pareto c(10.4, 1.624)        0.688   0.000500 Monte-Ca… two-sided   Pareto c…
#>  8 Uniform c(8.341, 31.8…       0.25    0.267    Monte-Ca… two-sided   Uniform …
#>  9 Weibull c(3.579, 22.2…       0.125   0.964    Monte-Ca… two-sided   Weibull …
#> 10 Gaussian c(20.091, 5.…       0.188   0.627    Monte-Ca… two-sided   Gaussian…
#> 
#> $multi_metric_tbl
#> # A tibble: 10 × 8
#>    dist_type abs_tot_deviance aic_value abs_aic ks_statistic ks_pvalue ks_method
#>    <fct>                <dbl>     <dbl>   <dbl>        <dbl>     <dbl> <chr>    
#>  1 Logistic…            0.183    -195.    195.        0.125   0.970    Monte-Ca…
#>  2 Lognorma…            1.37     -227.    227.        0.0938  1        Monte-Ca…
#>  3 Gaussian…            2.24     -192.    192.        0.188   0.627    Monte-Ca…
#>  4 Uniform …            2.39     -195.    195.        0.25    0.267    Monte-Ca…
#>  5 Weibull …            3.33     -168.    168.        0.125   0.964    Monte-Ca…
#>  6 Beta c(1…            3.99       14.9    14.9       0.781   0.000500 Monte-Ca…
#>  7 Gamma c(…            4.04     -157.    157.        0.25    0.276    Monte-Ca…
#>  8 Pareto c…            7.21       86.9    86.9       0.688   0.000500 Monte-Ca…
#>  9 Exponent…            7.46     -202.    202.        0.438   0.00700  Monte-Ca…
#> 10 Cauchy c…           12.5      -218.    218.        0.562   0.000500 Monte-Ca…
#> # ℹ 1 more variable: alternative <chr>
#> 
#> attr(,".x")
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
#> attr(,".n")
#> [1] 32