Choosing the Best Fit

We've seen lots of different ways to fit a function to a data set but how can we figure out which one is best? There are three general rules to follow:

If you're creating a fit based on a theoretical prediction then you should use exactly that form. For example, if theory predicts a form like y = mx then you should not perform a fit using y = mx + b because the theory predicts no intercept b.
If you're creating a fit for something you suspect obeys a simple physical phenomenon then you should choose a form that's appropriate to that phenomenon. For example, in most cases involving chemical reactions an exponential form is reasonable whereas a sinusoidal form is not.
Lacking all other intuition, choose the simplest form with the least number of significant coefficients which still describes the data sufficiently well.

Regarding the last point, and indeed any time you need to compare the results of two or more fit results, you've got two statistical tools to help out: goodness-of-fit parameters and coefficient confidence intervals.

Goodness-of-fit Parameter

For a quick comparison the goodness-of-fit parameter, also known as R2, is sufficient. More detailed an accurate evaluations can be obtained by ANOVA (specifically by comparing the P-value of the two models) but for ease of use we'll stick with R2 because it almost always varies between 0 and 1 with the following interpretation:

R2 = 0 means that the model explains none of the variability of the response data around its mean.
R2 = 1 means that the model explains all of the variability of the response data around its mean.

In most chemical engineering experiments that you'll encounter a value of R2 > 0.95 is fairly good, but there's no strict definition of what "fairly good" is and it can vary from experiment to experiment. You can ask the FIT function to output the goodness-of-fit parameter by providing an additional output argument:

[fobj, gof] = fit(k', Y', 'poly1')

fobj =

Linear model Poly1:

fobj(x) = p1*x + p2

Coefficients (with 95% confidence bounds):

p1 = 2.303 (1.974, 2.632)

p2 = -0.09937 (-1.095, 0.8967)

gof = struct with fields:

sse: 0.9828

rsquare: 0.9895

dfe: 4

adjrsquare: 0.9869

rmse: 0.4957

Coefficient Confidence Intervals

You should always looks at the coefficient confidence intervals for a simple reason:

If the coefficient's confidence interval contains zero then the coefficient

should be excluded and you should repeat the fit procedure.

This is how you can decide, for example, between a third-order polynomial and a fourth-order polynomial when both give almost identical goodness-of-fit parameters. The exception to this rule is if you're fitting data to a theoretical form: if a coefficient in a fit to a theoretical form includes zero then you have two choices:

Get better data and try the fit again.
Re-evaluate the potential validity of the model.

Page updated

Report abuse