Overfitting

questions concerning analysis/theory using program PRESENCE

Overfitting

Postby dbuhl » Wed Dec 11, 2019 11:35 am

I have a couple questions regarding occupancy analysis.

1) Given a particular design and sample size, how many parameters can I estimate without overfitting? For example, if I have 100 sites and 5 surveys at each site, how many parameters can I estimate in PRESENCE if the species was present at 80% of the sites with a detection probability of 0.8? Or what if I only observed my species at 10% of the 100 sites with a detection probability of 0.20, then how many parameters can I estimate?

2) If you have covariates with levels that have no detections, is this a problem? For example, I have a covariate for region, but my species was only seen in 2 of the 3 regions so for the 3rd region the detection history is all zeros. In logistic regression, this would be a problem. Is it a problem in occupancy analysis? I ran a couple simulations in PRESENCE to try and figure it out and in those simulations I found that if the covariate was used to model detection, it resulted in huge standard errors (like what would happen in logistic regression). But, if the covariate was used to model occupancy, it did not seem to have any problems.

Any insight on either of these questions would be greatly appreciated. Thank you.
dbuhl
 
Posts: 5
Joined: Wed Sep 26, 2012 6:36 pm

Re: Overfitting

Postby jhines » Wed Dec 11, 2019 3:37 pm

Old reliable answer: It depends. Obviously, each parameter to estimate will require data. So, if you have 100 sites you will need data for each site covariate you use in the model. Say you want to build a model with 1 site-specific covariate (eg., good/bad habitat type). If sites are distributed evenly, you'll have 50 sites with good habitat and 50 with bad. If you have 2 surveys with p=.2, then most of the detection-histories will be all zeros (64% of them). So you'll only have information about detection from around 18 sites (p*=1-(1-p)*(1-p) = 1-.8*.8 = .36, 50*p* = 18) for each habitat type. Each additional covariate in the model divides the data further, and if the data are not distributed evenly among the categories, there is a good chance that there will be some categories with little to no data. When you have no detections for a covariate, it is impossible to distinguish whether psi=0, or psi>0 and p=0. That is a problem for occupancy models and PRESENCE tends to give the result, psi=1 with an unreasonable standard error.

My suggestion is to 1) think of plausible models before running models (ie., have a hypothesis in mind as a reason for running each model... don't just try models to see what's important), and 2) start with the most simple model and work towards more complicated ones. Once the standard errors become unreasonable, you've probably reached a limit for the data. Also, 3) check the results to make sure the estimates and standard errors make sense.
jhines
 
Posts: 599
Joined: Fri May 16, 2003 9:24 am
Location: Laurel, MD, USA


Return to analysis help

Who is online

Users browsing this forum: No registered users and 9 guests