Hi everyone,

I am trying to estimate relative species richness for birds using dynamic occupancy in RPresence for a long-term data set (12 years). My primary question is what is the RSR for my study area and how has it changed over time with respect to environmental covariates of interest. I am also interested in temporal turnover of the bird community, both long-term and seasonal.

Because of the large size of my data and number of parameters, the null model alone requires >7 days to run and >100Gb RAM. I am wondering if anyone has any ideas as to how to deal with the computation problem (e.g., parallel processing for a single model, if this is possible) and/or any thoughts or suggestions regarding about the approaches that I am currently using (which I will describe below).

To describe my data and first approach:

I am using avian point count data conducted over a 12-year period with 2-4 surveys per year (36 surveys in total), each at least several months apart (so there is no temporal replication with closure). To deal with the lack of replication with closure, I am using a space-for-time replacement approach where I have divided the park into 750mx750m grid cells and I treat the points within each grid as secondary sampling occasions. Using this method, there are 85 grid cells, each with 1-11 points. To reduce the data, I have combined the points into two groups, one group with five points and another with six points, randomly assigned to each group. Because the number of points surveyed would differ among grids and primary surveys, I have added effort as a covariate for p (the sum of the points surveyed per secondary season).

I have structured the data so that the rows are all possible species, where the complete list of species is repeated for each grid cell, and grid cell ID is a site-specific (species-specific) covariate. With 179 species and 85 grid cells, the number of sampling units is 15215.

Brief summary:

nunits = 15215

nsurveys = 72

nseasons = 36

nsurveyseason = 2 (these are groups of points)

The null model for this approach is psi(gridID)gamma(gridID)epsilon(gridID)p(.) and includes 256 parameters, which is not computationally feasible.

To reduce the number of parameters and the size of the data, one suggestion I received was to structure the data and grid cells based on combinations of categorical covariates instead of grid cell IDs. For example, if I have the 3 covariates - NDVI, elevation, distance to edge – with 3 categories of each I would summarize my data and detection histories for all possible combinations (n=27) of these (e.g., highNDVI_midELEV_nearEDGE and highNDVI_lowELEV_farEDGE). This however assumes that RSR is approximately the same based on the covariates I have available and choose. Preliminary analyses using this alternative approach suggests I might have issues based on uneven sampling area per covariate group (the number of grids) and other covariates influencing RSR that are not being accounted for. For example, if one combination of covariates is more heavily sampled (thus incorporating more environmental heterogeneity) than another combination, RSR is overestimated for the former compared to the latter. I am not sure how to best handle these issues.

Other possibilities include:

1) Analyze a sets of grids separately (e.g, break my study area into quadrants and do 4 separate analyses)

2) Try some kind of bootstrapping with replacement approach, running the model on a randomly selected subset of grid cells many times.

3) Analyze temporal subsets of the data separately (e.g., years 1-3, 4-6, 7-9 and 10-12)

4) Model sets of primary seasons as secondary seasons, thus violating closure (and I think rendering p largely meaningless).

Any feedback would be greatly appreciated!

Thank you!