Package 'gasplit' reference manual

Title:	Flexible estimation of 2-category proportions
Description:	Like a GLM/GAM where the true category of each sample is unknown, but the response data is a continuous variable with known distributions for category-1 and category-2 samples.
Authors:	Mark Bravington <[email protected]>
Maintainer:	Mark Bravington <[email protected]>
License:	CC-BY
Version:	1.1.2
Built:	2025-03-11 04:25:37 UTC
Source:	https://github.com/markbravington/gasplit

Flexible models for 2-category proportions from continuous data

Description

Suppose your data each come from one or the other of two categories (say, Eastern or Western Atlantic). You don't know for sure which one, but you do have a "measurement" for each one (say, a log-likelihood ratio) which tends to be high for one category and low for the other, with an a-priori-known distribution for each category (say, based on other samples of known category).

Now you want to estimate the proportion of category-1 data in your data, but stratified according to covariates of the data (say, year and location)— and/or simply dependent on individual covariates, (say, size). Ideally, you'd like to be able to specify a flexible model formula just as you might in a GLM or GAM. Well, now you can! For example:

  addmod <- gasplit( LGLR ~ year + region - 1, data= ew_data,
      d1= known_distro_of_measurement_for_category_1,
      d2= ditto_for_2
    )

to fit a two-way model with no interactions, to a dataset with covariates "year" and "region" based on a measurement "LGLR".

Having fitted your model, you can make a subsequent call to predict the proportion for (presumably) new covariate values, via the predict_from_previous argument..

If you want to use smoothers, i.e. s() terms a la mgcv, then you should call gasplit2 instead. The latter requires RTMB. Results should be the same as gasplit if there are no smooth terms; IDNK which one is faster. NDIRC (Nor Do I Really Care.)

See Examples of test_gasplit (qv) for practical stuff.

Usage

gasplit(formula, data, d1, d2, start=0.001,
    predict_from_previous=NULL)
gasplit2(
  formula,
  data,
  d1, d2,
  predict_from_previous= NULL,
  ... # for gam()
)
gasplit(formula, data, d1, d2, start=0.001,
    predict_from_previous=NULL)
gasplit2(
  formula,
  data,
  d1, d2,
  predict_from_previous= NULL,
  ... # for gam()
)

Arguments

`formula`	a classic R model formula (eventually allowing `mgcv` extensions)
`data`	a dataframe...
`d1`, `d2`	these are functions giving the PDF of the response variable for category-1 and category-2 samples. You gotta pre-fit those to other data. See Examples.
`start`	(just `gasplit`, not `gasplit2`). Either a scalar (probably close to 0) which will be used for all coefficients, or a vector with length `ncol(<model matrix>)`. TBH I'd leave it blank; the default will prolly work.
`predict_from_previous`	Optional result from a previous call to `gasplit`. If set, then no fitting will happen, but `gasplit` will make predictions at the (presumably new) values of `data`, based on the parameters of the previous fit.
`...`	for `gasplit2` only, additional args to `mgcv::gam` (qv), eg `knots`.

Details

Given parameters beta and a model-matrix X from the formula, the log-likelihood of observation i with response y_i is:

  ppn_i*d1( y_i) + (1-ppn_i)*d2( y_i)

where ppn_i = inv.logit( sum_j{ X_{i,j} *beta_j}).

Value

If predict_from_previous is supplied (and data, of course), gasplit and gasplit2 both return predicted proportions at each row of data. Otherwise, gasplit returns a list with elements par (raw parameter estimates from optim) and ppn, which is a vector of estimated proportion-1 for each observation. If the "strata" are discrete, there will be lots of identical ppn values, but not necessarily the case. gasplit2 returns the RTMB object (from MakeADFun) after fitting. A few extra components are added: G (from gam setup), beta (underlying params; could perhaps be used for variance computations via delta-method and hessian and blah-blah) and ppn (fitted values), plus convergence, message, and evaluations from nlminb. Calling <obj>$report() on the output will give a list with beta and ppn. If gasplit2 is called without any s() terms, then the internal details of the RTMB object will be a bit different, but the extra components will be the same.

Examples

?test_gasplit
?test_gasplit

Simulate dataset for 'gasplit'

Description

make_fake_ppns simulates a dataset for fitting with gasplit, using two factors Y and Z and giving a response variable LGLR (so the strata are all combinations of Y and Z). You can control the number of factor-levels, distributions of LGLR for category-1 and category-2 observations (and thus the "power" of LGLR for assignment), the number of samples per stratum, the average strength of each Y-effect and Z-effect, the average strength of Y*Z interactions, etc. It requires the offarray package (purely for my programming convenience; it's not apparent in the output).

test_gasplit fits a gasplit model (with no interactions) to the output of make_fake_ppns, running the latter itself if required. The output is the (true and) fitted proportions per strata, which can be compared: see Examples. Note that it's a bit harsh to simulate with interpow>0 and then test with gasplit, coz the latter assumes there's no interactions. (Of course, you can fit a model with interactions if you want.)

Usage

make_fake_ppns(nyears = 4, nzones = 3, interpow = 0.1,
    df_t = 5, mean_nsamp = 100,  meanE = 1, prange = 1, seed = 2)
test_gasplit(
    sim=NULL,
    formulalala= LGLR ~ Y+Z-1,
    use2= 's' %in% all.names( formula),
    ...
  )
make_fake_ppns(nyears = 4, nzones = 3, interpow = 0.1,
    df_t = 5, mean_nsamp = 100,  meanE = 1, prange = 1, seed = 2)
test_gasplit(
    sim=NULL,
    formulalala= LGLR ~ Y+Z-1,
    use2= 's' %in% all.names( formula),
    ...
  )

Arguments

`nyears`, `nzones`	number of factor levels
`interpow`	average strength of an interaction relative to a main effect
`mean_nsamp`	actually there are exactly this many total samples from each stratum
`df_t`, `meanE`	`df_t` degrees-of-freedom for the t-distributions of response by category, which will have variance 1 and means `+meanE` and `-meanE`. A `meanE` of 1 corresponds to pretty poor separation; 3 is pretty good.
`prange`	how strong the effects should be. If they are too strong, then
`seed`	random number seed. The system seed is restored on completion.
`sim`, `...`	(test_gasplit) `sim` should come from a previous call to `make_fake_ppns`. If not supplied, then `make_fake_ppns` will be run first, using the `...` arguments.
`formulalala`	what model to fit. The default corresponds to the simulation in `make_fake_ppns`. You can try putting smoothers in there, to test `gasplit2`.
`use2`	if TRUE, force `test_gasplit` to fit with `gasplit2` rather than `gasplit`. Normally auto-deduced based on whether `formulalala` contains a call to `s` (but the test is not completely reliable, if you care). Even without a smoother, `gasplit` and `gasplit2` should give the same result.

Value

make_fake_ppns returns a dataframe that has all the stuff you need, plus an attribute truth which is a list containing the real dope, as stratum-specific offarrays. Note that truth$samp_ppnE is the actual ppn of category-1 ("E") samples in each stratum, which will differ from the nominal value given by the factor strengths and interactions because of binomial sampling variability. test_gasplit returns a great long list of Stuff. See Examples and use yr noddle!

Examples

if( require( offarray)){
sim2 <- make_fake_ppns(
    meanE= -2, # not-brilliant separation
    mean_nsamp=200, # reasonable samp sizes
    interpow=0) # no interactoins
hist( sim2$LGLR, nc=40) # to see how good the separation is
test2 <- test_gasplit( sim2)
with( test2, plot( tru_ppnE, fit_ppnE))
abline( 0,1)
test2b <- test_gasplit( sim2, use2=TRUE)
summary( test2$fit_ppnE - test2b$fit_ppnE) # close to 0...
# Make up some new data...
splodge <- with( test2, simpure[ 1:7,])
# ... and predict. The result of the original call to gasplit() is
# stored in test2$gg1
splodge_pred <- gasplit( data=splodge, predict_from_previous=test2$gg1)
# ... for normal non-test situations, do something more like this:
if( FALSE){
  pfit <- gasplit( myform, mydata)
  ppred <- gasplit( data=newmydata, predict_from_previous=pfit)
}
}
if( require( offarray)){
sim2 <- make_fake_ppns(
    meanE= -2, # not-brilliant separation
    mean_nsamp=200, # reasonable samp sizes
    interpow=0) # no interactoins
hist( sim2$LGLR, nc=40) # to see how good the separation is
test2 <- test_gasplit( sim2)
with( test2, plot( tru_ppnE, fit_ppnE))
abline( 0,1)
test2b <- test_gasplit( sim2, use2=TRUE)
summary( test2$fit_ppnE - test2b$fit_ppnE) # close to 0...
# Make up some new data...
splodge <- with( test2, simpure[ 1:7,])
# ... and predict. The result of the original call to gasplit() is
# stored in test2$gg1
splodge_pred <- gasplit( data=splodge, predict_from_previous=test2$gg1)
# ... for normal non-test situations, do something more like this:
if( FALSE){
  pfit <- gasplit( myform, mydata)
  ppred <- gasplit( data=newmydata, predict_from_previous=pfit)
}
}

Package 'gasplit'

Help Index

Flexible models for 2-category proportions from continuous data

Description

Usage

Arguments

Details

Value

See Also

Examples

Simulate dataset for 'gasplit'

Description

Usage

Arguments

Value

See Also

Examples