Package 'gasplit'

Title: Flexible estimation of 2-category proportions
Description: Like a GLM/GAM where the true category of each sample is unknown, but the response data is a continuous variable with known distributions for category-1 and category-2 samples.
Authors: Mark Bravington <markb2@summerinsouth.net>
Maintainer: Mark Bravington <markb2@summerinsouth.net>
License: CC-BY
Version: 1.1.1
Built: 2025-02-27 03:00:45 UTC
Source: https://github.com/markbravington/gasplit

Help Index


Flexible models for 2-category proportions from continuous data

Description

Suppose your data each come from one or the other of two categories (say, Eastern or Western Atlantic). You don't know for sure which one, but you do have a "measurement" for each one (say, a log-likelihood ratio) which tends to be high for one category and low for the other, with an a-priori-known distribution for each category (say, based on other samples of known category).

Now you want to estimate the proportion of category-1 data in your data, but stratified according to covariates of the data (say, year and location)— and/or simply dependent on individual covariates, (say, size). Ideally, you'd like to be able to specify a flexible model formula just as you might in a GLM or GAM. Well, now you can! For example:

  addmod <- gasplit( LGLR ~ year + region - 1, data= ew_data,
      d1= known_distro_of_measurement_for_category_1,
      d2= ditto_for_2
    )

to fit a two-way model with no interactions, to a dataset with covariates "year" and "region" based on a measurement "LGLR".

Having fitted your model, you can make a subsequent call to predict the proportion for (presumably) new covariate values, via the predict_from_previous argument..

If you want to use smoothers, i.e. s() terms a la mgcv, then you should call gasplit2 instead. The latter requires RTMB. Results should be the same as gasplit if there are no smooth terms; IDNK which one is faster. NDIRC (Nor Do I Really Care.)

See Examples of test_gasplit (qv) for practical stuff.

Usage

gasplit(formula, data, d1, d2, start=0.001,
    predict_from_previous=NULL)
gasplit2(
  formula,
  data,
  d1, d2,
  predict_from_previous= NULL,
  ... # for gam()
)

Arguments

formula

a classic R model formula (eventually allowing mgcv extensions)

data

a dataframe...

d1, d2

these are functions giving the PDF of the response variable for category-1 and category-2 samples. You gotta pre-fit those to other data. See Examples.

start

(just gasplit, not gasplit2). Either a scalar (probably close to 0) which will be used for all coefficients, or a vector with length ncol(<model matrix>). TBH I'd leave it blank; the default will prolly work.

predict_from_previous

Optional result from a previous call to gasplit. If set, then no fitting will happen, but gasplit will make predictions at the (presumably new) values of data, based on the parameters of the previous fit.

...

for gasplit2 only, additional args to mgcv::gam (qv), eg knots.

Details

Given parameters beta and a model-matrix X from the formula, the log-likelihood of observation i with response y_i is:

  ppn_i*d1( y_i) + (1-ppn_i)*d2( y_i)

where ppn_i = inv.logit( sum_j{ X_{i,j} *beta_j}).

Value

If predict_from_previous is supplied (and data, of course), gasplit and gasplit2 both return predicted proportions at each row of data. Otherwise, gasplit returns a list with elements par (raw parameter estimates from optim) and ppn, which is a vector of estimated proportion-1 for each observation. If the "strata" are discrete, there will be lots of identical ppn values, but not necessarily the case. gasplit2 returns the RTMB object (from MakeADFun) after fitting. A few extra components are added: G (from gam setup), beta (underlying params; could perhaps be used for variance computations via delta-method and hessian and blah-blah) and ppn (fitted values), plus convergence, message, and evaluations from nlminb. Calling <obj>$report() on the output will give a list with beta and ppn. If gasplit2 is called without any s() terms, then the internal details of the RTMB object will be a bit different, but the extra components will be the same.

See Also

test.gasplit

Examples

?test_gasplit

Simulate dataset for 'gasplit'

Description

make_fake_ppns simulates a dataset for fitting with gasplit, using two factors Y and Z and giving a response variable LGLR (so the strata are all combinations of Y and Z). You can control the number of factor-levels, distributions of LGLR for category-1 and category-2 observations (and thus the "power" of LGLR for assignment), the number of samples per stratum, the average strength of each Y-effect and Z-effect, the average strength of Y*Z interactions, etc. It requires the offarray package (purely for my programming convenience; it's not apparent in the output).

test_gasplit fits a gasplit model (with no interactions) to the output of make_fake_ppns, running the latter itself if required. The output is the (true and) fitted proportions per strata, which can be compared: see Examples. Note that it's a bit harsh to simulate with interpow>0 and then test with gasplit, coz the latter assumes there's no interactions. (Of course, you can fit a model with interactions if you want.)

Usage

make_fake_ppns(nyears = 4, nzones = 3, interpow = 0.1,
    df_t = 5, mean_nsamp = 100,  meanE = 1, prange = 1, seed = 2)
test_gasplit(
    sim=NULL,
    formulalala= LGLR ~ Y+Z-1,
    use2= 's' %in% all.names( formula),
    ...
  )

Arguments

nyears, nzones

number of factor levels

interpow

average strength of an interaction relative to a main effect

mean_nsamp

actually there are exactly this many total samples from each stratum

df_t, meanE

df_t degrees-of-freedom for the t-distributions of response by category, which will have variance 1 and means +meanE and -meanE. A meanE of 1 corresponds to pretty poor separation; 3 is pretty good.

prange

how strong the effects should be. If they are too strong, then

seed

random number seed. The system seed is restored on completion.

sim, ...

(test_gasplit) sim should come from a previous call to make_fake_ppns. If not supplied, then make_fake_ppns will be run first, using the ... arguments.

formulalala

what model to fit. The default corresponds to the simulation in make_fake_ppns. You can try putting smoothers in there, to test gasplit2.

use2

if TRUE, force test_gasplit to fit with gasplit2 rather than gasplit. Normally auto-deduced based on whether formulalala contains a call to s (but the test is not completely reliable, if you care). Even without a smoother, gasplit and gasplit2 should give the same result.

Value

make_fake_ppns returns a dataframe that has all the stuff you need, plus an attribute truth which is a list containing the real dope, as stratum-specific offarrays. Note that truth$samp_ppnE is the actual ppn of category-1 ("E") samples in each stratum, which will differ from the nominal value given by the factor strengths and interactions because of binomial sampling variability. test_gasplit returns a great long list of Stuff. See Examples and use yr noddle!

See Also

gasplit

Examples

if( require( offarray)){
sim2 <- make_fake_ppns(
    meanE= -2, # not-brilliant separation
    mean_nsamp=200, # reasonable samp sizes
    interpow=0) # no interactoins
hist( sim2$LGLR, nc=40) # to see how good the separation is
test2 <- test_gasplit( sim2)
with( test2, plot( tru_ppnE, fit_ppnE))
abline( 0,1)
test2b <- test_gasplit( sim2, use2=TRUE)
summary( test2$fit_ppnE - test2b$fit_ppnE) # close to 0...
# Make up some new data...
splodge <- with( test2, simpure[ 1:7,])
# ... and predict. The result of the original call to gasplit() is
# stored in test2$gg1
splodge_pred <- gasplit( data=splodge, predict_from_previous=test2$gg1)
# ... for normal non-test situations, do something more like this:
if( FALSE){
  pfit <- gasplit( myform, mydata)
  ppred <- gasplit( data=newmydata, predict_from_previous=pfit)
}
}