| Title: | Flexible estimation of 2-category proportions |
|---|---|
| Description: | Like a GLM/GAM where the true category of each sample is unknown, but the response data is a continuous variable with known distributions for category-1 and category-2 samples. |
| Authors: | Mark Bravington <[email protected]> |
| Maintainer: | Mark Bravington <[email protected]> |
| License: | GPL-3 |
| Version: | 1.4.16 |
| Built: | 2026-06-05 11:50:44 UTC |
| Source: | https://github.com/markbravington/gasplit |
example_prep4missy is a "template" showing how to prep a dataset with (one) sometimes-missing covariate (ie field) into a form suitable for gasplit_missy (qv). You could adapt example_prep4missy to your own situation.
There is an example dataset babydat just to demonstrate the prep (you can't use it to fit a model, coz the d1/d2 distributions are missing). The covariate in question is Sex, taking values "F", "M", or NA. Someone has already fitted a model which predicts the probability that a case will be Female ("F") based on its other covariates (but not on whatever the ultimate response variable will be!); that prediction has been stored in the field PrFem_yl. Cases where Sex is known are left untouched; cases where Sex is NA are duplicated (since there are only two possible values for this covariate), with one copy getting Sex="F", and the other Sex="M"; the field PROBIMP is set accordingly based on PrFem_yl. The MULTIMP field is used to link together the multiple versions of each case.
Remember that you have to develop whatever submodel for probabilities of the alternative values that your missing covariates might take! If you've got missing values on continuous covariate, rather than a discrete one like Sex, my recommendation is to add rows based on a few quantiles of the covariate (which will potentially be at different values for each case), and give them all equal weight (ie in the analog of PrFem_yl). I'd start by trying 5 quantiles (0.1; 0.3; ...; 0.9), but you could increase & see if it matters.
example_prep4missy( data)example_prep4missy( data)
data |
data.frame as per description. |
babydat # with some missing values for 'Sex' example_prep4missy( babydat)babydat # with some missing values for 'Sex' example_prep4missy( babydat)
Suppose your data each come from one or the other of two categories (say, Eastern or Western Atlantic). You don't know for sure which one, but you do have a "measurement" for each one (e.g. some kind of log-likelihood ratio, but it could be anything) which tends to be high for one category and low for the other, with an a-priori-known distribution for each category (say, based on other samples of known category).
Now you want to estimate the proportion of category-1 samples in your data, but stratified according to covariates of each sample (say, year and location)— and/or simply dependent on individual covariates, (say, size). Ideally, you'd like to be able to specify a flexible model formula just as you might in a GLM or GAM. Well, now you can! For example:
addmod <- gasplit( DISCRIM ~ factor(year) + factor(region) - 1,
data= ew_data,
d1= known_distro_of_measurement_for_category_1,
d2= ditto_for_2
)
to fit a two-way model with no interactions, to a dataset with covariates "year" and "region" based on a measurement "DISCRIM".
Having fitted your model, you can make a subsequent call to predict the proportion for (presumably) new covariate values, via the predict_from_previous argument. And, provided that your "measurement" really is a log-likelihood ratio, you can also calculate a posterior probability (of being category-1) for each case, using posterior.
If you want to use smoothers, i.e. s() terms a la mgcv (including non-sparse random effects handled via s(bs="re")), then you should call gasplit2 instead. The latter requires package RTMB, in order to estimate smoothing parameters/random-effect variances. Results should be the same as gasplit if there are no smooth terms nor random effects; IDNK which routine is faster. NDIRC (Nor Do I Really Care.) If your covariates have missing values (MAR), you can do "full information maximum likelihood" using gasplit_missy, described next.
See Examples of test_gasplit (qv) for practical stuff. I should update that for smoothers and missy.
gasplit_missy can handle Missing-At-Random (MAR) values in covariates via "full information maximum likelihood", though you have to have already done your homework by fitting sub-models that predict the probabilities of the different possible covariate values for each case. The covariate(s) should be intrinsically discrete, or reasonably discretizable; if they are continuous, a good approximation would be use a few quantiles of their conditional distribution.
Your data will be a copy of the original dataset with extra rows added (one or more extra rows each case with missing covariates), and two extra columns, by default assumed to be called MULTIMP and PROBIMP though you can change that. Their meaning (which is very simple) is explained in the next paragraph. You might find it helpful to look at an example in example_prep4missy (qv), and the test dataset babydat. (It's not for use in actually fitting gasplit_missy— just showing how the data prep works.)
MULTIMP— a misnomer really, since this is actually FIML not "multiple implementation"— should hold the "case number", so that a missing-covariate case will appear several times with the same value for some of its columns (including MULTIMP and the response variable), but different values for the missing covariate(s) in each appearance (which I might loosely call each "imputation"). The probabilities of those alternative covariate values are in PROBIMP, from that submodel(s) you've fitted (which will usually be conditional on the covariates that are observed). Cases with no (relevant) missing covariates should have a unique MULTIMP so they only appear once, and a PROBIMP of 1. MULTIMP does not have to contain numbers; any unique case-identifier will do. Fitted values are only returned for the first occurence of each case, but take into account all "imputations" of that case. Predictions on new data don't allow for missing covariates, and the new data does not need MULTIMP or PROBIMP fields.
If there are cases where several covariates have missing values, then in principle you should produce extra rows for all combinations of possible values for that case. That could get messy for you, but fortunately it's not my problem; you might decide to do Multiple-Imputation instead (though it is less statistically efficient, and more fiddly).
I've tried to support most mgcv features that I more-or-less understand. For example, I think gasplit2 can cope with s(...,id=...). However:
I'm not sure how robust my solutions are;
there are lots of things in mgcv that I don't understand (and haven't used) which may not work at all.
An example of something that might well misfire, is trying to fix some smoothing parameters at chosen values. Let me know how y'all go...
There's deffo no support for sparsity.
gasplit( formula, data, d1, d2, start=0.001, predict_from_previous= NULL, dbeta= FALSE ) gasplit2( formula, data, d1, d2, predict_from_previous= NULL, dbeta= FALSE, ... # for gam() ) gasplit_missy( formula, data, d1, d2, link_field = "MULTIMP", prob_field = "PROBIMP", predict_from_previous= NULL, dbeta= FALSE, ... # for gam() )gasplit( formula, data, d1, d2, start=0.001, predict_from_previous= NULL, dbeta= FALSE ) gasplit2( formula, data, d1, d2, predict_from_previous= NULL, dbeta= FALSE, ... # for gam() ) gasplit_missy( formula, data, d1, d2, link_field = "MULTIMP", prob_field = "PROBIMP", predict_from_previous= NULL, dbeta= FALSE, ... # for gam() )
formula |
a classic R model formula (eventually allowing |
data |
a dataframe... |
d1, d2
|
these are functions giving the PDF of the response variable for category-1 and category-2 samples. You gotta pre-fit those to other data. See Examples. |
start |
(just |
link_field, prob_field
|
what names the missing-data columns columns have. Probably simpler to make sure your dataset has them under the default names. |
predict_from_previous |
Optional result from a previous call to |
dbeta |
whether to calculate derivatives of each prediction WRTO the parameters (mainly for subsequent variance calculations). |
... |
for |
Given parameters beta and a model-matrix X from the formula, the log-likelihood of observation i with response y_i is:
log( ppn_i*d1( y_i) + (1-ppn_i)*d2( y_i))
where ppn_i = inv.logit( sum_j{ X_{i,j} *beta_j}). The log-likelihood extracts the constant factor d2(y_i), to leave (in the non-missy case)
log( 1 + ppn_i*(d1_i/d2_i-1)) + const
In a general setting, when a covariate z_i is missing for case i but other covariates x_i may be observed, the FIML log-lik of an observation y_i is
log( sum_z{ Pr[y_i|x_i,z] * Pr[Z_i=z|x_i]})
where the sum ranges over possible values z that the random variable Z_i might take, and the second prob is your "submodel". The point here is that the weighted sum is (and should be) applied to the unlogged probabilities, before the whole lot gets logged finally. That's why the code needs to be adapted.
For our particular application, this becomes:
log( sum_z{ (ppn_i(z)*d1( y_i) + (1-ppn_i(z))*d2( y_i)) * prob(z_i=z)}
= log( 1 + (d1_i/d2_i-1) * sum_z{ ppn_i(z) * prob( z_i=z)}
with some obvious & harmless abuse of notation.
Smoothers and random-effects add "penalties" (ie priors) on the coefficients in the usual mgcv way, and RTMB computes the automated Laplace approximation required for consistent estimation of smoopars/RE-variances.
If predict_from_previous is supplied (and data, of course), all three functions return predicted proportions at each row of data. If dbeta==TRUE then the result will be not a vector, but a matrix whose first column is the point estimate, the remaining columns being its derivatives WRTO the elements of beta. You can use that to get standard errors on the predictions, something like this:
t( ppn[, -1]) %**% V_beta %**% ppn[, -1]
|
Otherwise, the functions return a list, with various elements including at least:
beta, SE_beta, V_beta
|
for the raw paraameter estimates; |
ppn |
estimated proportions, casewise (or a matrix if |
G |
setup for the underlying |
plus various convergence diagnostics. gasplit2 and gasplit_missy also returns $obj which is the RTMB object (from MakeADFun) after fitting, plus the estimated "outer parameters" (smoothing parameters and/or random-effect variances) as outerpars (NB probably on log scale). The latter are treated as fixed when computing the coefficient Hessian, i.e. there's no Bayes-Empirical-Bayes correction.
gasplit_missy also returns the link_field and prob_field arguments, plus a 3-column matrix $split, with columns "PPN", <link>, and <prob> (the latter two set by link_field and prob_field). This corresponds row-wise to the input dataframe (and to $G$X), so that a single case might be split into several imputations. It is used by posterior. Note that the casewise proportions $ppn has just row per case.
test_gasplit, example_prep4gasplit
?test_gasplit ?example_prep4gasplit?test_gasplit ?example_prep4gasplit
make_fake_ppns simulates a dataset for fitting with gasplit, using two factors Y and Z and giving a response variable DISCOSTAT; the strata comprise all combinations of Y and Z. You can control the number of factor-levels, distributions of DISCOSTAT for category-1 and category-2 observations (and thus the "power" of DISCOSTAT for assignment), the number of samples per stratum, the average strength of each Y-effect and Z-effect, the average strength of Y*Z interactions, etc. It requires the offarray package (purely for my programming convenience; it's not apparent in the output).
test_gasplit fits a gasplit model (with no interactions) to the output of make_fake_ppns, running the latter itself if required. The output is the (true and) fitted proportions per strata, which can be compared: see Examples. Note that it's a bit harsh to simulate with interpow>0 and then test with gasplit, coz the latter assumes there's no interactions. (Of course, you can fit a model with interactions if you want.)
make_fake_ppns(nyears = 4, nzones = 3, interpow = 0.1, df_t = 5, mean_nsamp = 100, meanE = 1, prange = 1, seed = 2) test_gasplit( sim=NULL, formulalala= DISCOSTAT ~ Y+Z-1, use2= 's' %in% all.names( formula), ... )make_fake_ppns(nyears = 4, nzones = 3, interpow = 0.1, df_t = 5, mean_nsamp = 100, meanE = 1, prange = 1, seed = 2) test_gasplit( sim=NULL, formulalala= DISCOSTAT ~ Y+Z-1, use2= 's' %in% all.names( formula), ... )
nyears, nzones
|
number of factor levels |
interpow |
average strength of an interaction relative to a main effect |
mean_nsamp |
actually there are exactly this many total samples from each stratum |
df_t, meanE
|
|
prange |
how strong the effects should be. If they are too strong, then |
seed |
random number seed. The system seed is restored on completion. |
sim, ...
|
(test_gasplit) |
formulalala |
what model to fit. The default corresponds to the simulation in |
use2 |
if TRUE, force |
make_fake_ppns returns a dataframe that has all the stuff you need, plus an attribute truth which is a list containing the real dope, as stratum-specific offarrays. Note that truth$samp_ppnE is the actual ppn of category-1 ("E") samples in each stratum, which will differ from the nominal value given by the factor strengths and interactions because of binomial sampling variability.
test_gasplit returns a great long list of Stuff. See Examples and use yr noddle!
if( require( offarray)){ sim2 <- make_fake_ppns( meanE= +2, # not-brilliant separation mean_nsamp=200, # reasonable samp sizes interpow=0) # no interactoins hist( sim2$DISCOSTAT, nc=40) # to see how good the separation is test2 <- test_gasplit( sim2) with( test2, plot( tru_ppnE, fit_ppnE)) abline( 0,1) test2b <- test_gasplit( sim2, use2=TRUE) summary( test2$fit_ppnE - test2b$fit_ppnE) # close to 0... # Make up some new data... splodge <- with( test2, simpure[ 1:7,]) # ... and predict. The result of the original call to gasplit() is # stored in test2$gg1 splodge_pred <- gasplit( data=splodge, predict_from_previous=test2$gg1) # ... for normal non-test situations, do something more like this: if( FALSE){ pfit <- gasplit( myform, mydata) ppred <- gasplit( data=newmydata, predict_from_previous=pfit) } # See also *Examples* for 'posterior' }if( require( offarray)){ sim2 <- make_fake_ppns( meanE= +2, # not-brilliant separation mean_nsamp=200, # reasonable samp sizes interpow=0) # no interactoins hist( sim2$DISCOSTAT, nc=40) # to see how good the separation is test2 <- test_gasplit( sim2) with( test2, plot( tru_ppnE, fit_ppnE)) abline( 0,1) test2b <- test_gasplit( sim2, use2=TRUE) summary( test2$fit_ppnE - test2b$fit_ppnE) # close to 0... # Make up some new data... splodge <- with( test2, simpure[ 1:7,]) # ... and predict. The result of the original call to gasplit() is # stored in test2$gg1 splodge_pred <- gasplit( data=splodge, predict_from_previous=test2$gg1) # ... for normal non-test situations, do something more like this: if( FALSE){ pfit <- gasplit( myform, mydata) ppred <- gasplit( data=newmydata, predict_from_previous=pfit) } # See also *Examples* for 'posterior' }
After fitting a gasplit model to estimate the prior probability of whateverness by covariates, you can also calculate a per-observation posterior probability.
The posteriors are also estimates, since they depend on the estimated model coefficients beta. You can also get their derivatives WRTO the coefficients (useful for downstream variance propagation).
Missing data can be handled a la gasplit_missy (qv) by setting the link_field and prob_field arguments. This means the outputs are aggregated to case level with weighted sums over the imputations (which I think is the correct approach). It is most likely to be useful for extracting posteriors from an existing model that was fitted by gasplit_missy (when it will be automatic), rather than for brand-new predictions. But, whatever floats your modeling boat.
posterior( object, newdata = NULL, dbeta = FALSE, link_field= object$link_field %||% NA, prob_field= object$prob_field %||% NA )posterior( object, newdata = NULL, dbeta = FALSE, link_field= object$link_field %||% NA, prob_field= object$prob_field %||% NA )
object |
from previous |
newdata |
default is to use the original data, but it can accept a dataframe with the necessary covariates. If |
dbeta |
if TRUE, return derivs of the posterior probs WRTO the params |
link_field, prob_field
|
optional for "multiple imputation" with missing data; see |
If dbeta==FALSE, a vector of estimated posterior probs, one per case. If not, then a matrix where the first column is the posteriors, and the remaining columns are the derivs WRTO each component of beta.
sim2 <- make_fake_ppns( meanE= +2, # not-brilliant separation mean_nsamp=200, # reasonable samp sizes interpow=0) # no interactoins hist( sim2$DISCOSTAT, nc=40) test <- test_gasplit( sim2) with( test, plot( tru_ppnE, fit_ppnE)) abline( 0,1) # Posteriors, and their derivs posty <- posterior( test$gg1, dbeta=TRUE) # Base-R does not define logit()--- super-annoying. Sigh (again). plot( sim2$DISCOSTAT, mvbutils::logit( posty[,1]), pch='.') abline( 0,1,col='red') lm( posty[,1] ~ test$gg1$ppn) # 0, 1 pretty much posty[1,2] # d postprob[1] / dparam[2] # ... Points with high DISCOSTAT tend to pushed upwards in the posterior # ... which makes sense, coz if you've got high DISCOSTAT, your neighbours prolly will too # ... so your prior would be pushed upsim2 <- make_fake_ppns( meanE= +2, # not-brilliant separation mean_nsamp=200, # reasonable samp sizes interpow=0) # no interactoins hist( sim2$DISCOSTAT, nc=40) test <- test_gasplit( sim2) with( test, plot( tru_ppnE, fit_ppnE)) abline( 0,1) # Posteriors, and their derivs posty <- posterior( test$gg1, dbeta=TRUE) # Base-R does not define logit()--- super-annoying. Sigh (again). plot( sim2$DISCOSTAT, mvbutils::logit( posty[,1]), pch='.') abline( 0,1,col='red') lm( posty[,1] ~ test$gg1$ppn) # 0, 1 pretty much posty[1,2] # d postprob[1] / dparam[2] # ... Points with high DISCOSTAT tend to pushed upwards in the posterior # ... which makes sense, coz if you've got high DISCOSTAT, your neighbours prolly will too # ... so your prior would be pushed up