I guess every social scientist at one point has been confronted with a quite common dilemma: We know that we would need a quite large number of items to assess a certain latent concept comprehensively and validly, but we do not have enough space in our questionnaire to use the entire long scale or even inventory. More often than not, we then choose the “best” items (e.g., with the highest factor loadings). However, by reducing the number of items, we are likely reducing the scales validity as well. After all, we reduce the diversity and breadth of the manifest indicators that make up the scale.

An alternative to choosing a static subset of a scale that we administer to all participants in the exact same way, we can also use so-called item sampling. Here, the idea is to randomly select a subset of items for each respondent from a larger pool. This way, each participant answers a different subset of items. Due to the randomness in this sampling procedure, we can still gather sufficient information overall to accurately estimate the latent trait for each individual participant and impute any missing responses. The question is, however, how much missings per persons is acceptable? And in how far is the quality of the imputed scale contingent on the size of the item pool, the size of the sample, and the original quality of the scale (e.g., as expressed by reliability indices such as Cronbach’s alpha)?

Statistical background

Classical Test Theory (CTT) offers a framework where each observed score (𝑋) is conceptualized as the sum of a true score (𝑇) and the error score (e𝑒):

$$X = T + e$$

In this context, the true score represents the actual level of the latent trait, and the error score encompasses all item-specific variance, measurement errors, or random noise.

CTT defines reliability as the proportion of variance in the observed scores that is attributable to the true scores. When items are randomly sampled, reliability can still be maintained by ensuring that the items are representative of the entire item pool and that the sampling process introduces minimal bias. The basic idea of item sampling is thus as follows:

Random Selection of Items: Instead of administering the entire set of items (or a fixed subset) to every respondent, a small, randomly selected subset is given to each respondent, resulting in unique short scales per participant. This reduces the burden on respondents and shortens the questionnaire.
Imputation of Missing Data: Using imputation techniques (e.g., full information maximum likelihood in SEM), the responses to the unadministered items can be accurately imputed based on the administered subset. This approach leverages the correlations among items and respondents’ overall response patterns.

Despite the reduction in the number of items per respondent, the randomness and overlap between different respondents’ subsets ensure that the overall information about the latent trait is preserved. Properly implemented item sampling thus maintains the reliability and validity of the measurement while significantly improving the efficiency of data collection.

Mathematical derivation

Given a large item pool, if we denote the total number of items by 𝑘 and the number of items sampled for each respondent by 𝑘_𝑠, the reliability of the item-sampled test can be approximated by the Spearman-Brown prophecy formula (see e.g., de Vet et al., 2017)

$$\alpha_{new} = \frac{n*\alpha}{1+(n-1)*\alpha}$$

where:

𝛼_𝑛𝑒𝑤 is the “new” reliability after item sampling
𝑛 = 𝑘_𝑠/𝑘 is the ratio of sample items to total items
α𝛼 is the original reliability of the full item pool

By ensuring ks𝑘𝑠 is sufficiently large, the reliability of the scale can be preserved to a certain. But how large should those to parameters be? Let’s compute the reliability after sampling 5 items from a pool of 10 given an alpha of .9:

# Set parameters
alpha <- .9
k_s <- 5
k <- 10

# Formula from above
new_alpha <- ((k_s/k)*alpha)/(1+((k_s/k)-1)*alpha)
round(new_alpha, 2)

## [1] 0.82

As we can see, the resulting reliability is still .82 (compared to the original .90), which may be completely sufficient for some contexts. Now, let’s explore the relationship between the sampled items to total items ratio and alpha more systematically. Yet, so far, we did not account for the impact of sample size. Further, the above formula only tells us whether the reliability can be preserved. What we also would like to know, whether the validity is preserved (e.g., the correlation between short and full scale is large).

The impact of sample size can be included by considering the standard error of the reliability estimate and the standard error of the correlation. Larger sample sizes typically result in more precise estimates of reliability and validity:

$$SE_{\alpha} = \frac{1-\alpha}{\sqrt{N}}$$

where:

𝑎𝑙𝑝ℎ𝑎 is the reliability of the reduced scale
𝑁 is the sample size

Similarly, the standard error of the correlation (SE_r) between two sets of scores can be approximated by:

$$SE_r = \frac{1-r^2}{\sqrt{N}}$$

where:

𝑟 is the estimated correlation between the two scales
𝑁 is the sample size

These standard errors indicate the precision of our reliability and correlation estimates. While the formulas used for the reliability and validity provide point estimates, the standard errors derived here show how these estimates would vary with different sample sizes.

Systematic approach

So let’s put all these formulas in an R-function:

library(tidyverse)

# Custom function to compute reliability and correlation
reli_new <- function(alpha, k_s, k, N = 100) {
  
  # Ratio between sample items and total items
  n <- k_s/k
  
  # Spearman-Brown formula
  alpha_new <- ((n)*alpha)/(1+((n)-1)*alpha)
  
  # Standard error of alpha_new
  se_a <- (1-alpha_new)/sqrt(N)
  
  # Correlation between scales
  r <- sqrt(alpha_new/alpha)
  se_r <- (1-r^2)/sqrt(N)
  
  # Create output table
  tibble(N = N, 
         k = k, 
         k_s = k_s, 
         n = n,
         alpha = alpha, 
         alpha_new = alpha_new, 
         se_a = se_a, 
         r = r, 
         se_r = se_r)
}

# Check function
reli_new(alpha = .9, k_s = 5, k = 10, N = 100)

## # A tibble: 1 × 9
##       N     k   k_s     n alpha alpha_new   se_a     r    se_r
##   <dbl> <dbl> <dbl> <dbl> <dbl>     <dbl>  <dbl> <dbl>   <dbl>
## 1   100    10     5   0.5   0.9     0.818 0.0182 0.953 0.00909

As we can see, given 𝑘𝑠 = .5 and 𝑘 = 10, an alpha level of 𝛼 = .9 and a sample size of 𝑁 = 100, the reduced scale should have an alpha of 𝛼_𝑛𝑒𝑤 = .82, 95%-CI[.78, 85]. Further, the reduced scale should still correlate highly with the original scale: r = .95 95%-CI[.94, .97]

We can now create a grid of possible parameter ranges to explore the effect of item sampling under various conditions:

# Compute new alpha across grid
results <- expand.grid(
  k   = seq(10, 100, by = 20),
  alpha = seq(.6, .9, by = .05),
  n = seq(.1, .9, by = .2),
  N = c(100, 250, 500, 750, 1000, 1250)
  ) |> 
  mutate(k_s = round(n*k, 0)) |> 
  dplyr::select(-n) |> 
  as_tibble() |> 
  pmap_dfr(reli_new)

# Plot results
results |> 
  ggplot(aes(x = alpha, y = alpha_new, ymin = alpha_new-1.96*se_a, ymax = alpha_new+1.96*se_a, color = n, group = n)) +
  geom_pointrange() +
  geom_line() +
  geom_hline(yintercept = .8, linetype = "dashed", color = "black") +
  facet_wrap(~N) +
  theme_bw() +
  labs(x = "alpha (full item pool)", 
       y = "alpha (sampled items)", 
       color = "n = ratio between \nsampled (k_s) and \ntotal items (k)",
       title = "Relationship between original and sampled alpha")

We can see that the ability to “preserve” the reliability drops significantly as the number of items per person decreases. It further decreases faster, if the original reliability was already low. Yet, there is some interesting insights here: For example, if the original reliability is .9, we can “afford” up to 50% missings per person and still end up with a reliability of .80. If we only use a third of the items, we can still reach reliabilities of .75, which again may be acceptable in some contexts.

results |> 
  ggplot(aes(x = alpha, y = r, ymin = r-1.96*se_r, ymax = r+1.96*se_r, color = n, group = n)) +
  geom_pointrange() +
  geom_line() +
  geom_hline(yintercept = .8, linetype = "dashed", color = "black") +
  geom_hline(yintercept = .9, linetype = "dashed", color = "black") +
  theme_bw() +
  facet_wrap(~N) +
  labs(x = "alpha (full item pool)", 
       y = "r", 
       color = "n = ratio between \nsampled (k_s) and \ntotal items (k)",
       title = "Correlation estimates between original and sample scale")

With regard to the validity of the reduced scale, we can see that for scales with a reliability of at least .8, we can afford up to 50% missings to still have have good convergent validity with the full scale (r > .90) and up to 70% missings if a validity of r > .80 is still sufficient.

Bear in mind that this is just the result based on the formula outlined above. We can account for sampling by computing the confidence intervals for these estimates given a certain sample size. Obviously, the 95% confidence intervals become wider with smaller sample size as we loose estimation precision.

Practically, however, more things may have to be taken into account:

Normal Distribution Assumption: The formulas assume that the underlying data is normally distributed, which might not be the case in reality.
Equal Item Parameters: The formula assume that all items have equal difficulty and discrimination parameters, while in reality, items can vary significantly.
Multidimensionality: Real data might have multiple underlying factors, while the theoretical formula assumes unidimensionality.
Nonlinear Relationships: Real data might include non-linear relationships between items and the latent trait, which are not considered in the formulas.
Sampling Variability: In real data, there is inherent variability due to random sampling. This can lead to differences between theoretical expectations and observed outcomes.
Computation of the scales’ factor scores: The method used to handle missing data could significantly impact the results. Different imputation methods might lead to different results

To explore these assumptions a bit further, let’s simulate some data and introduce missings per person systematically.

Simulation-Approach

library(MASS)
library(psych)
library(lavaan)
library(semTools)
library(mice)

Setting up custom functions to simualte data and missingness

To simulate the ideas outlined above, we first need to simulate a data set that corresponds to a typical social scientific questionnaire. More specifically, we want to simulate items of a scale with different reliabilities. Here, I am first creating the correlation matrix between all items. Therefor, we need to compute the average correlation between all items based on the alpha-level we are providing. We then can compute a data set that contains normally distributed items that correlate in the prescribed manner using the function mvrnorm() from the MASS package. In order to received typical Likert-type answer options, we cut these items into 5 and label the with 1 to 5. Finally, we need to make sure that these items are numeric.

set.seed(123)  # for reproducibility

sim_likert <- function(n_items, n_respondents, alpha, answer_options = 5){
  
  alpha <- alpha + 0.005
  
  # Create a correlation matrix with high internal consistency
  average_r <- alpha / (n_items - alpha * (n_items - 1))
  correlation_matrix <- matrix(average_r, n_items, n_items)
  diag(correlation_matrix) <- 1
  
  # Generate multivariate normal data
  normal_data <- mvrnorm(n_respondents, mu = rep(0, n_items), 
                         Sigma = correlation_matrix)
  
  # Convert the normal data to Likert-type scale (1 to 5)
  likert_data <- apply(normal_data, 2, function(x) cut(x,
                                                       breaks = quantile(x, probs = seq(0, 1, length.out = answer_options+1)),
                                                       labels = 1:answer_options,
                                                       include.lowest = TRUE))
  # Transform to numeric values
  likert_data <- as_tibble(likert_data) |> 
    mutate(across(everything(), as.numeric))
  return(likert_data)
}

# Check function
test_data <- sim_likert(n_items = 10, n_respondents = 100, alpha = .9)
test_data

## # A tibble: 100 × 10
##       V1    V2    V3    V4    V5    V6    V7    V8    V9   V10
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     3     5     4     3     4     5     3     5     3     4
##  2     1     5     3     4     3     4     3     5     2     4
##  3     1     1     1     2     2     1     1     1     1     2
##  4     3     3     2     4     5     1     1     3     4     4
##  5     4     2     4     2     3     3     2     5     2     3
##  6     1     1     1     1     1     1     2     1     1     1
##  7     2     1     3     5     3     2     1     4     3     1
##  8     5     4     5     4     5     3     5     5     3     3
##  9     3     5     4     3     5     3     5     4     5     5
## 10     4     3     3     3     3     3     5     3     5     4
## # ℹ 90 more rows

As we can see, the function create a data set with 10 items and n = 100 participants. Let’s quickly check whether the alpha level of this scale is close enough to the specified alpha value of .90:

psych::alpha(test_data)$total

##  raw_alpha std.alpha   G6(smc) average_r      S/N        ase mean       sd
##  0.8899945 0.8899945 0.8922854 0.4472222 8.090452 0.01645043    3 1.007547
##  median_r
##      0.45

Indeed, it is very close.

Next, we need a function that allows to simulate the item sampling approach. We can think of a random item sampling procedure as being similar to introducing a certain number of missings per person (i.e., per row). As we want to work with a percentage of missings, we firt need to transform the percentage of missings into the actual number of items that should be NA (i.e., not sampled per person). Then, we loop through all rows to create missing indices for which we will later introduce NAs.

add_random_missing_per_row <- function(data, n) {
  
  # Compute actual number of items to drop
  num_missing_per_row <- round((1-n)*length(data), 0)
  
  # Get the number of rows and columns in the dataframe
  n_rows <- nrow(data)
  n_cols <- ncol(data)
  
  # Check if the number of missing values per row is less than the number of columns
  if (num_missing_per_row >= n_cols) {
    stop("Number of missing values per row should be less than the number of columns")
  }
  
  # Loop through each row and introduce missing values
  for (i in 1:n_rows) {
    # Randomly select columns for missing values in the current row
    missing_indices <- sample(seq_len(n_cols), num_missing_per_row, replace = FALSE)
    
    # Introduce missing values
    data[i, missing_indices] <- NA
  }
  
  return(data)
}

# Check function
test_data_na <- add_random_missing_per_row(test_data, .8)
test_data_na

## # A tibble: 100 × 10
##       V1    V2    V3    V4    V5    V6    V7    V8    V9   V10
##    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1     4    NA     1    NA     2     2     2     1     1     1
##  2     3     4    NA     4     4     5    NA     3     4     4
##  3    NA     4     3     1     3     3     2     3     2    NA
##  4     2     4    NA     1     1     4     3     2    NA     3
##  5    NA     1     4     4     2     2     2    NA     3     4
##  6     2     3     5     4     3     5    NA     2    NA     2
##  7     1    NA     2     2     1     1     2     2    NA     1
##  8     5    NA     4     2    NA     4     4     4     1     3
##  9     3     1     1     1     2    NA     1     1    NA     1
## 10     1     4    NA     5     4     2     5     2    NA     3
## # ℹ 90 more rows

Here, I sample 80% of the items per row (which equals to 2 of the 10 items missing). As we can see, it has worked as each row now has two randomly placed NAs. In fact, this corresponds to an overall amount of missings in the data set of 20%:

pmmisc::count_na(test_data_na)

## # A tibble: 3 × 3
##   missings     n percent
##   <chr>    <int>   <dbl>
## 1 FALSE      800      80
## 2 TRUE       200      20
## 3 sum       1000     100

Creating a custom function to fit all models and extract relevant result parameters

Next, we build a function that imputes the data with missings, fit unidimensional CFAs on both the full data set and the data set that was imputed (or using “full information maximum likelihood estimation” as implemented in lavaan), extracts factor scores and reliabiliy from both models, and finally computes the correlation between the factor scores and the means and sds of the mean score.

extract_results <- function(full_data, missing_data = NULL, method = "fiml") {
  
  # Impute missing data using Predictive mean matching as implemented in mice
  imputed_data <- complete(mice(missing_data, 1, printFlag = F))
  
  # Fit unidimensional CFA on the full data
  model <- paste("G =~", paste(names(full_data), collapse = " + "))
  fit_full <- cfa(model, full_data)
  
  # Extract factor score and reliability from the full model 
  full_fscore = predict(fit_full)
  alpha_sim <- reliability(fit_full)[1,1]
  
  # Fit unidimensional CFA on the missing data/imputed data
  if(method == "fiml") {
    fit_missing <- cfa(model, missing_data, missing = "fiml")
  } else {
    fit_missing <- cfa(model, imputed_data)
  }
  
  # Extract factor score and reliability from the model with missings/based on imputed data
  missing_fscore <- predict(fit_missing)
  alpha_new <- reliability(fit_missing)[1,1]
  
  # Compute correlation between factor scores
  r <- cor.test(full_fscore, missing_fscore)
  
  # Bind altogether
  tibble(alpha_sim = alpha_sim,
         alpha_new = alpha_new,
         r = r$estimate)
}

# Check function
extract_results(test_data, test_data_na, method = "mice")

## # A tibble: 1 × 3
##   alpha_sim alpha_new      r
##       <dbl>     <dbl>  <dbl>
## 1     0.871     0.871 -0.104

Using the data sets we created earlier, we can see that given 10 items, n = 100 participants, an alpha level of .90 and an amount of missings of 20%, the resulting factor scores (from the full and the imputed data set) still correlated extremely highly: .97 and the imputed scale has still a reliability of alpha = .86.

Putting it all together and simulating the item sampling approach

Now, we put it all together in one function that we can then map across a grid of different parameters.

sim_func <- function(N, k, alpha, n, method, answer_options = 5) {
  full_data <- sim_likert(n_items = k, n_respondents = N, 
                          alpha = alpha, answer_options = answer_options)
  missing_data <- add_random_missing_per_row(full_data, n = n)
  res <- extract_results(full_data, missing_data, method = method)
  tibble(N = N, k = k, k_s = round(n*k, 0), n = n, alpha = alpha, 
         method = method, a_options = answer_options) |> 
    bind_cols(res)
}

# Check sim function with mice
sim_func(N = 100, 
         k = 10,
         alpha = .9,
         n = .5, 
         answer_options = 5,
         method = "mice")

## # A tibble: 1 × 10
##       N     k   k_s     n alpha method a_options alpha_sim alpha_new     r
##   <dbl> <dbl> <dbl> <dbl> <dbl> <chr>      <dbl>     <dbl>     <dbl> <dbl>
## 1   100    10     5   0.5   0.9 mice           5     0.871     0.857 0.844

We first setup a grid that contains the parameter ranges that we want to explore. The function expand.grid() basically creates all combinations of the values in the grid.

# Setup simulation grid
setup <- expand.grid(
  N = c(500, 750, 1000),
  k = c(20, 40, 60),
  alpha = seq(.75,.95, by = .05),
  n = c(.3, .4, .5, .6, .7),
  method = "mice",
  answer_options = 5
)

# check setup
head(setup)

##      N  k alpha   n method answer_options
## 1  500 20  0.75 0.3   mice              5
## 2  750 20  0.75 0.3   mice              5
## 3 1000 20  0.75 0.3   mice              5
## 4  500 40  0.75 0.3   mice              5
## 5  750 40  0.75 0.3   mice              5
## 6 1000 40  0.75 0.3   mice              5

As can see, the resulting data set includes all combinations of the input parameter. Now, we need to run our custom function sim_func() from above across all of these combinations. Although the computations per row are not very intensive, the following step requires us to impute thousands of data sets, fit many many models and extract many parameters. I hence use the package furrr to parallelize the procedure. I am only doing 100 simulations per combination to save time (which results already in 22,500 iterations), but we should do 5,000 to 10,000 per parameter combination to get stable outcomes. So take any of the following results with grain of salt.

# Setup parallelization
library(furrr)
plan(multisession, workers = 14)

# Run simulation
n_sim <- 100
if (file.exists("sim_results.RData")) {
  load("sim_results.RData")
} else {
  results <- do.call("rbind", replicate(n_sim, setup, simplify = FALSE)) |>
  future_pmap_dfr(sim_func, .progress = T)
  save(results, file = "sim_results.RData")
}

# Check results
head(results)

## # A tibble: 6 × 10
##       N     k   k_s     n alpha method a_options alpha_sim alpha_new     r
##   <dbl> <dbl> <dbl> <dbl> <dbl> <fct>      <dbl>     <dbl>     <dbl> <dbl>
## 1   500    20     6   0.3  0.75 mice           5     0.721     0.575 0.451
## 2   750    20     6   0.3  0.75 mice           5     0.755     0.668 0.545
## 3  1000    20     6   0.3  0.75 mice           5     0.698     0.656 0.524
## 4   500    40    12   0.3  0.75 mice           5     0.738     0.581 0.360
## 5   750    40    12   0.3  0.75 mice           5     0.717     0.508 0.426
## 6  1000    40    12   0.3  0.75 mice           5     0.742     0.633 0.548

The resulting data frame contains the parameter combinations and all outcomes.

Correlation / Convergent validity

In a first step, we can plot the resulting correlations between the factor score extracted from the model that was estiamted on the full data set and the factor score from model that was estimated on the imputed data set:

summarized <- results |> 
  group_by(N, k, alpha, n) |> 
  summarize(r = mean(r)) 

results |> 
  ggplot(aes(x = jitter(alpha,factor = 1.75), y = r, color = factor(n))) + 
  geom_point(alpha = .1, size = .75) +
  geom_line(mapping = aes(x = alpha, y = r, color = factor(n)), data = summarized, linewidth = 1.2) +
  geom_point(mapping = aes(x = alpha, y = r, color = factor(n)),data = summarized, size = 3) + 
  geom_hline(yintercept = .9, linetype = "dotted") +
  geom_hline(yintercept = .8, linetype = "dotted") +
  scale_color_brewer(palette = "Spectral") +
  facet_grid(rows = vars(N),
             cols = vars(k)) +
  ylim(0, 1) +
  theme_bw() +
  labs(x = "alpha", y = "Person's r", 
       color = "n")

We already see that the correlation between the factor scores decreases significantly with higher amounts of missings per person and lower reliability. Sample size and sizeof the initial item pool seem to be less influential. But we can already see some interesting things here: For example, if our initial scale has a high reliability (alpha = .90), our sample is n > 500 and the amount of missing is < 60%, convergent validity is still quite high with r > .80. If the alpha level is still > .80, we still reach correlations of r > .80 if the amount of missings per person is < 50%. With higher amounts of missings or lower initial alpha levels, however, the validity of the imputed scale decreases quickly. Let’s test this statistically. We can run a simple regression model in which the parameters predict the correlation value (I am rescaling the variables here to make the coefficients more interpretable):

# Rescaling
results_r <- results |> 
  mutate(N = N/100, k = k/10, alpha = alpha*10, n = n*10)

# Regression model
m1r <- lm(r ~ N + k + n + alpha, results_r)
tidy(m1r)

## # A tibble: 5 × 5
##   term        estimate std.error statistic   p.value
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept) -0.584    0.00991      -59.0 0        
## 2 N            0.00782  0.000364      21.5 3.81e-101
## 3 k           -0.0106   0.000456     -23.3 2.73e-118
## 4 n            0.0734   0.000526     140.  0        
## 5 alpha        0.119    0.00105      113.  0

# Coefficient plot
broom::tidy(m1r, conf.int = T) |> 
  slice(-1) |> 
  ggplot(aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange(color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw() +
  coord_flip() +
  labs(y = "unstd. estimate (change in r)",
       x = "")

We can see that all parameters exert an influence on the correlation. For example, 10% less items sampled leads to a reduction in the correlation of b = -0.07. Similarly, if the original alpha level is e.g., .8 instead of .9, the correlation drops by b = .12. On the other hand, a sample size increase of n = 100 leads only to a small improvement of b = 0.01 and a larger original item pool (e.g., a change of k = 10 items) leads to a slightly worse correlation (b = -0.01).

Recoverability of the reliability

Next, we can also investigate in how far the imputed scale has a similar reliability as the original (full) scale:

results |> 
  ggplot(aes(x = alpha_sim, y = alpha_new, color = n, shape = factor(alpha))) +
  geom_point() +
  geom_hline(yintercept = .8, linetype = "dotted") +
  geom_hline(yintercept = .7, linetype = "dotted") +
  facet_grid(N~k) +
  theme_bw() +
  labs(x = "alpha (original)", y = "alpha (new)", 
       color = "amount of missing\nper person (%)",
       shape = "alpha level \n(input to the simulation)")

We can clearly see that with higher amount of missings per person, the variance in the imputed reliability increases and lower reliabilities are more likely. However, we also quite clearly see that higher sample sizes lead to less reduction in reliability. Let’s again test these assumptions statistically using a standard linear regression model in which we predict the difference between the original alpha and the imputed alpha:

# Rescaling and creating outcome variable
results_alpha <- results |> 
  mutate(alpha_diff = alpha_sim -alpha_new,
         N = N/100, k = k/10, alpha = alpha*10, n = n*10)

# Regression model
m1a <- lm(alpha_diff ~ N + k + n + alpha, results_alpha)
tidy(m1a)

## # A tibble: 5 × 5
##   term        estimate std.error statistic p.value
##   <chr>          <dbl>     <dbl>     <dbl>   <dbl>
## 1 (Intercept)  0.289    0.00291       99.3       0
## 2 N           -0.00547  0.000107     -51.2       0
## 3 k            0.00954  0.000134      71.4       0
## 4 n           -0.0176   0.000154    -114.        0
## 5 alpha       -0.0198   0.000309     -64.1       0

# Coefficient plot
tidy(m1a, conf.int = T) |> 
  slice(-1) |> 
  ggplot(aes(x = term, y = estimate, ymin = conf.low, ymax = conf.high)) +
  geom_pointrange(color = "steelblue") +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_bw() +
  coord_flip() +
  labs(y = "unstd. estimate (difference in alpha)",
       x = "")

We see that 10% less items sampled per person reduces the reliability by only b = .02. Similarly, lower original alpha leads to only b = 0.02 reduction in the imputed alpha. For the alpha recoverability, the number of items in the original item pool actually matters quite a bit. On the other hand, higher sample sizes can account for this as 100 more participants reduce the alpha difference by b = 0.01.

So how does the simulation compare to the mathematical approach outlined above?

Let’s quickly compare the results of the simulation with the results of the mathematical model:

# Simulation results
results |> 
  filter(N == 500 & k == 20 & k_s == 12 & alpha == .8) |> 
  group_by(N, k, k_s, alpha) |> 
  summarize(r = mean(r),
            alpha_new = mean(alpha_new)) |> 
  dplyr::select(N, k, k_s, alpha, alpha_new, r)

## # A tibble: 1 × 6
## # Groups:   N, k, k_s [1]
##       N     k   k_s alpha alpha_new     r
##   <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl>
## 1   500    20    12   0.8     0.774 0.863

# Mathematical results
reli_new(alpha = .8, k_s = 12, k = 20, N = 500) |> 
  mutate(ll = r - 1.06*se_r,
         ul = r + 1.96*se_r) |> 
  dplyr::select(N, k, k_s, alpha, alpha_new, r, ul, ll)

## # A tibble: 1 × 8
##       N     k   k_s alpha alpha_new     r    ul    ll
##   <dbl> <dbl> <dbl> <dbl>     <dbl> <dbl> <dbl> <dbl>
## 1   500    20    12   0.8     0.706 0.939 0.950 0.934

As we can see, the simulation seems to suggest a higher alpha, but lower r for this combination of parameters. The r is also not within the 95% CIs of the mathematical solution. From my point of view, this probably shows that the mathematical formula makes some fairly idealistic/unrealistic assumptions about the composition of the scale. Yet, the general tendency seems to align between both approaches.

Conclusion and recommendations

So what does this all mean? From my point of view, the findings from this simulations indicate that item sampling is a viable strategy for efficiently measuring latent traits in social science research (e.g., scales of 10-30 items with likert-type items) without significant loss of validity and reliability. Specifically, when the original scale has a high reliability (e.g., α ≥ 0.80), and the proportion of sampled items remains sufficiently large (e.g., missing data per respondent ≤ 50%), the resulting correlations between factor scores from full and imputed data sets remain robust (r > 0.80). Furthermore, higher sample sizes (n ≥ 500) and larger item pools (k ≥ 20) help mitigate any reductions in scale reliability due to item sampling.

This technique not only reduces respondent burden and shortens questionnaires but also maintains the integrity of the data collected. By leveraging the power of item sampling, we can conduct more efficient survey assessments while preserving essential psychometric properties, thereby enhancing the overall quality and feasibility of large-scale studies.

References

de Vet, H. C., Mokkink, L. B., Mosmuller, D. G., & Terwee, C. B. (2017). Spearman–Brown prophecy formula and Cronbach’s alpha: different faces of reliability and opportunities for new applications. Journal of clinical epidemiology, 85, 45-49.
Husek, T. R, & Sirotniuk, K. (1967) Item Sampling in Educational Research. https://files.eric.ed.gov/fulltext/ED013975.pdf

sessionInfo()

## R version 4.4.0 (2024-04-24)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sonoma 14.3
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/Amsterdam
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] furrr_0.3.1     future_1.33.2   mice_3.16.0     magrittr_2.0.3 
##  [5] semTools_0.5-6  lavaan_0.6-17   psych_2.4.3     MASS_7.3-60.2  
##  [9] lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4    
## [13] purrr_1.0.2     readr_2.1.5     tidyr_1.3.1     tibble_3.2.1   
## [17] ggplot2_3.5.1   tidyverse_2.0.0 knitr_1.47     
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.5      shape_1.4.6.1     xfun_0.44         bslib_0.7.0      
##  [5] lattice_0.22-6    tzdb_0.4.0        quadprog_1.5-8    vctrs_0.6.5      
##  [9] tools_4.4.0       generics_0.1.3    stats4_4.4.0      parallel_4.4.0   
## [13] fansi_1.0.6       pan_1.9           pkgconfig_2.0.3   jomo_2.7-6       
## [17] Matrix_1.7-0      lifecycle_1.0.4   compiler_4.4.0    munsell_0.5.1    
## [21] mnormt_2.1.1      codetools_0.2-20  htmltools_0.5.8.1 sass_0.4.9       
## [25] yaml_2.3.8        glmnet_4.1-8      nloptr_2.0.3      pillar_1.9.0     
## [29] jquerylib_0.1.4   cachem_1.1.0      iterators_1.0.14  rpart_4.1.23     
## [33] boot_1.3-30       foreach_1.5.2     mitml_0.4-5       parallelly_1.37.1
## [37] nlme_3.1-164      tidyselect_1.2.1  digest_0.6.35     stringi_1.8.4    
## [41] listenv_0.9.1     splines_4.4.0     fastmap_1.2.0     grid_4.4.0       
## [45] colorspace_2.1-0  cli_3.6.2         survival_3.5-8    utf8_1.2.4       
## [49] broom_1.0.6       pbivnorm_0.6.0    withr_3.0.0       scales_1.3.0     
## [53] backports_1.5.0   timechange_0.3.0  rmarkdown_2.27    globals_0.16.3   
## [57] nnet_7.3-19       lme4_1.1-35.3     hms_1.1.3         evaluate_0.23    
## [61] rlang_1.1.4       Rcpp_1.0.12       glue_1.7.0        minqa_1.2.7      
## [65] rstudioapi_0.16.0 jsonlite_1.8.8    R6_2.5.1

What is ‘item sampling’ and why is it useful? Or how can we can measure person characteristics with large item pools?

Statistical background

Mathematical derivation

Systematic approach

Simulation-Approach

Setting up custom functions to simualte data and missingness

Creating a custom function to fit all models and extract relevant result parameters

Putting it all together and simulating the item sampling approach

Correlation / Convergent validity

Recoverability of the reliability

So how does the simulation compare to the mathematical approach outlined above?

Conclusion and recommendations

References

Leave a Reply Cancel reply