Overview

arbalistIO facilitates the import of single-cell ATAC-seq and Multiome (ATAC + Gene Expression) data into Bioconductor’s MultiAssayExperiment. It provides efficient C++ based parsers for 10x Genomics fragment files to generate HDF5-backed sparse matrices for both genomic tiles and custom regions.

1. Preparing Mock Data

For this vignette, we will mock data that simulates the output of 10x Genomics Cell Ranger. This includes a fragment file (ATAC-seq) and a filtered feature matrix (Gene Expression).

Generate a Fragment File

We use mockFragmentFile to create a fragment file containing reads for 50 cells.

tmp_dir <- tempdir()
fragment_file <- file.path(tmp_dir, "fragments.tsv.gz")

seq_lengths <- c(chr1 = 10000, chr2 = 5000)
cell_names <- paste0("Cell", 1:50)

mockFragmentFile(
  output.file = fragment_file,
  seq.lengths = seq_lengths,
  num.fragments = 5000,
  cell.names = cell_names
)

Generate a Mock Feature Matrix (HDF5)

createArbalistMAE requires the Cell Ranger filtered feature matrix (HDF5) to identify valid barcodes.

h5_file <- file.path(tmp_dir, "filtered_feature_bc_matrix.h5")

mockCellRangerH5(h5_file, 
                 n.cells=50,
                 cell.names = cell_names)

2. Creating the MultiAssayExperiment

Now we can use createArbalistMAE to import everything into a single object. We will also define a set of genomic regions (e.g., promoters) to quantify accessibility alongside the standard tile matrix.

promoters <- GRanges(
  seqnames = "chr1",
  ranges = IRanges(start = seq(100, 1000, by = 200), width = 100)
)

mae <- createArbalistMAE(
  sample.names = "MockSample",
  fragment.files = fragment_file,
  filtered.feature.matrix.files = h5_file,
  output.dir = tmp_dir,
  gene.grs = promoters,
  tile.size = 500,
  seq.lengths = seq_lengths
)

print(mae)
#> A MultiAssayExperiment object of 3 listed
#>  experiments with user-defined names and respective classes.
#>  Containing an ExperimentList class object of length 3:
#>  [1] TileMatrix500: SingleCellExperiment with 30 rows and 50 columns
#>  [2] GeneAccessibilityMatrix: SingleCellExperiment with 5 rows and 50 columns
#>  [3] GeneExpressionMatrix: SingleCellExperiment with 50 rows and 50 columns
#> Functionality:
#>  experiments() - obtain the ExperimentList instance
#>  colData() - the primary/phenotype DataFrame
#>  sampleMap() - the sample coordination DataFrame
#>  `$`, `[`, `[[` - extract colData columns, subset, or experiment
#>  *Format() - convert into a long or wide DataFrame
#>  assays() - convert ExperimentList to a SimpleList of matrices
#>  exportClass() - save data to flat files

The resulting MultiAssayExperiment contains three experiments:

  1. TileMatrix500: 500bp tile counts (ATAC).
  2. GeneAccessibilityMatrix: Counts over the provided gene.grs regions (ATAC).
  3. GeneExpressionMatrix: RNA counts imported from the HDF5 file.

Exploring the Object

tiles <- experiments(mae)[["TileMatrix500"]]
print(tiles)
#> class: SingleCellExperiment 
#> dim: 30 50 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(50): MockSample#Cell1-1 MockSample#Cell2-1 ...
#>   MockSample#Cell49-1 MockSample#Cell50-1
#> colData names(1): Sample
#> reducedDimNames(0):
#> mainExpName: TileMatrix500
#> altExpNames(0):

rna <- experiments(mae)[["GeneExpressionMatrix"]]
print(rna)
#> class: SingleCellExperiment 
#> dim: 50 50 
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(5): feature_type genome id interval name
#> colnames(50): MockSample#Cell1-1 MockSample#Cell2-1 ...
#>   MockSample#Cell49-1 MockSample#Cell50-1
#> colData names(1): Sample
#> reducedDimNames(0):
#> mainExpName: GeneExpressionMatrix
#> altExpNames(0):

head(colData(mae))
#> DataFrame with 1 row and 1 column
#>                     fragment_file
#>                       <character>
#> MockSample /tmp/Rtmp7UWYZV/frag..

3. Low-Level Matrix Generation

If you do not need the full MultiAssayExperiment structure, you can use the underlying C++ wrappers to generate specific matrices.

Generating a Tile Matrix

Use saveTileMatrix to bin fragments into genome-wide tiles.

tile_h5 <- file.path(tmp_dir, "my_tiles.h5")

tile_res <- saveTileMatrix(
  fragment.file = fragment_file,
  output.file = tile_h5,
  output.name = "counts",
  seq.lengths = seq_lengths,
  tile.size = 1000
)

# Returns a list with the GRanges tiles and the HDF5Matrix
tile_res$tiles
#> GRanges object with 15 ranges and 0 metadata columns:
#>        seqnames    ranges strand
#>           <Rle> <IRanges>  <Rle>
#>    [1]     chr1    1-1000      *
#>    [2]     chr1 1001-2000      *
#>    [3]     chr1 2001-3000      *
#>    [4]     chr1 3001-4000      *
#>    [5]     chr1 4001-5000      *
#>    ...      ...       ...    ...
#>   [11]     chr2    1-1000      *
#>   [12]     chr2 1001-2000      *
#>   [13]     chr2 2001-3000      *
#>   [14]     chr2 3001-4000      *
#>   [15]     chr2 4001-5000      *
#>   -------
#>   seqinfo: 2 sequences from an unspecified genome
tile_res$counts
#> <15 x 50> sparse DelayedMatrix object of type "integer":
#>        Cell2  Cell5 Cell22 Cell15 ... Cell43 Cell35 Cell46 Cell28
#>  [1,]    238    217    236    243   .    262    247    247    237
#>  [2,]    352    341    373    387   .    371    394    394    415
#>  [3,]    376    412    386    350   .    364    367    399    404
#>  [4,]    397    411    398    358   .    404    395    342    408
#>  [5,]    397    379    380    393   .    388    399    353    381
#>   ...      .      .      .      .   .      .      .      .      .
#> [11,]    495    514    500    486   .    452    475    497    460
#> [12,]    750    774    745    740   .    722    765    738    709
#> [13,]    749    760    763    724   .    717    764    747    703
#> [14,]    745    778    754    744   .    769    762    745    776
#> [15,]    757    730    764    779   .    793    757    737    762

Generating a Region Matrix

Use saveRegionMatrix to count fragments overlapping specific genomic regions

region_h5 <- file.path(tmp_dir, "my_regions.h5")

# Define arbitrary regions
peaks <- GRanges("chr1", IRanges(start = c(10, 5000), width = 200))

region_mat <- saveRegionMatrix(
  fragment.file = fragment_file,
  output.file = region_h5,
  output.name = "counts",
  regions = peaks
)

print(region_mat)
#> <2 x 50> sparse DelayedMatrix object of type "raw":
#>       Cell2  Cell5 Cell22 ... Cell46 Cell28
#> [1,]     37     28     2b   .     31     2c
#> [2,]     5e     68     58   .     58     56

That’s it! For analyzing multiome datasets, check out the arbalist package.