arbalistIO facilitates the import of single-cell
ATAC-seq and Multiome (ATAC + Gene Expression) data into Bioconductor’s
MultiAssayExperiment. It provides efficient C++ based
parsers for 10x Genomics fragment files to generate HDF5-backed sparse
matrices for both genomic tiles and custom regions.
library(arbalistIO)
library(MultiAssayExperiment)
library(SingleCellExperiment)
library(GenomicRanges)
library(rhdf5)
library(Matrix)For this vignette, we will mock data that simulates the output of 10x Genomics Cell Ranger. This includes a fragment file (ATAC-seq) and a filtered feature matrix (Gene Expression).
We use mockFragmentFile to create a fragment file
containing reads for 50 cells.
tmp_dir <- tempdir()
fragment_file <- file.path(tmp_dir, "fragments.tsv.gz")
seq_lengths <- c(chr1 = 10000, chr2 = 5000)
cell_names <- paste0("Cell", 1:50)
mockFragmentFile(
output.file = fragment_file,
seq.lengths = seq_lengths,
num.fragments = 5000,
cell.names = cell_names
)createArbalistMAE requires the Cell Ranger filtered
feature matrix (HDF5) to identify valid barcodes.
h5_file <- file.path(tmp_dir, "filtered_feature_bc_matrix.h5")
mockCellRangerH5(h5_file,
n.cells=50,
cell.names = cell_names)Now we can use createArbalistMAE to import everything
into a single object. We will also define a set of genomic regions
(e.g., promoters) to quantify accessibility alongside the standard tile
matrix.
promoters <- GRanges(
seqnames = "chr1",
ranges = IRanges(start = seq(100, 1000, by = 200), width = 100)
)
mae <- createArbalistMAE(
sample.names = "MockSample",
fragment.files = fragment_file,
filtered.feature.matrix.files = h5_file,
output.dir = tmp_dir,
gene.grs = promoters,
tile.size = 500,
seq.lengths = seq_lengths
)
print(mae)
#> A MultiAssayExperiment object of 3 listed
#> experiments with user-defined names and respective classes.
#> Containing an ExperimentList class object of length 3:
#> [1] TileMatrix500: SingleCellExperiment with 30 rows and 50 columns
#> [2] GeneAccessibilityMatrix: SingleCellExperiment with 5 rows and 50 columns
#> [3] GeneExpressionMatrix: SingleCellExperiment with 50 rows and 50 columns
#> Functionality:
#> experiments() - obtain the ExperimentList instance
#> colData() - the primary/phenotype DataFrame
#> sampleMap() - the sample coordination DataFrame
#> `$`, `[`, `[[` - extract colData columns, subset, or experiment
#> *Format() - convert into a long or wide DataFrame
#> assays() - convert ExperimentList to a SimpleList of matrices
#> exportClass() - save data to flat filesThe resulting MultiAssayExperiment contains three
experiments:
gene.grs regions (ATAC).
tiles <- experiments(mae)[["TileMatrix500"]]
print(tiles)
#> class: SingleCellExperiment
#> dim: 30 50
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(0):
#> colnames(50): MockSample#Cell1-1 MockSample#Cell2-1 ...
#> MockSample#Cell49-1 MockSample#Cell50-1
#> colData names(1): Sample
#> reducedDimNames(0):
#> mainExpName: TileMatrix500
#> altExpNames(0):
rna <- experiments(mae)[["GeneExpressionMatrix"]]
print(rna)
#> class: SingleCellExperiment
#> dim: 50 50
#> metadata(0):
#> assays(1): counts
#> rownames: NULL
#> rowData names(5): feature_type genome id interval name
#> colnames(50): MockSample#Cell1-1 MockSample#Cell2-1 ...
#> MockSample#Cell49-1 MockSample#Cell50-1
#> colData names(1): Sample
#> reducedDimNames(0):
#> mainExpName: GeneExpressionMatrix
#> altExpNames(0):
head(colData(mae))
#> DataFrame with 1 row and 1 column
#> fragment_file
#> <character>
#> MockSample /tmp/Rtmp7UWYZV/frag..If you do not need the full MultiAssayExperiment
structure, you can use the underlying C++ wrappers to generate specific
matrices.
Use saveTileMatrix to bin fragments into genome-wide
tiles.
tile_h5 <- file.path(tmp_dir, "my_tiles.h5")
tile_res <- saveTileMatrix(
fragment.file = fragment_file,
output.file = tile_h5,
output.name = "counts",
seq.lengths = seq_lengths,
tile.size = 1000
)
# Returns a list with the GRanges tiles and the HDF5Matrix
tile_res$tiles
#> GRanges object with 15 ranges and 0 metadata columns:
#> seqnames ranges strand
#> <Rle> <IRanges> <Rle>
#> [1] chr1 1-1000 *
#> [2] chr1 1001-2000 *
#> [3] chr1 2001-3000 *
#> [4] chr1 3001-4000 *
#> [5] chr1 4001-5000 *
#> ... ... ... ...
#> [11] chr2 1-1000 *
#> [12] chr2 1001-2000 *
#> [13] chr2 2001-3000 *
#> [14] chr2 3001-4000 *
#> [15] chr2 4001-5000 *
#> -------
#> seqinfo: 2 sequences from an unspecified genome
tile_res$counts
#> <15 x 50> sparse DelayedMatrix object of type "integer":
#> Cell2 Cell5 Cell22 Cell15 ... Cell43 Cell35 Cell46 Cell28
#> [1,] 238 217 236 243 . 262 247 247 237
#> [2,] 352 341 373 387 . 371 394 394 415
#> [3,] 376 412 386 350 . 364 367 399 404
#> [4,] 397 411 398 358 . 404 395 342 408
#> [5,] 397 379 380 393 . 388 399 353 381
#> ... . . . . . . . . .
#> [11,] 495 514 500 486 . 452 475 497 460
#> [12,] 750 774 745 740 . 722 765 738 709
#> [13,] 749 760 763 724 . 717 764 747 703
#> [14,] 745 778 754 744 . 769 762 745 776
#> [15,] 757 730 764 779 . 793 757 737 762Use saveRegionMatrix to count fragments overlapping
specific genomic regions
region_h5 <- file.path(tmp_dir, "my_regions.h5")
# Define arbitrary regions
peaks <- GRanges("chr1", IRanges(start = c(10, 5000), width = 200))
region_mat <- saveRegionMatrix(
fragment.file = fragment_file,
output.file = region_h5,
output.name = "counts",
regions = peaks
)
print(region_mat)
#> <2 x 50> sparse DelayedMatrix object of type "raw":
#> Cell2 Cell5 Cell22 ... Cell46 Cell28
#> [1,] 37 28 2b . 31 2c
#> [2,] 5e 68 58 . 58 56That’s it! For analyzing multiome datasets, check out the arbalist package.