wrapper function to scan sterile/productive IGH molecules from BAM file
getIGHmapping.Rd
getIGHmapping
is the wrapper function intended for users to supply a BAM file,
and scan for sterile and productive IGH transcripts over each C gene as defined in the parameter
IGHC_granges
.
Usage
getIGHmapping(
bam,
definitions,
cellBarcodeTag = "CB",
umiTag = "UB",
paired = FALSE,
flank = 5000
)
Arguments
- bam
filepath to the BAM file to read
- definitions
A list of
GenomicRanges::GRanges
object, each specifying the genomic coordinates of VDJ genes, C genes coding segments and sterile C transcripts. See the data objects 'human_definitions' and 'mouse_definitions' in the sciCSR package for formats.- cellBarcodeTag
Name of tag holding information about cell barcode. The code expects and extracts this tag from each line in the BAM alignments. Set as NULL or NA if no such information is available in the BAM file. (Default: "CB")
- umiTag
Name of tag holding information about molecule barcode. The code expects and extracts this tag from each line in the BAM alignments. Set as NULL or NA if no such information is available in the BAM file. (Default: "UB")
- paired
Are the sequencing reads paired-end? (Default: FALSE)
- flank
either (1) an integer (indicating 5' distance from the CH exons) or (2) a
GRanges
object (indicating exact genomic positions) for defining sterile IgH transcripts. Ignored if the positions of sterile transcripts are already included indefinitions
.
Value
A list with two items:
- read_count
a data.frame in wide format indicating for each cell barcodes and UMI combination, the number of **reads** (Note: NOT UMI!) covering VDJ, and the Coding region (C) or 5' intronic region (I) of each IGH C gene.
- junction_reads
a data.frame of spliced reads and their mapped cell barcodes & UMIs. Either genuine spliced productive IgH transcripts, or strange molecules potentially worth detailed inspection.
Details
The function reads in two GenomicRanges::GRanges
objects, one defining the genomic
coordinates of the IGHC genes and another for the VDJ genes. It scans the BAM file
for reads covering these regions, extracting their mapped cell barcodes and Unique
Molecule identifier (UMI). Sterile reads are defined as those covering the intronic region
upstream of the 5' end of the C gene coding region - the default is to consider the region
(min(previous_C_CDS_end
, -flank
), 0), where previous_C_CDS_end
refers to the
3' end of the coding region of the previous C gene, and flank
is an integr, given by the
user, which indicates 'how far' the function should looks 5' of the coding region for
sterile reads.
If you have information on where the sterile transcripts begin, these coordinatees can be
passed as a GRanges
object to the parameter flank
(see examples).
Outputs data frame of cell barcodes and molecules mapped to
each IGHC gene, classified as sterile/productive/C-only.