How do I ensure that I accurately represent repetitive RNAs (rRNA, tRNA, snRNA, repetitive elements etc) in my analysis?
If you perform a standard genome mapping and only take singly mapping reads, it is likely you will misrepresent the ncRNA portion of your CLIP library. This is because RNA species such as rRNA, tRNA and snRNA have many copies in the genome, that often have a high sequence identity. This means that it is possible to map short CLIP reads to multiple gene copies, and if you exclude multi-mapping reads, then these reads will be excluded. Another thing that makes this difficult is that in the case of rRNA and tRNA, genome annotations are likely incomplete, and reads that you think might be pre-mRNA, might actually come from an unannotated intronic tRNA for example (see (Schwartz, 2018) for an example of where this can become contentious).
One approach to this issue is to randomly assign multi-mapping reads to one of the genomic locations where they map. The issues with this approach arise when it comes to gene by gene quantification - a gene with multiple high identity copies in the genome will be penalised in terms of counts because multi-mapping reads will be spread amongst all copies. In addition, the assignment of a read to certain groups can become complicated, for example if a read maps between tRNA, introns, intergenic spaces, for example, then you will be forced to come up with a hierarchy.
Other approaches that look to solve this issue involve some kind of pre-mapping. This means mapping reads to certain groups of RNA species before mapping to the whole genome. This effectively reduces the sequence space available for reads to map to and so there are several considerations in doing this: if I am too lenient in terms of mismatches .etc then I may map a read to (say) tRNA, that could map much better without mismatches to somewhere else in the genome, however if I am too stringent, then I will fail to assign a true tRNA read as “tRNA”. Pre-mapping involves some assumption making, in that it is typical to map to the most abundant ncRNA first (rRNA, tRNA). Because they are the most abundant it makes sense that a read that could map to these RNAs probably does originate from these species - but this might not always be the case. Even pre-mapping will leave you with problems when it comes to individual gene quantification. How best, for example, to categorise reads mapping to some combination of the 193 annotated U1 snRNA genes in the human genome? We can use some knowledge of biology to help us make sensible groupings.
In the case of the FAST-iCLIP pipeline, the authors decide to quantify tRNAs at the level of anticodon groups for example, rather than individual genes. A further, CLIP-specific issue, is assigning the location of crosslinks to reads that multimap within ncRNA. One option here is to make metaprofiles over groups of ncRNA. Another solution is to map to single representative copies of specific species, for example to map to one copy of the 45s rDNA cluster in the case of rRNA. Some software tools that aim to help in this area are: