A simple pipeline to annotate transposable elements (TEs) in a genome assembly
Updated: Apr 8, 2021
In this tutorial, we are going to find, classify and quantify transposable elements (TEs) in an unannotated genome assembly. As an example, we will use the chromosome 2L of Drosophila melanogaster to represent a newly assembled genome.
You will need the following programs/scripts to complete the analysis (click on link to find and install them):
RepeatModeler2 (TE assembly and classification)
RepeatMasker (TE mapping, quantification)
ParseRM scripts (A. Kapusta) (Parser for RepeatMasker)
rm2gff3.sh (produce a coloured gff3 track for your TEs!)
1. Building a de-novo TE library with RepeatModeler 2
RepeatModeler 2 (RM2) combines multiple repeat detection tools to find and assemble TE families in a genome assembly. RM2 produces a multi-fasta file containing one consensus sequence for each putative TE family.
RM2 includes a basic (but very helpful) classification module. However, it is based on sequence homology, so its sensitivity will vary depending on the target species. Nevertheless, unknown repeats will still be reported and processed normally.
First, let's download and prepare a toy genome
mkdir toygenome cd toygenome wget ftp://ftp.flybase.net/genomes/Drosophila_melanogaster/dmel_r6.38_FB2021_01/fasta/dmel-all-chromosome-r6.38.fasta.gz gunzip dmel-all-chromosome-r6.38.fasta.gz awk 'NR < 293924' dmel-all-chromosome-r6.38.fasta > genome.fa rm dmel-all-chromosome-r6.38.fasta
Build the RM2 database for your genome
<RepeatModelerPath>/BuildDatabase -name genome genome.fa Building database genome: Reading genome.fa... Number of sequences (bp) added to database: 1 ( 23513712 bp )
Run RepeatModeler 2
nohup <RepeatModelerPath>/RepeatModeler -database genome -pa 20 -LTRStruct >& run.out &
Note: the "nohup <command> &" syntax allows long jobs to run in the background. However if your machines use a scheduler such as SLURM or qsub, these can be ignored
In this example, we are using 20 cpus and activate the LTR module of RepeatModeler. This option is highly recommended and improve the results for LTR retrotransposons.
Depending the size of your target genome and the repeat load, a RepeatModeler 2 run can take between a few hours and a few days. A log is stored in the file "run.out"
A quick look at RM2 outputs
Before moving on, locate the library produced by RepeatModeler 2. The final consensus sequences are stored in the file <genome>-families.fa.
The header format of the TE library is as follow:
>TE-family-name#classification [extra info]
The family name can be either rnd-x_family-y or ltr-x_family-y depending if the consensus comes from the main or the LTR modules.
A classification is given in the RepeatMasker format (# followed by subclass/superfamily) - alternatively (# Unknown)
LTR elements are split in 2 distinct consensus: one for the internal region (Type=INT) and one for the LTR sequence (Type=LTR)
>rnd-1_family-23#LINE/I [...] >rnd-1_family-78#Unknown [...]
>ltr-1_family-43#LTR/Pao [ Type=LTR, Final Multiple Alignment Size = 1 ] >ltr-1_family-42#LTR/Pao [ Type=INT, Final Multiple Alignment Size = 1 ] >ltr-1_family-40#LTR/Unknown [ Type=LTR, Final Multiple Alignment Size = 1 ]
Note: one advantage of using RepeatModeler 2 is that the classification format is directly recognized by RepeatMasker and the follow-up scripts, however, this strict syntax is not mandatory and all repeats will be processed identically whether they are classified or not.
2. Annotate your genome using RepeatMasker and the newly build TE library
Running Repeat Masker
You are ready to run RepeatMasker to map and quantify TE discovered in the previous step. By default, RepeatMasker will use rmblastn, a modified version of blastn.
nohup <RepeatMaskerPath>/RepeatMasker -pa 5 -a -s -gff -no_is -lib genome-families.fa genome.fa &> RM.run.out &
-pa: CPUs **WARNING** RepeatMasker multiplies CPU x 4 while using rmblastn !!!
-a: .align file (needed for TE landscapes) **
-s: "slow"-search mode (recommended) *
-gff: produces a gff track
-no_is: don't look for insertion sequences (prokaryotic TE) *
*These parameters are based on Permal et al., 2012 for consistency
**The align file is necessary for downstream analyses based on TE divergence (e.g. TE landscapes)
Note: RepeatMasker can take from minutes to days to complete depending on the genome size and TE content. For larger genomes (such as human), it is highly recommended to split RepeatMasker runs per chromosome and take advantage of parallelization when possible.
You can follow your RepeatMasker run by reading in the log file:
tail -f RM.run.out nohup: ignoring input RepeatMasker version 4.1.0 Search Engine: NCBI/RMBLAST [ 2.10.0+ ] Master RepeatMasker Database: /programs/RepeatMasker_4-1-0/Libraries/RepeatMaskerLib.embl ( Complete Database: CONS-Dfam_3.1 ) Custom Repeat Library: genome-families.fa analyzing file genome.fa identifying Simple Repeats in batch 1 of 406 identifying Simple Repeats in batch 2 of 406 identifying Simple Repeats in batch 3 of 406
at the end RepeatMasker can spend a significant amount of time compiling the results using a single thread. To be sure that the RepeatMasker run is done, check the log for the message:
Generating output... ................. masking done
Repeat Masker outputs
The outputs of RepeatMasker are named with the prefix genome.fa.*
genome.fa.out: main RepeatMasker output, this is a table with a 3 lines header. It gives the genomic coordinates of TE hits on the genome.
genome.fa.align: alignment details for the retained RM hits. We will use it in the parsing step
genome.fa.out.gff: gff version of .out for visualisation (IGV...)
genome.fa.masked: masked version of the genome (.fasta), with TE residues marked as "X"
genome.fa.tbl: summary table; this is great to have a quick estimate of the repeat content. As you can see, the repeat content is broken down by type of elements. This will only shows if the database used is in the RepeatModeler/RepeatMasker formats as described before, but it doesn't have an impact on the actual results.
3. Parsing the RepeatMasker output for statistical analysis
As is, the results from RepeatMasker need a little polishing before being loaded in R (or your favorite statistical analysis software, but... why using something else?). The goal is to produce a data table that will summarize for each TE family the total bp masked, number of hits, average hit size, divergence, etc. For this, we are going to use a very convenient script created by Aurelie Kapusta which automatically parse the RM outputs into analyzable data tables.
The script has two types of outputs "parse" (summary data tables) and "landscape" (to perform TE age analysis). I will dedicate a specific tutorials for the landscapes, so I will focus here on the "parse" routine. It will produce two tables, one for the per-family results, and one "summary", according to the classification levels (if available).
Note. If your target genome has a larger TE content than D. melanogaster, you will observe that TE hits sometimes overlap. It can be the case if a region has homology to different consensus of the library. RM somewhat allow some of these overlaps to happen, and thus, the raw results from RepeatMasker can be often slightly overestimated. The script we use will automatically get rid of these overlaps by choosing the less divergent consensus in the overlap region.
Parsing the output with ParseRM.pl
First download the script from Aurelie's GitHub
git clone https://github.com/4ureliek/Parsing-RepeatMasker-Outputs.git
And run the parsing script
nohup perl ./Parsing-RepeatMasker-Outputs/parseRM.pl -i genome.fa.align -p -f genome.fa -r genome-families.fa -v &> parseRM.log &
-i: input (.align) -- in our case genome.fa.align
-p: parse mode
-f: genome file -- [from the script's help; "If the file THAT WAS MASKED, typically a genome, is in the same directory as the .out or .align file(s). If not provided, the % of genome masked won't be calculated. Names should correspond before .align and .fa(sta), and/or .out and .fa(sta), for ex: '-i hg38.out -f' will look for hg38.fa and 'hg38.out_RM405_RB20140131 -f' will also look for hg38.fa"]
--> this works in our case, and I encourage to keep this workflow.
-r: the repeat database (consensus from RepeatModeler 2)
-v: verbose mode for log
It only takes a few minutes in this example, but it can easily run for hours (e.g.: one night for 750Mb genome with 74% TE) - be patient, it is worth it!
TO DO: add parse outputs details
To be continued... with:
BONUS: TE visualization on the genome browser (IGV)