Blast notes

For simple tasks we may just go to NCBI: NCBI BLAST.

De novo assembly

De novo assembly

GTF files

A common format for storing biological coordinate data, such as the chromosome, start and end position of a feature, is the GTF/GFF format. The General Feature Format (GFF) and the General Transfer Format (GTF) are similar formats that have technical differences. These technical details should be found online, such as GFF on Wikipedia and GFF/GTF File Format at ensembl. Table 1 is an example GTF file produced by the BRAKER pipeline.

Table 1. Example GTF file produced by BRAKER including a single ‘gene’ (W103_g1) that consists of several features.
V1 V2 V3 V4 V5 V6 V7 V8 V9
000000F|arrow AUGUSTUS gene 13924 18566 .
. W103_g1
000000F|arrow AUGUSTUS transcript 13924 18566 .
. W103_g1.t1
000000F|arrow AUGUSTUS stop_codon 13924 13926 .
0 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 13924 13960 1
1 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 13924 13960 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 13961 14096 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 14097 14168 1
1 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 14097 14168 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 14169 14974 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 14975 15055 1
1 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 14975 15055 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 15056 15272 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 15273 15323 1
1 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 15273 15323 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 15324 15486 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 15487 15581 1
0 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 15487 15581 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 15582 16447 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 16448 16528 1
0 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 16448 16528 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 16529 16623 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 16624 16708 1
1 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 16624 16708 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 16709 16943 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 16944 17023 1
0 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 16944 17023 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 17024 17138 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 17139 17253 1
1 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 17139 17253 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 17254 17849 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 17850 17920 1
0 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 17850 17920 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 17921 18012 1
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 18013 18100 0.97
1 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 18013 18100 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 18101 18323 0.97
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 18324 18396 0.54
2 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 18324 18396 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS intron 18397 18559 0.45
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS CDS 18560 18566 0.68
0 transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS exon 18560 18566 .
. transcript_id W103_g1.t1; gene_id W103_g1;
000000F|arrow AUGUSTUS start_codon 18564 18566 .
0 transcript_id W103_g1.t1; gene_id W103_g1;

The format does not include a ‘header’ but the specification does include names for the columns. For example, column 1 is the ‘seqid’ and in our case is the name of the sequence (contig, chromosome, etc.). Columns 4 and 5 are the start and end and are sorted so that column 4 is less than column 5, this is why column 7 specifies the strand. Column 3 is the ‘type’ and is important when extracting information from the file. The contents of this column are part of a ‘controlled vocabulary’ that we can reference at the Sequence Ontology. This ‘gene’ consists of 1 ‘gene’ record, 1 ‘transcript’ record, as well as 1 ‘start_codon’ and one ‘stop_codon’. The gene also consists of 13 ‘CDS’, 13 ‘exon’, and 12 ‘intron’ records.