Methods

Blast notes

For simple tasks we may just go to NCBI: NCBI BLAST.

De novo assembly

GTF files

A common format for storing biological coordinate data, such as the chromosome, start and end position of a feature, is the GTF/GFF format. The General Feature Format (GFF) and the General Transfer Format (GTF) are similar formats that have technical differences. These technical details should be found online, such as GFF on Wikipedia and GFF/GTF File Format at ensembl. Table 1 is an example GTF file produced by the BRAKER pipeline.

**Table 1.** Example GTF file produced by BRAKER including a single ‘gene’ (W103_g1) that consists of several features.
V1	V2	V3	V4	V5	V6	V8	V9
000000F\|arrow	AUGUSTUS	gene	13924	18566	.	.	W103_g1
000000F\|arrow	AUGUSTUS	transcript	13924	18566	.	.	W103_g1.t1
000000F\|arrow	AUGUSTUS	stop_codon	13924	13926	.	0	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	13924	13960	1	1	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	13924	13960	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	13961	14096	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	14097	14168	1	1	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	14097	14168	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	14169	14974	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	14975	15055	1	1	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	14975	15055	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	15056	15272	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	15273	15323	1	1	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	15273	15323	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	15324	15486	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	15487	15581	1	0	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	15487	15581	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	15582	16447	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	16448	16528	1	0	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	16448	16528	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	16529	16623	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	16624	16708	1	1	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	16624	16708	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	16709	16943	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	16944	17023	1	0	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	16944	17023	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	17024	17138	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	17139	17253	1	1	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	17139	17253	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	17254	17849	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	17850	17920	1	0	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	17850	17920	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	17921	18012	1	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	18013	18100	0.97	1	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	18013	18100	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	18101	18323	0.97	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	18324	18396	0.54	2	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	18324	18396	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	intron	18397	18559	0.45	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	CDS	18560	18566	0.68	0	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	exon	18560	18566	.	.	transcript_id W103_g1.t1; gene_id W103_g1;
000000F\|arrow	AUGUSTUS	start_codon	18564	18566	.	0	transcript_id W103_g1.t1; gene_id W103_g1;

The format does not include a ‘header’ but the specification does include names for the columns. For example, column 1 is the ‘seqid’ and in our case is the name of the sequence (contig, chromosome, etc.). Columns 4 and 5 are the start and end and are sorted so that column 4 is less than column 5, this is why column 7 specifies the strand. Column 3 is the ‘type’ and is important when extracting information from the file. The contents of this column are part of a ‘controlled vocabulary’ that we can reference at the Sequence Ontology. This ‘gene’ consists of 1 ‘gene’ record, 1 ‘transcript’ record, as well as 1 ‘start_codon’ and one ‘stop_codon’. The gene also consists of 13 ‘CDS’, 13 ‘exon’, and 12 ‘intron’ records.

Methods

Brian J. Knaus

7/28/2021

Blast notes

De novo assembly

Links

GTF files