BLAST

Setting up your project

The easy way

If you're on a Mac or Linux machine, run setup.pl from your project root to

create data directory
download all sequences
download the query file
create the database
index the database
create stub files

Be sure to read through the script before executing it.

Installing BLAST

Download one of the installers from the NCBI

Project layout

Tree

├── README
├── data/
│   ├── NC_003997.fna
│   ├── NC_005945.fna
│   ├── NC_012581.fna
│   ├── NC_012659.fna
│   ├── anthraxDB.fna
│   └── queryNuc.txt
├── setup.pl
└── src/
    ├── query_nuc.pl
    └── query_prot.pl

Explanation


File	Purpose
`README`	How to run your program, etc.
`data/`	Contains data for your project
`setup.pl`	Automate setting up data
`src/`	Source code
`src/query_nuc.pl`	Performs nucleotide searches
`src/query_prot.pl`	Performs protein searches

Downloading sequences

Downloading from the command-line

OS X: Use curl -O URL
Linux: Use wget URL
Windows: Use your browser

Files

Preparing the DB

Need to merge all of our FNA files into anthraxDB.fna so BLAST can search it

Mac & Linux

cd data
cat *.fna > anthraxDB.fna

Windows

cd data
copy /a *.fna anthraxDB.fna

Index the database

From your project's working directory (i.e., above data), type

makeblastdb -in data/anthraxDB.fna -dbtype nucl

Download the query file

Save queryNuc.txt to data/ (we'll use it in the next tutorial)

Running BLAST

From the shell

Do a search against the database:

blastn -db data/anthraxDB.fna -query data/query.txt

look at the E-values (smaller is better)

Planning

Remember our goal

We are trying to automate queries against BLAST to determine whether ~100 fragments are from the A0248 strain.

Example Run

$ perl query_nuc.pl
best hit:       ambiguous
best hit:       ambiguous
:
best hit:       Bacillus anthracis str. A0248
:
:
best hit:       ambiguous

votes for ambiguous: XX
votes for Bacillus anthracis str. A0248: YY

Input & Output

input: results from a BLAST command
output: how many hits were ambiguous or conclusive

Process (pseudocode)

Read the sequences in from the query file (queryNuc.txt)
For each query sequence:
1. Write the query to a text file to use in a BLAST command
2. Run the BLAST command
3. Parse the output to determine what strain
Print out how many ambiguous and conclusive strains were found

Process

At the top…

use strict;
use warnings;

my $db = 'data/anthraxDB.fna'; # path to BLAST DB

my $query_fn = 'tmp_query.txt'; # file we generate each seq

# this is defined on the "snippets" page -- paste
# the definition into your file
my @queries = fasta_to_array('data/queryNuc.txt');

my $num_ambiguous  = 0; # number of ambiguous hits
my $num_conclusive = 0; # number of conclusive hits

# command to r un for each query (i.e., 100 times)
my $cmd = "blastn -db $db -query $query_fn -evalue 1e-10";

In the middle…

for my $query (@queries) {
    # create a BLAST-query for the current sequence ($query)
    # ...write it to $query_fn

    # execute BLAST
    # my $result = `$cmd`;

    # see what we got (parse the output)
}

At the bottom…

print "Total ambiguous: $num_ambiguous\n";
print "Total conclusive: $num_conclusive\n";