Document the original 534 coding sequence targets

Description of capture probe design

On the supercomputer: (wd: /fslhome/jaudall/fsl_groups/fslg_cotton/compute/Nimblegen_seq_capture/Cotton_genome_blast_databases)

We did a blastn of the coding sequences:

/fslapps/blast/blast-2.2.21/bin/blastall -i Cotton_Targets_Sequences_021910.cds.fasta -o blast_output.2.txt -p blastn -e 1e-50 -d try2.scafSeq /fslapps/blast/blast-2.2.21/bin/blastall -i Cotton_Targets_Sequences_021910.cds.fasta -o blast_output.1.txt -p blastn -e 1e-50 -d try1.scafSeq 

We wrote a script to extract the upstream regions of coding sequences based on blast hits. This script creates a file with the upstream sequences in it. It a new file (overwritten every time the script is run) and it is located in the directory:

Udall's computer: /Users/jaudall/Nimblegen_48_hirsutum

get_upstream_sequence_from_blast.pl blast_output.1.txt Cotton_Targets_Sequences_021910.cds.fasta > get_upstream_sequence_from_blast.out 

I looked carefully the upstream regions of Cotton16_20270_01.

New Procedure

We rewrote the script that extracts the upstream regions of the coding sequences based on the blast hits. There are now three steps to extracting the promoter sequences. First, we wrote a script that opens one blast file and looks at the first hsp of every hit.

It then takes the the number that the coding sequence started at and subtracts 1500 from it, if possible. Otherwise, it just sets the start of the promoter sequence to 1. It also takes inversions into account. If there is an inversion, it will take the end of the coding sequence and add 1500 to it since the promoter sequence is to the right of it. The name of the query, the name of the hit and the start and end numbers of the promoter are then written to a file. The script only does one blast file at a time and the file that is written to is specified in the perl script:

all files are in: /Users/dbrandt/Documents/upstream coding sequences 
First script is run like this: perl parse_first_hsp.pl blast_output.1.txt perl parse_first_hsp.pl blast_output.2.txt (Don't forget to change the name of the file the script writes to when running different blast outputs. The file name is changed in the script.) 

Once the first script is run, we changed changed it to just output the hit ID's of each gene. This is done by commenting out everything that the first script writes to the file except forprintf TEXT $hit->name."\n";

After we created one ID file for each blast output, we used a script we found online to extract the needed genes out of the fasta files created from the cotton genome. The script basically takes a list of IDs and creates a new fasta file with only the genes listed in the ID file

Second Script: perl extractSeqByID.pl (Note that nothing is passed in through the command line. The files are hard coded in the script) 

The last step is to take the new fasta files that were created and run a script, that we wrote, which uses the file that has the names and the numbers of the start and end of the promoter sequences (created with the first script). The script reads these files into an array and then uses that information to parse the new fasta files and pull out the promoter sequences. The script trims down the end number of the promoter sequence if it is longer then the sequence in the fasta file. This is only a problem with inversions because the promoter sequence is to the right of the coding sequence. The script creates two promoter sequence files, one for each fasta file. The script also creates two stats pages that summarize the average length of each promoter sequence extracted. One stats page for each fasta file is created.

Third Script: perl get_upstream_sequences_genome.pl try1_revised.fasta try2_revised.fasta (The file with the names, and start and end of each promoter sequence is hard coded in the script. The files where the promoter sequences and stats are written are hard coded in the script as well.)