这篇文章主要为大家展示了“perl如何分离NR NT库”,内容简而易懂,条理清晰,希望能够帮助大家解决疑惑,下面让小编带领大家一起研究并学习一下“perl如何分离NR NT库”这篇文章吧。
分离NR NT库,快速blast本地比对同源注释基因
我们需要对自己的基因做注释,需要blast同源比对NCBI当中的NR NT库;通常做无参转录组,会组织出10几万的unigene ,如果比对全库的话,就太浪费时间了,我们可以根据NCBI的分类数据库将数据库分开,可下载如下文件,然后利用下面的perl脚本就可以把NR或者NT库分开成小库:
wget -c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nr.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/nt.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz
wget -c ftp://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
perl脚本如下,由于脚本会把分类文件全部读入内存,所有脚本内存消耗较大,建议确保100G以上的内存空间:
#The gi_taxid_nucl.dmp is about 160 MB and contains two columns: the nucleotide's gi and taxid.
#The gi_taxid_prot.dmp is about 17 MB and contains two columns: the protein's gi and taxid.
#Divisions file (division.dmp):
# division id -- taxonomy database division id
# division cde -- GenBank division code (three characters)
# division name -- e.g. BCT, PLN, VRT, MAM, PRI...
# comments
#0 | BCT | Bacteria | |
#1 | INV | Invertebrates | |
#2 | MAM | Mammals | |
#3 | PHG | Phages | |
#4 | PLN | Plants and Fungi | |
#5 | PRI | Primates | |
#6 | ROD | Rodents | |
#7 | SYN | Synthetic and Chimeric | |
#8 | UNA | Unassigned | No species nodes should inherit this division assignment |
#9 | VRL | Viruses | |
#10 | VRT | Vertebrates | |
#11 | ENV | Environmental samples | Anonymous sequences cloned directly from the environment |
#nodes.dmp file consists of taxonomy nodes. The description for each node includes the following
#fields:
# tax_id -- node id in GenBank taxonomy database
# parent tax_id -- parent node id in GenBank taxonomy database
# rank -- rank of this node (superkingdom, kingdom, ...)
# embl code -- locus-name prefix; not unique
# division id -- see division.dmp file
# inherited div flag (1 or 0) -- 1 if node inherits division from parent
# genetic code id -- see gencode.dmp file
# inherited GC flag (1 or 0) -- 1 if node inherits genetic code from parent
# mitochondrial genetic code id -- see gencode.dmp file
# inherited MGC flag (1 or 0) -- 1 if node inherits mitochondrial gencode from parent
# GenBank hidden flag (1 or 0) -- 1 if name is suppressed in GenBank entry lineage
# hidden subtree root flag (1 or 0) -- 1 if this subtree has no sequence data yet
# comments -- free-text comments and citations
die "perl $0 <division.dmp> <nodes.dmp> <gi_taxid_n/p.dmp> <nr.fa/nt.fa> <od>" unless(@ARGV==5);
use Math::BigFloat;
use Bio::SeqIO;
use Bio::Seq;
use Data::Dumper;
use PerlIO::gzip;
use FileHandle;
use Cwd qw(abs_path getcwd);
if ($ARGV[3]=~/gz$/){
open $Fa, "<:gzip", "$ARGV[3]" or die "$!";
}else{
open $Fa, "$ARGV[3]" or die "$!";
}
my$od=$ARGV[4];
$od||=getcwd;
$od=abs_path($od);
unless(-d $od){ mkdir $od;}
my $in = Bio::SeqIO->new(-fh => $Fa ,
-format => 'Fasta');
my %division2name=();
open IN,"$ARGV[0]" or die "$!";
while(<IN>){
chomp;
my($taxid,$name)=split(/\s+\|\s+/);
$division2name{$taxid}=$name;
}
close(IN);
my%fout=();
for my$i (keys %division2name){
my$FO=FileHandle->new("> $od/$division2name{$i}.fa");
my $out = Bio::SeqIO->new(-fh => $FO , -format => 'Fasta');
$fout{$i}=$out;
}
print "$ARGV[0] readed\n";
#print Dumper(\%fout);
#print Dumper(\%division2name);
open IN,"$ARGV[1]" or die "$!";
my%taxid2division=();
while(<IN>){
chomp;
my@tmp=split(/\s+\|\s+/);
$taxid2division{$tmp[0]}=$tmp[4];
#last if $.>100;
}
close(IN);
print "$ARGV[1] readed\n";
my %gi2taxid=();
open IN,"$ARGV[2]" or die "$!";
while(<IN>){
chomp;
my@tmp=split(/\s+/);
$gi2taxid{$tmp[0]}=$tmp[1];
#last if $.>100;
}
close(IN);
print "$ARGV[2] readed\n";
#print Dumper(\%gi2taxid);
while( my $seq = $in->next_seq()){
my $id=$seq->id;
my ($gi)=($id=~/gi\|(\d+)\|ref\|/);
if(exists($gi2taxid{$gi}) and exists($taxid2division{$gi2taxid{$gi}})){
$fout{$taxid2division{$gi2taxid{$gi}}}->write_seq($seq);
}else{
print "unknown gi:$gi\n";
}
}
for my $i (keys %fout){
$fout{$i}->close();
}
以上是“perl如何分离NR NT库”这篇文章的所有内容,感谢各位的阅读!相信大家都有了一定的了解,希望分享的内容对大家有所帮助,如果还想学习更多知识,欢迎关注亿速云行业资讯频道!
亿速云「云服务器」,即开即用、新一代英特尔至强铂金CPU、三副本存储NVMe SSD云盘,价格低至29元/月。点击查看>>
免责声明:本站发布的内容(图片、视频和文字)以原创、转载和分享为主,文章观点不代表本网站立场,如果涉及侵权请联系站长邮箱:is@yisu.com进行举报,并提供相关证据,一经查实,将立刻删除涉嫌侵权内容。