This function calculates (conditional-)reciprocal best hit (CRBHit) pairs from two AA fasta files. Conditional-reciprocal best hit pairs were introduced by Aubry S, Kelly S et al. (2014). Sequence searches are performed with last Kiełbasa, SM et al. (2011) [default] or with mmseqs2 Steinegger, M and Soeding, J (2017) or with diamond Buchfink, B et al. (2021). If one specifies aafile1 and aafile2 as the same input a selfblast is conducted.
Usage
aafile2rbh(
aafile1,
aafile2,
dbfile1 = NULL,
dbfile2 = NULL,
searchtool = "last",
lastpath = paste0(find.package("CRBHits"), "/extdata/last-1595/bin/"),
lastD = 1e+06,
lastm = 10,
mmseqs2path = NULL,
mmseqs2sensitivity = 5.7,
mmseqs2maxseqs = 300,
diamondpath = NULL,
diamondsensitivity = "--sensitive",
diamondmaxtargetseqs = 0,
lambda3path = NULL,
lambda3sensitivity = "sensitive",
lambda3nummatches = 25,
outpath = "/tmp",
crbh = TRUE,
keepSingleDirection = FALSE,
eval = 0.001,
qcov = 0,
tcov = 0,
pident = 0,
alnlen = 0,
rost1999 = FALSE,
filter = NULL,
plotCurve = FALSE,
fit.type = "mean",
fit.varweight = 0.1,
fit.min = 5,
threads = 1,
aafile2tmp = TRUE,
remove = TRUE,
remove.db = TRUE
)
Arguments
- aafile1
aa1 fasta file [mandatory]
- aafile2
aa2 fasta file [mandatory]
- dbfile1
aa1 db file [optional]
- dbfile2
aa2 db file [optional]
- searchtool
specify sequence search algorithm last, mmseqs2, diamond or lambda3 [default: last]
- lastpath
specify the PATH to the last binaries [default: /extdata/last-1595/bin/]
- lastD
last option D: query letters per random alignment [default: 1e6]
- lastm
last option m: maximum initial matches per query position [default: 10]
- mmseqs2path
specify the PATH to the mmseqs2 binaries [default: NULL]
- mmseqs2sensitivity
specify the sensitivity option of mmseqs2 [default: 5.7]
- mmseqs2maxseqs
mmseqs2 option: Maximum results per query sequence allowed to pass the prefilter [default: 300]
- diamondpath
specify the PATH to the diamond binaries [default: NULL]
- diamondsensitivity
specify the sensitivity option of diamond [default: –sensitive]
- diamondmaxtargetseqs
specify the maximum number of target sequences per query option of diamond [default: 0]
- lambda3path
specify the PATH to the lambda3 binaries [default: NULL]
- lambda3sensitivity
specify the sensitivity option of lambda3 [default: sensitive]
- lambda3nummatches
specify the number of matches per query option of lambda3 [default: 25]
- outpath
specify the output PATH [default: /tmp]
- crbh
specify if conditional-reciprocal hit pairs should be retained as secondary hits [default: TRUE]
- keepSingleDirection
specify if single direction secondary hit pairs should be retained [default: FALSE]
- eval
evalue [default: 1e-3]
- qcov
query coverage [default: 0.0]
- tcov
target coverage [default: 0.0]
- pident
percent identity [default: 0.0]
- alnlen
alignment length [default: 0.0]
- rost1999
specify if hit pairs should be filter by equation 2 of Rost 1999 [default: FALSE]
- filter
specify additional custom filters as list to be applied on hit pairs [default: NULL]
- plotCurve
specify if crbh fitting curve should be plotted [default: FALSE]
- fit.type
specify if mean or median should be used for fitting [default: mean]
- fit.varweight
factor for fitting function to consider neighborhood [default: 0.1]
- fit.min
specify minimum neighborhood alignment length [default: 5]
- threads
number of parallel threads [default: 1]
- aafile2tmp
specify if aa input files sequenceIDs should be reduced to the first word and be written to a temporary aa file [default: TRUE]
- remove
specify if last result files should be removed [default: TRUE]
- remove.db
specify if last db files should be removed [default: TRUE]
Value
List of three (crbh=FALSE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
List of four (crbh=TRUE)
1: $crbh.pairs
2: $crbh1 matrix; query > target
3: $crbh2 matrix; target > query
4: $rbh1_rbh2_fit; evalue fitting function
References
Aubry S, Kelly S et al. (2014) Deep Evolutionary Comparison of Gene Expression Identifies Parallel Recruitment of Trans-Factors in Two Independent Origins of C4 Photosynthesis. PLOS Genetics, 10(6) e1004365.
Kiełbasa, SM et al. (2011) Adaptive seeds tame genomic sequence comparison. Genome research, 21(3), 487-493.
Rost B. (1999). Twilight zone of protein sequence alignments. Protein Engineering, 12(2), 85-94.
Examples
## compile last-1595 within CRBHits
CRBHits::make_last()
## load example sequence data
athfile <- system.file("fasta", "ath.aa.fasta.gz", package="CRBHits")
alyfile <- system.file("fasta", "aly.aa.fasta.gz", package="CRBHits")
## get CRBHit pairs
ath_aly_crbh <- aafile2rbh(
aafile1=athfile,
aafile2=alyfile,
plotCurve=TRUE)
dim(ath_aly_crbh$crbh.pairs)
#> [1] 211 3
## get classical reciprocal best hit (RBHit) pairs
ath_aly_rbh <- aafile2rbh(
aafile1=athfile,
aafile2=alyfile,
crbh=FALSE)
dim(ath_aly_rbh$crbh.pairs)
#> [1] 181 3
## selfblast
ath_selfblast_crbh <- aafile2rbh(
aafile1=athfile,
aafile2=athfile,
plotCurve=TRUE)
## see ?cds2rbh for more examples