IMMAN is specified to retrieve the interlog protein network shared across diverse species. For this aim, first we exceed orthology relationships among sequences from various species by iterating over any pair of input species, and using the Needleman-Wunsch alignment algorithm and a best reciprocal hit strategy to reach the orthologues through all versus all pairwise cross-species alignments. From orthology assignment, we derive Orthologous Protein sets (OPSs), an assortment of clusters of orthologues(maximum one per species) which will conform the nodes of the so-called Interolog Protein Network (IPN).

We exceed n species-specific interlog protein networks from STRING database, where each node maps to a single OPS in the IPN, and distinguish the edges of the outcome IPN by choosing only edges linking nodes in the IPN that also are linked in at least species-specific networks (where ‘k’ is set as a parameter).

A scoring system is used by the alignment process, which can be described as a set of values specified for quantifying the likelihood of one residue that has been substituted by another in an alignment. The scoring systems used by alignment procedure is called a substitution matrix and it can be achieved from statistical analysis of residue substitution data from sets of reputable alignments of highly relevent sequences. Using identityU value which ranges from 0 to 100, user would be able to specialize how the IPNs should be larger or not. As the value of identityU gets higher, the algorithm will find much similar orthologs and vice versa. We used gapOpening and gapExtension arguments to figure numeric values of ortholog proteins. For matching alignments of proteins if we skip a protein, gapOpening argument would be incremented. The smaller the amount of gap, protein alignements are more similar to each other. The score_threshold argument is specified for evaluating the similarity values between two proteins in substitutionMatrix. It differs from 0 to 100, however, the common use ranges from 25 to 30. The transference of interactionn among orthologs of different species called the interlog approach. We used Besthit argument to reach proteins which has the most similarity in all versus all protein alignment. If an interaction was exist between each pair of proteins of OPSs, an edge would be linked in the IPN. The coverage_threshold specifies the number of interactions that are exist among pair of proteins of OPSs. It differs from 1 to number of species. As much as the value of coverage_threshold was high, the final IPN would be more robust and usually smaller. NetworkShrinkage argument determine whether two similar OPSs which have ortholog proteins in common should be merged or not. If it was TRUE the resulting IPN would be smaller.

For using this package, we assume that the “CINNA” package has been properly installed into the R environment. After installations, the “CINNA” package can be loaded via

library(IMMAN)

For illustration, we will read two datasets from different species which can be accessed via:

data(Celegance)
data(FruitFly)

Then, we have to make a list of dataset species and set their taxanomy IDs.

ProteinLists = list(as.character(Celegance$V1), as.character(FruitFly$V1))

List1_Species_ID = 6239  # taxonomy ID Celegance
List2_Species_ID = 7227  # taxonomy ID FruitFly

Species_IDs  = c(List1_Species_ID, List2_Species_ID)

To continue, set the parameters to run the analysis. Here is a description of parameters in IMMAN. If you need more information you can refer to the paper.

identityU: Cut off value for selecting proteins whose alignment score is greater or equal than identityU.

substitutionMatrix: Which scoring matrix to be used for alignment setting gapOpening and gapExtension for alignment purposes.

For NetworkShrinkage, coverage, and BestHit refer to paper.

STRINGversion: Indicated which version of STRING database should program search in for the score of PPIs.

Then, we will set the argument values:

identityU = 30
substitutionMatrix = "BLOSUM62"
gapOpening = -8
gapExtension = -8
NetworkShrinkage = FALSE
coverage = 1
BestHit = TRUE
score_threshold = 400
STRINGversion="10"

Finally, we can run the IMMAN function:

output = IMMAN(ProteinLists, fileNames=NULL, Species_IDs,
identityU, substitutionMatrix,
gapOpening, gapExtension, BestHit,
coverage, NetworkShrinkage,
score_threshold, STRINGversion,
InputDirectory = getwd())
## Step 1/4:Downloading amino acid sequences...
## Downloading amino acid sequences of List1
## Downloading amino acid sequences of List2
## Step 2/4: Alignment...
## Align List1 with List2
## Step 3/4: Detection in STRING...
## Detecting List1 in STRING
## Detecting List2 in STRING
## Step 4/4: Retrieving String Network...
## Retrieving List1
## Retrieving List2
## Producing IPN...
## DONE!

In order to see some particular parts of the result, you can use:

head(output$IPNEdges) ## node1 node2 ## 1 OPS0001 OPS0004 ## 2 OPS0001 OPS0009 ## 3 OPS0001 OPS0019 ## 4 OPS0001 OPS0021 ## 5 OPS0001 OPS0025 ## 6 OPS0001 OPS0027 head(output$IPNNodes)
##               node1            node2 OPSLabel
## 1      6239.C18E9.6 7227.FBpp0085338  OPS0001
## 2      6239.R07B7.5 7227.FBpp0088005  OPS0002
## 3    6239.Y67H2A.4a 7227.FBpp0111941  OPS0003
## 4 6239.Y22D7AL.5a.2 7227.FBpp0073290  OPS0004
## 5    6239.R07G3.5.2 7227.FBpp0070350  OPS0005
## 6      6239.MTCE.31 7227.FBpp0100177  OPS0006
str(output$Networks) ## List of 2 ##$ :'data.frame':    141 obs. of  2 variables:
##   ..$from: Factor w/ 32 levels "6239.B0250.5",..: 2 1 4 6 7 1 6 7 8 2 ... ## ..$ to  : Factor w/ 35 levels "6239.C05G5.4.1",..: 1 2 3 4 5 6 6 6 6 7 ...
##  $:'data.frame': 168 obs. of 2 variables: ## ..$ from: Factor w/ 37 levels "7227.FBpp0070087",..: 3 1 1 3 7 2 3 3 7 3 ...
##   ..$to : Factor w/ 41 levels "7227.FBpp0070873",..: 1 2 3 3 4 5 5 6 6 7 ... head(output$Networks[[1]])
##             from               to
## 1 6239.C03G5.1.2   6239.C05G5.4.1
## 2   6239.B0250.5 6239.C16A3.10a.1
## 3 6239.C06G3.11a     6239.C18E9.6
## 4   6239.C18E9.6   6239.C29E4.8.1
## 5 6239.C29E4.8.1   6239.C34B2.6.1
## 6   6239.B0250.5  6239.C34C12.8.1
str(output$maps) ## List of 2 ##$ :'data.frame':    49 obs. of  2 variables:
##   ..$UNIPROT_AC: chr [1:49] "Q9XTI0" "Q22347" "Q9XXK1" "P46561" ... ## ..$ STRING_id : chr [1:49] "6239.B0250.5" "6239.T08G2.3.1" "6239.H28O16.1a" "6239.C34E10.6.2" ...
##  $:'data.frame': 55 obs. of 2 variables: ## ..$ UNIPROT_AC: chr [1:55] "Q9V8M5" "Q9VSA3" "P35381" "Q05825" ...
##   ..$STRING_id : chr [1:55] "7227.FBpp0085821" "7227.FBpp0076520" "7227.FBpp0071794" "7227.FBpp0305828" ... head(output$maps[[2]])
##   UNIPROT_AC        STRING_id
## 1     Q9V8M5 7227.FBpp0085821
## 2     Q9VSA3 7227.FBpp0076520
## 3     P35381 7227.FBpp0071794
## 4     Q05825 7227.FBpp0305828
## 5     O02649 7227.FBpp0073290
## 6     Q9W401 7227.FBpp0070871