Are the results of SIMAP interesting for Rosetta?

Author	Message
tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 17578 - Posted: 3 Jun 2006, 14:42:33 UTC I wonder, whether the BOINC-SIMAP project produces valuable results for Rosetta. SIMAP is a database of protein similarities. Provide those similiarities useful informations about the shape of a protein? http://boinc.bio.wzw.tum.de/boincsimap/ ID: 17578 · Rating: 2 · rate: / Reply Quote

stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0	Message 17708 - Posted: 5 Jun 2006, 21:16:11 UTC - in response to Message 17578. Last modified: 5 Jun 2006, 21:28:52 UTC I wonder, whether the BOINC-SIMAP project produces valuable results for Rosetta. SIMAP is a database of protein similarities. Provide those similiarities useful informations about the shape of a protein? http://boinc.bio.wzw.tum.de/boincsimap/ I just noticed your post. I also run Rosetta@home (75%) and boincsimap (25%). I don't know if the Rosetta project has ever used the SIMAP database, because it is so new, but I am quite certain that the Rosetta@home project makes use of protein similarities. What is SIMAP? SIMAP is a database of protein similarities. It contains about all currently published protein sequences and is continuously updated. Protein similarities are computed using the FASTA algorithm which provides optimal speed and sensitivity. SIMAP is to our knowledge the only project that combines comprehensive coverage with respect to all known proteins and incremental update capabilities. http://boinc.bio.wzw.tum.de/boincsimap/project.php What is SIMAP used for? Because of the huge amount of known protein sequences in public databases it became clear that most of them will not be experimentally characterized in the near future. Nevertheless, proteins that have evolved from a common ancestor often share same functions (so-called orthologs). So it is possible to infer the function of a non-characterized protein from an ortholog with known function. A well-known example are the investigations about mouse genes and proteins. Their results are also beeing true for orthologous human genes and proteins in many cases. Protein similarities provide information about relations between proteins and are necessary for the prediction of orthologs. Note: I have noticed that Rosetta@home often mentions Homologs and simap talks about Orthologs, but I think that an Ortholog is just a subset of a Homolog. I don't fully understand if the difference in terminology is significant. Homolog: Any member of a set of genes, DNA sequences or protein sequences whose nucleotide sequences show a high degree of one-to-one correspondence. Ortholog: homologous sequence found in different species and derived from a common ancestor. Jack ID: 17708 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 17807 - Posted: 6 Jun 2006, 17:02:14 UTC Last modified: 6 Jun 2006, 17:06:59 UTC Rhiju, Kim, Bin, please let us know whether you know this already/use it/or don't use it: http://webclu.bio.wzw.tum.de/cgi-bin/simap/start.pl ID: 17807 · Rating: 1 · rate: / Reply Quote

stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0	Message 17862 - Posted: 6 Jun 2006, 23:52:37 UTC Last modified: 6 Jun 2006, 23:59:36 UTC Tralala Are you aware of the new Windows/x86 application that SIMAP or MIPS is developing? -------------------------- HMMER@home http://boinc.bio.wzw.tum.de/boincsimap/apps.php -------------------------- See SIMAP Message board thread New Application! http://boinc.bio.wzw.tum.de/boincsimap/forum/viewtopic.php?p=3044 --------------------- Jack ID: 17862 · Rating: -0.99999999999999 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 17909 - Posted: 7 Jun 2006, 9:00:33 UTC Last modified: 7 Jun 2006, 9:09:14 UTC Another link from SIMAP where one can find similarities which aren't in the database yet: http://mips.gsf.de/genre/proj/simap @stewjack: I am aware of that but what has this to do with the topic? This thread is of the scientific usefullness of SIMAP for Rosetta. ID: 17909 · Rating: 0 · rate: / Reply Quote

stewjack Send message Joined: 23 Apr 06 Posts: 39 Credit: 95,871 RAC: 0	Message 18053 - Posted: 8 Jun 2006, 3:30:01 UTC - in response to Message 17909. Last modified: 8 Jun 2006, 3:45:57 UTC @stewjack: I am aware of that but what has this to do with the topic? This thread is of the scientific usefullness of SIMAP for Rosetta. You are the only other person that has mentioned the same connection of SIMAP to protein folding that I also believe exists. I guess I was interested in the relationship between SIMAP's data products, and also possibly HMMER's, to the scientific goals of Rosetta - and other folding projects. I was interested in SIMAPS usefulness to any effort to improve our knowledge of protein 3D structure. If HMMER does not replace SIMAP, then I will be required to consider attaching to HMMER@home as well as SIMAP and Rosetta. It will depend on what I learn about all three projects. I will not change anything during CASP. I have plenty of time to learn more about HMMER. Heck, I may move on to Einstein at the end of the summer, but right now protein folding, and particularly Rosetta, is much more interesting. Jack ID: 18053 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 26077 - Posted: 5 Sep 2006, 8:10:14 UTC bump Now that CASP is over and the transition to the new credit system has been succesfully completed maybe someone from the project team find the time to answer whether BOINC SIMAP (Similarity Matrix of Proteins) is of any use for Rosetta: http://webclu.bio.wzw.tum.de/cgi-bin/simap/start.pl and the BOINC project is here http://boinc.bio.wzw.tum.de/boincsimap/ ID: 26077 · Rating: 0 · rate: / Reply Quote

James Thompson Send message Joined: 13 Oct 05 Posts: 46 Credit: 186,109 RAC: 0	Message 26116 - Posted: 5 Sep 2006, 17:18:02 UTC - in response to Message 26077. We do not currently use the results of SIMAP in our laboratory. However, searching a sequence database for sequences similar to a given query sequence is a very common task in computational biology, and there are applications that use this technique. First, let me give you the ten-second rundown on algorithms for aligning two protein (or DNA) sequences: - Smith-Waterman: an exhaustive algorithm guaranteed to find the best alignment given a scoring system. - FASTA: a heuristic algorithm Smith-Waterman that looks for "seeds" of true matches by finding submatches of a given length (usually 3-5 for protein sequences, or 10-12 for DNA sequences). - BLAST: another heuristic algorithm that improves on the speed of FASTA without significantly decreasing its ability to find good matches. These algorithms can be extended to compare one protein vs. many proteins, as one might do with a CASP target of unknown function. However, using Smith-Waterman quickly becomes unreasonable because aligning sequences of this form is an NP-complete problem. SIMAP is using the FASTA algorithm for computing similarity between proteins, while for most of our purposes we use BLAST (and most often we use a variant of BLAST known as PSI-BLAST). BLAST is definitely faster than FASTA, and for similar sequences they give the same results. The speed gains by using BLAST are especially significant for our purposes, we're dealing with a large number of comparisons on our hardware because we're doing Certainly the idea of pre-computing protein similarities is a good one. However, when performing a PSI-BLAST search, pre-computing these similarities presents a number of problems. I am not quite sure why SIMAP decided on FASTA rather than BLAST as I have not reviewed that project extensively. I need to run right now, but if I have time later today or tomorrow I'll post more on why we use PSI-BLAST rather than BLAST, and I'll tell you folks about a similar project to SIMAP that tries to accomplish a similar goal usign 3-dimensional structures of proteins. Hope that this makes sense! Cheers, James Thompson bump Now that CASP is over and the transition to the new credit system has been succesfully completed maybe someone from the project team find the time to answer whether BOINC SIMAP (Similarity Matrix of Proteins) is of any use for Rosetta: http://webclu.bio.wzw.tum.de/cgi-bin/simap/start.pl and the BOINC project is here http://boinc.bio.wzw.tum.de/boincsimap/ ID: 26116 · Rating: 0 · rate: / Reply Quote

tralala Send message Joined: 8 Apr 06 Posts: 376 Credit: 581,806 RAC: 0	Message 26130 - Posted: 5 Sep 2006, 20:41:50 UTC - in response to Message 26116. Hi James, thanks for your answer. In a nutshell you are saying protein similarities are interesting for Rosetta and that you do that currently yourself. If SIMAP would have all known protein sequences precomputed wouldn't it be easier and faster for you just to look them up rather than calculating them? If I understand you correctly BLAST does not give better results just faster. Since SIMAP precomputes them and has potentially unlimited computing ressources via BOINC perhaps they decided for FASTA to have the best accuracy which one can achieve. They are trying a new algorithm soon, which they call hmmer, don't know what that means though. I'd appreciated if you could elaborate a bit how you do it, if you find the time. ID: 26130 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 26145 - Posted: 6 Sep 2006, 0:29:26 UTC Last modified: 6 Sep 2006, 3:16:02 UTC Sequence similarity searches are very important and relevant to our work. It is in one way or another, the first step in protein structure prediction. Proteins with similar sequences are likely to have similar structures depending on the level of similarity. So, generally, the more similar a sequence is to that of a known structure, the easier it is to predict its structure. Structure prediction based on similarity to an already known structure is termed comparative modeling (CM) or homology modeling and is typically done by using the aligned regions as a starting template and then modeling the remaining variable regions. This is of course an over simplification as there are many different CM methods out there. The difficulty arises when sequences are less similar and for more distant sequences there are better detection/alignment methods that we use as James mentioned, psi-blast is one of them. Psi-blast is more sensitive than FASTA because it uses an iterative approach where each step uses a position specific profile generated from the previous search so variable positions do not contribute much to the score. There are many more sensitive methods that use profile-profile searches and even more than just sequence information such as predicted structural elements like secondary structure. The best automated remote detection methods actually use multiple methods and then select based on a consensus. In our automated structure prediction server, Robetta, the first step is a sequence similarity search using blast and psi-blast. If a very similar match is found, we go directly to our comparative modeling method which still tries to refine the alignment further using an alignment method developed in our lab by Dylan Chivian who also developed Robetta. Perhaps in this initial step, SIMAP could be used to detect the most similar matches very quickly. On a side note, Robetta also uses hmmer to search the pfam database to try to assign protein domains if similarity to a known structure is not found. minor edits made ID: 26145 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 42	Message 26165 - Posted: 6 Sep 2006, 11:52:56 UTC They are trying a new algorithm soon, which they call hmmer, don't know what that means though. The hmmer application at SIMAP is an implementation of "Hidden Markov Model", a neural net type of thing. It is not released yet, but is pretty close. As an aside, the surrent SIMAP program is the CPU user champion on my machines, (both Intel P-IV's), the CPU temperature usually climbs 2-3 degrees over average when SIMAP is running, 4-5 degrees if SIMAP is running in both the vitual CPU's on my hyper threaded chip. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 26165 · Rating: -1 · rate: / Reply Quote

Mod.DE Volunteer moderator Send message Joined: 23 Aug 06 Posts: 78 Credit: 0 RAC: 0	Message 26175 - Posted: 6 Sep 2006, 13:24:19 UTC - in response to Message 26145. Last modified: 6 Sep 2006, 14:29:16 UTC On a side note, Robetta also uses hmmer to search the pfam database to try to assign protein domains if similarity to a known structure is not found. minor edits made Thanks for the insights. Is the hmmer, which you use an implementation of "Hidden Markov Model" the same which they will roll out in SIMAP soon? If they have a complete database with both Fasta and hmmer would it be a good starting point for your Robetta refinement without computation costs? Could one add Psi-Blast as a third algorithm besides Fasta and hmmer as well, or is it a decision between either Fasta or Psi-Blast? edited for clarity I am a forum moderator! Am I? ID: 26175 · Rating: 0 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 26183 - Posted: 6 Sep 2006, 16:45:15 UTC hmmer is a sequence analysis package that uses hidden markov models. See their website at: http://hmmer.wustl.edu/. The source is available under GNU GPL. The package includes hmmpfam which we use to search a Pfam hmm library. Pfam includes a curated database of multiple sequence alignments and hmms representing protein domains and families. See their site at http://www.sanger.ac.uk/Software/Pfam/. Precomputed psi-blast would help reduce the computational demand of running the searches but it is not trivial since there are parameter changes that we use to loosen/tighten the searches when generating the profiles and the sequence databases are very large which makes distributed psi-blast difficult (but not impossible). In fact, Psi-blast has become a limiting factor in our Robetta server because of old hardware and the sequence database has increased dramatically in the last few years due to genome sequencing projects. Currently, we are working on ways to fix this problem as we are transitioning to a new data center. Distributed psi-blast would be an interesting, yet difficult, project. Keep in mind that the sequence databases are continually expanding so the databases and searches would have to continually be updated also. ID: 26183 · Rating: 1 · rate: / Reply Quote

David E K Volunteer moderator Project administrator Project developer Project scientist Send message Joined: 1 Jul 05 Posts: 1018 Credit: 4,334,829 RAC: 0	Message 26188 - Posted: 6 Sep 2006, 18:10:57 UTC I forgot to mention that for the HPF world community grid project, Rich Bonneau and his colleagues, including Lars who is currently working in our lab, used a scaled down version of the domain detection method, Ginzu, that Dylan Chivian developed and that is used in Robetta's initial step to determine what sequences and sub-sequences (predicted domains) to target. The scaled down method was basically psi-blast, a remote detection method, ORFeus, pfam, and an in-house alignment based method for domain parsing. They have search results for a number of genomes I believe. This was pre-HPF processing to determine what targets to use before the first work units were even sent out a couple years ago. They are working on the paper now and the hope/plan is to link the robetta server up with the HPF database to use as a prefilter. So if a user submits a sequence that has already been processed by HPF, we would send them to the HPF database. The caveat to this approach is that methods have improved since but there are efforts to develop a confidence function that will let a researcher know if a prediction is likely to be correct and there will also be experimental data in the HPF database that researches will have access to for more insight of the particular target of interest, to my knowledge. ID: 26188 · Rating: 1 · rate: / Reply Quote

rattei Send message Joined: 6 Sep 06 Posts: 1 Credit: 0 RAC: 0	Message 26206 - Posted: 6 Sep 2006, 20:42:35 UTC - in response to Message 26116. Dear James, I am not quite sure why SIMAP decided on FASTA rather than BLAST as I have not reviewed that project extensively. Sorry, I've seen this thread just today. We have chosen FASTA due to its better sensitivity compared to blast when ktup=1 is being used. That's what we do for the SIMAP workunits. We store the Smith-Waterman-Alignment data for the hits found by FASTA in our database, that is eequivalent to the best HSP of blastp. So if there is any need for BLASTP like alignment data of known proteins, just contact me. It would be great to establish cooperation between two BOINC projects. Best regards Thomas Rattei SIMAP and BOINCSIMAP http://mips.gsf.de/simap and http://boinc.bio.wzw.tum.de/boincsimap ID: 26206 · Rating: 8 · rate: / Reply Quote

daniele Send message Joined: 12 Oct 06 Posts: 18 Credit: 20,328 RAC: 0	Message 29575 - Posted: 18 Oct 2006, 13:39:24 UTC - in response to Message 26206. Last modified: 18 Oct 2006, 13:40:38 UTC So if there is any need for BLASTP like alignment data of known proteins, just contact me. It would be great to establish cooperation between two BOINC projects. Make us know how this story ends, it's important!!! ID: 29575 · Rating: 0 · rate: / Reply Quote

adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 42	Message 29663 - Posted: 19 Oct 2006, 21:26:04 UTC Last modified: 19 Oct 2006, 21:35:01 UTC This could be very important as SIMAP is now, (within experimental error), up to date with SIMAP and HMMER, and has lapsed to a periodical update project. I am wondering also if "we", (note - SIMAP hat on), can help "us", (note - now with Rosetta hat on). OT - Waves to Thomas! Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. ID: 29663 · Rating: 0 · rate: / Reply Quote

[BAT]Krikke Send message Joined: 20 Sep 05 Posts: 5 Credit: 669,899 RAC: 0	Message 42284 - Posted: 18 Jun 2007, 13:27:09 UTC - in response to Message 29663. This could be very important as SIMAP is now, (within experimental error), up to date with SIMAP and HMMER, and has lapsed to a periodical update project. I am wondering also if "we", (note - SIMAP hat on), can help "us", (note - now with Rosetta hat on). OT - Waves to Thomas! Is there any news on this subject? Is there a cooperation between both projects or is this impossible? ID: 42284 · Rating: 0 · rate: / Reply Quote

Tom Philippart Send message Joined: 29 May 06 Posts: 183 Credit: 834,667 RAC: 0	Message 42288 - Posted: 18 Jun 2007, 16:09:14 UTC - in response to Message 42284. Last modified: 18 Jun 2007, 16:09:47 UTC I asked about it at the SIMAP forums some time ago: Original answer: 23 Oct 2006: Ja, wir haben ueber Moeglichkeiten der Zusammenarbeit diskutiert. Es ist momentan noch unklar, ob die SIMAP-Daten fuer Rosetta hilfreich sind. Bislang war das nicht der Fall, weil nur Aehnlichkeiten mit einer kleinen Proteindatenbank (PDB) benoetigt wurden, die konnte man schneller selbst rechnen als aus SIMAP holen. Aber es wird momentan diskutiert diese Aehnlichkeitssuche auszuweiten. Da warte ich noch auf das feedback ob das tatsaechlich so kommt. Beste Gruesse Thomas http://boinc.bio.wzw.tum.de/boincsimap/forum/viewtopic.php?t=567 My translation (I try to keep it as accurate as possible): Yes, we discussed the possibilities of a collaboration. At this point it is still unsure whether the SIMAP results are useful for Rosetta. Till now this wasn't the case, because they only needed a comparison with a small protein database (PDB), which was easier to do by computing on their own than by taking data from SIMAP. Right now we're discussing whether we could widen the search for a comparison of proteins. I'm still waiting for an answer from the Rosetta team. Greetings Thomas ID: 42288 · Rating: 0 · rate: / Reply Quote