Message boards : Number crunching : Constant computation errors.
Author | Message |
---|---|
Bryn Send message Joined: 28 Nov 20 Posts: 3 Credit: 920,974 RAC: 0 |
Why am I even bothering to try and contribute to the solving of corona when everything keeps coming up with this problem? Application Rosetta 4.20 Name preetham_gen_26749_0001_0001_0_SAVE_ALL_OUT_2911458_130 State Computation error Received 16/03/2022 12:04:15 Report deadline 19/03/2022 12:04:09 Estimated computation size 80,000 GFLOPs CPU time 00:00:02 Elapsed time 00:00:16 Executable rosetta_4.20_windows_x86_64.exe |
Jean-David Beyer Send message Joined: 2 Nov 05 Posts: 187 Credit: 6,328,464 RAC: 6,028 |
Why am I even bothering to try and contribute to the solving of corona when everything keeps coming up with this problem? The same reason I do, perhaps? It is my guess that they set up a huge group of work units that are defective, and did not test them before letting them loose on us. Either by incompetence, or by some terrible mistake. And perhaps they have inadequate staff to watch how things were going. On my machine, all recent work units of this batch failed. I disabled getting any new ones for an hour or so, and then tried again for a little while. I now have 100% failure rate on over 300 work units, so I stopped getting new work units. Most of mine have some other machine working on them, and they all failed too. My machine has an Intel Xeon processor running Red Hat Enterprise Linux 5.4. Other users that failed on my work units run either Linux or Windows. They all fail too. Sooner or later, my guess will get back from the long weekend and notice the 100% failure rate and do something about it. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1667 Credit: 17,444,761 RAC: 24,783 |
Why am I even bothering to try and contribute to the solving of corona when everything keeps coming up with this problem?The last batch that died like that only did so on Windows systems, but processed OK on LINUX. Looks like they fixed the problem with those Tasks so these new ones now fail on all systems. Grant Darwin NT |
adrianxw Send message Joined: 18 Sep 05 Posts: 653 Credit: 11,840,739 RAC: 335 |
All failing on both my Windows 8.1x64 systems after about 70 seconds. The exit code is "1 (0x00000001) Unknown error code" which is not much help. Wave upon wave of demented avengers march cheerfully out of obscurity into the dream. |
mmonnin Send message Joined: 2 Jun 16 Posts: 57 Credit: 23,165,110 RAC: 57,119 |
Error like this? process exited with code 1 (0x1, -255)</message> <stderr_txt> command: ../../projects/boinc.bakerlab.org_rosetta/rosetta_4.20_x86_64-pc-linux-gnu @preetham_gen_38675_0001_0001_0.flags -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1371728 Using database: database_357d5d93529_n_methyl/minirosetta_database ERROR: Error in simple_cycpcp_predict app read_sequence() function! The minimum number of residues for a cyclic peptide is 4. (GenKIC requires three residues, plus a fourth to serve as an anchor). ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 2264 BOINC:: Error reading and gzipping output datafile: default.out 16:47:07 (139426): called boinc_finish(1) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1990 Credit: 9,487,902 RAC: 12,207 |
The last batch that died like that only did so on Windows systems, but processed OK on LINUX. :-P |
Bryn Send message Joined: 28 Nov 20 Posts: 3 Credit: 920,974 RAC: 0 |
All from this person are corrupt. Who do I contact to tell them? Application Rosetta 4.20 Name preetham_gen_10036_0001_0001_0_SAVE_ALL_OUT_2912745_775 State Computation error Received 17/03/2022 17:01:55 Report deadline 20/03/2022 17:01:54 Estimated computation size 80,000 GFLOPs CPU time 00:01:48 Elapsed time 00:02:06 Executable rosetta_4.20_windows_x86_64.exe |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1667 Credit: 17,444,761 RAC: 24,783 |
All from this person are corrupt. Who do I contact to tell them?Nobody. In case you haven't been paying attention, it was the project that released a batch of faulty Tasks. And they decided to do nothing about it and just let them error out. That batch is now gone, although there will be plenty of resends over the next week and a half or so that will continue to error out until they are all gone. Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1990 Credit: 9,487,902 RAC: 12,207 |
Nobody. As usual, i wrote a tweet to R@H account. But my questions are: are R@H servers unattended? Is possible, in Boinc server, activate triggers to warn admins in case of problems?? |
Dr Who Fan Send message Joined: 28 May 06 Posts: 64 Credit: 260,690 RAC: 449 |
As usual, i wrote a tweet to R@H account. I think you will be waiting indefinitely for a reply. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
Nobody. Haven't you been paying attention to what I have said in other posts about this exact same thing? This project is NOT monitored. NO ONE watches Twitter (I hammered them in a tweet and they did nothing).NO ONE watches the boards here. They ignore all emails. Grant I think used to have a in, but not any more. If they get a 50% return i.e. linux and not windows or the other way around, then they just have a smaller data set to work with, but still data. If stuff craps out because they don't do the code right, oh well, no need to fix it. Correct it later and resubmit it. So don't waste your time on Twitter or email. They will all be ignored. We figure it out on our own and if not your SOL. As for triggers...who you kidding. That's to advanced for this group. They can barely write protein software correctly, you really think they know anything about the code behind Rosetta? Or how to set alerts based on error received? Sorry man, you get what you get good or bad and that's how it is. We just roll with it and bitch about it. But nothing we can do. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2112 Credit: 41,044,764 RAC: 21,216 |
All from this person are corrupt. Who do I contact to tell them?Nobody. Just to say it publicly, I left for France last Monday, returned to London on Thursday night where the PC I use there was merrily running RB tasks so I didn't appreciate what Grant PM'd me about while I was away and have only now (Sunday am) returned home while there's no Rosetta 4.20 tasks to run, so I missed everything that was going wrong. Sorry about that. Not sure if it's a plus, but I'll be stuck in one place for the next few months so hopefully I pick up on problems a lot sooner to pass on. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1990 Credit: 9,487,902 RAC: 12,207 |
As for triggers...who you kidding. That's to advanced for this group. Misunderstanding. The "trigger functionality" may be introduced by Boinc's developers, not by Rosetta admins. Alerts based on errors have to be mandatory during server installation/configuration |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
As for triggers...who you kidding. That's to advanced for this group. ok..interesting. But the new version is not ready yet. Someone got it in a linux uncompiled personal release, so in the meantime we still have this situation. But then again, its just an alert, RAH could just as easily ignore those alerts just as we ignore constant non threatening alerts on our systems. Unless it blocks their screens from allowing any new input until acknowledged. But they could just click 'ok' or whatever and be done with the problem. I don't have much faith in their internal production despite all the old rah rah on the homepage and other locations. It's more like the outside sources produce more reliable tasks than internal. It looks more and more like the outside sources are where the action is now rather than the lab itself. The lab studies the results and reports back the data. They created the backbone that a lot of the programs use now. I find it interesting the way this is all structured. You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1990 Credit: 9,487,902 RAC: 12,207 |
You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act. I understand you're tired of this situation (like me), but i think you're a little bit impolite. This project is not "a circus", it's science and every kind of help, from simply cpu time to Foldit volunteers, is done with a purpose. |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 22,950,823 RAC: 17,561 |
You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act. Rosetta may not be a "circus", BUT the person integrating the "science program" with the "real world machines" is unqualified to do the job. There are simple warning messages and parameter testing limits that can be implemented that could screen out most of the error situations before they reach volunteer machines. Simple things like a "Set the ALLOW computer detail switch to enable Python jobs" message. There are many of these informational messages that could be added, but the integrator is unqualified or simply lazy. My suggestion: require each researcher submitting WU to the public have an identifier embedded in the WU name. Make incompetence public, traceable and give researchers CREDIT for their successes and failures. 8-) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1990 Credit: 9,487,902 RAC: 12,207 |
My suggestion: require each researcher submitting WU to the public have an identifier embedded in the WU name. Make incompetence public, traceable and give researchers CREDIT for their successes and failures. Constructive criticism is always welcome!! In the past, sometimes, they put the name of researcher into the wus name P.S. I'm still waiting when they will introduce your suggestions about cpu optimization (SSEx, AVX) :-P |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
You have the the molecular chemistry or whatever department at the university, then the institute, then it seems Baker Lab falls under that umbrella and splits out into robetta, rosetta and foldit, A lot of names that are just sub units of something. It's almost like a circus juggling act. That is what RALPH is for. Alpha and Beta testing before release to Rosetta. But probably since hardly anyone signs up on RALPH they just try them on their end and if they work, toss them out. Again, if 50% return is valid, I think that is a good enough data set for this person. You can track non Python stuff back to Robetta if you have the patience to go through all the listings and see who the submitter is. But that is a lot of work. |
Greg_BE Send message Joined: 30 May 06 Posts: 5691 Credit: 5,859,226 RAC: 0 |
The science is good. The execution to the PC group is bad. The organization reads like a flow chart. And if the bottom of the flow chart Rosetta and Baker lab can not get their act together and make tasks that work on both linux and windows and communicate or listen and respond to people telling them things, then it is a circus act in its function. The ignorance of the lab in not assigning someone to monitor boards, to monitor bad tasks, to communicate that they know there are issues and fixing them or offering a way to fix them shows just more of the same. They are in their world of nice chains of proteins and we are out here struggling to understand why tasks do not work so as to give them nice chains of proteins to work with. It is as if they really don't care what goes on out here, as long as they get results. I have been with this project since its early stages. Back when Dr. B took the time to write interesting things. When DEK took care of technical things and watched here for issues popping up. Or a MOD (grad student or whatever) that also monitored things here and reported back to DEK and also wrote up information on what we were crunching. This all disappeared long ago back when they started adding names to their group. We used to read about how the protein chain we had just finished analyzing was then done in crystal and how the results were close or exactly what the computers had come up with. Now its just a lot of dead air and figure it yourself mentality and suggestions written to them are ignored. Comments via twitter or other means are ignored. Emails to the project are ignored. If your not a scientist or a whatever dealing with science, they don't want to hear from you. I keep going because this is my first project. But there are always other interesting more steady more technically stable projects that I also joined. There are many little things that they could do to make this project so much better. But no. That is not of interest. But the science is good. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1990 Credit: 9,487,902 RAC: 12,207 |
That is what RALPH is for. Alpha and Beta testing before release to Rosetta. Are you kidding? When, very rarely, they released works on Ralph, it finished after few hours. There is a lot of volunteers ready to test wus... |
Message boards :
Number crunching :
Constant computation errors.
©2024 University of Washington
https://www.bakerlab.org