Current issues with 7+ boinc client

Message boards : Number crunching : Current issues with 7+ boinc client

To post messages, you must log in.

1 · 2 · 3 · 4 . . . 7 · Next

AuthorMessage
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 74024 - Posted: 15 Oct 2012, 19:54:23 UTC

First, sorry for my long hiatus. Mod.Sense recently brought this issue to our attention and I'd like to fix it as soon as possible.

I have installed the latest client on a new Mac and it successfully completed a task. I'll try the other platforms.

Does this issue still exist for the latest client version?

Any positive input that might help us track this down is greatly appreciated.

Thanks,

David Kim
ID: 74024 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 74026 - Posted: 16 Oct 2012, 1:14:07 UTC - in response to Message 74024.  

Not quite sure what you're looking for here but I've been running R@h with Boinc 7 on a couple of machines without any great problems other than less frequent checkpointing: Boinc 7.031 on Mac OS X 10.6.8 and 7.0.28 on W7.
ID: 74026 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 74030 - Posted: 16 Oct 2012, 13:12:12 UTC

Hi David, welcome back. Good to see you on here again.

I personally have had no problems with BOINC 7.x on Windows or Linux machines, but:
1. I only run Rosetta.
2. The machines that do have CUDA-capable GPUs are either disabled with cc_config.xml or otherwise not performing any tasks with them in BOINC.

I believe it has been theorized on here in other threads that the troubles with BOINC 7 are related to running other projects with Rosetta, and/or/especially when running GPU tasks from other projects. There may be other symptoms or problems present that are not related to above.
ID: 74030 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 74031 - Posted: 16 Oct 2012, 15:53:26 UTC

That's encouraging news. I'll definitely check this out, running R@h with a GPU project. Thanks for the info! There is a significant amount of "hybridize" jobs that unfortunately do not have checkpointing capabilities with greater resolution than a model yet. It will take some time to code in checkpointing for these jobs because there's a lot of information that has to be serialized but we will be working on it.
ID: 74031 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 74035 - Posted: 16 Oct 2012, 22:27:37 UTC
Last modified: 16 Oct 2012, 22:30:12 UTC

For me, this machine could not get any WU validated (it finished them w/o errors, though), when running a GPU in parallel and WITHOUT running a GPU in parallel:

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1569977

I tried all the combinations possible, downgraded BOINC (to both 32 and 64 versions), updated BIOS, ran rosetta@home EXCLUSIVELY (as in, no GPU project in parallel)... etc. All I had left to do was to reinstall the OS... but that's just ridiculous.
So, I was forced to abandon rosseta (with this machine) due to this issue.

Note: This machine is currently running WCG and GPUGRID at the same time with no problems.
ID: 74035 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 74036 - Posted: 17 Oct 2012, 4:06:06 UTC - in response to Message 74035.  

For me, this machine could not get any WU validated (it finished them w/o errors, though), when running a GPU in parallel and WITHOUT running a GPU in parallel:

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1569977

I tried all the combinations possible, downgraded BOINC (to both 32 and 64 versions), updated BIOS, ran rosetta@home EXCLUSIVELY (as in, no GPU project in parallel)... etc. All I had left to do was to reinstall the OS... but that's just ridiculous.
So, I was forced to abandon rosseta (with this machine) due to this issue.

Note: This machine is currently running WCG and GPUGRID at the same time with no problems.


What boinc client version? Do you still see this issue with the latest version?

Sorry for your troubles. This is the exact issue I want to fix as soon as possible.
ID: 74036 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Polian
Avatar

Send message
Joined: 21 Sep 05
Posts: 152
Credit: 10,141,266
RAC: 0
Message 74037 - Posted: 17 Oct 2012, 4:55:54 UTC
Last modified: 17 Oct 2012, 4:56:19 UTC

From stderr out it looks like he used 7.0.28 here: https://boinc.bakerlab.org/rosetta/result.php?resultid=536434124

and downgraded to try 6.12.34 here: https://boinc.bakerlab.org/rosetta/result.php?resultid=536537603

Too short of a runtime pref for these units? (3600 and 7200s)
ID: 74037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 74039 - Posted: 17 Oct 2012, 7:27:38 UTC - in response to Message 74037.  

From stderr out it looks like he used 7.0.28 here: https://boinc.bakerlab.org/rosetta/result.php?resultid=536434124

and downgraded to try 6.12.34 here: https://boinc.bakerlab.org/rosetta/result.php?resultid=536537603

Too short of a runtime pref for these units? (3600 and 7200s)


I usually let it run for 2-3 hours. While trying to troubleshoot the source of the problem I reduced the runtime to 1 hour.
ID: 74039 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 74040 - Posted: 17 Oct 2012, 7:34:24 UTC - in response to Message 74035.  
Last modified: 17 Oct 2012, 7:47:39 UTC

For me, this machine could not get any WU validated (it finished them w/o errors, though), when running a GPU in parallel and WITHOUT running a GPU in parallel:

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1569977

I tried all the combinations possible, downgraded BOINC (to both 32 and 64 versions), updated BIOS, ran rosetta@home EXCLUSIVELY (as in, no GPU project in parallel)... etc. All I had left to do was to reinstall the OS... but that's just ridiculous.
So, I was forced to abandon rosseta (with this machine) due to this issue.

Note: This machine is currently running WCG and GPUGRID at the same time with no problems.


I tried almost all versions, both 32 and 64 bit.
I even ran a single WU on this 8-threaded CPU, thinking it could be the fact that running 8 WUs at the same time was the source of the problem. (Hint: it's not.)

My other machines are running with BOINC version 7.X and some are even crunching with the GPU as well (like Collatz and Moo!) and have no problems. The only difference between those and this machine is the CPU (Ivy Bridge), RAM (some PC3-12800), and the NVIDIA GPU (GTX 660M).

Edit: BTW, from reading all the errors people are getting, I think the source of this issue is more of a hardware "incompatibility" problem than just a pure software problem. I for instance have multiple machines with no problem, but one with the problem, the only difference is the OS (one has Win 7 Ultimate, the other Win 7 Home Premium) and the hardware. It's a really weird bug.
ID: 74040 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Daedalus

Send message
Joined: 1 Aug 08
Posts: 39
Credit: 10,100,422
RAC: 1,202
Message 74046 - Posted: 18 Oct 2012, 22:17:38 UTC

Still the same problem: one WU, one error. The error not visible in the client:

<core_client_version>7.0.27</core_client_version>
<![CDATA[
<stderr_txt>
[2012-10-18 19:30: 7:] :: BOINC:: Initializing ... ok.
[2012-10-18 19:30: 7:] :: BOINC :: boinc_init()
BOINC:: Setting up shared resources ... ok.
BOINC:: Setting up semaphores ... ok.
BOINC:: Updating status ... ok.
BOINC:: Registering timer callback... ok.
BOINC:: Worker initialized successfully.
Registering options..
Registered extra options.
Initializing broker options ...
Registered extra options.
Initializing core...
Initializing options.... ok
Options::initialize()
Options::adding_options()
Options::initialize() Check specs.
Options::initialize() End reached
Loaded options.... ok
Processed options.... ok
Initializing random generators... ok
Initialization complete.
Setting WU description ...
Unpacking zip data: ../../projects/boinc.bakerlab.org_rosetta/minirosetta_database_rev50262.zip
Unpacking WU data ...
Unpacking data: ../../projects/boinc.bakerlab.org_rosetta/2012_10_9_mini_y001_folding.zip
Setting database description ...
Setting up checkpointing ...
Setting up graphics native ...
Setting up folding (abrelax) ...
Beginning folding (abrelax) ...
BOINC:: Worker startup.
Starting watchdog...
Watchdog active.
Starting work on structure: _00001
Starting work on structure: _00002
Starting work on structure: _00003
Starting work on structure: _00004
Starting work on structure: _00005
Starting work on structure: _00006
Starting work on structure: _00007
Starting work on structure: _00008
Starting work on structure: _00009
Starting work on structure: _00010
Starting work on structure: _00011
Starting work on structure: _00012
Starting work on structure: _00013
Starting work on structure: _00014
Starting work on structure: _00015
Starting work on structure: _00016
Starting work on structure: _00017
Starting work on structure: _00018
======================================================
DONE :: 1 starting structures 10282.1 cpu seconds
This process generated 18 decoys from 18 attempts
======================================================
BOINC :: WS_max 1.51771e+82

BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down cleanly ...
called boinc_finish

</stderr_txt>
]]>


ID: 74046 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,743,799
RAC: 14,341
Message 74048 - Posted: 18 Oct 2012, 23:40:45 UTC - in response to Message 74035.  
Last modified: 18 Oct 2012, 23:43:53 UTC

For me, this machine could not get any WU validated (it finished them w/o errors, though), when running a GPU in parallel and WITHOUT running a GPU in parallel:

https://boinc.bakerlab.org/rosetta/show_host_detail.php?hostid=1569977

I tried all the combinations possible, downgraded BOINC (to both 32 and 64 versions), updated BIOS, ran rosetta@home EXCLUSIVELY (as in, no GPU project in parallel)... etc. All I had left to do was to reinstall the OS... but that's just ridiculous.
So, I was forced to abandon rosseta (with this machine) due to this issue.

Note: This machine is currently running WCG and GPUGRID at the same time with no problems.

In our team (TSC! Russia) we have now 5 (five) computers (from different owners/members) with the same symptoms. Now all of them switched to other projects now (which are working successfully and without errors), as can not run R@H at all: calculations went without any errors(in local BOINC client or in logs), but after passing validator all 100% of WUs marked as invalid.
One of one of these computers was attached to R@H for short time to check if errors continue or not? They continue, here 2 bad Wus for example (after which the computer was again switched to other projects):
https://boinc.bakerlab.org/rosetta/results.php?hostid=1555324
ID: 74048 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 74066 - Posted: 21 Oct 2012, 7:23:15 UTC

Thanks for all the info. It definitely helps. Hopefully I'll have some time to look into this further next week.
ID: 74066 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 74147 - Posted: 31 Oct 2012, 22:51:42 UTC

Two win7 machines, both at BOINC 7.0.36. One has problems consistently, the other works fine, consistently. See thread for details.
Rosetta Moderator: Mod.Sense
ID: 74147 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 74169 - Posted: 4 Nov 2012, 17:35:40 UTC

Like others, I'm seeing messages in the event log (Mac OS X 10.6.8/Boinc 7.0.31) reporting this error:

exited with zero status but no 'finished' file

Sample output


Sat Nov 3 23:17:29 2012 | rosetta@home | Scheduler request completed: got 0 new tasks
Sat Nov 3 23:19:04 2012 | rosetta@home | Finished download of input_hyb_al_02_bench_3slkB_yfsong.zip
Sat Nov 3 23:19:47 2012 | | Suspending network activity - user request
Sun Nov 4 02:53:03 2012 | rosetta@home | Computation for task Ploop4_2_abinitio_design_y465_009_60334_1680_0 finished
Sun Nov 4 02:53:20 2012 | rosetta@home | Starting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_05_62798_11_0 using minirosetta version 341 in slot 1
Sun Nov 4 02:55:19 2012 | rosetta@home | Task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 exited with zero status but no 'finished' file
Sun Nov 4 02:55:19 2012 | rosetta@home | If this happens repeatedly you may need to reset the project.
Sun Nov 4 02:55:19 2012 | rosetta@home | Restarting task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 using minirosetta version 341 in slot 0
Sun Nov 4 08:36:36 2012 | rosetta@home | Computation for task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_05_62798_11_0 finished
Sun Nov 4 08:36:48 2012 | rosetta@home | Starting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_10_03_62798_11_0 using minirosetta version 341 in slot 1
Sun Nov 4 08:40:53 2012 | rosetta@home | Task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 exited with zero status but no 'finished' file
Sun Nov 4 08:40:53 2012 | rosetta@home | If this happens repeatedly you may need to reset the project.
Sun Nov 4 08:40:53 2012 | rosetta@home | Restarting task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 using minirosetta version 341 in slot 0
Sun Nov 4 08:42:18 2012 | rosetta@home | work fetch suspended by user
Sun Nov 4 08:42:56 2012 | rosetta@home | task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 aborted by user
Sun Nov 4 08:42:57 2012 | rosetta@home | Starting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_07_62798_7_0 using minirosetta version 341 in slot 2
Sun Nov 4 08:43:39 2012 | rosetta@home | Computation for task hyb_al_08_bench_3slkB_SAVE_ALL_OUT_IGNORE_THE_REST_60945_2133_0 finished
Sun Nov 4 08:44:15 2012 | rosetta@home | Task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_07_62798_7_0 exited with zero status but no 'finished' file
Sun Nov 4 08:44:15 2012 | rosetta@home | If this happens repeatedly you may need to reset the project.
Sun Nov 4 08:44:15 2012 | rosetta@home | Restarting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_07_62798_7_0 using minirosetta version 341 in slot 2
Sun Nov 4 08:44:17 2012 | rosetta@home | Task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_10_03_62798_11_0 exited with zero status but no 'finished' file
Sun Nov 4 08:44:17 2012 | rosetta@home | If this happens repeatedly you may need to reset the project.
Sun Nov 4 08:44:17 2012 | rosetta@home | Restarting task rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_10_03_62798_11_0 using minirosetta version 341 in slot 1
Sun Nov 4 08:44:20 2012 | | Resuming network activity
Sun Nov 4 08:44:20 2012 | rosetta@home | Started upload of Ploop4_2_abinitio_design_y465_009_60334_1680_0_0
Sun Nov 4 08:44:20 2012 | rosetta@home | Started upload of rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_05_62798_11_0_0
Sun Nov 4 08:44:25 2012 | rosetta@home | Finished upload of Ploop4_2_abinitio_design_y465_009_60334_1680_0_0
Sun Nov 4 08:44:27 2012 | rosetta@home | Finished upload of rb_11_03_30323_64727_h001__sp1_IGNORE_THE_REST_08_05_62798_11_0_0
Sun Nov 4 08:44:31 2012 | rosetta@home | Sending scheduler request: To report completed tasks.
Sun Nov 4 08:44:31 2012 | rosetta@home | Reporting 3 completed tasks
Sun Nov 4 08:44:31 2012 | rosetta@home | Not requesting tasks: scheduler RPC backoff
Sun Nov 4 08:44:35 2012 | rosetta@home | Scheduler request completed
ID: 74169 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mod.Sense
Volunteer moderator

Send message
Joined: 22 Aug 06
Posts: 4018
Credit: 0
RAC: 0
Message 74172 - Posted: 4 Nov 2012, 22:42:41 UTC

I get the same with Windows and BOINC 6, so I don't think it is a part of what this thread was created for. So, please open a new thread if you like, to keep the two concepts separated. I get that reported error when I shutdown my laptop with sleep or hibernate.
Rosetta Moderator: Mod.Sense
ID: 74172 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
svincent

Send message
Joined: 30 Dec 05
Posts: 219
Credit: 12,120,035
RAC: 0
Message 74181 - Posted: 5 Nov 2012, 18:28:58 UTC - in response to Message 74172.  

I get the same with Windows and BOINC 6, so I don't think it is a part of what this thread was created for. So, please open a new thread if you like, to keep the two concepts separated. I get that reported error when I shutdown my laptop with sleep or hibernate.


Well the problems thread, where I agree this post really belongs, is getting a bit unwieldy with 400+ entries. Will start a new thread on this though.
ID: 74181 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 74188 - Posted: 6 Nov 2012, 17:45:21 UTC

I wonder if it's safe to attach rosetta again to this machine. I'm not even sure if rosetta is even working on this very weird bug.
ID: 74188 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
WR-HW95

Send message
Joined: 5 Jan 06
Posts: 2
Credit: 8,086,818
RAC: 0
Message 74192 - Posted: 6 Nov 2012, 22:08:36 UTC
Last modified: 6 Nov 2012, 22:10:00 UTC

My other machine Win XP,Phenom 965, GTX 275 SLI works fine when running S@H and R@h same time using Boinc 6, but this machine Win 7 Ultimate, Phenom 1090, GTX 470 + GTX 660 Ti fails every R@H work in validation using Boinc 7.0.28.
ID: 74192 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Chilean
Avatar

Send message
Joined: 16 Oct 05
Posts: 711
Credit: 26,694,507
RAC: 0
Message 74233 - Posted: 10 Nov 2012, 4:33:49 UTC

Another host resulting in nothing but client errors (from the validator, the WUs finish with no problems):

https://boinc.bakerlab.org/rosetta/results.php?hostid=1577411
ID: 74233 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 31 Dec 09
Posts: 209
Credit: 25,743,799
RAC: 14,341
Message 74244 - Posted: 10 Nov 2012, 20:24:03 UTC

2 David E K

Another thing which is wanted to draw your attention. One of the common things that common in all computers with this bug (100% error rate at validation stage), it is missing version of minirosetta in the logs.
Like in example: https://boinc.bakerlab.org/rosetta/result.php?resultid=543001353
Validate state Invalid
Claimed credit 34.1018733270665
Granted credit 34.1018733270665
application version ---


This (no version information) may be the reason that the validator mark all such WUs as invalid? Despite the fact that he was correctly calculated actually?
ID: 74244 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
1 · 2 · 3 · 4 . . . 7 · Next

Message boards : Number crunching : Current issues with 7+ boinc client



©2024 University of Washington
https://www.bakerlab.org