Client Errors

Message boards : Number crunching : Client Errors

To post messages, you must log in.

Previous · 1 · 2 · 3

AuthorMessage
Nuadormrac

Send message
Joined: 27 Sep 05
Posts: 37
Credit: 202,469
RAC: 0
Message 902 - Posted: 2 Oct 2005, 9:45:42 UTC
Last modified: 2 Oct 2005, 9:48:03 UTC

I too have seen some access violation error messages. No indication of benchmarks from what I've seen. I'm also seeing

10/2/2005 12:01:04 AM|rosetta@home|Result 1pvaA_abrelax_no_cst_17565_0 exited with zero status but no 'finished' file
10/2/2005 12:01:04 AM|rosetta@home|If this happens repeatedly you may need to reset the project.

from time to time, though I've seen this error a coupla times with results that have returned successfully and given credit as well. Anyhow, the results I'm seeing Access Violations on

https://boinc.bakerlab.org/rosetta/result.php?resultid=127887
https://boinc.bakerlab.org/rosetta/result.php?resultid=79885

Ironically, a moment ago, in my results page it showed a result to be done which came up "server status unsent, client status initial", and clicking on it came back to this WU, the unsent of this same one. That seems to have cleared itself up on the results page however as of the time of this post.

BTW, this box only has 1 CPU, so we're not just looking at errors related to multi-CPU computers with these, essentially what used to be refered (back in win3.1 days) as a general protection fault.
ID: 902 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JimB
Avatar

Send message
Joined: 17 Sep 05
Posts: 19
Credit: 228,111
RAC: 0
Message 903 - Posted: 2 Oct 2005, 11:55:09 UTC
Last modified: 2 Oct 2005, 11:55:40 UTC

I tracked one of my "exited with zero status" wu's out of curiosity, and it does get credit; it is also documented in the Wiki - "most of the time the best thing to do is to do nothing" :

2005-09-20 06:40:59 [SETI@home] Starting result 08no03aa.2694.26145.129826.90_1 using setiathome version 4.18
2005-09-20 07:33:14 [SETI@home] Result 08no03aa.2694.26145.129826.90_1 exited with zero status but no 'finished' file
2005-09-20 07:33:14 [SETI@home] Restarting result 08no03aa.2694.26145.129826.90_1 using setiathome version 4.18
2005-09-20 08:16:48 [SETI@home] Computation for result 08no03aa.2694.26145.129826.90_1 finished
2005-09-20 08:16:49 [SETI@home] Started upload of 08no03aa.2694.26145.129826.90_1_0
2005-09-20 08:16:51 [SETI@home] Finished upload of 08no03aa.2694.26145.129826.90_1_0

115364537 1140844 19 Sep 2005 21:31:33 UTC 20 Sep 2005 12:16:57 UTC Over Success Done 5,356.25 9.62 29.98

Result ID 115364537



"Be all that you can be...considering." Harold Green
ID: 903 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Pconfig

Send message
Joined: 26 Sep 05
Posts: 6
Credit: 56,254
RAC: 0
Message 905 - Posted: 2 Oct 2005, 12:27:34 UTC
Last modified: 2 Oct 2005, 12:31:41 UTC

Benchmarking errors out over here:

1-10-2005 23:00:47||Suspending computation and network activity - running CPU benchmarks
1-10-2005 23:00:47|rosetta@home|Pausing result 1btn__abrelax_no_cst_16723_0 (removed from memory)
1-10-2005 23:00:47|rosetta@home|Pausing result 1btn__abrelax_no_cst_16971_0 (removed from memory)
1-10-2005 23:00:48|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16971_0 ( - exit code -1073741819 (0xc0000005))
1-10-2005 23:00:48||request_reschedule_cpus: process exited
1-10-2005 23:00:49|rosetta@home|Unrecoverable error for result 1btn__abrelax_no_cst_16723_0 ( - exit code -1073741819 (0xc0000005))
1-10-2005 23:00:49||request_reschedule_cpus: process exited
1-10-2005 23:01:47|rosetta@home|Computation for result 1btn__abrelax_no_cst_16723_0 finished
1-10-2005 23:01:47|rosetta@home|resume_or_start(): unexpected process state 2

(going to leave it in mem while preempted) i think it's strange that a wu isn't aborted when boinc gets restarted, only when boinc tries to benchmark...
Proud member of the Dutch Power Cows
ID: 905 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 907 - Posted: 2 Oct 2005, 14:57:53 UTC

The "Client Errors" appears to be when the Rosetta@Home Science Application is suspended and removed from memory. I do not recall the exact mechanics of the operation, but, if the API is not called in the proper order that could be part of it. This is one of the reasons I have suggested that the BOINC Official documentation is lacking. It is also why I am trying to add more information to the Wiki about development issues. Like, where the messages are created.

At some point, I am going to have to start to actually do development work to "test" the documentation as it stands. But, from my review as I added the pages, being an old and tired systems engineer, woefully inadequate. It assumes much knowledge on the part of the developer and does not seem to trace the logical development path well ... The good news is that I have had a couple project developers add some of what they learned when burned ...

@JimB

There are a couple causes of that message, if it was not clear from the Wiki. Fundamentally, there are timing issues with the Science Application and the BOINC Daemon and this message is the result. Later versions of the BOINC Client Software when compiled into the Science Application should "cure" this ... or reduce its frequency.

One of the more interesting "clues" is that if you adjust the system clock, or when it adjusts itself on synchronization with a Internet update, well, that outputs the message ....
ID: 907 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile JimB
Avatar

Send message
Joined: 17 Sep 05
Posts: 19
Credit: 228,111
RAC: 0
Message 910 - Posted: 2 Oct 2005, 17:25:32 UTC
Last modified: 2 Oct 2005, 17:31:48 UTC

These errors are interesting. I've done a *very* quick summary list of errors I recall seeing in this forum (feel free to add to or ignore):


  • 1. wu's take longer than some expect
  • 2. wu's fail when switching projects
  • 3. wu's fail when removed from memory to run benchmarks
  • 4. wu's "flail" as regards progress & time to completion
  • 5. wu's hang at 1% and need a BOINC restart
  • 6. wu's don't seem to keep proper time crunched
  • 7. some wu's never finish
  • 8. some trial and error configuring pref's - leave in memory, 120 min swap, etc.
  • 9. different problems on different processors



I've only had #3 since I set "leave in memory to yes"; no recent problem with "exited with zero status" error.

I'm using BOINC 4.45 (tried 4.72 but didn't work well for me) & rosetta 4.77 on HT processor, leave in memory=yes, not running other projects on this box to avoid switching.


I found the wiki entry to be very well done, particularly the line-by-line log analysis. Very nice! But what a chore to do.




"Be all that you can be...considering." Harold Green
ID: 910 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Nuadormrac

Send message
Joined: 27 Sep 05
Posts: 37
Credit: 202,469
RAC: 0
Message 922 - Posted: 3 Oct 2005, 10:17:03 UTC
Last modified: 3 Oct 2005, 10:19:25 UTC

Oddly, well in my case all my projects have been set to leave in memory pretty much since I signed up for my first BOINC project. Unless the benchmark was auto-invoked, I hadn't requested one, and the projects, well Rosseta was the last one I signed up for/connected to. I haven't seen a message in the log about one.

I do wonder about these Access Violations though which I have seen on 2 units, and some others have claimed to be seeing. A slight blurb on this.

Basically, when one of our PC processors are running in 32-bit protected mode, there's a degree of protection in place from some client software accessing memory it isn't entitled to (aka protected mode). This restriction is enforced by the CPU itself, in hardware, where the bit of code is running in user mode (ring 3). The OS code (or the system kernel) runs at ring 0, along with various device drivers, etc, and is entitled to access any memory in the system, with application software being launched in user mode (and yes Windows XP would launch things this way). The application is only entitled to access memory that belongs to it, or to make various API calls to request the operating system do some needed function for it. If it attempts to directly read from or write to memory that doesn't belong to that process, it generates an Access Violation. On a programming level, Rossetta would be attempting to (in my case the error message specified a read) from memory it wasn't entitled to.

Either the memory was de-allocated and still referenced, or there's a bad pointer or something that crops up from time to time which is attempting to de-reference memory which the program has no valid access to. I don't envy the project devs who might have to hunt down the bit of code that might be causing this.
ID: 922 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
The Pirate
Avatar

Send message
Joined: 22 Sep 05
Posts: 20
Credit: 7,090,933
RAC: 0
Message 933 - Posted: 4 Oct 2005, 2:10:31 UTC

Both my multi-cpu computers haven't had the error since since the 27th when I set Rosetta to stay in memory. One of them has done a bench mark since then and that didn't cause any errors, of course that one has 8 gigs of memory.

ID: 933 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Joe

Send message
Joined: 26 Sep 05
Posts: 3
Credit: 607,639
RAC: 1,345
Message 970 - Posted: 5 Oct 2005, 6:20:43 UTC

I just had 6 results on a p4 invalid, exiting with access code violation. I think ill abort the rest of the units on that machine....
ID: 970 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile petrusbroder

Send message
Joined: 23 Sep 05
Posts: 9
Credit: 2,111,764
RAC: 0
Message 1009 - Posted: 6 Oct 2005, 4:45:16 UTC
Last modified: 6 Oct 2005, 5:12:14 UTC

Oh, BTW I have not had any errors on PCs for the last 8 hours. and only 2 for the last 3 days - OTOH I have no dual core CPUs and my newest CPU is an Athlon 64 3200+ ...

However, my macs (a minimac with G4, 1.40 GHz and a PowerMac with 2 x G5 @ 2 GHz) report 37 WUs "client error" and a code etc. in a row. Looked at the details and was surprized to see that on some WUs there were three computers reporting client error while a fourth computer got a result correctly. for example:

application Rosetta
created 3 Oct 2005 23:51:49 UTC
name 1cfyA_abrelax_01967
canonical result 179136
granted credit 24.72


166737 9320 4 Oct 2005 7:39:28 UTC 5 Oct 2005 14:38:38 UTC Over Client error Computing 3,590.89 3.59 ---
178654 10758 5 Oct 2005 15:29:19 UTC 5 Oct 2005 15:29:28 UTC Over Client error Computing 0.00 0.00 ---
178811 8313 5 Oct 2005 16:58:16 UTC 5 Oct 2005 17:01:36 UTC Over Client error Computing 0.00 0.00 ---
179136 5895 5 Oct 2005 18:18:46 UTC 5 Oct 2005 23:46:35 UTC Over Success Done 10,384.97 24.72 24.72


The report looks like this (for the mac):
Result ID 178811
Name 1cfyA_abrelax_01967_2
Workunit 139188
Created 5 Oct 2005 15:29:31 UTC
Sent 5 Oct 2005 16:58:16 UTC
Received 5 Oct 2005 17:01:36 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status 5 (0x5)
Computer ID 8313
Report deadline 2 Nov 2005 16:58:16 UTC
CPU time 0
stderr out

<core_client_version>4.43</core_client_version>
<message>process got signal 5
</message>
<stderr_txt>
dyld: rosetta_4.76_powerpc-apple-darwin Undefined symbols:
rosetta_4.76_powerpc-apple-darwin undefined reference to _floorl expected to be defined in /usr/lib/libmx.A.dylib
rosetta_4.76_powerpc-apple-darwin undefined reference to _log10l expected to be defined in /usr/lib/libmx.A.dylib
rosetta_4.76_powerpc-apple-darwin undefined reference to _statvfs expected to be defined in /usr/lib/libSystem.B.dylib

</stderr_txt>

Validate state Invalid
Claimed credit 0
Granted credit 0
application version 4.76


I have checked 8 of the failed WUs for the mac and the <stderr_txt> look all the same.
For the dual CPU PC the report looks like this:
Result ID 166737
Name 1cfyA_abrelax_01967_0
Workunit 139188
Created 3 Oct 2005 23:51:54 UTC
Sent 4 Oct 2005 7:39:28 UTC
Received 5 Oct 2005 14:38:38 UTC
Server state Over
Outcome Client error
Client state Computing
Exit status -1073741819 (0xc0000005)
Computer ID 9320
Report deadline 1 Nov 2005 7:39:28 UTC
CPU time 3590.890625
stderr out

<core_client_version>4.45</core_client_version>
<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>

***UNHANDLED EXCEPTION****
Reason: Access Violation (0xc0000005) at address 0x006652D8 read attempt to address 0x06DB4120

Exiting...

</stderr_txt>

Validate state Invalid
Claimed credit 3.59271960053677
Granted credit 0
application version 4.77






It seems that the successful CPUs were singlecore and "old" PIII or P4 below 2 GHz or older Athlons.
There were also some very fast single core CPUs running the failed WUs ...

ID: 1009 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile petrusbroder

Send message
Joined: 23 Sep 05
Posts: 9
Credit: 2,111,764
RAC: 0
Message 1037 - Posted: 6 Oct 2005, 20:33:55 UTC

I have to add some clarifications:

The PowerMac has had 2 WUs with error codes but has during the same period produced 58 WUs with correct results.

The minimac has processed 251 WUs with error such as the one noted in my previous post.

It is interesting to see that the processor makes such a difference - because the minimac has enough RAM - 512 MBytes.

Makes one think ...
ID: 1037 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1038 - Posted: 6 Oct 2005, 20:54:27 UTC - in response to Message 1037.  

I have to add some clarifications:

The PowerMac has had 2 WUs with error codes but has during the same period produced 58 WUs with correct results.

The minimac has processed 251 WUs with error such as the one noted in my previous post.

It is interesting to see that the processor makes such a difference - because the minimac has enough RAM - 512 MBytes.

Makes one think ...



The following error is due to our OSX rosetta application only supporting 10.4+ currently:


dyld: rosetta_4.76_powerpc-apple-darwin Undefined symbols:
rosetta_4.76_powerpc-apple-darwin undefined reference to _floorl expected to be defined in /usr/lib/libmx.A.dylib
rosetta_4.76_powerpc-apple-darwin undefined reference to _log10l expected to be defined in /usr/lib/libmx.A.dylib
rosetta_4.76_powerpc-apple-darwin undefined reference to _statvfs expected to be defined in /usr/lib/libSystem.B.dylib


ID: 1038 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Paul D. Buck

Send message
Joined: 17 Sep 05
Posts: 815
Credit: 1,812,737
RAC: 0
Message 1041 - Posted: 6 Oct 2005, 21:59:01 UTC - in response to Message 1038.  


The following error is due to our OSX rosetta application only supporting 10.4+ currently:

<stderr_txt>
dyld: rosetta_4.76_powerpc-apple-darwin Undefined symbols:
rosetta_4.76_powerpc-apple-darwin undefined reference to _floorl expected to be defined in /usr/lib/libmx.A.dylib
rosetta_4.76_powerpc-apple-darwin undefined reference to _log10l expected to be defined in /usr/lib/libmx.A.dylib
rosetta_4.76_powerpc-apple-darwin undefined reference to _statvfs expected to be defined in /usr/lib/libSystem.B.dylib

</stderr_txt>


Is there some reason we cannot do the fix similar to what had to be done for CPDN?

Almost the same error set ...

See ... http://climateapps2.oucs.ox.ac.uk/cpdnboinc/forum_thread.php?id=2635
ID: 1041 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile David E K
Volunteer moderator
Project administrator
Project developer
Project scientist

Send message
Joined: 1 Jul 05
Posts: 1018
Credit: 4,334,829
RAC: 0
Message 1042 - Posted: 6 Oct 2005, 22:45:10 UTC

I don't know if that will work but regardless I would rather make a build that is compatible with OSX versions prior to 10.4. The problem is that anything built with Xcode2's gcc 4 will not run on anything prior to 10.3.9 (and even to support 10.3.9, the Cross-Development SDK has to be used). There may be a performance trade-off. I haven't yet had the time to look into this.
ID: 1042 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Shaktai
Avatar

Send message
Joined: 21 Sep 05
Posts: 56
Credit: 575,419
RAC: 0
Message 1045 - Posted: 7 Oct 2005, 2:21:42 UTC - in response to Message 1042.  

There may be a performance trade-off. I haven't yet had the time to look into this.


Typically we haven't seen a noticable performance trade off on other projects. The 10.3.x compiles have generally been very successful. However, there has been no success with compiling BOINC for 10.2.x or earlier. The cross development SDK has usually worked well. If you create a cross development compile, I think our team can scrounge up some testers.

Just curious, have you tried Xcode2's gcc 4 auto vectorization function? It has helped sometimes with G4 and G5 processors that have altivec capabilities.

Of course, I am not a coder, just a user. A 10.3.9 compatible version will draw a lot of new mac users though. Upgrades from earlier 10.3.x versions to 10.3.9 are free, so most users have or will upgrade.


Team MacNN - The best Macintosh team ever.
ID: 1045 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile petrusbroder

Send message
Joined: 23 Sep 05
Posts: 9
Credit: 2,111,764
RAC: 0
Message 1050 - Posted: 7 Oct 2005, 5:09:39 UTC - in response to Message 1038.  

I have to add some clarifications:

The PowerMac has had 2 WUs with error codes but has during the same period produced 58 WUs with correct results.

The minimac has processed 251 WUs with error such as the one noted in my previous post.

It is interesting to see that the processor makes such a difference - because the minimac has enough RAM - 512 MBytes.

Makes one think ...



The following error is due to our OSX rosetta application only supporting 10.4+ currently:

<stderr_txt>
dyld: rosetta_4.76_powerpc-apple-darwin Undefined symbols:
rosetta_4.76_powerpc-apple-darwin undefined reference to _floorl expected to be defined in /usr/lib/libmx.A.dylib
rosetta_4.76_powerpc-apple-darwin undefined reference to _log10l expected to be defined in /usr/lib/libmx.A.dylib
rosetta_4.76_powerpc-apple-darwin undefined reference to _statvfs expected to be defined in /usr/lib/libSystem.B.dylib

</stderr_txt>



Oh, Sorry - never realised that - got to update ... /blushing/
ID: 1050 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile UBT - Halifax--lad
Avatar

Send message
Joined: 17 Sep 05
Posts: 157
Credit: 2,687
RAC: 0
Message 1072 - Posted: 7 Oct 2005, 17:49:10 UTC

Are all the problems with WU's getting errors on them gone yet I would like to get back to the project but dont want to waste any time or WU's if the are going to abort
Join us in Chat (see the forum) Click the Sig


Join UBT
ID: 1072 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
devn

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 2,063
RAC: 0
Message 1080 - Posted: 7 Oct 2005, 21:51:41 UTC

i'm having far more success with wus since upgrading to cc4.72; also, now running only 1 cpu with rosetta even though i have HT. benchmarks no longer cause a problem either.
ID: 1080 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
devn

Send message
Joined: 17 Sep 05
Posts: 18
Credit: 2,063
RAC: 0
Message 1083 - Posted: 7 Oct 2005, 23:24:18 UTC

update to previous post: benchmarks just ran and caused "unrecoverable error." why this time and not last, have no idea, nothing else going on.
ID: 1083 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Previous · 1 · 2 · 3

Message boards : Number crunching : Client Errors



©2024 University of Washington
https://www.bakerlab.org