Many crashes.

Message boards : Number crunching : Many crashes.

To post messages, you must log in.

AuthorMessage
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 188
Credit: 6,431,332
RAC: 4,520
Message 103470 - Posted: 22 Nov 2021, 17:21:08 UTC

I recently got 13 tasks and 12 of them failed. One completed successfully. The machine runs other Boinc projects successfully. A typical failure looks like this:
Task 1451454977
Name 	rb_11_21_153050_149232_ab_t000__robetta_cstwt_5.0_FT_IGNORE_THE_REST_07_08_2728289_69_0
Workunit 	1295214447
Created 	22 Nov 2021, 7:04:32 UTC
Sent 	22 Nov 2021, 7:07:03 UTC
Report deadline 	25 Nov 2021, 7:07:03 UTC
Received 	22 Nov 2021, 16:39:33 UTC
Server state 	Over
Outcome 	Computation error
Client state 	Compute error
Exit status 	1 (0x00000001) Unknown error code
Computer ID 	5958977
Run time 	14 min 38 sec
CPU time 	14 min 19 sec
Validate state 	Invalid
Credit 	0.00
Device peak FLOPS 	3.86 GFLOPS
Application version 	Rosetta v4.20
windows_x86_64
Peak working set size 	373.02 MB
Peak swap size 	352.90 MB
Peak disk usage 	0.33 MB
Stderr output

<core_client_version>7.16.20</core_client_version>
<![CDATA[
<message>
Incorrect function.
 (0x1) - exit code 1 (0x1)</message>
<stderr_txt>
command: projects/boinc.bakerlab.org_rosetta/rosetta_4.20_windows_x86_64.exe @rb_11_21_153050_149232_ab_t000__robetta_FLAGS -in::file::fasta t000_.fasta -jumps:pairing_file t000_.fasta.bbcontacts.jumps -jumps:random_sheets 4 -constraints::cst_file t000_.fasta.CB.cst -constraints:cst_weight 5.0 -constraints::cst_fa_file t000_.fasta.MIN.cst -constraints:cst_fa_weight 5.0 -in:file:boinc_wu_zip rb_11_21_153050_149232_ab_t000__robetta.zip -frag3 rb_11_21_153050_149232_ab_t000__robetta.200.3mers.index.gz -fragA rb_11_21_153050_149232_ab_t000__robetta.200.8mers.index.gz -fragB rb_11_21_153050_149232_ab_t000__robetta.200.7mers.index.gz -nstruct 10000 -cpu_run_time 28800 -boinc:max_nstruct 20000 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -boinc::cpu_run_timeout 36000 -run::rng mt19937 -constant_seed -jran 1169817
Using database: database_357d5d93529_n_methylminirosetta_database

[ ERROR ]: Caught exception:

File: C:cygwin64homeboinc4.17Rosettamainsourcesrccore/pack/dunbrack/SingleResidueDunbrackLibrary.hh:306
chi angle must be between -180 and 180: -nan(ind)


Did I get a bad batch, or is something else going on?
ID: 103470 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
.clair.

Send message
Joined: 2 Jan 07
Posts: 274
Credit: 26,399,595
RAC: 0
Message 103471 - Posted: 22 Nov 2021, 18:05:52 UTC

I had some rb_11 tasks get funky , mostly over running time, others took a walk on the wild side
It`l pass
ID: 103471 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Profile Grant (SSSF)

Send message
Joined: 28 Mar 20
Posts: 1682
Credit: 17,854,150
RAC: 18,215
Message 103484 - Posted: 23 Nov 2021, 7:02:22 UTC - in response to Message 103470.  

Did I get a bad batch, or is something else going on?
Bad batch.
If you click on the link for the Work Unit, you can see that the other systems that tried to process those Tasks also errored out.
Grant
Darwin NT
ID: 103484 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 188
Credit: 6,431,332
RAC: 4,520
Message 103492 - Posted: 23 Nov 2021, 14:04:53 UTC - in response to Message 103484.  

Did I get a bad batch, or is something else going on?

Bad batch.
If you click on the link for the Work Unit, you can see that the other systems that tried to process those Tasks also errored out.


I agree about a bad batch. I have since had 5 work units complete successfully and no more failures.

However, one of my failures had another user complete it successfully.
ID: 103492 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote
Jean-David Beyer

Send message
Joined: 2 Nov 05
Posts: 188
Credit: 6,431,332
RAC: 4,520
Message 103497 - Posted: 24 Nov 2021, 13:35:26 UTC

I notice all my units run on my Linux machine end up valid.
And about half my units on my Windows machine are now coming up valid.
FWIW.
ID: 103497 · Rating: 0 · rate: Rate + / Rate - Report as offensive    Reply Quote

Message boards : Number crunching : Many crashes.



©2024 University of Washington
https://www.bakerlab.org