Message boards : Number crunching : Rosetta 4.0+
Previous · 1 · 2 · 3 · 4 · 5 . . . 19 · Next
Author | Message |
---|---|
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 5,361 |
Version 4.06: Getting "computational error" after 1 second of trying on a Mac Pro, Boinc 7.6.33. It looks to me like Rosetta 4.06 version is compiled with AVX2 enabled. My rosetta_4.06_x86_64-pc-linux-gnu binary had a number of AVX2 instructions, but I doubt it makes much performance difference. Your Harpertown computer does not support any AVX instructions. All the 3.78 binaries passed (no AVX2). All the 4.06 binaries failed (AVX2). IMO, it looks like someone at Rosetta turned on the AVX compile switch on 4.06 without fixing the server job dispatcher to send 4.06 jobs ONLY to CPU that did support them. Negative impact ... burns network traffic, power, disk space, slows job completion for Rosetta job submitters, ... UGH! Asleep at the switch. I don't see any PREFERENCE to tell Rosetta to stop sending the version 4.06 AVX jobs, so it appears that you and everyone else in that situation is stuck. Too bad they could not have figured this out using RALPH .... 8-) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1995 Credit: 9,635,489 RAC: 6,843 |
It looks to me like Rosetta 4.06 version is compiled with AVX2 enabled. My rosetta_4.06_x86_64-pc-linux-gnu binary had a number of AVX2 instructions, Uh, in the windows version i see only 64 bits active. I will see deeper, if i'm able to. but I doubt it makes much performance difference. Maybe this is only the beginning. Maybe they put bigger simulations in avx wus IMO, it looks like someone at Rosetta turned on the AVX compile switch on 4.06 without fixing the server job dispatcher to send 4.06 jobs ONLY to CPU that did support them. With the new servers this is easier to do. Too bad they could not have figured this out using RALPH .... 8-) I prefer that wus crash in Ralph than in Rosetta. Sometimes i think that Rosetta's admins are afraid to use Ralph.... |
rjs5 Send message Joined: 22 Nov 10 Posts: 273 Credit: 23,054,272 RAC: 5,361 |
It looks to me like Rosetta 4.06 version is compiled with AVX2 enabled. My rosetta_4.06_x86_64-pc-linux-gnu binary had a number of AVX2 instructions, The windows event exception should show an "ILLEGAL INSTRUCTION" is the cause of the abort. If you can disassemble, look for instructions using the YMM registers. That is the easiest way on Linux. I just use "objdump -d binary > binary.od" and then look at registers used. If you see ymm registers, the binary was compiled with avx2. grep ymm binary.od 5a23591: c4 e3 7d 18 44 0f 10 vinsertf128 $0x1,0x10(%rdi,%rcx,1),%ymm0,%ymm0 5a235a0: c4 c3 7d 19 44 0a 20 vextractf128 $0x1,%ymm0,0x20(%r10,%rcx,1) 5a23faa: c4 e2 7d 19 45 c8 vbroadcastsd -0x38(%rbp),%ymm0 5a23fb4: c4 c1 7d 7f 02 vmovdqa %ymm0,(%r10) 5a24f4c: c4 e3 7d 18 40 10 01 vinsertf128 $0x1,0x10(%rax),%ymm0,%ymm0 5a24f66: c4 e3 7d 19 84 24 90 vextractf128 $0x1,%ymm0,0x190(%rsp) 5a24f76: c4 e3 7d 18 40 30 01 vinsertf128 $0x1,0x30(%rax),%ymm0,%ymm0 5a24f86: c4 e3 7d 19 84 24 b0 vextractf128 $0x1,%ymm0,0x1b0(%rsp) |
Saenger Send message Joined: 19 Sep 05 Posts: 271 Credit: 824,883 RAC: 0 |
The last three of my 4.06-WUs (967991507, 967991500 and 967991525) all errored out after a few seconds with the following message or similar: ERROR: Error in simple_cycpep_predict app: The N-methylation position indices must be within the pose! ERROR:: Exit from: src/protocols/cyclic_peptide_predict/SimpleCycpepPredictApplication.cc line: 1399 BACKTRACE: [0xe60f258] [0x8914d8a] [0x891762b] [0x805620d] [0xeabf881] [0xeabfa7d] [0x82f2057] BOINC:: Error reading and gzipping output datafile: default.out 08:35:42 (30370): called boinc_finish(1) Four others (967991472, 967991483, 967991493 and 967991490) are currently running without problems. I fail to see any pattern in the list on my host in regard of names or such. I have to wait for approximately another 6-10h for the first of the currently running to see whether it will error out later. Grüße vom Sänger |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
Re: My host Windows XP with Pentium 4 CPU. Issue with v4.06 windows _intelx86 since began getting these workunits. Will not actually process and get these messages and errors soon after starting the WUs. 01/20/2018 1:42:00 PM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1) I've reset the project with no change in processing. FYI - Thanks. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
No errors in the last week on mini-rosetta 3.78 but all these on 4.06 PF04295.12_aivan_SAVE_ALL_OUT_03_09_541715_1877_0 PF09868.8_aivan_SAVE_ALL_OUT_03_09_541716_745_0 PF12787.6_aivan_SAVE_ALL_OUT_03_09_541721_799_0 PF06763.10_aivan_SAVE_ALL_OUT_03_09_541721_1445_0 PF10076.8_aivan_SAVE_ALL_OUT_03_09_541721_1445_0 PF11732.7_aivan_SAVE_ALL_OUT_03_09_541716_1461_0 PF07762.13_aivan_SAVE_ALL_OUT_03_09_541716_1248_1 All the above show this same error, which I first mentioned in October std::cerr: Exception was thrown: CycA_AGPF_6res_hydrophobic_designs_2_CycA_AGPF_c.17.8_0001_SAVE_ALL_OUT_542098_301_0 CycA_AGPF_6res_hydrophobic_designs_2_CycA_AGPF_c.31.6_0001_SAVE_ALL_OUT_542108_763_0 Both the above show this error ERROR: Error in simple_cycpep_predict app: The N-methylation position indices must be within the pose! CycA_HP_6res_hydrophobic_automated_c.2.5_0001_SAVE_ALL_OUT_542206_138_0 ERROR: in::file::boinc_wu_zip CycA_HP_6res_hydrophobic_designs_2_c.2.5_0001.zip does not exist! |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
Another v4.06 windows_intelx86 error for my host Windows XP with Pentium 4 CPU, which occurred shortly after starting the WU. Re: Workunit 873834971 CycA_AGPF_7res_hydrophobic_designs_CycA_AGPF_7res_c.6.10_0001_SAVE_ALL_OUT_542366_836_0 Client state Compute error Exit status -185 (0xFFFFFF47) ERR_RESULT_START Computer ID 1580783 Stderr output: <core_client_version>7.8.3</core_client_version> <message>couldn't start app: CreateProcess() failed - (unknown error)</message> |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1995 Credit: 9,635,489 RAC: 6,843 |
Seems that 4.06 will need some debug |
[AF>Le_Pommier] Jerome_C2005 Send message Joined: 22 Aug 06 Posts: 42 Credit: 1,258,039 RAC: 0 |
PLUS it would be nice that the admin who created this topic would come and read it, sometimes ! |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1995 Credit: 9,635,489 RAC: 6,843 |
PLUS it would be nice that the admin who created this topic would come and read it, sometimes ! Lack of communications in Rosetta it's a long (and sad) story. |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2125 Credit: 41,249,734 RAC: 8,235 |
Another week goes by, another 7 PF* tasks coming up with the same "nan" error after running to apparent completion PF14335.5_aivan_SAVE_ALL_OUT_03_09_541721_2511_0 PF11824.7_aivan_SAVE_ALL_OUT_03_09_541716_1913_0 PF11981.7_aivan_SAVE_ALL_OUT_03_09_541721_3743_0 PF10092.8_aivan_SAVE_ALL_OUT_03_09_541721_2663_0 PF03169.14_aivan_SAVE_ALL_OUT_03_09_541715_2865_0 PF10972.7_aivan_SAVE_ALL_OUT_03_09_541721_4953_0 PF10070.8_aivan_SAVE_ALL_OUT_03_09_541721_4953_0 <core_client_version>7.8.3</core_client_version> |
James W Send message Joined: 25 Nov 12 Posts: 130 Credit: 1,766,254 RAC: 0 |
Re: My host Windows XP with Pentium 4 CPU. Issue with v4.06 windows_intelx86 since began getting these workunits. Will not actually process the WUs, and I'll get these messages and errors soon after starting the WUs. 01/31/2018 2:17:15 AM | Rosetta@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1) Is the above a problem unique to XP or is it across various OS? |
Juha Send message Joined: 28 Mar 16 Posts: 13 Credit: 705,034 RAC: 0 |
Windows XP. Issue with v4.06 Looks like 4.06 was compiled with Visual Studio 2015 which by default doesn't create XP compatible program files. I don't know if the project decided to drop XP support or if it happened accidentally. |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1995 Credit: 9,635,489 RAC: 6,843 |
I don't know if the project decided to drop XP support or if it happened accidentally. From Apps page: Microsoft Windows (98 or later) running on an Intel x86-compatible CPU 4.06 So it seems that they don't drop XP (even if i think it's a good idea to abandon XP) |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1995 Credit: 9,635,489 RAC: 6,843 |
972378950 ERROR: Assertion `! lines.empty()` failed. ERROR:: Exit from: ......srccoreiopdbpdb_reader.cc line: 78 BOINC:: Error reading and gzipping output datafile: default.out 22:06:16 (14768): called boinc_finish(1) |
pututu Send message Joined: 12 Jun 16 Posts: 5 Credit: 10,028,325 RAC: 0 |
Got a few of these errors running Rosetta v4.06 over the past few days. Task ID_____WU name 973051977 PF09826.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_2050_0 973046720 PF06980.10_bnd_aivan_SAVE_ALL_OUT_03_09_543807_2040_0 973000035 PF06980.10_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1949_0 972990240 PF10070.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1934_0 972988377 PF13584.5_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1931_0 972971744 PF10070.8_bnd_aivan_SAVE_ALL_OUT_03_09_543807_1906_0 Sample error message: <core_client_version>7.8.3</core_client_version> <![CDATA[ <stderr_txt> command: projects/boinc.bakerlab.org_rosetta/rosetta_4.06_windows_intelx86.exe @PF09826.8.bnd.flags -in:file:boinc_wu_zip PF09826.8.bnd.zip -nstruct 10000 -cpu_run_time 28800 -watchdog -boinc:max_nstruct 600 -checkpoint_interval 120 -mute all -database minirosetta_database -in::file::zip minirosetta_database.zip -boinc::watchdog -run::rng mt19937 -constant_seed -jran 3839119 Starting watchdog... Watchdog active. BOINC:: CPU time: 21742.4s, 14400s + 7200s[2018- 2-10 12:37:57:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 21742.4 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== 12:37:57 (5308): called boinc_finish(0) </stderr_txt> |
Trotador Send message Joined: 30 May 09 Posts: 108 Credit: 291,214,977 RAC: 0 |
The same error I already reported 2 months ago and I can see it still continues. And it is because I'm not crunching Rosetta but in some android devices not affected by this error. Got a few of these errors running Rosetta v4.06 over the past few days. |
mikey Send message Joined: 5 Jan 06 Posts: 1895 Credit: 9,179,826 RAC: 3,209 |
The same error I already reported 2 months ago and I can see it still continues. I'm having that problem and the problem of something 'preempting' my workunits I have taken one of my machines off of here!! It's a boinc ONLY machine and nothing is preempting anything anywhere!! I have other machine that aren't over the 50% error mark, most are under 10%. |
Aladar42 Send message Joined: 14 Nov 17 Posts: 2 Credit: 67,864 RAC: 0 |
Getting a good amount of errors myself: https://boinc.bakerlab.org/workunit.php?wuid=878899708 https://boinc.bakerlab.org/workunit.php?wuid=878668362 https://boinc.bakerlab.org/workunit.php?wuid=878668207 https://boinc.bakerlab.org/workunit.php?wuid=878668371 https://boinc.bakerlab.org/workunit.php?wuid=878668369 |
LarryMajor Send message Joined: 1 Apr 16 Posts: 22 Credit: 31,533,212 RAC: 0 |
I started getting errors about a week ago. The common points are that the jobs are all PF*_bnd_aivan_SAVE_ALL_OUT*, and that I only get errors on the machine with AMD Opterons. Some WU’s with this name run successfully, and the ones that fail all exceed the target CPU time by four hours before failing. The error, in part, is “WARNING! cannot get file size for default.out.gz: could not open file” and “Output exists: default.out.gz Size: -1.” The Exit Status is 11. About half the jobs fail when re-sent to other machines, but when I looked at one that finished successfully on another machine, I see the same errors in both outputs: Failed: https://boinc.bakerlab.org/result.php?resultid=974837214 Completed: https://boinc.bakerlab.org/result.php?resultid=975103716 After seeing the same errors, but an Exit Status 0 on the re-send, I’m really confused about where the problem lies, and will appreciate any help you guys can give me. |
Message boards :
Number crunching :
Rosetta 4.0+
©2024 University of Washington
https://www.bakerlab.org