Message boards : Number crunching : Problems and Technical Issues with Rosetta@home
Previous · 1 . . . 296 · 297 · 298 · 299 · 300 · 301 · 302 · Next
Author | Message |
---|---|
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
Over the last 90 min, the Validator backlog has dropped by over 100k. Looks like it's dropping by around 35k per hour (when the Validators were down completely, the rate of increase was roughly 12k per hour). Err... backlog to validate - nil |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
Over the last 90 min, the Validator backlog has dropped by over 100k. Looks like it's dropping by around 35k per hour (when the Validators were down completely, the rate of increase was roughly 12k per hour). Not quite sure what's happening atm, but the validation backlog is up at 10k, but I don't think it's stopped working - just not quite keeping up for some reason. The weirdness continues |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
Not quite sure what's happening atm, but the validation backlog is up at 10k, but I don't think it's stopped working - just not quite keeping up for some reason.26k now. The server has had issues for months now. I'm wondering if this is a symptom of those issues as they progressively worsen? Someone there really needs to take a close look at the system logs to see just what is going on- WTF does the server keep crashing? And why is it now having so much trouble Validating work? I'm thinking it's time to it to be replaced- a decade and a half is a very long time in computer hardware development. Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
Someone there really needs to take a close look at the system logs to see just what is going on- WTF does the server keep crashing? And why is it now having so much trouble Validating work? I'm thinking it's time to it to be replaced- a decade and a half is a very long time in computer hardware development. As i said a lot of time ago, we don't know if the server page is updated. If not, the hw and (above all) the os/sw are very old. Ubuntu 16.... |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
Someone there really needs to take a close look at the system logs to see just what is going on- WTF does the server keep crashing? And why is it now having so much trouble Validating work? I'm thinking it's time to it to be replaced- a decade and a half is a very long time in computer hardware development. Just being 'old' isn't the worst thing in the world. Being old and having failure issues every few weeks is a sign that if you don't fix this stuff, it's going to fail altogether. Which will inevitably result in someone asking whether they can afford the time and trouble to update it all or whether they should go in another direction entirely. I'm not sure how convinced I am they'll update the hw & sw to continue here tbh In the meantime, I think all tasks have just run out, so we'll soon see if the validation backlog (currently 59k) will start to edge back down again Edit: Just checked and no-one in my team has <any> tasks pending validation. Am I just lucky? Or is the backlog not real? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
In the meantime, I think all tasks have just run out, so we'll soon see if the validation backlog (currently 59k) will start to edge back down again Well, that changed quick. Validation backlog back down to nil and 700k tasks have popped up We live to crunch another day |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
Just being 'old' isn't the worst thing in the world. Not for servers exposed costantly to the internet. Security fixes, bugfix, support are fundamental (if you care about the project). There is also the performance factor: do you see the difference of a recente file system (ZFS 2.5) and old one (0.7 - if true)? Which will inevitably result in someone asking whether they can afford the time and trouble to update it all or whether they should go in another direction entirely. +1 |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
Validation backlog back down to nil and 700k tasks have popped up And another day with over 46k wus pending validation... :-( |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
Validation backlog back down to nil and 700k tasks have popped up Yes, and now 88k But I just looked through my team's tasks again and it's the same as a few days ago. A high figure showing on the server status page, but none of my team have <any> tasks awaiting validation. Is this 2 coincidences in a row? I'm certainly confused. |
Bryn Mawr Send message Joined: 26 Dec 18 Posts: 393 Credit: 12,110,248 RAC: 6,015 |
Validation backlog back down to nil and 700k tasks have popped up You are the lucky one. The problem appears to have started for me at 02:00 GMT, for the next hour I have about 50% pending and since then I’ve only had 5 validated out of nearly 100 completed. |
robertmiles Send message Joined: 16 Jun 08 Posts: 1232 Credit: 14,281,662 RAC: 1,807 |
Validation backlog back down to nil and 700k tasks have popped up Could it mean that the validator processes for some operating systems correctly, but not for some others? |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
Validation backlog back down to nil and 700k tasks have popped up I've just looked at your pending tasks and I'm amazed at the backlog. I just returned 8 tasks and, while they didn't validate immediately, it only took 20-30 minutes, not 20hrs! I'm almost apologetic about my success. I've done nothing to warrant it, certainly. Definitely some strange and inexplicable business going on. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
Could it mean that the validator processes for some operating systems correctly, but not for some others?Most likely a disk issue, if your results are on the disk that is having issues, then you get stuck with all the pending's. If you're lucky and they're on those that are OK, then it's no problems for the database to read & Validate them & then transition the result & then remove it. To me it's looking more and more like a dodgy drive in an array issue (or if it's a hardware RAID controller, then the disk(s) might be OK but the controller might be having issues with a channel or two...). All wild speculation on my part. Unfortunately the site that provides the BOINC graphs has been having issue, but it's come back up and it shows the Validation backlog this time hit 125k, but is now falling at roughly 40k per hour. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
Validator backlog is back with a vengeance, and boinc-process host is officially dead again on the Server Staus page. Oh, and for an idea of how much better CPUs have become over the years, the Xeon E3-1280 v5 in the graph below is 200Mhz slower & has 3 GB/s less memory bandwidth than the E3-1270 v6 CPU being used for the database server here at Rosetta (so close enough for there to be bugger all difference in performance between them). The EPYC 4124P has the same thread & core count as the Xeon E3-1280 v5, the same TDP rating, but double the performance. And the EPYC 4564P, a bit over double the power, but with 8 times the performance.... Grant Darwin NT |
[VENETO] boboviz Send message Joined: 1 Dec 05 Posts: 1994 Credit: 9,623,704 RAC: 9,591 |
The EPYC 4124P has the same thread & core count as the Xeon E3-1280 v5, the same TDP rating, but double the performance. I don't know if the problem is the cpu. I think much more to the hd/ssd systems. And I continue to consider that the sw/os is important as the hw If you scroll this page, you can see how much the file system of the R@H server is old And here, here, etc, some ideas about optimization of file system resources |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
I don't know if the problem is the cpu. I think much more to the hd/ssd systems.So do i, however a couple of new systems could replace all of the existing systems, provide much better performance, and use less power. They could spend days, weeks, months (and money) sorting out exactly what is dying on the current system, or just one new half-decent system to replace the existing problem hardware and sort out the old one at leisure & keep it for emergencies/ other needs. And I continue to consider that the sw/os is important as the hwActually extremely old. There have been plenty of performance updates over the years, let alone security-based ones, that would make it worthwhile upgrading to the current releases IMHO. Grant Darwin NT |
Sid Celery Send message Joined: 11 Feb 08 Posts: 2124 Credit: 41,228,659 RAC: 10,982 |
Boinc-process server is back and validation seems to be working, with a 330k backlog to work through |
JLDun Send message Joined: 31 May 08 Posts: 8 Credit: 71,072 RAC: 527 |
Getting some "transient https errors" in attempting to download some tasks. |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
Getting some "transient https errors" in attempting to download some tasks.No issues with your net connection in general? Looked at my Event log, and no signs of issues with uploads or downloads. Grant Darwin NT |
Grant (SSSF) Send message Joined: 28 Mar 20 Posts: 1681 Credit: 17,854,150 RAC: 22,647 |
Boinc-process server is back and validation seems to be working, with a 330k backlog to work throughThis time it appears to be doing well straight off- the backlog is almost cleared and all of my Pendings have already cleared. Grant Darwin NT |
Message boards :
Number crunching :
Problems and Technical Issues with Rosetta@home
©2024 University of Washington
https://www.bakerlab.org