Ticket #232 (new defect)

Opened 2 years ago

Last modified 1 year ago

FahMon 2.3.99.1 says WU in hung when it's not

Reported by: RubberDuck Assigned to: uncle_fungus
Priority: critical Milestone:
Component: Monitoring system Keywords:
Cc:

Description

FahMon? 2.3.99.1 says WU in hung when it's not. On the last 3 or 4 WU FahMon? 2.3.99.1 says the Wu ae hung when there not. I have Electron Microscope III running and it says the WU is good. Plus if you watch it the progress keep going up ( like 68% to 69% to 70% and so on ) the WU finishes and when it download a new one before progress gets past 1% it reports it as hung.

Attachments

untitled2.JPG (38.2 kB) - added by RubberDuck on 11/02/09 12:13:30.
FahMon? Screenshot

Change History

11/02/09 12:13:30 changed by RubberDuck

  • attachment untitled2.JPG added.

FahMon? Screenshot

11/03/09 09:27:24 changed by lp0cua0

Same issue here, appears to be caused by a low % WU getting tossed due to UNSTABLE_ MACHINE with Checkpoint Failure (caused by failing HD/SATA retry timeouts in this case) FahMon? reports 3% complete, then HUNG for ETA. I do notice that the line immediately following the 3% is a full line of whitespaces, this does not occur in any normal log as far as I can tell from looking over my old logs. Not sure if this has anything to do with the error, but worth pointing out.

(Look for linebreaks to point out where the error happens, the add/change box stripped the LF's from a notepad copy/paste)

[05:24:46] Starting GUI Server [05:32:03] Completed 1% [05:39:16] Completed 2% [05:43:32] + Working...

[05:46:38] Completed 3%

<<Full line of whitespaces occurs here, webform may not show them correctly>>

--- Opening Log file [November 3 06:04:03 UTC]

# Windows GPU Console Edition ################################################# ###############################################################################

Folding@Home Client Version 6.23

http://folding.stanford.edu

############################################################################### ###############################################################################

Launch directory: C:\Users\Administrator\AppData\Roaming\Folding@home-gpu

[06:04:03] - Ask before connecting: No [06:04:03] - User name: lp0 (Team 0) [06:04:03] - User ID: 31BB072156BFD2DF [06:04:03] - Machine ID: 2 [06:04:03] [06:04:04] Loaded queue successfully. [06:04:04] Initialization complete [06:04:04] [06:04:04] + Processing work unit [06:04:05] Core required: FahCore_11.exe [06:04:05] Core found. [06:04:05] Working on queue slot 08 [November 3 06:04:05 UTC] [06:04:05] + Working ... [06:04:06] [06:04:06] *------------------------------* [06:04:06] Folding@Home GPU Core - Beta [06:04:06] Version 1.24 (Mon Feb 9 11:00:12 PST 2009) [06:04:06] [06:04:06] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 [06:04:06] Build host: amoeba [06:04:06] Board Type: AMD [06:04:06] Core : [06:04:06] Preparing to commence simulation [06:04:06] - Ensuring status. Please wait. [06:04:16] - Looking at optimizations... [06:04:16] - Working with standard loops on this execution. [06:04:16] - Previous termination of core was improper. [06:04:16] - Files status OK [06:04:16] - Expanded 96572 -> 489152 (decompressed 506.5 percent) [06:04:16] Called DecompressByteArray?: compressed_data_size=96572 data_size=489152, decompressed_data_size=489152 diff=0 [06:04:16] - Digital signature verified [06:04:16] [06:04:16] Project: 5739 (Run 2, Clone 108, Gen 415) [06:04:16] [06:04:16] Entering M.D. [06:04:22] Will resume from checkpoint file [06:04:22] Tpr hash work/wudata_08.tpr: 3118545753 304953393 3450966592 283153951 2716831802 [06:04:24] Working on Protein [06:04:25] Client config found, loading data. [06:04:26] Starting GUI Server [06:04:31] Resuming from checkpoint [06:04:31] fcCheckPointResume: retreived and current tpr file hash: [06:04:31] 0 0 3118545753 [06:04:31] 1 0 304953393 [06:04:31] 2 0 3450966592 [06:04:31] 3 0 283153951 [06:04:31] 4 0 2716831802

[06:04:31] fcCheckPointResume: file hashes different -- aborting.

[06:04:31] mdrun_gpu returned

[06:04:31] Checkpoint failure

[06:04:31]

[06:04:31] Folding@home Core Shutdown: UNSTABLE_MACHINE

[06:04:34] CoreStatus? = 7A (122) [06:04:34] Sending work to server [06:04:34] Project: 5739 (Run 2, Clone 108, Gen 415) [06:04:34] - Read packet limit of 540015616... Set to 524286976. [06:04:34] - Error: Could not get length of results file work/wuresults_08.dat [06:04:34] - Error: Could not read unit 08 file. Removing from queue. [06:04:34] - Preparing to get new work unit... [06:04:34] + Attempting to get work packet [06:04:34] - Connecting to assignment server [06:04:34] - Successful: assigned to (171.64.65.102). [06:04:34] + News From Folding@Home: Welcome to Folding@Home [06:04:35] Loaded queue successfully. [06:04:36] + Closed connections [06:04:41] [06:04:41] + Processing work unit [06:04:41] Core required: FahCore_11.exe [06:04:41] Core found. [06:04:41] Working on queue slot 09 [November 3 06:04:41 UTC] [06:04:41] + Working ... [06:04:41] [06:04:41] *------------------------------* [06:04:41] Folding@Home GPU Core - Beta [06:04:41] Version 1.24 (Mon Feb 9 11:00:12 PST 2009) [06:04:41] [06:04:41] Compiler : Microsoft (R) 32-bit C/C++ Optimizing Compiler Version 14.00.50727.762 for 80x86 [06:04:41] Build host: amoeba [06:04:41] Board Type: AMD [06:04:41] Core : [06:04:41] Preparing to commence simulation [06:04:41] - Looking at optimizations... [06:04:41] - Created dyn [06:04:41] - Files status OK [06:04:41] - Expanded 96572 -> 489152 (decompressed 506.5 percent) [06:04:41] Called DecompressByteArray?: compressed_data_size=96572 data_size=489152, decompressed_data_size=489152 diff=0 [06:04:41] - Digital signature verified [06:04:41] [06:04:41] Project: 5739 (Run 2, Clone 108, Gen 415) [06:04:41] [06:04:41] Assembly optimizations on if available. [06:04:41] Entering M.D. [06:04:47] Tpr hash work/wudata_09.tpr: 3118545753 304953393 3450966592 283153951 2716831802 [06:04:47] Working on Protein [06:04:47] Client config found, loading data. [06:04:47] Starting GUI Server

[06:12:27] Completed 1%

[06:20:05] Completed 2%

[06:27:47] Completed 3%

[06:35:17] Completed 4%

[06:42:48] Completed 5%

[06:50:14] Completed 6%

[06:57:44] Completed 7%

(follow-up: ↓ 3 ) 11/03/09 19:59:15 changed by RubberDuck

Here is a copy of the F@H log for that WU

Folding@Home Client Version 6.23

http://folding.stanford.edu

############################################################################### ###############################################################################

Launch directory: C:\Abit Service: C:\Abit\Folding@home-Win32-x86.exe Arguments: -svcstart -d C:\Abit -forceasm

Launched as a service. Entered C:\Abit to do work.

Warning:

By using the -forceasm flag, you are overriding safeguards in the program. If you did not intend to do this, please restart the program without -forceasm. If work units are not completing fully (and particularly if your machine is overclocked), then please discontinue use of the flag.

[01:52:38] - Ask before connecting: No [01:52:38] - User name: Cummins (Team 13285) [01:52:38] - User ID: 444E177940790DF5 [01:52:38] - Machine ID: 1 [01:52:38] [01:52:38] Loaded queue successfully. [01:52:38] [01:52:38] + Processing work unit [01:52:38] Core required: FahCore_82.exe [01:52:38] Core found. [01:52:38] Working on queue slot 04 [November 3 01:52:38 UTC] [01:52:38] + Working ... [01:52:40] [01:52:40] *------------------------------* [01:52:40] Folding@Home PMD Core [01:52:40] Version 1.03 (September 7, 2005) [01:52:40] [01:52:40] Preparing to commence simulation [01:52:40] - Assembly optimizations manually forced on. [01:52:40] - Not checking prior termination. [01:52:40] - Expanded 22012 -> 140259 (decompressed 637.1 percent) [01:52:40] [01:52:40] Project: 4608 (Run 7, Clone 88, Gen 28) [01:52:40] [01:52:40] Assembly optimizations on if available. [01:52:40] Entering M.D. [01:57:39] Protein: p4608_T0_PSBD-16_minout [01:57:39] [01:57:39] Completed 1900000 out of 2500000 steps (76%) [02:05:42] Writing local files [02:05:42] Completed 1925000 out of 2500000 steps (77%) [02:07:40] Writing checkpoint files [02:12:27] Writing local files [02:12:27] Completed 1950000 out of 2500000 steps (78%) [02:12:27] Writing checkpoint files [02:19:10] Writing local files [02:19:10] Completed 1975000 out of 2500000 steps (79%) [02:25:51] Writing local files [02:25:51] Completed 2000000 out of 2500000 steps (80%) [02:25:51] Writing checkpoint files

(in reply to: ↑ 2 ; follow-up: ↓ 4 ) 11/16/09 17:51:25 changed by RubberDuck

Replying to RubberDuck:

Here is a copy of the F@H log for that WU Folding@Home Client Version 6.23 http://folding.stanford.edu ############################################################################### ############################################################################### Launch directory: C:\Abit Service: C:\Abit\Folding@home-Win32-x86.exe Arguments: -svcstart -d C:\Abit -forceasm Launched as a service. Entered C:\Abit to do work. Warning: By using the -forceasm flag, you are overriding safeguards in the program. If you did not intend to do this, please restart the program without -forceasm. If work units are not completing fully (and particularly if your machine is overclocked), then please discontinue use of the flag. [01:52:38] - Ask before connecting: No [01:52:38] - User name: Cummins (Team 13285) [01:52:38] - User ID: 444E177940790DF5 [01:52:38] - Machine ID: 1 [01:52:38] [01:52:38] Loaded queue successfully. [01:52:38] [01:52:38] + Processing work unit [01:52:38] Core required: FahCore_82.exe [01:52:38] Core found. [01:52:38] Working on queue slot 04 [November 3 01:52:38 UTC] [01:52:38] + Working ... [01:52:40] [01:52:40] *------------------------------* [01:52:40] Folding@Home PMD Core [01:52:40] Version 1.03 (September 7, 2005) [01:52:40] [01:52:40] Preparing to commence simulation [01:52:40] - Assembly optimizations manually forced on. [01:52:40] - Not checking prior termination. [01:52:40] - Expanded 22012 -> 140259 (decompressed 637.1 percent) [01:52:40] [01:52:40] Project: 4608 (Run 7, Clone 88, Gen 28) [01:52:40] [01:52:40] Assembly optimizations on if available. [01:52:40] Entering M.D. [01:57:39] Protein: p4608_T0_PSBD-16_minout [01:57:39] [01:57:39] Completed 1900000 out of 2500000 steps (76%) [02:05:42] Writing local files [02:05:42] Completed 1925000 out of 2500000 steps (77%) [02:07:40] Writing checkpoint files [02:12:27] Writing local files [02:12:27] Completed 1950000 out of 2500000 steps (78%) [02:12:27] Writing checkpoint files [02:19:10] Writing local files [02:19:10] Completed 1975000 out of 2500000 steps (79%) [02:25:51] Writing local files [02:25:51] Completed 2000000 out of 2500000 steps (80%) [02:25:51] Writing checkpoint files.

. . . . . --> The only way I can get F@Hmon to work ( not say WU hung ) is to put a cheak mark next to " Client is on a Virtual Machine " <--

(in reply to: ↑ 3 ; follow-up: ↓ 5 ) 11/16/09 17:53:56 changed by RubberDuck

Replying to RubberDuck:

Replying to RubberDuck:

Here is a copy of the F@H log for that WU Folding@Home Client Version 6.23 http://folding.stanford.edu ############################################################################### ############################################################################### Launch directory: C:\Abit Service: C:\Abit\Folding@home-Win32-x86.exe Arguments: -svcstart -d C:\Abit -forceasm Launched as a service. Entered C:\Abit to do work. Warning: By using the -forceasm flag, you are overriding safeguards in the program. If you did not intend to do this, please restart the program without -forceasm. If work units are not completing fully (and particularly if your machine is overclocked), then please discontinue use of the flag. [01:52:38] - Ask before connecting: No [01:52:38] - User name: Cummins (Team 13285) [01:52:38] - User ID: 444E177940790DF5 [01:52:38] - Machine ID: 1 [01:52:38] [01:52:38] Loaded queue successfully. [01:52:38] [01:52:38] + Processing work unit [01:52:38] Core required: FahCore_82.exe [01:52:38] Core found. [01:52:38] Working on queue slot 04 [November 3 01:52:38 UTC] [01:52:38] + Working ... [01:52:40] [01:52:40] *------------------------------* [01:52:40] Folding@Home PMD Core [01:52:40] Version 1.03 (September 7, 2005) [01:52:40] [01:52:40] Preparing to commence simulation [01:52:40] - Assembly optimizations manually forced on. [01:52:40] - Not checking prior termination. [01:52:40] - Expanded 22012 -> 140259 (decompressed 637.1 percent) [01:52:40] [01:52:40] Project: 4608 (Run 7, Clone 88, Gen 28) [01:52:40] [01:52:40] Assembly optimizations on if available. [01:52:40] Entering M.D. [01:57:39] Protein: p4608_T0_PSBD-16_minout [01:57:39] [01:57:39] Completed 1900000 out of 2500000 steps (76%) [02:05:42] Writing local files [02:05:42] Completed 1925000 out of 2500000 steps (77%) [02:07:40] Writing checkpoint files [02:12:27] Writing local files [02:12:27] Completed 1950000 out of 2500000 steps (78%) [02:12:27] Writing checkpoint files [02:19:10] Writing local files [02:19:10] Completed 1975000 out of 2500000 steps (79%) [02:25:51] Writing local files [02:25:51] Completed 2000000 out of 2500000 steps (80%) [02:25:51] Writing checkpoint files.

. . . . .

Update

--> The only way I can get F@Hmon to work ( not say WU hung ) is to put a cheak mark next to " Client is on a Virtual Machine " <-- If there is anything else you need let me know. Thanks

(in reply to: ↑ 4 ) 11/16/09 17:54:33 changed by RubberDuck

Replying to RubberDuck:

Replying to RubberDuck:

Replying to RubberDuck:

Here is a copy of the F@H log for that WU Folding@Home Client Version 6.23 http://folding.stanford.edu ############################################################################### ############################################################################### Launch directory: C:\Abit Service: C:\Abit\Folding@home-Win32-x86.exe Arguments: -svcstart -d C:\Abit -forceasm Launched as a service. Entered C:\Abit to do work. Warning: By using the -forceasm flag, you are overriding safeguards in the program. If you did not intend to do this, please restart the program without -forceasm. If work units are not completing fully (and particularly if your machine is overclocked), then please discontinue use of the flag. [01:52:38] - Ask before connecting: No [01:52:38] - User name: Cummins (Team 13285) [01:52:38] - User ID: 444E177940790DF5 [01:52:38] - Machine ID: 1 [01:52:38] [01:52:38] Loaded queue successfully. [01:52:38] [01:52:38] + Processing work unit [01:52:38] Core required: FahCore_82.exe [01:52:38] Core found. [01:52:38] Working on queue slot 04 [November 3 01:52:38 UTC] [01:52:38] + Working ... [01:52:40] [01:52:40] *------------------------------* [01:52:40] Folding@Home PMD Core [01:52:40] Version 1.03 (September 7, 2005) [01:52:40] [01:52:40] Preparing to commence simulation [01:52:40] - Assembly optimizations manually forced on. [01:52:40] - Not checking prior termination. [01:52:40] - Expanded 22012 -> 140259 (decompressed 637.1 percent) [01:52:40] [01:52:40] Project: 4608 (Run 7, Clone 88, Gen 28) [01:52:40] [01:52:40] Assembly optimizations on if available. [01:52:40] Entering M.D. [01:57:39] Protein: p4608_T0_PSBD-16_minout [01:57:39] [01:57:39] Completed 1900000 out of 2500000 steps (76%) [02:05:42] Writing local files [02:05:42] Completed 1925000 out of 2500000 steps (77%) [02:07:40] Writing checkpoint files [02:12:27] Writing local files [02:12:27] Completed 1950000 out of 2500000 steps (78%) [02:12:27] Writing checkpoint files [02:19:10] Writing local files [02:19:10] Completed 1975000 out of 2500000 steps (79%) [02:25:51] Writing local files [02:25:51] Completed 2000000 out of 2500000 steps (80%) [02:25:51] Writing checkpoint files.

(follow-up: ↓ 7 ) 11/16/09 17:54:59 changed by RubberDuck

Update

The only way I can get F@Hmon to work ( not say WU hung ) is to put a cheak mark next to "Client is on a Virtual Machine".If there is anything else you need let me know. Thanks

(in reply to: ↑ 6 ; follow-ups: ↓ 8 ↓ 9 ↓ 10 ) 11/16/09 17:55:44 changed by RubberDuck

Replying to RubberDuck:

== Update == The only way I can get F@Hmon to work ( not say WU hung ) is to put a check mark next to "Client is on a Virtual Machine".If there is anything else you need let me know. Thanks

(in reply to: ↑ 7 ) 11/16/09 17:55:55 changed by RubberDuck

(in reply to: ↑ 7 ) 11/17/09 23:43:41 changed by RubberDuck

Sorry duplicate post.

(in reply to: ↑ 7 ) 11/17/09 23:45:58 changed by RubberDuck

delete

10/22/10 10:10:38 changed by uncle_fungus

  • component changed from Documentation to Monitoring system.

11/06/10 21:09:28 changed by Racer43

This is also happening with 2.3.99.3. When running SMP solo, runs fine. Added GPU, now has that both are hung; both are in fact running according to their individual logs and CPU usage. Ran the same clients with the same data files under 2.3.99.1 with no hung messages. No unstable machine error in either log file. The "Client is on Virtual Machine" does clear up the error.

11/12/10 11:29:06 changed by uncle_fungus

Please check that your timezone settings are correct in FahMon?.