... | @@ -81,3 +81,12 @@ The 5 clocks are |
... | @@ -81,3 +81,12 @@ The 5 clocks are |
|
There are 2 interesting points to take away from this:
|
|
There are 2 interesting points to take away from this:
|
|
1. The serial time is dominated by writing checkpoints. A huge spike can be seen every 10 minutes. Because the high core count jobs finish sooner, there are fewer data points. Checkpoints are an insurance against system crashes - losing 1 walltime hour of compute time in this rare event is not a big deal. Checkpointing every 10 minutes is paying a huge cost for very little extra benefit. At 256 cores, the code is spending over 15% of the CPU time writing checkpoints. To think of it a different way - the time saved by not making such frequent checkpoints could be used to re-compute the lost work in the event of a system crash. Note: we can’t invoke Amdahl’s law directly here, because the snapshots happen based on wall time, not iterations (CPU time).
|
|
1. The serial time is dominated by writing checkpoints. A huge spike can be seen every 10 minutes. Because the high core count jobs finish sooner, there are fewer data points. Checkpoints are an insurance against system crashes - losing 1 walltime hour of compute time in this rare event is not a big deal. Checkpointing every 10 minutes is paying a huge cost for very little extra benefit. At 256 cores, the code is spending over 15% of the CPU time writing checkpoints. To think of it a different way - the time saved by not making such frequent checkpoints could be used to re-compute the lost work in the event of a system crash. Note: we can’t invoke Amdahl’s law directly here, because the snapshots happen based on wall time, not iterations (CPU time).
|
|
2. The barrier time (a measure of the dispersion of finishing times of the workers) starts quite high (20-40%) before decreasing to zero. This is the other reason for poor scaling. For the 256 core case, the job finishes before the dispersion drops to zero.
|
|
2. The barrier time (a measure of the dispersion of finishing times of the workers) starts quite high (20-40%) before decreasing to zero. This is the other reason for poor scaling. For the 256 core case, the job finishes before the dispersion drops to zero.
|
|
|
|
|
|
|
|
## Reduced Order Quadrature
|
|
|
|
ROQ should be a factor of 5-10 faster, but this is not the case. The run only completed 1.4x faster. Profiling shows that time spent doing computations is similarly close to 100%. Though the fraction of barrier time is slightly higher, this does not account for the discrepancy.
|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|

|
|
|
|
|
|
|
|
The performance discrepancy is either caused by a slowdown in each individual worker task. Need to confirm if the speedup is achievable in serial. |