Update ADACS: Scaling and Profiling notes authored by Conrad Chan's avatar Conrad Chan
...@@ -80,4 +80,13 @@ The 5 clocks are ...@@ -80,4 +80,13 @@ The 5 clocks are
There are 2 interesting points to take away from this: There are 2 interesting points to take away from this:
1. The serial time is dominated by writing checkpoints. A huge spike can be seen every 10 minutes. Because the high core count jobs finish sooner, there are fewer data points. Checkpoints are an insurance against system crashes - losing 1 walltime hour of compute time in this rare event is not a big deal. Checkpointing every 10 minutes is paying a huge cost for very little extra benefit. At 256 cores, the code is spending over 15% of the CPU time writing checkpoints. To think of it a different way - the time saved by not making such frequent checkpoints could be used to re-compute the lost work in the event of a system crash. Note: we can’t invoke Amdahl’s law directly here, because the snapshots happen based on wall time, not iterations (CPU time). 1. The serial time is dominated by writing checkpoints. A huge spike can be seen every 10 minutes. Because the high core count jobs finish sooner, there are fewer data points. Checkpoints are an insurance against system crashes - losing 1 walltime hour of compute time in this rare event is not a big deal. Checkpointing every 10 minutes is paying a huge cost for very little extra benefit. At 256 cores, the code is spending over 15% of the CPU time writing checkpoints. To think of it a different way - the time saved by not making such frequent checkpoints could be used to re-compute the lost work in the event of a system crash. Note: we can’t invoke Amdahl’s law directly here, because the snapshots happen based on wall time, not iterations (CPU time).
2. The barrier time (a measure of the dispersion of finishing times of the workers) starts quite high (20-40%) before decreasing to zero. This is the other reason for poor scaling. For the 256 core case, the job finishes before the dispersion drops to zero. 2. The barrier time (a measure of the dispersion of finishing times of the workers) starts quite high (20-40%) before decreasing to zero. This is the other reason for poor scaling. For the 256 core case, the job finishes before the dispersion drops to zero.
\ No newline at end of file
## Reduced Order Quadrature
ROQ should be a factor of 5-10 faster, but this is not the case. The run only completed 1.4x faster. Profiling shows that time spent doing computations is similarly close to 100%. Though the fraction of barrier time is slightly higher, this does not account for the discrepancy.
![std_8](uploads/751dfa029a09b055c12d12d10fdb5aa4/std_8.png)
![roq_8](uploads/dc9f696a409207490fe60f20bf322b23/roq_8.png)
The performance discrepancy is either caused by a slowdown in each individual worker task. Need to confirm if the speedup is achievable in serial.