GraceDB Server issues
https://git.ligo.org/computing/gracedb/server/-/issues
2023-07-28T19:19:26Z
https://git.ligo.org/computing/gracedb/server/-/issues/323
Consider increasing the configuration parameter "max_wal_size".
2023-07-28T19:19:26Z
Alexander Pace
Consider increasing the configuration parameter "max_wal_size".
There were some timeouts on `gracedb-playground` this afternoon (2023-07-23) from around 18:40-18:43ish UTC that I think were triggered in some part by a `VACUUM FULL` when i was doing some exploratory maintenance on playground's db. Dur...
There were some timeouts on `gracedb-playground` this afternoon (2023-07-23) from around 18:40-18:43ish UTC that I think were triggered in some part by a `VACUUM FULL` when i was doing some exploratory maintenance on playground's db. During the period in question there were the following lines in `gracedb-playground`'s RDS logs:
```
2023-07-28 18:35:50 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:36:12 UTC::@:[393]:LOG: checkpoint complete: wrote 39902 buffers (16.5%); 0 WAL file(s) added, 0 removed, 16 recycled; write=20.183 s, sync=1.326 s, total=21.691 s; sync files=211, longest=1.323 s, average=0.007 s; distance=1048579 kB, estimate=1048579 kB
2023-07-28 18:36:13 UTC::@:[393]:LOG: checkpoints are occurring too frequently (23 seconds apart)
2023-07-28 18:36:13 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:36:13 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:36:39 UTC::@:[393]:LOG: checkpoint complete: wrote 231 buffers (0.1%); 0 WAL file(s) added, 0 removed, 13 recycled; write=25.661 s, sync=0.420 s, total=26.123 s; sync files=112, longest=0.399 s, average=0.004 s; distance=1048586 kB, estimate=1048586 kB
2023-07-28 18:36:49 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:14 UTC::@:[393]:LOG: checkpoint complete: wrote 2019 buffers (0.8%); 0 WAL file(s) added, 2 removed, 17 recycled; write=24.321 s, sync=0.191 s, total=25.505 s; sync files=138, longest=0.190 s, average=0.002 s; distance=1049475 kB, estimate=1049475 kB
2023-07-28 18:37:17 UTC::@:[393]:LOG: checkpoints are occurring too frequently (28 seconds apart)
2023-07-28 18:37:17 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:37:17 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:24 UTC::@:[393]:LOG: checkpoint complete: wrote 69 buffers (0.0%); 0 WAL file(s) added, 0 removed, 10 recycled; write=6.996 s, sync=0.342 s, total=7.539 s; sync files=34, longest=0.342 s, average=0.011 s; distance=1065103 kB, estimate=1065103 kB
2023-07-28 18:37:30 UTC::@:[393]:LOG: checkpoints are occurring too frequently (13 seconds apart)
2023-07-28 18:37:30 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:37:30 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:33 UTC::@:[393]:LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 9 recycled; write=0.480 s, sync=0.190 s, total=2.933 s; sync files=4, longest=0.190 s, average=0.048 s; distance=1056458 kB, estimate=1064239 kB
2023-07-28 18:38:33 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:38:49 UTC::@:[393]:LOG: checkpoint complete: wrote 171 buffers (0.1%); 0 WAL file(s) added, 0 removed, 19 recycled; write=15.533 s, sync=0.120 s, total=16.420 s; sync files=89, longest=0.120 s, average=0.002 s; distance=1034294 kB, estimate=1061244 kB
2023-07-28 18:39:19 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:39:36 UTC::@:[393]:LOG: checkpoint complete: wrote 171 buffers (0.1%); 0 WAL file(s) added, 0 removed, 14 recycled; write=17.051 s, sync=0.006 s, total=17.104 s; sync files=94, longest=0.006 s, average=0.001 s; distance=1063328 kB, estimate=1063328 kB
2023-07-28 18:40:59 UTC::@:[393]:LOG: checkpoint complete: wrote 517 buffers (0.2%); 0 WAL file(s) added, 11 removed, 17 recycled; write=28.949 s, sync=0.112 s, total=29.842 s; sync files=181, longest=0.111 s, average=0.001 s; distance=1040638 kB, estimate=1061059 kB
2023-07-28 18:41:00 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:41:11 UTC::@:[393]:LOG: checkpoint complete: wrote 118 buffers (0.0%); 0 WAL file(s) added, 0 removed, 14 recycled; write=10.732 s, sync=0.280 s, total=11.601 s; sync files=47, longest=0.280 s, average=0.006 s; distance=1084223 kB, estimate=1084223 kB
2023-07-28 18:41:14 UTC::@:[393]:LOG: checkpoints are occurring too frequently (14 seconds apart)
2023-07-28 18:41:14 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:41:14 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:41:16 UTC::@:[393]:LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 5 recycled; write=1.227 s, sync=0.054 s, total=2.786 s; sync files=2, longest=0.054 s, average=0.027 s; distance=1037553 kB, estimate=1079556 kB
2023-07-28 18:42:12 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:42:16 UTC::@:[393]:LOG: checkpoint complete: wrote 34 buffers (0.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=3.448 s, sync=0.090 s, total=3.948 s; sync files=22, longest=0.090 s, average=0.005 s; distance=1012093 kB, estimate=1072810 kB
2023-07-28 18:43:39 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:43:41 UTC::@:[393]:LOG: checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 0 removed, 16 recycled; write=1.116 s, sync=0.181 s, total=2.198 s; sync files=8, longest=0.181 s, average=0.023 s; distance=1103069 kB, estimate=1103069 kB
```
This also occurred during a period of high relational load in the database:
![Screen_Shot_2023-07-28_at_3.13.56_PM](/uploads/95a62730a64f5a8d0c75d39d8c809705/Screen_Shot_2023-07-28_at_3.13.56_PM.png)
I haven't seen these hints and warnings on production, even when the database gets `VACUUM`'ed, so hopefully chalk it up to another example of playground's growing pains. Either way, consider some of the recommendations that the internet has to offer:
* https://www.crunchydata.com/blog/tuning-your-postgres-database-for-high-write-loads
* https://www.enterprisedb.com/blog/tuning-maxwalsize-postgresql
* https://stackoverflow.com/questions/75134262/why-do-i-have-the-message-max-wal-size-suddenly-appearing-in-my-postgres-logs
And once those parameters are tuned and validations in the `gracedb-postgresql-dev` parameter group, apply it to production.