GraceDB Server issues
https://git.ligo.org/computing/gracedb/server/-/issues
2019-05-16T17:15:46Z
https://git.ligo.org/computing/gracedb/server/-/issues/141
Sentry instance for AWS
2019-05-16T17:15:46Z
Tanner Prestegard
Sentry instance for AWS
We need a better way of handling errors in the AWS gracedb instance. The issues with superevent directory creation for S190412m generated about 4000 error emails to each of the admins and disrupted my email traffic for days.
We need a better way of handling errors in the AWS gracedb instance. The issues with superevent directory creation for S190412m generated about 4000 error emails to each of the admins and disrupted my email traffic for days.
https://git.ligo.org/computing/gracedb/server/-/issues/323
Consider increasing the configuration parameter "max_wal_size".
2023-07-28T19:19:26Z
Alexander Pace
Consider increasing the configuration parameter "max_wal_size".
There were some timeouts on `gracedb-playground` this afternoon (2023-07-23) from around 18:40-18:43ish UTC that I think were triggered in some part by a `VACUUM FULL` when i was doing some exploratory maintenance on playground's db. Dur...
There were some timeouts on `gracedb-playground` this afternoon (2023-07-23) from around 18:40-18:43ish UTC that I think were triggered in some part by a `VACUUM FULL` when i was doing some exploratory maintenance on playground's db. During the period in question there were the following lines in `gracedb-playground`'s RDS logs:
```
2023-07-28 18:35:50 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:36:12 UTC::@:[393]:LOG: checkpoint complete: wrote 39902 buffers (16.5%); 0 WAL file(s) added, 0 removed, 16 recycled; write=20.183 s, sync=1.326 s, total=21.691 s; sync files=211, longest=1.323 s, average=0.007 s; distance=1048579 kB, estimate=1048579 kB
2023-07-28 18:36:13 UTC::@:[393]:LOG: checkpoints are occurring too frequently (23 seconds apart)
2023-07-28 18:36:13 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:36:13 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:36:39 UTC::@:[393]:LOG: checkpoint complete: wrote 231 buffers (0.1%); 0 WAL file(s) added, 0 removed, 13 recycled; write=25.661 s, sync=0.420 s, total=26.123 s; sync files=112, longest=0.399 s, average=0.004 s; distance=1048586 kB, estimate=1048586 kB
2023-07-28 18:36:49 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:14 UTC::@:[393]:LOG: checkpoint complete: wrote 2019 buffers (0.8%); 0 WAL file(s) added, 2 removed, 17 recycled; write=24.321 s, sync=0.191 s, total=25.505 s; sync files=138, longest=0.190 s, average=0.002 s; distance=1049475 kB, estimate=1049475 kB
2023-07-28 18:37:17 UTC::@:[393]:LOG: checkpoints are occurring too frequently (28 seconds apart)
2023-07-28 18:37:17 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:37:17 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:24 UTC::@:[393]:LOG: checkpoint complete: wrote 69 buffers (0.0%); 0 WAL file(s) added, 0 removed, 10 recycled; write=6.996 s, sync=0.342 s, total=7.539 s; sync files=34, longest=0.342 s, average=0.011 s; distance=1065103 kB, estimate=1065103 kB
2023-07-28 18:37:30 UTC::@:[393]:LOG: checkpoints are occurring too frequently (13 seconds apart)
2023-07-28 18:37:30 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:37:30 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:37:33 UTC::@:[393]:LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 9 recycled; write=0.480 s, sync=0.190 s, total=2.933 s; sync files=4, longest=0.190 s, average=0.048 s; distance=1056458 kB, estimate=1064239 kB
2023-07-28 18:38:33 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:38:49 UTC::@:[393]:LOG: checkpoint complete: wrote 171 buffers (0.1%); 0 WAL file(s) added, 0 removed, 19 recycled; write=15.533 s, sync=0.120 s, total=16.420 s; sync files=89, longest=0.120 s, average=0.002 s; distance=1034294 kB, estimate=1061244 kB
2023-07-28 18:39:19 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:39:36 UTC::@:[393]:LOG: checkpoint complete: wrote 171 buffers (0.1%); 0 WAL file(s) added, 0 removed, 14 recycled; write=17.051 s, sync=0.006 s, total=17.104 s; sync files=94, longest=0.006 s, average=0.001 s; distance=1063328 kB, estimate=1063328 kB
2023-07-28 18:40:59 UTC::@:[393]:LOG: checkpoint complete: wrote 517 buffers (0.2%); 0 WAL file(s) added, 11 removed, 17 recycled; write=28.949 s, sync=0.112 s, total=29.842 s; sync files=181, longest=0.111 s, average=0.001 s; distance=1040638 kB, estimate=1061059 kB
2023-07-28 18:41:00 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:41:11 UTC::@:[393]:LOG: checkpoint complete: wrote 118 buffers (0.0%); 0 WAL file(s) added, 0 removed, 14 recycled; write=10.732 s, sync=0.280 s, total=11.601 s; sync files=47, longest=0.280 s, average=0.006 s; distance=1084223 kB, estimate=1084223 kB
2023-07-28 18:41:14 UTC::@:[393]:LOG: checkpoints are occurring too frequently (14 seconds apart)
2023-07-28 18:41:14 UTC::@:[393]:HINT: Consider increasing the configuration parameter "max_wal_size".
2023-07-28 18:41:14 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:41:16 UTC::@:[393]:LOG: checkpoint complete: wrote 4 buffers (0.0%); 0 WAL file(s) added, 0 removed, 5 recycled; write=1.227 s, sync=0.054 s, total=2.786 s; sync files=2, longest=0.054 s, average=0.027 s; distance=1037553 kB, estimate=1079556 kB
2023-07-28 18:42:12 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:42:16 UTC::@:[393]:LOG: checkpoint complete: wrote 34 buffers (0.0%); 0 WAL file(s) added, 0 removed, 18 recycled; write=3.448 s, sync=0.090 s, total=3.948 s; sync files=22, longest=0.090 s, average=0.005 s; distance=1012093 kB, estimate=1072810 kB
2023-07-28 18:43:39 UTC::@:[393]:LOG: checkpoint starting: wal
2023-07-28 18:43:41 UTC::@:[393]:LOG: checkpoint complete: wrote 11 buffers (0.0%); 0 WAL file(s) added, 0 removed, 16 recycled; write=1.116 s, sync=0.181 s, total=2.198 s; sync files=8, longest=0.181 s, average=0.023 s; distance=1103069 kB, estimate=1103069 kB
```
This also occurred during a period of high relational load in the database:
![Screen_Shot_2023-07-28_at_3.13.56_PM](/uploads/95a62730a64f5a8d0c75d39d8c809705/Screen_Shot_2023-07-28_at_3.13.56_PM.png)
I haven't seen these hints and warnings on production, even when the database gets `VACUUM`'ed, so hopefully chalk it up to another example of playground's growing pains. Either way, consider some of the recommendations that the internet has to offer:
* https://www.crunchydata.com/blog/tuning-your-postgres-database-for-high-write-loads
* https://www.enterprisedb.com/blog/tuning-maxwalsize-postgresql
* https://stackoverflow.com/questions/75134262/why-do-i-have-the-message-max-wal-size-suddenly-appearing-in-my-postgres-logs
And once those parameters are tuned and validations in the `gracedb-postgresql-dev` parameter group, apply it to production.