Occasional 500 error when reading files
There is an occasional 500 error returned by the cloud instances when attempting to read files. It occurs infrequently and randomly enough that I'm not able to reproduce it, but it does it gwcelery's workflow on occasion (~2 times per week). And example error traceback looks like:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 47, in inner
response = get_response(request)
File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 181, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.7/dist-packages/django/views/decorators/cache.py", line 44, in _wrapped_view_func
response = view_func(request, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/rest_framework/viewsets.py", line 125, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/rest_framework/views.py", line 509, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.7/dist-packages/rest_framework/views.py", line 469, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.7/dist-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
raise exc
File "/usr/local/lib/python3.7/dist-packages/rest_framework/views.py", line 506, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), **kw)
File "/usr/local/lib/python3.7/dist-packages/retry/api.py", line 74, in retry_decorator
logger)
File "/usr/local/lib/python3.7/dist-packages/retry/api.py", line 33, in __retry_internal
return f()
File "/app/gracedb_project/gracedb/api/v1/superevents/views.py", line 321, in list
file_list = get_file_list(viewable_logs, parent_superevent.datadir)
File "/app/gracedb_project/gracedb/core/file_utils.py", line 32, in get_file_list
pointed_to = os.path.basename(os.path.realpath(full_path))
File "/usr/lib/python3.7/posixpath.py", line 395, in realpath
path, ok = _joinrealpath(filename[:0], filename, {})
File "/usr/lib/python3.7/posixpath.py", line 443, in _joinrealpath
path, ok = _joinrealpath(path, os.readlink(newpath), seen)
Exception Type: OSError at /api/superevents/MS220919n/files/
Exception Value: [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png'
It appears to be triggering the retrying hook that I put in, but it doesn't seem to work because it is retying four times to get the file, sleeping one second between each attempt:
gracedb-swarm-test-us-west-2a-docker-mgr-01.log:Sep 19 13:38:13 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_gracedb_gracedb.2.o400wqmzk6yutaoaz1cd8mjyt: DJANGO | 2022-09-19 13:38:13.591 | e459e5951d2a | 10.0.2.51 | api.v1.superevents.views | WARNING | api.py, line 40 | [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png', retrying in 1.0 seconds...
gracedb-swarm-test-us-west-2a-docker-mgr-01.log:Sep 19 13:38:14 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_gracedb_gracedb.2.o400wqmzk6yutaoaz1cd8mjyt: DJANGO | 2022-09-19 13:38:14.608 | e459e5951d2a | 10.0.2.51 | api.v1.superevents.views | WARNING | api.py, line 40 | [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png', retrying in 1.0 seconds...
gracedb-swarm-test-us-west-2a-docker-mgr-01.log:Sep 19 13:38:15 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_gracedb_gracedb.2.o400wqmzk6yutaoaz1cd8mjyt: DJANGO | 2022-09-19 13:38:15.622 | e459e5951d2a | 10.0.2.51 | api.v1.superevents.views | WARNING | api.py, line 40 | [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png', retrying in 1.0 seconds...
gracedb-swarm-test-us-west-2a-docker-mgr-01.log:Sep 19 13:38:16 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_gracedb_gracedb.2.o400wqmzk6yutaoaz1cd8mjyt: DJANGO | 2022-09-19 13:38:16.636 | e459e5951d2a | 10.0.2.51 | api.v1.superevents.views | WARNING | api.py, line 40 | [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png', retrying in 1.0 seconds...
Traefik
is showing that the request is returning a 500 error and is taking almost five seconds because of the retries:
Sep 19 13:38:18 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_webgateway_webgateway.1.l4j2u8hibrrtgelvsfhiubxfh: 131.215.113.198 - - [19/Sep/2022:13:38:13 +0000] "GET /api/superevents/MS220919n/files/ HTTP/1.1" 500 10472 "-" "-" 174967 "gracedb@docker" "http://10.0.2.51:80" 4815ms
For reference the nfs mounts are mounted with: nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,_netdev