Skip to content

Occasional 500 error when reading files

There is an occasional 500 error returned by the cloud instances when attempting to read files. It occurs infrequently and randomly enough that I'm not able to reproduce it, but it does it gwcelery's workflow on occasion (~2 times per week). And example error traceback looks like:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/exception.py", line 47, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.7/dist-packages/django/core/handlers/base.py", line 181, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.7/dist-packages/django/views/decorators/cache.py", line 44, in _wrapped_view_func
    response = view_func(request, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/rest_framework/viewsets.py", line 125, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/rest_framework/views.py", line 509, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.7/dist-packages/rest_framework/views.py", line 469, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.7/dist-packages/rest_framework/views.py", line 480, in raise_uncaught_exception
    raise exc
  File "/usr/local/lib/python3.7/dist-packages/rest_framework/views.py", line 506, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/usr/local/lib/python3.7/dist-packages/retry/api.py", line 74, in retry_decorator
    logger)
  File "/usr/local/lib/python3.7/dist-packages/retry/api.py", line 33, in __retry_internal
    return f()
  File "/app/gracedb_project/gracedb/api/v1/superevents/views.py", line 321, in list
    file_list = get_file_list(viewable_logs, parent_superevent.datadir)
  File "/app/gracedb_project/gracedb/core/file_utils.py", line 32, in get_file_list
    pointed_to = os.path.basename(os.path.realpath(full_path))
  File "/usr/lib/python3.7/posixpath.py", line 395, in realpath
    path, ok = _joinrealpath(filename[:0], filename, {})
  File "/usr/lib/python3.7/posixpath.py", line 443, in _joinrealpath
    path, ok = _joinrealpath(path, os.readlink(newpath), seen)

Exception Type: OSError at /api/superevents/MS220919n/files/
Exception Value: [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png'

It appears to be triggering the retrying hook that I put in, but it doesn't seem to work because it is retying four times to get the file, sleeping one second between each attempt:

gracedb-swarm-test-us-west-2a-docker-mgr-01.log:Sep 19 13:38:13 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_gracedb_gracedb.2.o400wqmzk6yutaoaz1cd8mjyt: DJANGO | 2022-09-19 13:38:13.591 | e459e5951d2a | 10.0.2.51 | api.v1.superevents.views | WARNING | api.py, line 40 | [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png', retrying in 1.0 seconds...
gracedb-swarm-test-us-west-2a-docker-mgr-01.log:Sep 19 13:38:14 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_gracedb_gracedb.2.o400wqmzk6yutaoaz1cd8mjyt: DJANGO | 2022-09-19 13:38:14.608 | e459e5951d2a | 10.0.2.51 | api.v1.superevents.views | WARNING | api.py, line 40 | [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png', retrying in 1.0 seconds...
gracedb-swarm-test-us-west-2a-docker-mgr-01.log:Sep 19 13:38:15 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_gracedb_gracedb.2.o400wqmzk6yutaoaz1cd8mjyt: DJANGO | 2022-09-19 13:38:15.622 | e459e5951d2a | 10.0.2.51 | api.v1.superevents.views | WARNING | api.py, line 40 | [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png', retrying in 1.0 seconds...
gracedb-swarm-test-us-west-2a-docker-mgr-01.log:Sep 19 13:38:16 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_gracedb_gracedb.2.o400wqmzk6yutaoaz1cd8mjyt: DJANGO | 2022-09-19 13:38:16.636 | e459e5951d2a | 10.0.2.51 | api.v1.superevents.views | WARNING | api.py, line 40 | [Errno 5] Input/output error: '/app/db_data/9a/6/8ac9f1720d59940bed2d8e384d57c98049c82/bayestar.multiorder.coherence.png', retrying in 1.0 seconds...

Traefik is showing that the request is returning a 500 error and is taking almost five seconds because of the retries:

Sep 19 13:38:18 gracedb-swarm-test-us-west-2a-docker-mgr-01 gracedb_docker_webgateway_webgateway.1.l4j2u8hibrrtgelvsfhiubxfh: 131.215.113.198 - - [19/Sep/2022:13:38:13 +0000] "GET /api/superevents/MS220919n/files/ HTTP/1.1" 500 10472 "-" "-" 174967 "gracedb@docker" "http://10.0.2.51:80" 4815ms

For reference the nfs mounts are mounted with: nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport,_netdev

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information