Commit 9270065e authored by Sean Leavey's avatar Sean Leavey
Browse files

Merge branch 'release/0.8.0'

parents 9c5080c9 43656a4b
......@@ -109,6 +109,10 @@ pages:
stage: deploy
needs:
- docs/html
# Ideally we'd have a "stage: test" here too, but this isn't possible yet; see
# https://gitlab.com/gitlab-org/gitlab/-/issues/220758
- test/py38
- test/py310
only:
- tags@sean-leavey/dcc
- /^dcc-\d(.\d){2}$/
......@@ -124,6 +128,10 @@ pypi:
image: python:3.8
needs:
- build/any
# Ideally we'd have a "stage: test" here too, but this isn't possible yet; see
# https://gitlab.com/gitlab-org/gitlab/-/issues/220758
- test/py38
- test/py310
only:
- tags@sean-leavey/dcc
- /^dcc-\d(.\d){2}$/
......
......@@ -2,6 +2,29 @@
# Change log
All notable changes to `dcc` will be documented in this file.
## 0.8.0
- CLI:
- Changed `dcc update` `--dry-run` flag to `--confirm/--no-confirm`; now it shows
the changes and prompts for confirmation
- Added interactive mode to `dcc archive` command, which prompts before downloading
files
- Changed `dcc archive` arguments to accept DCC numbers by default, instead allowing
loading of numbers from file using `--from-file`
- Package:
- Made remote file stream handling (for progress bars, too large files, etc.) more
generic, allowing for greater flexibility over downloads
- Added `DCCFile.exists` method
- Titles and filename strings are now sanitised upon `DCCFile` instantiation
- Added `DCCArchive.latest_revisions` method
- Moved default session function to `dcc.sessions`
- Exposed some imports on the package scope
- Developer tools:
- Added archive tests
- Renamed test data files
- Documentation:
- Added link to PyPI project on installation page
- Updated release procedure
## 0.7.6 (hotfix)
- Developer tools:
- Fixed PyPI deployment on CI
......
......@@ -77,10 +77,6 @@ Options require a value of some sort, whereas flags don't.
Show or hide a download progress bar. For small files the progress bar may not be
shown. By default this is enabled.
.. option:: -n, --dry-run
Perform a trial run of a potentially destructive operation, making no real changes.
.. option:: -v, --verbose
Increase the program's verbosity. This can be specified multiple times to further
......@@ -116,20 +112,24 @@ Options require a value of some sort, whereas flags don't.
.. program:: dcc archive
Archive remote DCC records locally using DCC numbers listed in file.
Archive remote DCC records locally.
Each DCC number in :option:`SRC <dcc archive SRC>` should be a DCC record designation
Each specified :option:`NUMBER <dcc archive NUMBER>` should be a DCC record designation
with optional version such as 'D040105' or 'D040105-v1'.
If a DCC number contains a version and is present in the local archive, it is used
If a DCC number contains a version and is present in the local archive, it is skipped
unless :option:`--force <dcc archive --force>` is specified. If the DCC number does not
contain a version, a version exists in the local archive, and :option:`--ignore-version
<dcc archive --ignore-version>` is specified, the latest local version is used. In all
<dcc archive --ignore-version>` is specified, its archival is skipped as well. In all
other cases, the latest record is fetched from the remote host.
.. option:: SRC
.. option:: NUMBER
The number for the DCC record to archive.
A DCC number to archive (can be specified multiple times).
.. option:: --from-file
Archive records specified in file.
.. option:: --depth
......@@ -147,6 +147,12 @@ other cases, the latest record is fetched from the remote host.
In addition to fetching the record, fetch its attached files too.
.. option:: -i, --interactive
Enable interactive mode, which prompts for confirmation before downloading files.
This flag implies :option:`--files <dcc archive --files>`, and
:option:`--max-file-size <dcc archive --max-file-size>` is ignored.
.. option:: -s, --archive-dir
Directory to use to archive and retrieve downloaded documents and files. If not
......@@ -466,9 +472,10 @@ metadata for that field.
An author in the form "Albert Einstein" (can be specified multiple times).
.. option:: -n, --dry-run
.. option:: --confirm, --no-confirm
Perform a trial run of a the remote update, making no real changes.
Prompt (``--confirm``) or don't prompt (``--no-confirm``) for confirmation before
actually submitting the update to the remote DCC host.
.. option:: -s, --archive-dir
......
......@@ -15,8 +15,8 @@ then install the package as editable, alongside the developer dependencies:
.. code-block:: text
$ cd /path/to/cloned/dcc/repository
$ pip install -e .[dev]
$ cd /path/to/cloned/dcc/repository
$ pip install -e .[dev]
The project uses `pre-commit <https://pre-commit.com/>`__ to perform checking and code
formatting as part of the git commit process. To initialise this, run:
......@@ -72,19 +72,22 @@ Creating a tagged release
#. Check out the ``master`` branch again, and merge the release branch with ``git merge
--no-ff release/X.Y.Z``.
#. Delete the now fully-merged release branch with ``git branch -d release/X.Y.Z``.
#. Push the branches and tags to the remote with ``git push develop``, ``git push
master`` and ``git push --tags``.
#. Push the branches and tags to the remote with ``git push origin master develop
dcc-X.Y.Z``.
Uploading to PyPI
~~~~~~~~~~~~~~~~~
Deployment to PyPI is automatic for tagged branches pushed to the main repository at
``sean-leavey/dcc``. The steps for manual deployment are listed below in case needed.
The following instructions are based on
https://packaging.python.org/en/latest/tutorials/packaging-projects/.
.. note::
Uploading to `PyPI <https://pypi.org/>`__ requires an account that is a maintainer of
the `dcc project <https://pypi.org/project/dcc>`__ there.
Uploading to `PyPI <https://pypi.org/>`__ requires an account that is a maintainer
of the `dcc project <https://pypi.org/project/dcc>`__ there.
#. Check out the tag for the package you wish to publish with ``git checkout
dcc-X.Y.Z`` (``setuptools_scm`` used for versioning requires a tagged branch for a
......
......@@ -17,7 +17,51 @@ output into :program:`dcc archive`, passing a directory to store the results.
# Scrape the page for the QNWG session from the September 2021 LVK Meeting, then
# archive the records and attachments corresponding to the extracted DCC numbers.
$ dcc convert "https://dcc.ligo.org/cgi-bin/private/DocDB/DisplayMeeting?sessionid=5120" - | dcc archive -s /path/to/local/archive --files --force -
$ dcc convert "https://dcc.ligo.org/cgi-bin/private/DocDB/DisplayMeeting?sessionid=5120" - | dcc archive -s /path/to/archive --from-file - --files --force
The archive directory at ``/path/to/local/archive`` will then contain the DCC records
The archive directory at ``/path/to/archive`` will then contain the DCC records
and files associated with the session.
Check existing archive for missing downloads
--------------------------------------------
The :option:`--max-file-size <dcc --max-file-size>` option allows ignoring large files
when archiving. You may wish to archive files of a certain type without size limits.
This script searches the archive for (latest) records with missing files with certain
extensions and reports them:
.. code-block:: python
from pathlib import Path
from dcc import DCCArchive
# Attachment extensions to check exist.
KEEP = [".stl", ".step", ".dwg"]
# The local archive.
archive = DCCArchive("/path/to/archive")
for record in archive.latest_revisions:
report = False
for file_ in record.files:
if file_.exists():
continue
path = Path(file_.filename)
if path.suffix.casefold() in KEEP:
report = True
if report:
print(record.dcc_number)
The output from this program is in a form that can be easily passed to :program:`dcc
archive`:
.. code-block:: text
# Assume script above is stored in "file_missing.py".
$ python find_missing.py > missing.txt
$ dcc archive -s /path/to/archive --from-file missing.txt --files
The interactive mode flag :option:`-i <dcc archive -i>` (or :option:`--interactive <dcc
archive --interactive>`) can be useful here, which prompts before downloading each file,
allowing you to skipones you don't want.
......@@ -45,7 +45,7 @@ Archive a record and its files locally:
.. code-block:: text
$ echo "T010075" | dcc archive -s /path/to/archive --files -
$ dcc archive -s /path/to/archive T010075 --files
$ tree /path/to/archive
/path/to/archive
└── T010075
......
......@@ -15,8 +15,8 @@ choose to manage system dependencies yourself, ensure you have the relevant Kerb
packages above (provided on Linux by e.g. ``krb5-user`` on Debian derivatives or
``krb5-workstation`` on Red Hat derivatives).
``dcc`` can be installed using ``pip`` or your favourite Python package manager using
e.g.:
``dcc`` can be installed from `PyPI <https://pypi.org/project/dcc/>`__ using ``pip`` or
your favourite Python package manager using e.g.:
.. code-block:: text
......
......@@ -32,7 +32,7 @@ file system. For example:
.. code-block:: text
# Use a directory called "dcc" in your home directory.
$ echo "T010075" | dcc archive -s ~/dcc --files -
$ dcc archive -s ~/dcc --files T010075
If the archive directory is not given, ``dcc`` uses a temporary directory each time it
is invoked, and the data is lost upon program exit.
......
......@@ -79,30 +79,46 @@ Record archival
DCC records can be archived locally using :program:`dcc archive`. This downloads
records' metadata, and optionally attached files, and stores them in the :ref:`local
archive <local_archive>` for later retrieval. The command requires an input file
containing the DCC numbers to archive, separated by whitespace. For example:
archive <local_archive>` for later retrieval. The command requires one or more
:option:`NUMBER <dcc archive NUMBER>` arguments and/or a :option:`--from-file <dcc
archive --from-file>` option followed by a path to a file containing the DCC numbers
(separated by whitespace) to archive. For example:
.. code-block:: text
# Archive the latest version of T010075:
$ echo "T010075" > to-archive.txt
$ dcc archive -s /path/to/archive to-archive.txt
$ dcc archive -s /path/to/archive T010075
# Archive a specific version of T010075:
$ echo "T010075-v1" > to-archive.txt
$ dcc archive -s /path/to/archive to-archive.txt
$ dcc archive -s /path/to/archive T010075-v1
The input can also be set to ``stdin`` by specifying ``-``:
# Archive multiple records:
$ dcc archive -s /path/to/archive T010075 E1300945
# Alternatively specify the path to a file containing the records to archive:
$ echo "T010075 E1300945" > to-archive.txt
$ dcc archive -s /path/to/archive --from-file to-archive.txt
Similar to the behaviour of standard Unix utilities, the :option:`--from-file <dcc
archive --from-file>` option can also be set to ``stdin`` by specifying ``-``:
.. code-block:: text
$ echo "T010075" | dcc archive -s /path/to/archive -
$ echo "T010075 E1300945" | dcc archive -s /path/to/archive --from-file -
Files are not automatically archived. To fetch them too, specify the :option:`--files
<dcc --files>` flag. By default, files of any size will be retrieved. To limit the
maximum size of files retrieved, specify the :option:`--max-file-size <dcc
--max-file-size>` option, specifying a maximum file size in MB.
Interactive mode
~~~~~~~~~~~~~~~~
Specifying :option:`-i <dcc archive -i>` or :option:`--interactive <dcc archive
--interactive>` will prompt you for confirmation before downloading each record's files,
giving you the opportunity to skip unnecessary files. This flag implies :option:`--files
<dcc archive --files>`.
Scraping a URL for links to DCC records
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -121,7 +137,7 @@ scrape a URL for DCC numbers and archive them locally. For example:
# Fetch the "System Engineering" topic page, then extract and archive its DCC
# numbers.
$ dcc convert https://dcc.ligo.org/cgi-bin/private/DocDB/ListBy?topicid=18 - | dcc archive -s /path/to/archive -
$ dcc convert https://dcc.ligo.org/cgi-bin/private/DocDB/ListBy?topicid=18 - | dcc archive -s /path/to/archive --from-file -
Archival of referenced and referencing records
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
......@@ -148,16 +164,16 @@ to" and "referenced by" records can be switched on and off using
--depth>` is likely to lead to thousands of records being downloaded. Typically only
a value of 1 or 2 is sufficient to archive almost every relevant related record.
For example, the referenced documents of ``T010075`` can be archived alongside
``T010075`` itself using:
For example, the referenced documents of ``E1300945`` can be archived alongside
``E1300945`` itself using:
.. code-block:: text
# Fetch "related to" documents as well as T010075 itself:
$ echo "T010075" | dcc archive -s /path/to/archive --depth 1 -
# Fetch "related to" documents as well as E1300945 itself:
$ dcc archive -s /path/to/archive E1300945 --depth 1
# Fetch "referenced by" documents as well:
$ echo "T010075" | dcc archive -s /path/to/archive --depth 1 --fetch-referencing -
$ dcc archive -s /path/to/archive E1300945 --depth 1 --fetch-referencing
.. _updating_record_metadata:
......@@ -174,18 +190,23 @@ Record metadata can be updated via ``dcc`` using :program:`dcc update`. This acc
The :option:`--keyword <dcc update --keyword>`, :option:`--related <dcc update
--related>`, and :option:`--author <dcc update --author>` options can be specified
multiple times to set multiple values. Author names should be as written, e.g. "Albert
Einstein", and should correspond to real DCC users.
Einstein", and should correspond to real DCC users. For example:
.. code-block:: text
# Update the title of T2200016.
$ dcc update T2200016 --title "A new title"
By default, :program:`dcc update` will prompt for confirmation before sending the
updated record to the DCC. To make changes without any confirmation, specify the flag
:option:`--no-confirm <dcc update --no-confirm>`. Submitted changes are irreversible, so
be careful.
.. note::
The DCC does not appear to perform error checking on author names. If an author is
not given correctly, it is simply discarded.
A dry run can be performed, meaning nothing actually gets updated on the remote DCC
host, by specifying the :option:`-n <dcc -n>` or :option:`--dry-run <dcc --dry-run>`
flag. Used in combination with :option:`-v <dcc -v>`, this can give you an idea of the
changes that will be made to the record without actually making them.
.. _changing_host:
Changing the DCC or login host
......
......@@ -10,4 +10,23 @@ try:
except ImportError:
raise FileNotFoundError("Could not find version.py. Ensure you have run setup.")
__all__ = ("PROGRAM", "AUTHORS", "PROJECT_URL", "__version__")
# Import some modules into the package namespace.
from .records import DCCArchive, DCCNumber, DCCRecord
from .sessions import (
default_session,
DCCAuthenticatedSession,
DCCUnauthenticatedSession,
)
__all__ = (
"PROGRAM",
"AUTHORS",
"PROJECT_URL",
"__version__",
"DCCArchive",
"DCCNumber",
"DCCRecord",
"default_session",
"DCCAuthenticatedSession",
"DCCUnauthenticatedSession",
)
......@@ -15,16 +15,16 @@ import click
from . import __version__, PROGRAM, AUTHORS, PROJECT_URL
from .records import DCCArchive, DCCNumber, DCCAuthor
from .sessions import DCCAuthenticatedSession, DCCUnauthenticatedSession
from .sessions import DCCSession, DCCAuthenticatedSession, DCCUnauthenticatedSession
from .parsers import DCCParser
from .env import DEFAULT_HOST, DEFAULT_IDP
from .util import change_exc_msg
from .util import change_exc_msg, human_file_size
from .exceptions import (
NotLoggedInError,
UnrecognisedDCCRecordError,
UnauthorisedError,
FileTooLargeError,
DryRun,
FileSkippedException,
TooLargeFileSkippedException,
)
......@@ -167,18 +167,6 @@ download_progress_option = click.option(
help="Show progress bar.",
)
# Updating.
dry_run_option = click.option(
"-n",
"--dry-run",
is_flag=True,
default=False,
show_default=True,
callback=partial(_set_state_flag, flag="dry_run"),
expose_value=False,
help="Perform a trial run with no changes made.",
)
# Verbosity.
verbose_option = click.option(
"-v",
......@@ -308,8 +296,8 @@ def _archive_record(
number,
ignore_version=ignore_version,
overwrite=force,
fetch_files=files,
ignore_too_large=True,
# Don't fetch files yet.
fetch_files=False,
session=session,
)
except UnrecognisedDCCRecordError as err:
......@@ -338,7 +326,19 @@ def _archive_record(
result.archived += 1
if files:
result.files_archived += len(record.files)
for index in range(len(record.files)):
try:
archive.fetch_record_file(
record,
index + 1,
ignore_too_large=True, # Don't throw exception.
overwrite=force,
session=session,
)
except FileSkippedException as err:
state.echo_exception(err)
else:
result.files_archived += 1
if level > 0:
if fetch_related:
......@@ -357,6 +357,10 @@ def _archive_record(
try:
_do_fetch(dcc_number, level=depth)
except click.exceptions.Abort as err:
# Aborts during e.g. click.prompt() are not proper KeyboardInterrupts so we have
# to make them one.
raise KeyboardInterrupt() from err
except Exception as err:
change_exc_msg(err, f"Archival error: {err}")
state.echo_exception(err)
......@@ -371,7 +375,7 @@ class _State:
self.dcc_host = DEFAULT_HOST
self.idp_host = DEFAULT_IDP
self.archive_dir = None
self.dry_run = None
self.interactive = None
self.max_file_size = None
self.show_progress = None
self.public = None
......@@ -380,17 +384,7 @@ class _State:
self._verbosity = logging.WARNING
def dcc_session(self):
progress = None
if self.show_progress:
# Only show progress when not being quiet.
if self.verbose:
progress = self._download_progress_hook
kwargs = dict(
max_file_size=self.max_file_size,
simulate=self.dry_run,
download_progress_hook=progress,
)
kwargs = dict(stream_hook=self._stream_hook)
if self.public:
self.echo_info("Creating unauthenticated DCC session.")
......@@ -431,28 +425,62 @@ class _State:
self.echo_debug(f"Using {archive_dir} as archive.")
return DCCArchive(archive_dir)
def _download_progress_hook(self, thing, chunks, total_length):
def _stream_hook(self, response_type, item, response):
if response_type is not DCCSession.STREAM_FILE:
raise RuntimeError(f"Unrecognised response type {repr(response_type)}.")
# We're downloading a file.
content_length = response.headers.get("content-length")
if content_length:
content_length = int(content_length)
self.echo_debug(f"Content length: {content_length}")
if self.interactive:
if content_length:
# Show file size.
value, unit = human_file_size(content_length)
item_size = f" ({value:.2f} {unit})"
else:
item_size = ""
if item.exists():
prompt = f"{repr(str(item))} already archived. Re-download{item_size}?"
else:
prompt = f"Download {repr(str(item))}{item_size}?"
if not click.confirm(prompt):
raise FileSkippedException(item)
if content_length:
if not self.interactive:
if (
self.max_file_size is not None
and content_length > self.max_file_size
):
raise TooLargeFileSkippedException(
item, content_length, self.max_file_size
)
# Only show progress when not being quiet.
if self.show_progress and self.verbose:
response = self._download_progress_hook(item, response, content_length)
else:
self.echo_debug(
"Can't show progress or check file size: no Content-Length header."
)
yield from response
def _download_progress_hook(self, item, chunks, total_length):
# Iterate over the chunks, yielding each chunk and updating the progress bar.
with click.progressbar(length=total_length) as progressbar:
display_length = ""
if total_length:
# Convert to user friendly length.
if total_length >= 1024 * 1024 * 1024:
value = total_length / (1024 * 1024 * 1024)
unit = "GB"
elif total_length >= 1024 * 1024:
value = total_length / (1024 * 1024)
unit = "MB"
elif total_length >= 1024:
value = total_length / 1024
unit = "kB"
else:
value = total_length
unit = "B"
value, unit = human_file_size(total_length)
display_length = f" ({value:.2f} {unit})"
self.echo(f"Downloading {thing}{display_length}")
self.echo(f"Downloading {item}{display_length}")
for chunk in chunks:
yield chunk
progressbar.update(len(chunk))
......@@ -742,7 +770,7 @@ def open_file(ctx, dcc_number, file_number, ignore_version, locate, force):
except (NotLoggedInError, UnauthorisedError) as err:
change_exc_msg(err, f"You are not authorised to access {dcc_number}.")
state.echo_exception(err, exit_=True)
except FileTooLargeError as err:
except FileSkippedException as err:
state.echo_exception(err, _exit=True)
if state.archive_is_temporary:
......@@ -766,11 +794,25 @@ def open_file(ctx, dcc_number, file_number, ignore_version, locate, force):
@dcc.command()
@click.argument("src", type=click.File("r"))
@click.argument("number", type=DCC_NUMBER_TYPE, nargs=-1)
@click.option(
"--from-file", type=click.File("r"), help="Archive records specified in file."
)
@depth_option
@fetch_related_option
@fetch_referencing_option
@files_option
@click.option(
"-i/--interactive",
is_flag=True,
default=False,
callback=partial(_set_state_flag, flag="interactive"),
expose_value=False,
help=(
"Enable interactive mode, which prompts for confirmation before downloading "
"files. This flag implies --files, and --max-file-size is ignored."
),
)
@archive_dir_option
@ignore_version_option
@max_file_size_option
......@@ -786,7 +828,8 @@ def open_file(ctx, dcc_number, file_number, ignore_version, locate, force):
@click.pass_context
def archive(
ctx,
src,
number,
from_file,
depth,
fetch_related,
fetch_referencing,
......@@ -795,49 +838,54 @@ def archive(
skip_category,
force,
):
"""Archive remote DCC records locally using DCC numbers listed in file.
"""Archive remote DCC records locally.
Each DCC number in SRC should be a DCC record designation with optional version
such as 'D040105' or 'D040105-v1'.
Each specified NUMBER should be a DCC record designation with optional version such
as 'D040105' or 'D040105-v1'.
If a DCC number contains a version and is present in the local archive, it is used
unless --force is specified. If the DCC number does not contain a version, a version
exists in the local archive, and --ignore-version is specified, the latest local
version is used. In all other cases, the latest record is fetched from the remote
If a DCC number contains a version and is present in the local archive, it is
skipped unless --force is specified. If the DCC number does not contain a version, a
version exists in the local archive, and --ignore-version is specified, its archival
is skipped as well. In all other cases, the latest record is fetched from the remote
host.
It is recommended to specify -s/--archive-dir or set the DCC_ARCHIVE environment