Demonstration scripts and notebooks showing how to use the Data Hub API for data submissions, reporting, and other tasks.
Note: All of these scripts will require a Data Hub API key in order to use. Instructions for obtaining an API token can be found in the data submission documentation. For security reasons, these tokens should be stored as environment variables on your system. The scripts expect production and stage API keys to be stored in the environment variables PRODAPI and STAGEAPI, respectively.
This Jupyter notebooks walks through a basic example of how to do a CRDC submission using the Data Hub APIs. Topics covered in this notebook include:
- Finding the studies you are approved to submit to
- Creating a new submission or working on an existing submission
- Uploading the data submission templates
- Running the data and metadata validtions
- Reviewing the results from validations
- Final submission, cancellation, or withdrawl of a submission
This notebook covers several queries that can provide more detailed information on the status of your submissions such as:
- Listing all the submissions you have
- Getting high-level summary information about a specific submission
- Getting detailed information about specific submissions
- Getting a detailed inventory of the data that you've added to a submission
- Deleting specific information from a submission
- Retrieving a populated configuration file for use in uploading data files with the CLI Upload Tool
This is a Python Dash application that uses the APIs to create a personal dashboard of your submissions. To use this script, run the script (# python3 SubmissionReportDashboard.py), then launch a browser and navigate to http://localhost:8050.
Required Python Libraries dash, dash_bootstrap_components, plotly, requests, pandas, datetime, pytz
Simialr to the SubmissionReportDashboard only uses Python Shiny instead of Dash.
Submissions that are inactive for extended periods of time start generating warning emails and after 180 days get deleted. The remedy to this situation is to log into the Submission Portal and look at the submission. However, this gets burdensome if there are a large number of submissions to check. This script (also in notebook form) will query for all the submissions that are either New or In Progress and will request information from each of them. This re-sets the inactvitiy timer.
Currently, the Submission Portal does not have aggreation for the Warnings that are generated by data updates. This can make it difficult to see if the updates that are about to be applied are correct. These programs will aggregate all the warnings and display them in a paired manner so that it is easier to see what changes are about to be applied.
Both scripts take the following configurations. In the notebook you will find these in a marked cell, edit that cell to need. For the script, the configurations are provided in a YAML file (see warning_configs.yml for an example)
subid: A list of the submission IDs to be checked. Submission IDs can be obtained from the Submission Portal severity: Should be set to 'All' nodelist: A list of the nodes that should be checked for Warnings outputdirectory: A local directory where output can be written tier: The tier to use, shoudl be either 'stage' or 'prod'
Script runtime options
- -c/--configfile: The YAML configuration file
- -v/--verbose: The level of verbosity. Add more v's to be more verbose.
This script resets the inactivity timer for all of your New or In Progress submissions to the current date.
Script runtime options
- -t/--tier: The tier to check. Should be either 'prod' or 'stage'
- -v/--verbose: The level of verbosity. Add more v's to be more verbose.
This is a graphical version (Python Dash) of the SubmissionReset.py script. Select a tier from the drop-down and a table of your current New and In Progress submissions will be generated. Select the submissions you wish to reset the inactivity timer on and then click on the Reset Time on Selected Submissions button below the table. Each submission selected will be reset to the current date.
To use, run the script ($ python3 SubmissionResetGUI.py) and then bring up a browser and navigate to http://localhost:8050.
Required Python Libraries dash, dash_bootstrap_components, requests, pandas, datetime, pytz
When updating a submission that has previously been through DataHub, it's possible to get a great number of warnings that data is going to be changed. Unfortunately, the current Submission Portal interface doesn't have a way to aggregate and display these warnings which can make it difficult and tedious to check. This script and notebook will aggregate all the warnings in a submission and display alternating old and new lines in a table(notebook) or output a csv file (script).
This script addresses a weakness in the Submission Portal, namely that deleting some, but not all, entries in a node can become tedious. The graphical interface nicely supports deleting individual entries as well as entire nodes. However, the graphical interface does not support deleting dozens or hundreds of entries if needed.
DeleTron.py will take a Data Hub csv loading sheet and instead of adding the information to the submission, it will delete all the entries from the submission. This allows a submitter to start with one of their existing loading sheets, edit it down (or copy to a new load sheet) the entires they wish deleted. Like submission, deletion works on a node-by-node basis and a separate deletion sheet has to be provided for each node to be deleted.
Data Hub will also delete any child nodes that are orphaned by deleting a parent. For example, if a sample is orphaned when a participant is deleted, the sample will also be deleted even though a sample load sheet was never provided. For this reason it's usually useful to understand the existing relationships before deleting and exploring if updating the information would be a better approach.
DeleTron requires a yaml file with the following parameters. It is recommended you start with the delete_configs.yml example:
- tier: The Data Hub tier you wish to use. Likely either stage or prod
- deletefile: The full path to the file that contains the information to be deleted.
- submissionid: The UUID for the submission you are editing. This can be copied from the upper left of the submission view in the GUI.
- node: The node you will be deleting informaiton from. For example file, diagnosis, or participant
There is additional required information in the mdffiles section that should not be edited. If you create your own yaml configuration file, make sure this section is copied over and not edited.