Interacting with ENA Database

Data

Data in ENA are organzed into 11 domains (or type):

Domain Description
Assembly Information describing the construction of reads and sequence contigs into higher order scaffolds and chromosomes
Sequence Assembled and, optionally, annotated assembled reads
Coding A virtual domain comprising sequence regions reported by data providers as being protein-coding regions
Non-coding A virtual domain comprising sequence regions reported by data providers as representing non-protein-coding (RNA) genes
Marker A virtual domain comprising information relating to phylogenetic, identification and molecular ecology marker data
Analysis Derived data forms, such as recalibrated aligned reads and metabarcoding identifications
Read Raw sequencing reads from next generation platforms
Trace Raw sequencing data from capillary platforms
Taxon Information relating to the organism that was the source of the sequenced biological sample
Sample Information relating to the biological sample studied in the sequencing experiment
Study Information relating to the scope of the sequencing effort; also known as ‘Project’, the primary use of study is to unite content otherwise dispersed across the ENA domains

Each domains are further subdivided in some cases into data classes. It is the results that can be accessed:

Domain Result Description
Assembly assembly Genome assemblies
Sequence sequence_release Nucleotide sequences (Release)
sequence_update Nucleotide sequences (Update)
wgs_set Genome assembly contig sets (WGS)
tsa_set Transcriptome assembly contig sets (TSA)
Coding coding_release Protein coding sequences (Release)
coding_update Protein coding sequences (Update)
Non-coding noncoding_release Non-coding sequences (Release)
noncoding_update Non-coding sequences (Update)
Analysis analysis_study Studies used for nucleotide sequence analyses from reads
analysis Nucleotide sequence analyses from reads
Read read_experiment Experiments used for raw reads
read_run Raw reads
read_study Studies used for raw reads
Sample sample Samples
Taxon taxon Taxonomic classfication
Environmental environmental Environmental samples
Study Study Studies

This list can be accessed with get_results.

Each “result” can be searched, the outputs can be formatted and sorted given different fields. These fields are accessible via the commands:

  • get_filter_fields to obtain the fields to build a query or filter (more information about the type of these filters with get_filter_types)
  • get_returnable_fields to obtain the fields extractable for a result
  • get_sortable_fields to obtain the fields usable to sort the outputs

Programmatic access

The data on ENA can be accessed programmatically, in ENASearch:

  • ENA database can be queried via search_data

  • Data with an accession id can be retrieved via retrieve_data

    This function can not be used to

    • Retrieve taxonomic data

      It must be done via the taxon portal with retrieve_taxons. The taxonomy results can be accessed via get_taxonomy_results

    • Retrieve a run file report via a study accession (ERP, SRP, DRP, PRJ prefixes), experiment accession (ERX, SRX, DRX prefixes), sample accessions (ERS, SRS, DRS, SAM prefixes) or a run accessions (ERR, SRR, DRR prefixes)

      retrieve_run_report is used then. The fields accessible for the run report can be obtained with get_run_fields

    • Retrieve an analysis report via a study accession (ERP, SRP, DRP, PRJ prefixes), sample accession (ERS, SRS, DRS, SAM prefixes) or analysis accession (ERZ, SRZ, DRZ prefixes)

      retrieve_analysis_report is used then. The fields accessible for the run report can be obtained with get_analysis_fields