Interacting with ENA Database¶
Data¶
Data in ENA are organzed into 11 domains (or type):
| Domain | Description |
|---|---|
| Assembly | Information describing the construction of reads and sequence contigs into higher order scaffolds and chromosomes |
| Sequence | Assembled and, optionally, annotated assembled reads |
| Coding | A virtual domain comprising sequence regions reported by data providers as being protein-coding regions |
| Non-coding | A virtual domain comprising sequence regions reported by data providers as representing non-protein-coding (RNA) genes |
| Marker | A virtual domain comprising information relating to phylogenetic, identification and molecular ecology marker data |
| Analysis | Derived data forms, such as recalibrated aligned reads and metabarcoding identifications |
| Read | Raw sequencing reads from next generation platforms |
| Trace | Raw sequencing data from capillary platforms |
| Taxon | Information relating to the organism that was the source of the sequenced biological sample |
| Sample | Information relating to the biological sample studied in the sequencing experiment |
| Study | Information relating to the scope of the sequencing effort; also known as ‘Project’, the primary use of study is to unite content otherwise dispersed across the ENA domains |
Each domains are further subdivided in some cases into data classes. It is the results that can be accessed:
| Domain | Result | Description |
|---|---|---|
| Assembly | assembly | Genome assemblies |
| Sequence | sequence_release | Nucleotide sequences (Release) |
| sequence_update | Nucleotide sequences (Update) | |
| wgs_set | Genome assembly contig sets (WGS) | |
| tsa_set | Transcriptome assembly contig sets (TSA) | |
| Coding | coding_release | Protein coding sequences (Release) |
| coding_update | Protein coding sequences (Update) | |
| Non-coding | noncoding_release | Non-coding sequences (Release) |
| noncoding_update | Non-coding sequences (Update) | |
| Analysis | analysis_study | Studies used for nucleotide sequence analyses from reads |
| analysis | Nucleotide sequence analyses from reads | |
| Read | read_experiment | Experiments used for raw reads |
| read_run | Raw reads | |
| read_study | Studies used for raw reads | |
| Sample | sample | Samples |
| Taxon | taxon | Taxonomic classfication |
| Environmental | environmental | Environmental samples |
| Study | Study | Studies |
This list can be accessed with get_results.
Each “result” can be searched, the outputs can be formatted and sorted given different fields. These fields are accessible via the commands:
- get_filter_fields to obtain the fields to build a query or filter (more information about the type of these filters with get_filter_types)
- get_returnable_fields to obtain the fields extractable for a result
- get_sortable_fields to obtain the fields usable to sort the outputs
Programmatic access¶
The data on ENA can be accessed programmatically, in ENASearch:
ENA database can be queried via search_data
Data with an accession id can be retrieved via retrieve_data
This function can not be used to
Retrieve taxonomic data
It must be done via the taxon portal with retrieve_taxons. The taxonomy results can be accessed via get_taxonomy_results
Retrieve a run file report via a study accession (ERP, SRP, DRP, PRJ prefixes), experiment accession (ERX, SRX, DRX prefixes), sample accessions (ERS, SRS, DRS, SAM prefixes) or a run accessions (ERR, SRR, DRR prefixes)
retrieve_run_report is used then. The fields accessible for the run report can be obtained with get_run_fields
Retrieve an analysis report via a study accession (ERP, SRP, DRP, PRJ prefixes), sample accession (ERS, SRS, DRS, SAM prefixes) or analysis accession (ERZ, SRZ, DRZ prefixes)
retrieve_analysis_report is used then. The fields accessible for the run report can be obtained with get_analysis_fields