Code: pylprotpredictor module

CDS class

class pylprotpredictor.cds.CDS(seq_id='', origin_seq=None, origin_seq_id='', start=-1, end=-1, strand='forward', seq=None, alternative_ends=[], alternative_cds=[], alignments=[], conserved_cds=None, rejected_cds=[], status='')[source]

Class to describe a CDS

add_alignment(alignment)[source]

Add an alignment object to the list of alignment

Parameters:alignment – an alignment object
add_alternative_cds(alternative_cds)[source]

Add an alternative CDS to the list of possible alternative CDS

Parameters:alternative_cds – a CDS object
add_id_alignment(seq_id, alignment)[source]

Add alignment to the correct CDS object

Parameters:
  • seq_id – id of the CDS
  • alignment – alignment object to add
add_rejected_cds(rejected_cds)[source]

Add a rejected CDS to the list of rejected CDS

Parameters:rejected_cds – a CDS object
export_description()[source]

Export the description of the CDS

Returns:string with the description
export_to_dict()[source]

Export the object to CDS

Returns:dict corresponding to CDS object
extract_possible_alternative_seq()[source]

Extract the start, end and sequence of different possible sequences for a CDS identified as potential PYL CDS

find_alternative_ends()[source]

Find alternative ends (on the same ORF) for a CDS until the next found STOP codon on the genome (or its complement if the CDS is on the reverse strand)

get_alignments()[source]

Return the list of alignments

Returns:list of alignment object
get_alternative_cds()[source]

Return the list of possible alternative CDS if the CDS is ending with TAG STOP codon

Returns:list of CDS object
get_alternative_end()[source]

Return the list of alternative CDS end

Returns:list of the end of the alternative CDS
get_alternative_ends()[source]

Return the list of possible alternative ends if the CDS is ending with TAG STOP codon

Returns:list of int corresponding to the alternative ends
get_alternative_start()[source]

Return the list of alternative CDS start

Returns:list of the start of the alternative CDS
get_conserved_cds()[source]

Return the CDS object of the conserved CDS as correct CDS (start, end, sequence)

Returns:CDS object of the conserved CDS
get_end()[source]

Return the end position of the CDS on the origin sequence

Returns:int corresponding to the end position
get_highest_bitscore()[source]

Return the highest bitscore for all alignments

Returns:float
get_id()[source]

Return the id of the CDS

Returns:string corresponding to the id
get_lowest_evalue_alignment()[source]

Return the alignment with the lowest evalue

Returns:alignment
get_origin_seq()[source]

Return the SeqRecord object corresponding to the origin seq of the CDS

Returns:SeqRecord object
get_origin_seq_id()[source]

Return the id of origin seq of the CDS

Returns:string corresponding to the origin seq
get_origin_seq_size()[source]

Return the length of the origin sequence

Returns:int corresponding to the length of the origin sequence
get_origin_seq_string()[source]

Return the string of the origin sequence

Returns:string corresponding to the origin sequence
get_rejected_cds()[source]

Return a list of the rejected CDS objects as correct CDS (start, end, sequence)

Returns:list of CDS objects
get_seq()[source]

Return the sequence of the CDS

Returns:string with the sequence
get_seqrecord()[source]

Return a SeqRecord of the CDS

Returns:SeqRecord
get_start()[source]

Return the start position of the CDS on the origin sequence

Returns:int corresponding to the start position
get_status()[source]

Return the status

Returns:string with the status of the CDS
get_stop_codon()[source]

Return the STOP codon of the CDS

Returns:the STOP codon
get_strand()[source]

Return the strand of the CDS on the origin sequence

Returns:string corresponding to the strand (forward or reverse)
get_translated_alternative_seq()[source]

Return a list of the translated sequences of the alternative sequences

Returns:list of SeqRecord objects
get_translated_seq()[source]

Return the translated sequence of the CDS

Returns:SeqRecord object corresponding to the translated sequence
has_alternative_cds()[source]

Test if the CDS has alternative cds

Returns:boolean
has_alternative_ends()[source]

Test if the list of alternative ends is not empty

Returns:boolean
has_origin_seq()[source]

Test if the CDS has a origin seq

Returns:boolean
identify_cons_rej_cds()[source]

Identify which alternative CDS to converse or reject based on the evalue and the alignment length: Keep the sequence with a lowest evalue and a longer alignment

Returns:better alignment
init_from_dict(in_dict)[source]

Initiate a CDS instance with a dictionary

Parameters:in_dict – dictionary with attribute for a CDS object
init_from_record(record)[source]

Initiate a CDS instance with a SeqRecord object

Parameters:record
is_potential_pyl()[source]

Test if the sequence has a status for potential pyl

Returns:boolean
is_reverse_strand()[source]

Test if the strand is reverse

Returns:boolean
is_tag_ending()[source]

Test if the sequence has a status for tag-ending

Returns:boolean
is_tag_ending_seq()[source]

Test if the sequence is ending with TAG STOP codon

Returns:boolean
reset_alignments()[source]

Reset the list of alignments

reset_alternative_cds()[source]

Reset the list of alternative cds

reset_rejected_cds()[source]

Reset the list of rejected cds

set_alternative_ends(alternative_ends)[source]

Change the list of alternative ends

Parameters:alternative_ends – list of int corresponding to the new alternative ends
set_conserved_cds(conserved_cds)[source]

Change the conserved CDS

Parameters:conserved_cds – CDS object of the conserved CDS
set_end(end)[source]

Change the end position of the CDS

Parameters:end – new end value (int)
set_evalue(evalue)[source]

Change the evalue

Parameters:evalue – new evalue
set_id(seq_id)[source]

Change the id of the CDS

Parameters:seq_id – new seq id value
set_origin_seq(origin_seq)[source]

Change the SeqRecord object corresponding to the origin seq of the CDS

Parameters:origin_seq – SeqRecord object
set_origin_seq_id(origin_seq_id)[source]

Change the id of the origin sequence of the CDS

Parameters:origin_seq_id – new origin seq id value
set_seq(seq)[source]

Change the sequence object of the CDS

Parameters:seq – new Seq object with the sequence of the CDS
set_start(start)[source]

Change the start position of the CDS

Parameters:start – new start value (int)
set_status(status)[source]

Change the status

Parameters:status – new status
set_strand(strand)[source]

Change the strand value of the CDS

Parameters:end – new strand (forward or reverse)
pylprotpredictor.cds.extract_seq_desc(desc)[source]

Extract from description the seq id, the origin sequence, start, end and strand from a predicted CDS

Parameters:desc – description of a prediced CDS with Prodigal
Returns:id of predicted CDS
Returns:id of the origin sequence
Returns:start position of the predicted CDS
Returns:end position of the predicted CDS
Returns:strand of the predicted CDS
pylprotpredictor.cds.find_stop_codon_pos_in_seq(seq)[source]

Find position of STOP codon inside a sequence (not the last position)

Parameters:seq – string sequence of amino acids
Returns:list of position for possible STOP codons in a sequence
pylprotpredictor.cds.test_to_continue(end, origin_seq_size)[source]

Test if possible to extract next codon: position still in the genome

Parameters:
  • end – int corresponding to the current end
  • origin_seq_size – size of the origin sequence
Returns:

boolean

pylprotpredictor.cds.transform_strand(strand_id)[source]

Transform strand from numerical value to string value

Parameters:strand_id – numerical value to represent a strand (1 or -1)
Returns:string value (forward or reverse) for the strand
pylprotpredictor.cds.translate(seq)[source]

Translate a sequence into amino acids while replacing any possible STOP codon encoded by TAG by a Pyl amino acid

Parameters:seq – a Seq object
Returns:string with the corresponding amino acid sequence with the TAG encoded STOP are replaced by Pyl amino acid

Alignment class

class pylprotpredictor.alignment.Alignment(sseqid='', pident=0, length=0, mismatch=0, gapopen=0, qstart=0, qend=0, sstart=0, send=0, evalue=10, bitscore=0)[source]

Class to describe a DIAMOND alignment

get_bitscore()[source]

Return bit score

Returns:query sequence id
get_evalue()[source]

Return expect value

Returns:float
get_gapopen()[source]

Return number of gap openings

Returns:int
get_length()[source]

Return alignment length

Returns:int
get_mismatch()[source]

Return number of mismatches

Returns:int
get_pident()[source]

Return percentage of identical matches

Returns:float
get_qend()[source]

Return end of alignment in query

Returns:int
get_qstart()[source]

Return query sequence id

Returns:query sequence id
get_send()[source]

Return end of alignment in subject

Returns:int
get_sseqid()[source]

Return query sequence id

Returns:string
get_sstart()[source]

Return start of alignment in subject

Returns:int
init_from_search_report_row(row)[source]

Initiate an Alignment instance with a row extracted from a BLAST/DIAMOND table

Parameters:row – a pandas row
set_bitscore(bitscore)[source]

Modify bit score

Parameters:bitscore – int
set_evalue(evalue)[source]

Modify evalue

Parameters:evalue – float
set_gapopen(gapopen)[source]

Modify number of gap openings

Parameters:gapopen – string
set_length(length)[source]

Modify alignment length

Parameters:length – int
set_mismatch(mismatch)[source]

Modify number of mismatches

Parameters:mismatch – int
set_pident(pident)[source]

Modify percentage of identical matches

Parameters:pident – float
set_qend(qend)[source]

Modify end of alignment in query

Parameters:qend – string
set_qstart(qstart)[source]

Modify start of alignment in query

Parameters:qstart – string
set_send(send)[source]

Modify end of alignment in subject

Parameters:send – int
set_sseqid(sseqid)[source]

Modify query Seq-id

Parameters:sseqid – string
set_sstart(sstart)[source]

Modify start of alignment in subject

Parameters:sstart – int

Predict

pylprotpredictor.predict.extract_potential_pyl_cds(pred_cds, pot_pyl_cds_filepath, pot_pyl_cds_info_filepath, pred_cds_obj_filepath)[source]

Extract potential PYL CDS from TAG-ending CDS

Parameters:
  • pred_cds – a dictionary with the predicted CDS represented as CDS objects
  • pot_pyl_cds_filepath – path to fasta file in which the protein sequences of the potential PYL CDS are saved
  • pot_pyl_cds_info_filepath – path to a cvs file to get information about potential PYL CDS
  • pred_cds_obj_filepath – path to generated JSON file to store the list of predicted CDS objects
pylprotpredictor.predict.extract_predicted_cds(pred_cds_path, pred_cds_info_path, tag_ending_cds_info_path, genome_filepath)[source]

Extract the list of predicted CDS and identify the CDS ending with TAG STOP codon

Parameters:
  • pred_cds_path – path to the output of CDS prediction (Prodigal)
  • pred_cds_info_path – path to a CSV file in which the information (start, end, strand, origin) are collected for each predicted CDS
  • tag_ending_cds_info_path – path to CSV file to export the information about the TAG ending CDS
  • genome_filepath – path to reference genome
Returns:

a dictionary with the predicted CDS represented by CDS object

pylprotpredictor.predict.extract_seqs(seq_filepath)[source]

Extract the sequences in a fasta file

Parameters:seq_filepath – path to a fasta file
Returns:a dictionary with all sequences indexed by their id, their length and their complement sequence
pylprotpredictor.predict.predict_pyl_proteins(genome_filepath, pred_cds_filepath, pot_pyl_seq_filepath, log_filepath, pred_cds_info_filepath, tag_ending_cds_info_filepath, pot_pyl_seq_info_filepath, pred_cds_obj_filepath)[source]

Run prediction of potentila PYL CDS:

  • Extraction of predicted CDS into a dictionary
  • Identification of TAG-ending proteins
  • Extraction of potential PYL sequences
Parameters:
  • genome_filepath – path to file with genome sequence
  • pred_cds_filepath – path to the output of CDS prediction (Prodigal)
  • pot_pyl_seq_filepath – path to fasta file with potential PYL CDS sequence
  • log_filepath – path to log file
  • pred_cds_info_filepath – path to CSV file with predicted CDS info
  • tag_ending_cds_info_filepath – path to CSV file with TAG-ending CDS info
  • pot_pyl_seq_info_filepath – path to CSV file with potential PYL CDS info
  • pred_cds_obj_filepath – path to generated JSON file to store the list of predicted CDS objects

Check

pylprotpredictor.check.check_pyl_proteins(pot_pyl_similarity_search, pred_cds_obj_filepath, cons_pred_cds_seq, info_filepath)[source]

Check predicted PYL CDS:

  • Get the potential PYL CDS
  • Parse the similarity search report
  • Identify and extract the correct CDS sequence (the one with the lowest evalue and longest alignment for potential PYL)
Parameters:
  • pot_pyl_similarity_search – path to similarity search report of potential PYL CDS against a reference database
  • pred_cds_obj_filepath – path to generated JSON file to store the list of predicted CDS objects
  • cons_pred_cds_seq – path to a FASTA file for the conserved CDS sequences
  • info_filepath – path to a CSV file with final information about the CDS
pylprotpredictor.check.extract_correct_cds(pred_cds, cons_pred_cds_seq, info_filepath)[source]

Identify and extract the correct CDS sequence

Parameters:
  • pred_cds – dictionary of the predicted CDS
  • cons_pred_cds_seq – path to a FASTA file for the conserved CDS sequences
  • info_filepath – path to a CSV file with final information about the CDS
pylprotpredictor.check.get_cds_obj(cds_id, pred_cds)[source]

Find the CDS object given an id

Parameters:
  • cds_id – id of the CDS to find
  • pred_cds – dictionary of the predicted CDS
Returns:

a CDS object

pylprotpredictor.check.import_cds(cds_obj_filepath)[source]
Parameters:cds_obj_filepath – path to JSON file with collection of CDS objects
Returns:dictionary of the CDS objects
pylprotpredictor.check.parse_similarity_search_report(pot_pyl_similarity_search, pred_cds)[source]

Parse the similarity search report and add information to the list of potential PYL CDS

Parameters:
  • pot_pyl_similarity_search – path to similarity search report of potential PYL CDS against a reference database
  • pred_cds – dictionary of the predicted CDS

Write report

pylprotpredictor.write_report.extract_row_number(csv_filepath)[source]

Extract row number of a CSV file

Parameters:csv_filepath – path to a CSV file
Returns:an integer corresponding to the number of lines in the CSV file
pylprotpredictor.write_report.write_report(pred_cds, tag_ending_cds, pot_pyl_cds, final_cds, report_filepath)[source]

Write HTML report to summarize the full analysis

Parameters:
  • pred_cds – path to CSV file with predicted CDS info
  • tag_ending_cds – path to CSV file with TAG-ending CDS info
  • pot_pyl_cds – path to CSV file with potential PYL CDS info
  • final_cds_info – path to a CSV file with final information about the CDS
  • report_filepath – path to HTML file in which writing the report