hybkit (module)

Module storing primary hybkit classes and hybkit API.

This module contains classes and methods for reading, writing, and manipulating data in the hyb genomic sequence format ([Travis2014]). For more information, see the hybkit Hyb File Specification.

An example string of a hyb-format line from [Gay2018] is:

2407_718\tATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC\t.\tMIMAT0000078_MirBase_miR-23a_microRNA\t1\t21\t1\t21\t0.0027\tENSG00000188229_ENST00000340384_TUBB2C_mRNA\t23\t49\t1181\t1207\t1.2e-06

Hybkit functionality is primarily based on classes for storage and evaluation of chimeric genomic sequences and associated fold-information:

HybRecord

Class to store a single hyb (hybrid) sequence record

FoldRecord

Class to store predicted RNA secondary structure information for hybrid reads

Also included are classes for reading, writing, and iterating over files containing hybrid information:

HybFile

Class for reading and writing hyb-format files [Travis2014] containing chimeric RNA sequence information as HybRecord objects

ViennaFile

Class for reading and writing Vienna-format files [ViennaFormat] containing RNA secondary structure information in dot-bracket format as FoldRecord objects

CtFile

-BETA- Class for reading Connectivity Table (.ct)-format files [CTFormat] containing predicted RNA secondary-structure information as used by UNAFold as FoldRecord objects

HybFoldIter

Class for concurrent iteration over a HybFile and a ViennaFile or CtFile

HybRecord Class

class hybkit.HybRecord(id: str, seq: str, energy: Optional[Union[float, int, str]] = None, seg1_props: Optional[Dict[str, Union[float, int, str]]] = None, seg2_props: Optional[Dict[str, Union[float, int, str]]] = None, flags: Optional[Dict[str, Any]] = None, read_count: Optional[int] = None, allow_undefined_flags: Optional[bool] = None)

Class for storing and analyzing chimeric (hybrid) RNA-seq reads in hyb format.

Hyb file (hyb) format entries are a GFF-related file format described by [Travis2014] that contain information about a genomic sequence read identified to be a hybrid by a chimeric read caller. Each line contains 15 or 16 columns separated by tabs ("\t") and provides annotations on each component. An example hyb-format line from [Gay2018]:

2407_718\tATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC\t.\tMIMAT0000078_MirBase_miR-23a_microRNA\t1\t21\t1\t21\t0.0027\tENSG00000188229_ENST00000340384_TUBB2C_mRNA\t23\t49\t1181\t1207\t1.2e-06

The columns are respectively described in hybkit as:

id, seq, energy, seg1_ref_name, seg1_read_start, seg1_read_end, seg1_ref_start, seg1_ref_end, seg1_score, seg2_ref_name, seg2_read_start, seg2_read_end, seg2_ref_start, seg2_ref_end, seg2_score, flags

(For more information, see the hybkit Hyb File Specification)

The preferred method for reading hyb records from lines is with the HybRecord.from_line() constructor:

# line = "2407_718\tATC..."
hyb_record = hybkit.HybRecord.from_line(line)

This is the constructor used by the HybFile class to parse hyb files. For example, to print all hybrid identifiers in a hyb file:

with hybkit.HybFile('path/to/file.hyb', 'r') as hyb_file:
    # performs "hyb_record = hybkit.HybRecord.from_line(line)" for each line in file
    for hyb_record in hyb_file:
        print(hyb_record.id)

HybRecord objects can also be constructed directly. A minimum amount of data necessary for a HybRecord object is the genomic sequence and its corresponding identifier.

Examples

hyb_record_1 = hybkit.HybRecord('1_100', 'ACTG')
hyb_record_2 = hybkit.HybRecord('2_107', 'CTAG', '-7.3')
hyb_record_3 = hybkit.HybRecord('3_295', 'CTTG', energy='-10.3')

Details about segments are provided via python dictionaries with keys specific to each segment. Data can be provided either as strings or as floats/integers (where appropriate). For example, to create a HybRecord object representing the example line given above:

seg1_props = {'ref_name': 'MIMAT0000078_MirBase_miR-23a_microRNA',
             'read_start': '1',
             'read_end': '21',
             'ref_start': '1',
             'ref_end': '21',
             'score': '0.0027'}
seg2_props = {'ref_name': 'ENSG00000188229_ENST00000340384_TUBB2C_mRNA',
             'read_start': 23,
             'read_end': 49,
             'ref_start': 1181,
             'ref_end': 1207,
             'score': 1.2e-06}
seq_id = '2407_718'
seq = 'ATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC'
energy = None

hyb_record = hybkit.HybRecord(seq_id, seq, energy, seg1_props, seg2_props)
# OR
hyb_record = hybkit.HybRecord(seq_id, seq, seg1_props=seg1_props, seg2_props=seg2_props)
Parameters
  • id (str) -- Identifier for the hyb record

  • seq (str) -- Nucleotide sequence of the hyb record

  • energy (str or float, optional) -- Predicted energy of sequence folding in kcal/mol

  • seg1_props (dict, optional) -- Properties of segment 1 of the record, containing possible segment column keys: (ref_name, read_start, read_end, ref_start, ref_end, score)

  • seg2_props (dict, optional) -- Properties of segment 2 of the record, containing possible: segment column keys: (ref_name, read_start, read_end, ref_start, ref_end, score)

  • flags (dict, optional) -- Dict with keys of flags for the record and their associated values. By default flags must be defined in ALL_FLAGS but custom flags can be supplied by changing HybRecord.settings['custom_flags']. This setting can also be disabled by setting 'allow_undefined_flags' to True in HybRecord.settings.

  • allow_undefined_flags (bool, optional) -- If True, allows flags not defined in ALL_FLAGS or HybRecord.settings['custom_flags'] to be added to the record. If not provided, defaults to the value in HybRecord.settings['allow_undefined_flags'].

Variables
  • id (str) -- Identifier for the hyb record (Hyb format: <read-num>_<read-count>)

  • seq (str) -- Nucleotide sequence of the hyb record

  • energy (str) -- Predicted energy of folding

  • seg1_props (dict) -- Information on chimeric segment 1, contains segment column keys: ref_name (str), read_start (int), read_end (int), ref_start (int), ref_end (int), and score (float).

  • seg2_props (dict) -- Information on segment 2, contains segment column keys: ref_name (str), read_start (int), read_end (int), ref_start (int), ref_end (int), and score (float).

  • flags (dict) -- Dict of flags with possible flag keys and values as defined in the Flags section of the hybkit Hyb File Specification.

  • fold_record (FoldRecord) -- Information on the predicted secondary structure of the sequence set by set_fold_record().

  • allow_undefined_flags (bool) -- Whether to allow undefined flags to be set.

HYBRID_COLUMNS = ('id', 'seq', 'energy')

Record columns 1-3 defining parameters of the overall hybrid, defined by the Hyb format

SEGMENT_COLUMNS = ('ref_name', 'read_start', 'read_end', 'ref_start', 'ref_end', 'score')

Record columns 4-9 and 10-15, respectively, defining annotated parameters of seg1 and seg2 respectively, defined by the Hyb format

ALL_FLAGS = ('count_total', 'count_last_clustering', 'two_way_merged', 'seq_IDs_in_cluster', 'read_count', 'orient', 'det', 'seg1_type', 'seg2_type', 'seg1_det', 'seg2_det', 'miRNA_seg', 'target_reg', 'ext', 'dataset')

Flags defined by the hybkit package. Flags 1-4 are utilized by the Hyb software package. For information on flags, see the Flags portion of the hybkit Hyb File Specification.

settings = {'allow_undefined_flags': False, 'allow_unknown_seg_types': False, 'custom_flags': [], 'hyb_placeholder': '.', 'mirna_types': ['miRNA', 'microRNA'], 'reorder_flags': True}

Class-level settings. See settings.HybRecord_settings_info for descriptions.

TypeFinder

Link to type_finder.TypeFinder class for parsing sequence identifiers in assigning segment types by eval_types().

SET_PROPS = ('energy', 'full_seg_props', 'fold_record', 'eval_types', 'eval_mirna', 'eval_target')

Properties for the is_set() method.

GEN_PROPS = ('has_indels',)

General record properties for the prop() method.

  • has_indels : either seg1 or seg2 alignments has insertions/deletions, shown by differing read/reference length for the same alignment

STR_PROPS = ('id_is', 'id_prefix', 'id_suffix', 'id_contains', 'seq_is', 'seq_prefix', 'seq_suffix', 'seq_contains', 'seg1_is', 'seg1_prefix', 'seg1_suffix', 'seg1_contains', 'seg2_is', 'seg2_prefix', 'seg2_suffix', 'seg2_contains', 'any_seg_is', 'any_seg_prefix', 'any_seg_suffix', 'any_seg_contains', 'seg1_type_is', 'seg1_type_prefix', 'seg1_type_suffix', 'seg1_type_contains', 'seg2_type_is', 'seg2_type_prefix', 'seg2_type_suffix', 'seg2_type_contains', 'any_seg_type_is', 'any_seg_type_prefix', 'any_seg_type_suffix', 'any_seg_type_contains')

String-comparison properties for the prop() method.

MIRNA_PROPS = ('has_mirna', 'no_mirna', 'mirna_dimer', 'mirna_not_dimer', '5p_mirna', '3p_mirna')

miRNA-evaluation-related properties for the prop() method. Requires miRNA_seg flag to be set by eval_mirna() method.

  • has_mirna : Either or Both Seg1 or seg2 hve been identified as a miRNA

  • no_mirna : Both Seg1 and seg2 have been identified as Not a miRNA

  • mirna_dimer : Both seg1 and seg2 have been identified as a miRNA

  • mirna_not_dimer : One and Only One of seg1 or seg2 has been identified as a miRNA

  • 5p_mirna : Seg1 (5p) has been identified as a miRNA

  • 3p_mirna : Seg2 (3p) has been identified as a miRNA

MIRNA_STR_PROPS = ('mirna_is', 'mirna_prefix', 'mirna_suffix', 'mirna_contains', 'target_is', 'target_prefix', 'target_suffix', 'target_contains', 'mirna_seg_type_is', 'mirna_seg_type_prefix', 'mirna_seg_type_suffix', 'mirna_seg_type_contains', 'target_seg_type_is', 'target_seg_type_prefix', 'target_seg_type_suffix', 'target_seg_type_contains')
  • Comparisons:

  • is : Comparison string matches field exactly

  • prefix : Comparison string matches beginning of field

  • suffix : Comparison string matches end of field

  • contains : Comparison string is contained within field

HAS_PROPS = ('has_indels', 'id_is', 'id_prefix', 'id_suffix', 'id_contains', 'seq_is', 'seq_prefix', 'seq_suffix', 'seq_contains', 'seg1_is', 'seg1_prefix', 'seg1_suffix', 'seg1_contains', 'seg2_is', 'seg2_prefix', 'seg2_suffix', 'seg2_contains', 'any_seg_is', 'any_seg_prefix', 'any_seg_suffix', 'any_seg_contains', 'seg1_type_is', 'seg1_type_prefix', 'seg1_type_suffix', 'seg1_type_contains', 'seg2_type_is', 'seg2_type_prefix', 'seg2_type_suffix', 'seg2_type_contains', 'any_seg_type_is', 'any_seg_type_prefix', 'any_seg_type_suffix', 'any_seg_type_contains', 'has_mirna', 'no_mirna', 'mirna_dimer', 'mirna_not_dimer', '5p_mirna', '3p_mirna', 'mirna_is', 'mirna_prefix', 'mirna_suffix', 'mirna_contains', 'target_is', 'target_prefix', 'target_suffix', 'target_contains', 'mirna_seg_type_is', 'mirna_seg_type_prefix', 'mirna_seg_type_suffix', 'mirna_seg_type_contains', 'target_seg_type_is', 'target_seg_type_prefix', 'target_seg_type_suffix', 'target_seg_type_contains')

All allowed properties for the prop() method. See GEN_PROPS, STR_PROPS, MIRNA_PROPS, and MIRNA_STR_PROPS

set_flag(flag_key: str, flag_val: Optional[Union[float, int, str, bool]], allow_undefined_flags: Optional[bool] = None) None

Set the value of record flag_key to flag_val.

Parameters
get_seg1_type(require: bool = False) Optional[str]

Return the seg1_type flag if defined, or return None.

Parameters

require -- If True, raise an error if seg1_type is not defined.

get_seg2_type(require: bool = False) Optional[str]

Return the seg2_type flag if defined, or return None.

Parameters

require (bool, optional) -- If True, raise an error if seg2_type is not defined.

get_seg_types(require: bool = False) Tuple[Optional[str], Optional[str]]

Return "seg1_type" (or None), "seg2_type" (or None) flags.

Return a tuple of the seg1_type and seg2_type flags for each respective flag that is defined, or None for each flag that is not.

Parameters

require (bool, optional) -- If True, raise an error if either flag is not defined.

get_read_count(require: bool = False) Optional[int]

Return the read_count flag if defined, otherwise return None.

Parameters

require (bool, optional) -- If True, raise an error if the "read_count" flag is not defined.

get_record_count(require: bool = False) int

Return count_total flag if defined, or return 1 (this record).

Parameters

require (bool, optional) -- If True, raise an error if the "count_total" flag is not defined.

get_mirna_props(allow_mirna_dimers: bool = False, require: bool = True) Optional[Dict]

Return the seg_props dict corresponding to the miRNA segment, if set.

If eval_mirna() has been run, return the seg_props dict corresponding to the miRNA segment type as determined by checking the miRNA_seg flag, or None if the record does not contain a miRNA.

Parameters
  • allow_mirna_dimers (bool, optional) -- If True, consider miRNA dimers as a miRNA/target pair and return the 5p miRNA segment properties.

  • require (bool, optional) -- If True, raise an error if the read does not contain a miRNA-annotated segment (Default: True).

get_target_props(allow_mirna_dimers: bool = False, require: bool = True) Optional[Dict]

Return the seg_props dict corresponding to the target segment, if set.

If eval_mirna() has been run, return the seg_props dict corresponding to the target segment type as determined by checking the miRNA_seg flag, (and returning the other segment), or None if the record does not contain a miRNA or contains two miRNAs.

Parameters
  • allow_mirna_dimers (bool, optional) -- If True, consider miRNA dimers as a miRNA/target pair and return the 3p miRNA segment properties as the arbitrarily-selected "target" of the dimer pair.

  • require (bool, optional) -- If True, raise an error if the read does not contain a single target-annotated segment (Default: True).

eval_types(allow_unknown: Optional[bool] = None) None

Find the types of each segment using the the TypeFinder class.

This method provides HybRecord.seg1_props and HybRecord.seg2_props to the TypeFinder class, linked as attribute HybRecord.TypeFinder. This uses the method: TypeFinder.find set by TypeFinder.set_method or TypeFinder.set_custom_method to set the seg1_type, seg2_type flags if not already set.

To use a type-finding method other than the default, prepare the TypeFinder class by preparing and setting TypeFinder.params and using TypeFinder.set_method.

Parameters

allow_unknown (bool, optional) -- If True, allow segment types that cannot be identified and set them as "unknown". Otherwise raise an error. If not provided uses setting in settings['allow_unknown_seg_types'].

set_fold_record(fold_record: Union[FoldRecord, Tuple[FoldRecord, Any]], allow_energy_mismatch: bool = False) None

Check and set provided fold_record (FoldRecord) as attribute fold_record.

Ensures that fold_record argument is an instance of FoldRecord and has a matching sequence to this HybRecord, then set as HybRecord.fold_record.

Parameters
eval_mirna(override: bool = False, mirna_types: Optional[bool] = None) None

Analyze and set miRNA properties from type properties in the hyb record.

If not already done, determine whether a miRNA exists within this record and set the miRNA_seg flag. This evaluation requires the seg1_type and seg2_type flags to be populated, which can be performed by the eval_types() method.

Parameters
  • override (bool, optional) -- If True, override existing miRNA_seg flag if present.

  • mirna_types (list, tuple, or set, optional) -- Iterable of string representing sequence types considered as miRNA. Otherwise, the types are used from settings['mirna_types'] (it is suggested that this be provided as a set for fastest checking).

mirna_details(detail: Literal['all', 'mirna_ref', 'target_ref', 'mirna_seg_type', 'target_seg_type', 'mirna_seq', 'target_seq', 'mirna_fold', 'target_fold'] = 'all', allow_mirna_dimers: bool = False) Optional[Union[Dict, str]]

Provide a detail about the miRNA or target following eval_mirna().

Analyze miRNA properties within the sequence record and provide a detail as output. Unless allow_mirna_dimers is True, this method requires record to contain a non-dimer miRNA, otherwise an error will be raised.

Parameters
  • detail (str) --

    Type of detail to return. Options include:
    all : Dict of all properties (default)
    mirna_ref : Identifier for Assigned miRNA
    target_ref : Identifier for Assigned Target
    mirna_seg_type : Assigned seg_type of miRNA
    target_seg_type : Assigned seg_type of target
    mirna_seq : Annotated subsequence of miRNA
    target_seq : Annotated subsequence of target
    mirna_fold : Annotated fold substring of miRNA (requires fold_record set)
    target_fold : Annotated fold substring of target (requires fold_record set)

  • allow_mirna_dimers (bool, optional) -- Allow miRNA/miRNA dimers. The 5p-position will be assigned as the "miRNA", and the 3p-position will be assigned as the "target".

mirna_detail(*args, **kwargs)

Deprecate, alias for mirna_details().

Deprecated since version v0.3.0.

is_set(prop: str) bool

Return True if HybRecord property "prop" is set (if relevant) and is not None.

Options described in SET_PROPS.

Parameters

prop (str) -- Property / Analysis to check

not_set(prop: str) bool

Return False if HybRecord property "prop" is set (if relevant) and is not None.

( returns not is_set(prop) )

Parameters

prop (str) -- Property / Analysis to check

prop(prop: str, prop_compare: Optional[str] = None) bool

Return True if HybRecord has property: prop.

Check property against list of allowed properties in HAS_PROPS. If query property has a string comparator, provide this in prop_compare. Raises an error if a prerequisite field is not set (use is_set() to check whether properties are set).

Specific properties available to check are described in attributes:

GEN_PROPS

General Record Properties

STR_PROPS

Field String Comparison Properties

MIRNA_PROPS

miRNA-Associated Record Properties

MIRNA_STR_PROPS

miRNA-Associated String Comparison Properties

Parameters
  • prop (str) -- Property to check

  • prop_compare (str, optional) -- Comparator to check.

has_prop(*args, **kwargs)

Return True if HybRecord has property: prop.

Deprecated since version v0.3.0: Use prop() instead.

to_line(newline: bool = True, sep: str = '\t') str

Return a hyb-format string representation of the record.

Parameters
  • newline (bool, optional) -- Terminate returned string with a newline (default: True)

  • sep (str, optional) -- Separator between fields (Default: "\t")

to_csv(newline: bool = False) str

Return a comma-separated hyb-format string representation of the record.

Parameters

newline (bool, optional) -- If True, end the returned string with a newline.

to_fields(missing_obj: Optional[Union[float, int, str, bool]] = None) dict

Return a python dictionary representation of the record.

Returns a dictionary with keys corresponding to the fields in the hyb-format file, and values corresponding to the values in the record. Output can be used with the pandas DataFrame constructor or csv.DictWriter.

Parameters

missing_obj (optional) -- Object to use for missing values. Default = None.

to_fasta_record(mode: Literal['hybrid', 'seg1', 'seg2', 'mirna', 'target'] = 'hybrid', annotate: bool = True, allow_mirna_dimers: bool = False) None

Return nucleotide sequence as BioPython SeqRecord object.

Parameters
  • mode (str, optional) --

    Determines which sequence component to return. Options:
    hybrid: Entire hybrid sequence (default)
    seg1: Sequence 1 (if defined)
    seg2: Sequence 2 (if defined)
    miRNA: miRNA sequence of miRNA/target pair (if defined, else None)
    target: Target sequence of miRNA/target pair (if defined, else None)

  • annotate (bool, optional) -- Add name of components to fasta sequence identifier if present.

  • allow_mirna_dimers (bool, optional) --

    If True, allow miRNA dimers to be
    returned as miRNA sequence (the 5p segment
    will be selected as the "miRNA").

to_fasta_str(mode: Literal['hybrid', 'seg1', 'seg2', 'mirna', 'target'] = 'hybrid', annotate: bool = True) str

Return nucleotide sequence as a fasta string.

Parameters
  • mode (str, optional) --

    as with to_fasta_record() method.

  • annotate (bool, optional) -- Add name of components to fasta sequence identifier if present.

classmethod from_line(line: str, hybformat_id: bool = False, hybformat_ref: bool = False) Self

Construct a HybRecord instance from a single-line hyb-format string.

The Hyb software package ([Travis2014]) records read-count information in the "id" field of the record, which can be read by setting hybformat_id=True. Additionally, the Hyb hOH7 database contains the segment type in the identifier of each reference in the 4th field, which can be read by setting hybformat_ref=True.

Parameters
  • line (str) -- hyb-format string containing record information.

  • hybformat_id (bool, optional) -- If True, read count information from identifier in <read_number>_<read_count> format.

  • hybformat_ref (bool, optional) -- If True, read additional record information from identifier in <gene_id>_<transcript_id>_<gene_name>_<seg_type> format.

Returns

HybRecord instance containing record information.

classmethod from_fasta_records(seg1_record: None, seg2_record: None, hyb_id: Optional[str] = None, energy: Optional[Union[float, int, str]] = None, flags: Optional[Dict[str, Any]] = None) Self

Construct a HybRecord instance from two BioPython SeqRecord Objects.

Create artificial HybRecord from two SeqRecord Objects For the hybrid:

id: [seg1_record.id]--[seg2_record.id] (overwritten by "id" parameter if provided)
seq: seg1_record.seq + seg2_record

For each segment:

FASTA_Sequence_ID -> segN_ref_name
FASTA_Description -> Flags: segN_det (Overwritten if segN_det flag is provided directly)

Optional fields to add via function arguments:

hyb_id
energy
flags
Parameters
  • seg1_record (SeqRecord) -- Biopython SeqRecord object containing information on the left/first/5p hybrid segment (seg1)

  • seg2_record (SeqRecord) -- Biopython SeqRecord object containing information on the right/second/3p hybrid segment (seg2)

  • hyb_id (str, optional) -- Identifier for the hyb record (overwrites generated id if provided)

  • energy (str or float, optional) -- Predicted energy of sequence folding in kcal/mol

  • flags (dict, optional) -- Dict with keys of flags for the record and their associated values. Any flags provided overwrite default-generated flags.

Returns

HybRecord instance containing record information.

classmethod to_fields_header() Literal['id', 'seq', 'energy', 'seg1_ref_name', 'seg1_read_start', 'seg1_read_end', 'seg1_ref_start', 'seg1_ref_end', 'seg1_score', 'seg2_ref_name', 'seg2_read_start', 'seg2_read_end', 'seg2_ref_start', 'seg2_ref_end', 'seg2_score', 'flags']

Return a list of the fields in a HybRecord object.

For use with the to_fields() method.

classmethod to_csv_header(newline: bool = False) Literal['id,seq,energy,seg1_ref_name,seg1_read_start,seg1_read_end,seg1_ref_start,seg1_ref_end,seg1_score,seg2_ref_name,seg2_read_start,seg2_read_end,seg2_ref_start,seg2_ref_end,seg2_score,flags']

Return a comma-separated string representation of the fields in the record.

For use with the to_csv() method.

Parameters

newline (bool, optional) -- If True, end the returned string with a newline.

HybFile Class

class hybkit.HybFile(path: str, *args: Any, hybformat_id: Optional[bool] = None, hybformat_ref: Optional[bool] = None, from_file_like: bool = False, **kwargs: Any)

Wrapper for a hyb-format text file which returns entries (lines) as HybRecord objects.

Parameters
  • path (str) -- Path to text file to open as hyb-format file.

  • *args -- Arguments passed to open() function to open a text file for reading/writing.

  • hybformat_id (bool, optional) -- If True, during parsing of lines read count information from identifier in <read_number>_<read_count> format. Defaults to value in settings['hybformat_id'].

  • hybformat_ref (bool, optional) -- If True, during parsing of lines read additional record information from identifier in <gene_id>_<transcript_id>_<gene_name>_<seg_type> format. Defaults to value in settings['hybformat_ref'].

  • from_file_like (bool, optional) -- If True, the first argument is treated as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored. (Default False``)

  • **kwargs -- Keyword arguments passed to open() function to open a text file for reading/writing.

Variables
  • hybformat_id (bool) -- Read count information from identifier during line parsing

  • hybformat_ref (bool) -- Read type information from reference name during line parsing

  • fh (file) -- Underlying file handle for the HybFile object.

settings = {'hybformat_id': False, 'hybformat_ref': False}

Class-level settings. See hybkit.settings.HybFile_settings_info for descriptions.

close() None

Close the file.

read_record() str

Return next line of hyb file as HybRecord object.

read_records() List[str]

Return list of all (remaining) records in hyb file as HybRecord objects.

write_record(write_record: HybRecord) None

Write a HybRecord object to file as a Hyb-format string.

Unlike the file.write() method, this method will add a newline to the end of each written record line.

Parameters

write_record (HybRecord) -- Record to write.

write_records(write_records: Iterable[HybRecord]) None

Write a sequence of HybRecord objects as hyb-format lines to the Hyb file.

Unlike the file.writelines() method, this method will add a newline to the end of each written record line.

Parameters

write_records (list) -- List of HybRecord objects to write.

write_fh(*args, **kwargs) None

Write directly to the underlying file handle.

write(*_args, **_kwargs) None

Implement no-op / error for "write" method to catch errors.

Use write_record() or write_fh() instead.

classmethod open(path: str, *args: Any, hybformat_id: Optional[bool] = None, hybformat_ref: Optional[bool] = None, **kwargs: Any) Self

Open a path to a text file using open() and return a HybFile object.

Arguments match those of the Python3 built-in open() function and are passed directly to it.

This method is provided as a convenience function for drop-in replacement of the built-in open() function.

Specific keyword arguments are provided for HybFile-specific settings:

Parameters
  • path (str) -- Path to file to open.

  • hybformat_id (bool, optional) -- If True, during parsing of lines read count information from identifier in <read_number>_<read_count> format. Defaults to value in settings['hybformat_id'].

  • hybformat_ref (bool, optional) -- If True, during parsing of lines read additional record information from identifier in <gene_id>_<transcript_id>_<gene_name>_<seg_type> format. Defaults to value in settings['hybformat_ref'].

Example usage:
with HybFile.open('path/to/file.hyb', 'r') as hyb_file:
    for record in hyb_file:
        print(record)
Parameters
  • *args -- Passed directly to open().

  • **kwargs -- Passed directly to open().

Returns

HybFile object.

FoldRecord Class

class hybkit.FoldRecord(id: str, seq: str, fold: str, energy: Optional[Union[float, int, str]] = None, seq_type: Optional[Literal['static', 'dynamic']] = None)

Class for storing secondary structure (folding) information for a nucleotide sequence.

This class supports the following file types: (Data courtesy of [Gay2018])

  • The ".vienna" file format used by the ViennaRNA package ([ViennaFormat]; [Lorenz2011]):
    Example:
    34_151138_MIMAT0000076_MirBase_miR-21_microRNA_1_19-...
    TAGCTTATCAGACTGATGTTAGCTTATCAGACTGATG
    .....((((((.((((((......)))))).))))))   (-11.1)
    
  • The ".ct" file format used by UNAFold and other packages ([CTFormat], [Zuker2003]):
    Example:
    41        dG = -8 dH = -93.9      seq1_name-seq2_name
    1 A       0       2       0       1       0       0
    2 G       1       3       0       2       0       0
    ...
    ...
    ...
    40        G       39      41      11      17      39      41
    41        T       40      0       10      18      40      0
    

A minimum amount of data necessary for a FoldRecord object is a sequence identifier, a genomic sequence, and its fold representation.

Two types of FoldRecord objects are supported, 'static' and 'dynamic'. Static FoldRecord objects are those where the 'seq' attribute matches exactly to the corresponding HybRecord.seq attribute (where applicable). Dynamic FoldRecord objects are those where FoldRecord.seq is reconstructed from aligned regions of a HybRecord.seq chimeric read: Longer for chimeras with overlapping alignments, shorter for chimeras with gapped alignments.

Overlapping Alignment Example:

Static:
seg1: 1111111111111111111111
seg2:                   222222222222222222222
seq:  TAGCTTATCAGACTGATGTTTTAGCTTATCAGACTGATG

Dynamic:
seg1: 1111111111111111111111
seg2:                       222222222222222222222
seq:  TAGCTTATCAGACTGATGTTTTTTTTAGCTTATCAGACTGATG

Gapped Alignment Example:

Static:
seg1:   1111111111111111
seg2:                     222222222222222222
seq:  TTAGCTTATCAGACTGATGTTAGCTTATCAGACTGATG

Dynamic:
seg1: 1111111111111111
seg2:                 222222222222222222
seq:  AGCTTATCAGACTGATTAGCTTATCAGACTGATG

Dynamic sequences are found in the Hyb program *_hybrids_ua.hyb file type. This is primarily relevant in error-checking when setting the HybRecord.set_fold_record() method.

When the 'static' FoldRecord type is used, the following methods are used for HybRecord.fold_record error-checking:

When the 'dynamic' FoldRecord type is used, the following methods are used for HybRecord.fold_record error-checking:

Parameters
Variables
  • id (str) -- Sequence Identifier (often seg1name-seg2name)

  • seq (str) -- Genomic Sequence

  • fold (str) -- Dot-bracket Fold Representation, '(', '.', and ')' characters

  • energy (str) -- Predicted energy of folding

  • seq_type (str) -- Whether sequence is 'static' or 'dynamic' (Default: 'static'; see Args for details)

settings = {'allowed_mismatches': 0, 'error_mode': 'raise', 'fold_placeholder': '.', 'seq_type': 'static'}

Class-level settings. See hybkit.settings.FoldRecord_settings_info for descriptions.

to_vienna_lines(newline: bool = True) List[str]

Return a list of lines for the record in vienna format.

See (Vienna File Format).

Parameters

newline (bool, optional) -- Add newline character to the end of each returned line. (Default: True)

to_vienna_string(newline: bool = True) str

Return a 3-line string for the record in vienna format.

See (Vienna File Format).

Parameters

newline (bool, optional) -- Terminate the returned string with a newline character. (Default: True)

count_hyb_record_mismatches(hyb_record: HybRecord) int

Count mismatches between hyb_record.seq and fold_record.seq.

Uses static_count_hyb_record_mismatches() if seq_type is static, or dynamic_count_hyb_record_mismatches() if seq_type is dynamic.

Parameters

hyb_record (HybRecord) -- hyb_record for comparison.

static_count_hyb_record_mismatches(hyb_record: HybRecord) int

Count mismatches between hyb_record.seq and fold_record.seq.

Parameters

hyb_record (HybRecord) -- hyb_record for comparison.

dynamic_count_hyb_record_mismatches(hyb_record: HybRecord) int

Count mismatches between hyb_record.seq and dynamic fold_record.seq.

Parameters

hyb_record (HybRecord) -- hyb_record for comparison

matches_hyb_record(hyb_record: HybRecord, allowed_mismatches: Optional[int] = None) bool

Return True if self.seq and hyb_record.seq mismatches are <= allowed_mismatches.

Parameters
ensure_matches_hyb_record(hyb_record: HybRecord, allowed_mismatches: Optional[int] = None) None

Ensure self.seq matches hyb_record.seq, else raise an error.

Parameters
classmethod from_vienna_lines(record_lines: List[str], error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Construct instance from a list of 3 strings of vienna-format ([ViennaFormat]) lines.

See Vienna File Format for more details.

Parameters

record_lines (list or tuple) -- Iterable of 3 strings corresponding to lines of a vienna-format record.

classmethod from_vienna_string(record_string: str, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Construct instance from a string representing 3 vienna-format ([ViennaFormat]) lines.

See Vienna File Format for more details.

Parameters

record_string (str or tuple) -- 3-line string containing a vienna-format record

classmethod from_ct_lines(record_lines: List[str], error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Create a FoldRecord from a list of record lines in ".ct" format ([CTFormat]).

See CT File Format for more details.

Warning

This method is in beta stage, and is not well-tested.

Parameters

record_lines (list or tuple) -- list containing lines of ct record

classmethod from_ct_string(record_string: str, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Create a FoldRecord entry from a multi-line string from ".ct" format ([CTFormat]).

See CT File Format for more details.

Warning

This method is in beta stage, and is not well-tested.

Parameters

record_string (str) -- String containing lines of ct record

ViennaFile Class

class hybkit.ViennaFile(*args: Any, seq_type: Optional[Literal['static', 'dynamic']] = None, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, from_file_like: bool = False, **kwargs: Any)

Vienna file wrapper that returns vienna-format file lines as FoldRecord objects.

See Vienna File Format for more information.

Parameters
  • seq_type (str, optional) -- Type of FoldRecord to return: static, or dynamic (if not provided, uses FoldRecord.settings['seq_type']).

  • error_mode (str, optional) -- String representing the error mode. If None, defaults to the value set in settings['error_mode']. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.

  • from_file_like (bool, optional) -- If True, treat the first argument as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored (Default False).

  • *args -- Passed to open().

  • **kwargs -- Passed to open().

Variables
  • fh (file) -- File handle for the file being wrapped.

  • foldrecord_seq_type (str) -- Type of FoldRecord to return (see Args)

  • error_mode (str) -- Mode for error catching (see Args)

Warning

Occasionally fold files can be poorly-formatted. In that case, this iterator attempts error-catching but this is not always successful so verbose error modes are encouraged.

read_record(override_error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Read next three lines and return output as FoldRecord object.

Parameters

override_error_mode (str) -- Override the error_mode set in the ViennaFile object. See the ViennaFile Constructor for more information on allowed error modes.

close() None

Close the file handle.

classmethod open(path: str, *args: Any, **kwargs: Any) Self

Open a path to a text file using open() and return relevant file object.

Arguments match those of the Python3 built-in open() function and are passed directly to it.

This method is provided as a convenience function for drop-in replacement of the built-in open() function.

Specific keyword arguments are provided for fold-file-specific settings:

Parameters
  • path (str) -- Path to file to open.

  • seq_type (str, optional) -- Type of FoldRecord to return: static, or dynamic (if not provided, uses FoldRecord.settings['seq_type']).

  • error_mode (str, optional) -- String representing the error mode. If None, defaults to the value set in settings['error_mode']. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.

  • *args -- Passed directly to open().

  • **kwargs -- Passed directly to open().

Returns

HybFile object.

read_records() List[FoldRecord]

Return list of all FoldRecord objects for this file type.

settings = {}

Class-level settings. See hybkit.settings.FoldFile_settings_info for descriptions.

write_fh(*args: Any, **kwargs: Any) None

Write directly to the underlying file handle.

write_record(write_record: FoldRecord) None

Write a FoldRecord object for this file type.

Unlike the file.write() method, this method will add a newline to the end of each written record line.

Parameters

write_record (FoldRecord) -- FoldRecord objects to write.

write_records(write_records: Iterable[FoldRecord]) None

Write a sequence of FoldRecord objects for this file type.

Unlike the file.writelines() method, this method will add a newline to the end of each written record line.

Parameters

write_records (list) -- List of FoldRecord objects to write.

CtFile Class

class hybkit.CtFile(*args: Any, seq_type: Optional[Literal['static', 'dynamic']] = None, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, from_file_like: bool = False, **kwargs: Any)

Ct file wrapper that returns ".ct" file lines as FoldRecord objects.

See CT File Format for more information.

Warning

This class is in beta stage, and is not well-tested.

Parameters
  • seq_type (str, optional) -- Type of FoldRecord to return: static, or dynamic (if not provided, uses FoldRecord.settings['seq_type']).

  • error_mode (str, optional) -- String representing the error mode. If None, defaults to the value set in settings['error_mode']. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.

  • from_file_like (bool, optional) -- If True, treat the first argument as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored (Default False).

  • *args -- Passed to open().

  • **kwargs -- Passed to open().

Variables
  • fh (file) -- File handle for the file being wrapped.

  • foldrecord_seq_type (str) -- Type of FoldRecord to return (see Args)

  • error_mode (str) -- Mode for error catching (see Args)

Warning

Occasionally fold files can be poorly-formatted. In that case, this iterator attempts error-catching but this is not always successful so verbose error modes are encouraged.

read_record() Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Return the next CT record as a FoldRecord object.

Call next(self.fh) to return the first line of the next entry. Determine the expected number of following lines in the entry, and read that number of lines further. Return lines as a FoldRecord object.

write_record = None

CtFile Record Writing Not Implemented

write_records = None

CtFile Record Writing Not Implemented

close() None

Close the file handle.

classmethod open(path: str, *args: Any, **kwargs: Any) Self

Open a path to a text file using open() and return relevant file object.

Arguments match those of the Python3 built-in open() function and are passed directly to it.

This method is provided as a convenience function for drop-in replacement of the built-in open() function.

Specific keyword arguments are provided for fold-file-specific settings:

Parameters
  • path (str) -- Path to file to open.

  • seq_type (str, optional) -- Type of FoldRecord to return: static, or dynamic (if not provided, uses FoldRecord.settings['seq_type']).

  • error_mode (str, optional) -- String representing the error mode. If None, defaults to the value set in settings['error_mode']. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.

  • *args -- Passed directly to open().

  • **kwargs -- Passed directly to open().

Returns

HybFile object.

read_records() List[FoldRecord]

Return list of all FoldRecord objects for this file type.

settings = {}

Class-level settings. See hybkit.settings.FoldFile_settings_info for descriptions.

write_fh(*args: Any, **kwargs: Any) None

Write directly to the underlying file handle.

HybFoldIter Class

class hybkit.HybFoldIter(hybfile_handle: HybFile, foldfile_handle: FoldFile, combine: bool = False, iter_error_mode: Optional[Literal['raise', 'warn_return', 'warn_skip', 'skip', 'return']] = None)

Iterator for simultaneous iteration over a HybFile and FoldFile object.

This class provides an iterator to iterate through a HybFile and one of a ViennaFile, or CtFile simultaneously to return a HybRecord and FoldRecord.

Basic error checking / catching is performed based on the value of the ~settings['error_mode'] setting.

Parameters
Returns

(HybRecord, FoldRecord)

settings = {'error_checks': ['hybrecord_indel', 'foldrecord_nofold', 'max_mismatch', 'energy_mismatch'], 'iter_error_mode': 'warn_skip', 'max_sequential_skips': 100}

Class-level settings. See settings.HybFoldIter_settings_info for descriptions.

report() List[str]

Return a report of information from iteration.

print_report() None

Print a report of information from iteration.