hybkit (module)

Module storing primary hybkit classes and hybkit API.

This module contains classes and methods for reading, writing, and manipulating data in the hyb genomic sequence format ([Travis2014]). For more information, see the hybkit Hyb File Specification.

An example string of a hyb-format line from [Gay2018] is:

2407_718\tATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC\t.\tMIMAT0000078_MirBase_miR-23a_microRNA\t1\t21\t1\t21\t0.0027\tENSG00000188229_ENST00000340384_TUBB2C_mRNA\t23\t49\t1181\t1207\t1.2e-06

Hybkit functionality is primarily based on classes for storage and evaluation of chimeric genomic sequences and associated fold-information:

`HybRecord`	Class to store a single hyb (hybrid) sequence record
`FoldRecord`	Class to store predicted RNA secondary structure information for hybrid reads

Also included are classes for reading, writing, and iterating over files containing hybrid information:

`HybFile`	Class for reading and writing hyb-format files [Travis2014] containing chimeric RNA sequence information as `HybRecord` objects
`ViennaFile`	Class for reading and writing Vienna-format files [ViennaFormat] containing RNA secondary structure information in dot-bracket format as `FoldRecord` objects
`CtFile`	-BETA- Class for reading Connectivity Table (.ct)-format files [CTFormat] containing predicted RNA secondary-structure information as used by UNAFold as `FoldRecord` objects
`HybFoldIter`	Class for concurrent iteration over a `HybFile` and a `ViennaFile` or `CtFile`

HybRecord Class

class hybkit.HybRecord(id: str, seq: str, energy: Optional[Union[float, int, str]] = None, seg1_props: Optional[Dict[str, Union[float, int, str]]] = None, seg2_props: Optional[Dict[str, Union[float, int, str]]] = None, flags: Optional[Dict[str, Any]] = None, read_count: Optional[int] = None, allow_undefined_flags: Optional[bool] = None)

Class for storing and analyzing chimeric (hybrid) RNA-seq reads in hyb format.

Hyb file (hyb) format entries are a GFF-related file format described by [Travis2014] that contain information about a genomic sequence read identified to be a hybrid by a chimeric read caller. Each line contains 15 or 16 columns separated by tabs ("\t") and provides annotations on each component. An example hyb-format line from [Gay2018]:

2407_718\tATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC\t.\tMIMAT0000078_MirBase_miR-23a_microRNA\t1\t21\t1\t21\t0.0027\tENSG00000188229_ENST00000340384_TUBB2C_mRNA\t23\t49\t1181\t1207\t1.2e-06

The columns are respectively described in hybkit as:

id, seq, energy, seg1_ref_name, seg1_read_start, seg1_read_end, seg1_ref_start, seg1_ref_end, seg1_score, seg2_ref_name, seg2_read_start, seg2_read_end, seg2_ref_start, seg2_ref_end, seg2_score, flags

(For more information, see the hybkit Hyb File Specification)

The preferred method for reading hyb records from lines is with the HybRecord.from_line() constructor:

# line = "2407_718\tATC..."
hyb_record = hybkit.HybRecord.from_line(line)

This is the constructor used by the HybFile class to parse hyb files. For example, to print all hybrid identifiers in a hyb file:

with hybkit.HybFile('path/to/file.hyb', 'r') as hyb_file:
    # performs "hyb_record = hybkit.HybRecord.from_line(line)" for each line in file
    for hyb_record in hyb_file:
        print(hyb_record.id)

HybRecord objects can also be constructed directly. A minimum amount of data necessary for a HybRecord object is the genomic sequence and its corresponding identifier.

Examples

hyb_record_1 = hybkit.HybRecord('1_100', 'ACTG')
hyb_record_2 = hybkit.HybRecord('2_107', 'CTAG', '-7.3')
hyb_record_3 = hybkit.HybRecord('3_295', 'CTTG', energy='-10.3')

Details about segments are provided via python dictionaries with keys specific to each segment. Data can be provided either as strings or as floats/integers (where appropriate). For example, to create a HybRecord object representing the example line given above:

seg1_props = {'ref_name': 'MIMAT0000078_MirBase_miR-23a_microRNA',
             'read_start': '1',
             'read_end': '21',
             'ref_start': '1',
             'ref_end': '21',
             'score': '0.0027'}
seg2_props = {'ref_name': 'ENSG00000188229_ENST00000340384_TUBB2C_mRNA',
             'read_start': 23,
             'read_end': 49,
             'ref_start': 1181,
             'ref_end': 1207,
             'score': 1.2e-06}
seq_id = '2407_718'
seq = 'ATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC'
energy = None

hyb_record = hybkit.HybRecord(seq_id, seq, energy, seg1_props, seg2_props)
# OR
hyb_record = hybkit.HybRecord(seq_id, seq, seg1_props=seg1_props, seg2_props=seg2_props)

Parameters

id (str) -- Identifier for the hyb record
seq (str) -- Nucleotide sequence of the hyb record
energy (str or float, optional) -- Predicted energy of sequence folding in kcal/mol
seg1_props (dict, optional) -- Properties of segment 1 of the record, containing possible segment column keys: (ref_name, read_start, read_end, ref_start, ref_end, score)
seg2_props (dict, optional) -- Properties of segment 2 of the record, containing possible: segment column keys: (ref_name, read_start, read_end, ref_start, ref_end, score)
flags (dict, optional) -- Dict with keys of flags for the record and their associated values. By default flags must be defined in ALL_FLAGS but custom flags can be supplied by changing HybRecord.settings['custom_flags']. This setting can also be disabled by setting 'allow_undefined_flags' to True in HybRecord.settings.
allow_undefined_flags (bool, optional) -- If True, allows flags not defined in ALL_FLAGS or HybRecord.settings['custom_flags'] to be added to the record. If not provided, defaults to the value in HybRecord.settings['allow_undefined_flags'].

Variables

id (str) -- Identifier for the hyb record (Hyb format: <read-num>_<read-count>)
seq (str) -- Nucleotide sequence of the hyb record
energy (str) -- Predicted energy of folding
seg1_props (dict) -- Information on chimeric segment 1, contains segment column keys: ref_name (str), read_start (int), read_end (int), ref_start (int), ref_end (int), and score (float).
seg2_props (dict) -- Information on segment 2, contains segment column keys: ref_name (str), read_start (int), read_end (int), ref_start (int), ref_end (int), and score (float).
flags (dict) -- Dict of flags with possible flag keys and values as defined in the Flags section of the hybkit Hyb File Specification.
fold_record (FoldRecord) -- Information on the predicted secondary structure of the sequence set by set_fold_record().
allow_undefined_flags (bool) -- Whether to allow undefined flags to be set.

HYBRID_COLUMNS = ('id', 'seq', 'energy'): Record columns 1-3 defining parameters of the overall hybrid, defined by the Hyb format

SEGMENT_COLUMNS = ('ref_name', 'read_start', 'read_end', 'ref_start', 'ref_end', 'score'): Record columns 4-9 and 10-15, respectively, defining annotated parameters of seg1 and seg2 respectively, defined by the Hyb format

ALL_FLAGS = ('count_total', 'count_last_clustering', 'two_way_merged', 'seq_IDs_in_cluster', 'read_count', 'orient', 'det', 'seg1_type', 'seg2_type', 'seg1_det', 'seg2_det', 'miRNA_seg', 'target_reg', 'ext', 'dataset'): Flags defined by the hybkit package. Flags 1-4 are utilized by the Hyb software package. For information on flags, see the Flags portion of the hybkit Hyb File Specification.

settings = {'allow_undefined_flags': False, 'allow_unknown_seg_types': False, 'custom_flags': [], 'hyb_placeholder': '.', 'mirna_types': ['miRNA', 'microRNA'], 'reorder_flags': True}: Class-level settings. See settings.HybRecord_settings_info for descriptions.

TypeFinder

Link to type_finder.TypeFinder class for parsing sequence identifiers in assigning segment types by eval_types().

SET_PROPS = ('energy', 'full_seg_props', 'fold_record', 'eval_types', 'eval_mirna', 'eval_target')

Properties for the is_set() method.

energy : energy is not None
full_seg_props : Each seg key is in segN_props dict and is not None
fold_record : fold_record has been set
eval_types : seg1_type and seg2_type flags have been set
eval_mirna : miRNA_seg flag has been set

GEN_PROPS = ('has_indels',)

General record properties for the prop() method.

has_indels : either seg1 or seg2 alignments has insertions/deletions, shown by differing read/reference length for the same alignment

STR_PROPS = ('id_is', 'id_prefix', 'id_suffix', 'id_contains', 'seq_is', 'seq_prefix', 'seq_suffix', 'seq_contains', 'seg1_is', 'seg1_prefix', 'seg1_suffix', 'seg1_contains', 'seg2_is', 'seg2_prefix', 'seg2_suffix', 'seg2_contains', 'any_seg_is', 'any_seg_prefix', 'any_seg_suffix', 'any_seg_contains', 'seg1_type_is', 'seg1_type_prefix', 'seg1_type_suffix', 'seg1_type_contains', 'seg2_type_is', 'seg2_type_prefix', 'seg2_type_suffix', 'seg2_type_contains', 'any_seg_type_is', 'any_seg_type_prefix', 'any_seg_type_suffix', 'any_seg_type_contains')

String-comparison properties for the prop() method.

Field Types:
- id : record.id
- seq : record.seq
- seg1 : seg1_props['ref_name']
- seg2 : seg2_props['ref_name']
- any_seg : seg1_props['ref_name'] OR seg1_props['ref_name']
- seg1_type : seg1_type flag
- seg2_type : seg2_type flag
- any_seg_type : seg1_type OR seg2_type flags
Comparisons:
- is : Comparison string matches field exactly
- prefix : Comparison string matches beginning of field
- suffix : Comparison string matches end of field
- contains : Comparison string is contained within field

MIRNA_PROPS = ('has_mirna', 'no_mirna', 'mirna_dimer', 'mirna_not_dimer', '5p_mirna', '3p_mirna')

miRNA-evaluation-related properties for the prop() method. Requires miRNA_seg flag to be set by eval_mirna() method.

has_mirna : Either or Both Seg1 or seg2 hve been identified as a miRNA
no_mirna : Both Seg1 and seg2 have been identified as Not a miRNA
mirna_dimer : Both seg1 and seg2 have been identified as a miRNA
mirna_not_dimer : One and Only One of seg1 or seg2 has been identified as a miRNA
5p_mirna : Seg1 (5p) has been identified as a miRNA
3p_mirna : Seg2 (3p) has been identified as a miRNA

MIRNA_STR_PROPS = ('mirna_is', 'mirna_prefix', 'mirna_suffix', 'mirna_contains', 'target_is', 'target_prefix', 'target_suffix', 'target_contains', 'mirna_seg_type_is', 'mirna_seg_type_prefix', 'mirna_seg_type_suffix', 'mirna_seg_type_contains', 'target_seg_type_is', 'target_seg_type_prefix', 'target_seg_type_suffix', 'target_seg_type_contains')

Comparisons:
is : Comparison string matches field exactly
prefix : Comparison string matches beginning of field
suffix : Comparison string matches end of field
contains : Comparison string is contained within field

HAS_PROPS = ('has_indels', 'id_is', 'id_prefix', 'id_suffix', 'id_contains', 'seq_is', 'seq_prefix', 'seq_suffix', 'seq_contains', 'seg1_is', 'seg1_prefix', 'seg1_suffix', 'seg1_contains', 'seg2_is', 'seg2_prefix', 'seg2_suffix', 'seg2_contains', 'any_seg_is', 'any_seg_prefix', 'any_seg_suffix', 'any_seg_contains', 'seg1_type_is', 'seg1_type_prefix', 'seg1_type_suffix', 'seg1_type_contains', 'seg2_type_is', 'seg2_type_prefix', 'seg2_type_suffix', 'seg2_type_contains', 'any_seg_type_is', 'any_seg_type_prefix', 'any_seg_type_suffix', 'any_seg_type_contains', 'has_mirna', 'no_mirna', 'mirna_dimer', 'mirna_not_dimer', '5p_mirna', '3p_mirna', 'mirna_is', 'mirna_prefix', 'mirna_suffix', 'mirna_contains', 'target_is', 'target_prefix', 'target_suffix', 'target_contains', 'mirna_seg_type_is', 'mirna_seg_type_prefix', 'mirna_seg_type_suffix', 'mirna_seg_type_contains', 'target_seg_type_is', 'target_seg_type_prefix', 'target_seg_type_suffix', 'target_seg_type_contains'): All allowed properties for the prop() method. See GEN_PROPS, STR_PROPS, MIRNA_PROPS, and MIRNA_STR_PROPS

set_flag(flag_key: str, flag_val: Optional[Union[float, int, str, bool]], allow_undefined_flags: Optional[bool] = None) → None

Set the value of record flag_key to flag_val.

Parameters

flag_key (str) -- Key for flag to set.
flag_val -- Value for flag to set.
allow_undefined_flags (bool, optional) -- Allow inclusion of flags not defined in ALL_FLAGS or in settings['custom_flags']. If not provided, uses setting in 'HybRecord.allow_undefined_flags' (Defaults to value in: settings['allow_undefined_flags'] ).

get_seg1_type(require: bool = False) → Optional[str]

Return the seg1_type flag if defined, or return None.

Parameters: require -- If True, raise an error if seg1_type is not defined.

get_seg2_type(require: bool = False) → Optional[str]

Return the seg2_type flag if defined, or return None.

Parameters: require (bool, optional) -- If True, raise an error if seg2_type is not defined.

get_seg_types(require: bool = False) → Tuple[Optional[str], Optional[str]]

Return "seg1_type" (or None), "seg2_type" (or None) flags.

Return a tuple of the seg1_type and seg2_type flags for each respective flag that is defined, or None for each flag that is not.

Parameters: require (bool, optional) -- If True, raise an error if either flag is not defined.

get_read_count(require: bool = False) → Optional[int]

Return the read_count flag if defined, otherwise return None.

Parameters: require (bool, optional) -- If True, raise an error if the "read_count" flag is not defined.

get_record_count(require: bool = False) → int

Return count_total flag if defined, or return 1 (this record).

Parameters: require (bool, optional) -- If True, raise an error if the "count_total" flag is not defined.

get_mirna_props(allow_mirna_dimers: bool = False, require: bool = True) → Optional[Dict]

Return the seg_props dict corresponding to the miRNA segment, if set.

If eval_mirna() has been run, return the seg_props dict corresponding to the miRNA segment type as determined by checking the miRNA_seg flag, or None if the record does not contain a miRNA.

Parameters

allow_mirna_dimers (bool, optional) -- If True, consider miRNA dimers as a miRNA/target pair and return the 5p miRNA segment properties.
require (bool, optional) -- If True, raise an error if the read does not contain a miRNA-annotated segment (Default: True).

get_target_props(allow_mirna_dimers: bool = False, require: bool = True) → Optional[Dict]

Return the seg_props dict corresponding to the target segment, if set.

If eval_mirna() has been run, return the seg_props dict corresponding to the target segment type as determined by checking the miRNA_seg flag, (and returning the other segment), or None if the record does not contain a miRNA or contains two miRNAs.

Parameters

allow_mirna_dimers (bool, optional) -- If True, consider miRNA dimers as a miRNA/target pair and return the 3p miRNA segment properties as the arbitrarily-selected "target" of the dimer pair.
require (bool, optional) -- If True, raise an error if the read does not contain a single target-annotated segment (Default: True).

eval_types(allow_unknown: Optional[bool] = None) → None

Find the types of each segment using the the TypeFinder class.

This method provides HybRecord.seg1_props and HybRecord.seg2_props to the TypeFinder class, linked as attribute HybRecord.TypeFinder. This uses the method: TypeFinder.find set by TypeFinder.set_method or TypeFinder.set_custom_method to set the seg1_type, seg2_type flags if not already set.

To use a type-finding method other than the default, prepare the TypeFinder class by preparing and setting TypeFinder.params and using TypeFinder.set_method.

Parameters: allow_unknown (bool, optional) -- If True, allow segment types that cannot be identified and set them as "unknown". Otherwise raise an error. If not provided uses setting in settings['allow_unknown_seg_types'].

set_fold_record(fold_record: Union[FoldRecord, Tuple[FoldRecord, Any]], allow_energy_mismatch: bool = False) → None

Check and set provided fold_record (FoldRecord) as attribute fold_record.

Ensures that fold_record argument is an instance of FoldRecord and has a matching sequence to this HybRecord, then set as HybRecord.fold_record.

Parameters

fold_record (FoldRecord) -- FoldRecord instance to set as HybRecord.fold_record.
allow_energy_mismatch (bool, optional) -- If True, allow mismatched fold_record and HybRecord energy. Otherwise, raise an error.

eval_mirna(override: bool = False, mirna_types: Optional[bool] = None) → None

Analyze and set miRNA properties from type properties in the hyb record.

If not already done, determine whether a miRNA exists within this record and set the miRNA_seg flag. This evaluation requires the seg1_type and seg2_type flags to be populated, which can be performed by the eval_types() method.

Parameters

override (bool, optional) -- If True, override existing miRNA_seg flag if present.
mirna_types (list, tuple, or set, optional) -- Iterable of string representing sequence types considered as miRNA. Otherwise, the types are used from settings['mirna_types'] (it is suggested that this be provided as a set for fastest checking).

mirna_details(detail: Literal['all', 'mirna_ref', 'target_ref', 'mirna_seg_type', 'target_seg_type', 'mirna_seq', 'target_seq', 'mirna_fold', 'target_fold'] = 'all', allow_mirna_dimers: bool = False) → Optional[Union[Dict, str]]

Provide a detail about the miRNA or target following eval_mirna().

Analyze miRNA properties within the sequence record and provide a detail as output. Unless allow_mirna_dimers is True, this method requires record to contain a non-dimer miRNA, otherwise an error will be raised.

Parameters

detail (str) --

Type of detail to return. Options include:

all : Dict of all properties (default)

mirna_ref : Identifier for Assigned miRNA

target_ref : Identifier for Assigned Target

mirna_seg_type : Assigned seg_type of miRNA

target_seg_type : Assigned seg_type of target

mirna_seq : Annotated subsequence of miRNA

target_seq : Annotated subsequence of target

mirna_fold : Annotated fold substring of miRNA (requires fold_record set)

target_fold : Annotated fold substring of target (requires fold_record set)
allow_mirna_dimers (bool, optional) -- Allow miRNA/miRNA dimers. The 5p-position will be assigned as the "miRNA", and the 3p-position will be assigned as the "target".

mirna_detail(*args, **kwargs): Deprecate, alias for mirna_details().

Deprecated since version v0.3.0.

is_set(prop: str) → bool

Return True if HybRecord property "prop" is set (if relevant) and is not None.

Options described in SET_PROPS.

Parameters: prop (str) -- Property / Analysis to check

not_set(prop: str) → bool

Return False if HybRecord property "prop" is set (if relevant) and is not None.

( returns not is_set(prop) )

Parameters: prop (str) -- Property / Analysis to check

prop(prop: str, prop_compare: Optional[str] = None) → bool

Return True if HybRecord has property: prop.

Check property against list of allowed properties in HAS_PROPS. If query property has a string comparator, provide this in prop_compare. Raises an error if a prerequisite field is not set (use is_set() to check whether properties are set).

Specific properties available to check are described in attributes:

GEN_PROPS

General Record Properties

STR_PROPS

Field String Comparison Properties

MIRNA_PROPS

miRNA-Associated Record Properties

MIRNA_STR_PROPS

miRNA-Associated String Comparison Properties

Parameters

prop (str) -- Property to check
prop_compare (str, optional) -- Comparator to check.

has_prop(*args, **kwargs): Return True if HybRecord has property: prop.

Deprecated since version v0.3.0: Use prop() instead.

to_line(newline: bool = True, sep: str = '\t') → str

Return a hyb-format string representation of the record.

Parameters

newline (bool, optional) -- Terminate returned string with a newline (default: True)
sep (str, optional) -- Separator between fields (Default: "\t")

to_csv(newline: bool = False) → str

Return a comma-separated hyb-format string representation of the record.

Parameters: newline (bool, optional) -- If True, end the returned string with a newline.

to_fields(missing_obj: Optional[Union[float, int, str, bool]] = None) → dict

Return a python dictionary representation of the record.

Returns a dictionary with keys corresponding to the fields in the hyb-format file, and values corresponding to the values in the record. Output can be used with the pandas DataFrame constructor or csv.DictWriter.

Parameters: missing_obj (optional) -- Object to use for missing values. Default = None.

to_fasta_record(mode: Literal['hybrid', 'seg1', 'seg2', 'mirna', 'target'] = 'hybrid', annotate: bool = True, allow_mirna_dimers: bool = False) → None

Return nucleotide sequence as BioPython SeqRecord object.

Parameters

mode (str, optional) --

Determines which sequence component to return. Options:

hybrid: Entire hybrid sequence (default)

seg1: Sequence 1 (if defined)

seg2: Sequence 2 (if defined)

miRNA: miRNA sequence of miRNA/target pair (if defined, else None)

target: Target sequence of miRNA/target pair (if defined, else None)
annotate (bool, optional) -- Add name of components to fasta sequence identifier if present.
allow_mirna_dimers (bool, optional) --

If True, allow miRNA dimers to be

returned as miRNA sequence (the 5p segment

will be selected as the "miRNA").

to_fasta_str(mode: Literal['hybrid', 'seg1', 'seg2', 'mirna', 'target'] = 'hybrid', annotate: bool = True) → str

Return nucleotide sequence as a fasta string.

Parameters

mode (str, optional) --

as with to_fasta_record() method.
annotate (bool, optional) -- Add name of components to fasta sequence identifier if present.

classmethod from_line(line: str, hybformat_id: bool = False, hybformat_ref: bool = False) → Self

Construct a HybRecord instance from a single-line hyb-format string.

The Hyb software package ([Travis2014]) records read-count information in the "id" field of the record, which can be read by setting hybformat_id=True. Additionally, the Hyb hOH7 database contains the segment type in the identifier of each reference in the 4th field, which can be read by setting hybformat_ref=True.

Parameters

line (str) -- hyb-format string containing record information.
hybformat_id (bool, optional) -- If True, read count information from identifier in <read_number>_<read_count> format.
hybformat_ref (bool, optional) -- If True, read additional record information from identifier in <gene_id>_<transcript_id>_<gene_name>_<seg_type> format.

Returns

HybRecord instance containing record information.

classmethod from_fasta_records(seg1_record: None, seg2_record: None, hyb_id: Optional[str] = None, energy: Optional[Union[float, int, str]] = None, flags: Optional[Dict[str, Any]] = None) → Self

Construct a HybRecord instance from two BioPython SeqRecord Objects.

Create artificial HybRecord from two SeqRecord Objects For the hybrid:

id: [seg1_record.id]--[seg2_record.id] (overwritten by "id" parameter if provided)

seq: seg1_record.seq + seg2_record

For each segment:

FASTA_Sequence_ID -> segN_ref_name

FASTA_Description -> Flags: segN_det (Overwritten if segN_det flag is provided directly)

Optional fields to add via function arguments:

hyb_id

energy

flags

Parameters

seg1_record (SeqRecord) -- Biopython SeqRecord object containing information on the left/first/5p hybrid segment (seg1)
seg2_record (SeqRecord) -- Biopython SeqRecord object containing information on the right/second/3p hybrid segment (seg2)
hyb_id (str, optional) -- Identifier for the hyb record (overwrites generated id if provided)
energy (str or float, optional) -- Predicted energy of sequence folding in kcal/mol
flags (dict, optional) -- Dict with keys of flags for the record and their associated values. Any flags provided overwrite default-generated flags.

Returns

HybRecord instance containing record information.

classmethod to_fields_header() → Literal['id', 'seq', 'energy', 'seg1_ref_name', 'seg1_read_start', 'seg1_read_end', 'seg1_ref_start', 'seg1_ref_end', 'seg1_score', 'seg2_ref_name', 'seg2_read_start', 'seg2_read_end', 'seg2_ref_start', 'seg2_ref_end', 'seg2_score', 'flags']

Return a list of the fields in a HybRecord object.

For use with the to_fields() method.

classmethod to_csv_header(newline: bool = False) → Literal['id,seq,energy,seg1_ref_name,seg1_read_start,seg1_read_end,seg1_ref_start,seg1_ref_end,seg1_score,seg2_ref_name,seg2_read_start,seg2_read_end,seg2_ref_start,seg2_ref_end,seg2_score,flags']

Return a comma-separated string representation of the fields in the record.

For use with the to_csv() method.

Parameters: newline (bool, optional) -- If True, end the returned string with a newline.

HybFile Class

class hybkit.HybFile(path: str, *args: Any, hybformat_id: Optional[bool] = None, hybformat_ref: Optional[bool] = None, from_file_like: bool = False, **kwargs: Any)

Wrapper for a hyb-format text file which returns entries (lines) as HybRecord objects.

Parameters

path (str) -- Path to text file to open as hyb-format file.
*args -- Arguments passed to open() function to open a text file for reading/writing.
hybformat_id (bool, optional) -- If True, during parsing of lines read count information from identifier in <read_number>_<read_count> format. Defaults to value in settings['hybformat_id'].
hybformat_ref (bool, optional) -- If True, during parsing of lines read additional record information from identifier in <gene_id>_<transcript_id>_<gene_name>_<seg_type> format. Defaults to value in settings['hybformat_ref'].
from_file_like (bool, optional) -- If True, the first argument is treated as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored. (Default False``)
**kwargs -- Keyword arguments passed to open() function to open a text file for reading/writing.

Variables

hybformat_id (bool) -- Read count information from identifier during line parsing
hybformat_ref (bool) -- Read type information from reference name during line parsing
fh (file) -- Underlying file handle for the HybFile object.

settings = {'hybformat_id': False, 'hybformat_ref': False}: Class-level settings. See hybkit.settings.HybFile_settings_info for descriptions.

close() → None: Close the file.

read_record() → str: Return next line of hyb file as HybRecord object.

read_records() → List[str]: Return list of all (remaining) records in hyb file as HybRecord objects.

write_record(write_record: HybRecord) → None

Write a HybRecord object to file as a Hyb-format string.

Unlike the file.write() method, this method will add a newline to the end of each written record line.

Parameters: write_record (HybRecord) -- Record to write.

write_records(write_records: Iterable[HybRecord]) → None

Write a sequence of HybRecord objects as hyb-format lines to the Hyb file.

Unlike the file.writelines() method, this method will add a newline to the end of each written record line.

Parameters: write_records (list) -- List of HybRecord objects to write.

write_fh(*args, **kwargs) → None: Write directly to the underlying file handle.

write(*_args, **_kwargs) → None

Implement no-op / error for "write" method to catch errors.

Use write_record() or write_fh() instead.

classmethod open(path: str, *args: Any, hybformat_id: Optional[bool] = None, hybformat_ref: Optional[bool] = None, **kwargs: Any) → Self

Open a path to a text file using open() and return a HybFile object.

Arguments match those of the Python3 built-in open() function and are passed directly to it.

This method is provided as a convenience function for drop-in replacement of the built-in open() function.

Specific keyword arguments are provided for HybFile-specific settings:

Parameters

path (str) -- Path to file to open.
hybformat_id (bool, optional) -- If True, during parsing of lines read count information from identifier in <read_number>_<read_count> format. Defaults to value in settings['hybformat_id'].
hybformat_ref (bool, optional) -- If True, during parsing of lines read additional record information from identifier in <gene_id>_<transcript_id>_<gene_name>_<seg_type> format. Defaults to value in settings['hybformat_ref'].

Example usage:

with HybFile.open('path/to/file.hyb', 'r') as hyb_file:
    for record in hyb_file:
        print(record)

Parameters

*args -- Passed directly to open().
**kwargs -- Passed directly to open().

Returns

HybFile object.

FoldRecord Class

class hybkit.FoldRecord(id: str, seq: str, fold: str, energy: Optional[Union[float, int, str]] = None, seq_type: Optional[Literal['static', 'dynamic']] = None)

Class for storing secondary structure (folding) information for a nucleotide sequence.

This class supports the following file types: (Data courtesy of [Gay2018])

The ".vienna" file format used by the ViennaRNA package ([ViennaFormat]; [Lorenz2011]):

Example:

34_151138_MIMAT0000076_MirBase_miR-21_microRNA_1_19-...
TAGCTTATCAGACTGATGTTAGCTTATCAGACTGATG
.....((((((.((((((......)))))).))))))   (-11.1)

The ".ct" file format used by UNAFold and other packages ([CTFormat], [Zuker2003]):

Example:

41        dG = -8 dH = -93.9      seq1_name-seq2_name
1 A       0       2       0       1       0       0
2 G       1       3       0       2       0       0
...
...
...
40        G       39      41      11      17      39      41
41        T       40      0       10      18      40      0

A minimum amount of data necessary for a FoldRecord object is a sequence identifier, a genomic sequence, and its fold representation.

Two types of FoldRecord objects are supported, 'static' and 'dynamic'. Static FoldRecord objects are those where the 'seq' attribute matches exactly to the corresponding HybRecord.seq attribute (where applicable). Dynamic FoldRecord objects are those where FoldRecord.seq is reconstructed from aligned regions of a HybRecord.seq chimeric read: Longer for chimeras with overlapping alignments, shorter for chimeras with gapped alignments.

Overlapping Alignment Example:

Static:
seg1: 1111111111111111111111
seg2:                   222222222222222222222
seq:  TAGCTTATCAGACTGATGTTTTAGCTTATCAGACTGATG

Dynamic:
seg1: 1111111111111111111111
seg2:                       222222222222222222222
seq:  TAGCTTATCAGACTGATGTTTTTTTTAGCTTATCAGACTGATG

Gapped Alignment Example:

Static:
seg1:   1111111111111111
seg2:                     222222222222222222
seq:  TTAGCTTATCAGACTGATGTTAGCTTATCAGACTGATG

Dynamic:
seg1: 1111111111111111
seg2:                 222222222222222222
seq:  AGCTTATCAGACTGATTAGCTTATCAGACTGATG

Dynamic sequences are found in the Hyb program *_hybrids_ua.hyb file type. This is primarily relevant in error-checking when setting the HybRecord.set_fold_record() method.

When the 'static' FoldRecord type is used, the following methods are used for HybRecord.fold_record error-checking:

static_count_hyb_record_mismatches()

When the 'dynamic' FoldRecord type is used, the following methods are used for HybRecord.fold_record error-checking:

dynamic_count_hyb_record_mismatches()

Parameters

id (str) -- Identifier for record
seq (str) -- Nucleotide sequence of record.
fold (str) -- Fold representation of record.
energy (str or float, optional) -- Energy of folding for record.
seq_type (str, optional) -- Expect sequence to be 'static' (match exactly to corresponding HybRecord.seq), or 'dynamic' (construct from pieces of HybRecord.seq). if not provided, defaults to ~settings['seq_type'] setting. See hybkit.settings.FoldRecord_settings_info for descriptions.

Variables

id (str) -- Sequence Identifier (often seg1name-seg2name)
seq (str) -- Genomic Sequence
fold (str) -- Dot-bracket Fold Representation, '(', '.', and ')' characters
energy (str) -- Predicted energy of folding
seq_type (str) -- Whether sequence is 'static' or 'dynamic' (Default: 'static'; see Args for details)

settings = {'allowed_mismatches': 0, 'error_mode': 'raise', 'fold_placeholder': '.', 'seq_type': 'static'}: Class-level settings. See hybkit.settings.FoldRecord_settings_info for descriptions.

to_vienna_lines(newline: bool = True) → List[str]

Return a list of lines for the record in vienna format.

See (Vienna File Format).

Parameters: newline (bool, optional) -- Add newline character to the end of each returned line. (Default: True)

to_vienna_string(newline: bool = True) → str

Return a 3-line string for the record in vienna format.

See (Vienna File Format).

Parameters: newline (bool, optional) -- Terminate the returned string with a newline character. (Default: True)

count_hyb_record_mismatches(hyb_record: HybRecord) → int

Count mismatches between hyb_record.seq and fold_record.seq.

Uses static_count_hyb_record_mismatches() if seq_type is static, or dynamic_count_hyb_record_mismatches() if seq_type is dynamic.

Parameters: hyb_record (HybRecord) -- hyb_record for comparison.

static_count_hyb_record_mismatches(hyb_record: HybRecord) → int

Count mismatches between hyb_record.seq and fold_record.seq.

Parameters: hyb_record (HybRecord) -- hyb_record for comparison.

dynamic_count_hyb_record_mismatches(hyb_record: HybRecord) → int

Count mismatches between hyb_record.seq and dynamic fold_record.seq.

Parameters: hyb_record (HybRecord) -- hyb_record for comparison

matches_hyb_record(hyb_record: HybRecord, allowed_mismatches: Optional[int] = None) → bool

Return True if self.seq and hyb_record.seq mismatches are <= allowed_mismatches.

Parameters

hyb_record (HybRecord) -- hyb_record to compare.
allowed_mismatches (int, optional) -- Number of mismatches allowed for a match. If not provided, defaults to the option in settings['allowed_mismatches'].

ensure_matches_hyb_record(hyb_record: HybRecord, allowed_mismatches: Optional[int] = None) → None

Ensure self.seq matches hyb_record.seq, else raise an error.

Parameters

hyb_record (HybRecord) -- hyb_record to compare.
allowed_mismatches (int, optional) -- Number of mismatches allowed for a match. If not provided, defaults to the option in settings['allowed_mismatches'].

classmethod from_vienna_lines(record_lines: List[str], error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) → Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Construct instance from a list of 3 strings of vienna-format ([ViennaFormat]) lines.

See Vienna File Format for more details.

Parameters: record_lines (list or tuple) -- Iterable of 3 strings corresponding to lines of a vienna-format record.

classmethod from_vienna_string(record_string: str, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) → Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Construct instance from a string representing 3 vienna-format ([ViennaFormat]) lines.

See Vienna File Format for more details.

Parameters: record_string (str or tuple) -- 3-line string containing a vienna-format record

classmethod from_ct_lines(record_lines: List[str], error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) → Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Create a FoldRecord from a list of record lines in ".ct" format ([CTFormat]).

See CT File Format for more details.

Warning

This method is in beta stage, and is not well-tested.

Parameters: record_lines (list or tuple) -- list containing lines of ct record

classmethod from_ct_string(record_string: str, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) → Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Create a FoldRecord entry from a multi-line string from ".ct" format ([CTFormat]).

See CT File Format for more details.

Warning

This method is in beta stage, and is not well-tested.

Parameters: record_string (str) -- String containing lines of ct record

ViennaFile Class

class hybkit.ViennaFile(*args: Any, seq_type: Optional[Literal['static', 'dynamic']] = None, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, from_file_like: bool = False, **kwargs: Any)

Vienna file wrapper that returns vienna-format file lines as FoldRecord objects.

See Vienna File Format for more information.

Parameters

seq_type (str, optional) -- Type of FoldRecord to return: static, or dynamic (if not provided, uses FoldRecord.settings['seq_type']).
error_mode (str, optional) -- String representing the error mode. If None, defaults to the value set in settings['error_mode']. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.
from_file_like (bool, optional) -- If True, treat the first argument as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored (Default False).
*args -- Passed to open().
**kwargs -- Passed to open().

Variables

fh (file) -- File handle for the file being wrapped.
foldrecord_seq_type (str) -- Type of FoldRecord to return (see Args)
error_mode (str) -- Mode for error catching (see Args)

Warning

Occasionally fold files can be poorly-formatted. In that case, this iterator attempts error-catching but this is not always successful so verbose error modes are encouraged.

read_record(override_error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None) → Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Read next three lines and return output as FoldRecord object.

Parameters: override_error_mode (str) -- Override the error_mode set in the ViennaFile object. See the ViennaFile Constructor for more information on allowed error modes.

close() → None: Close the file handle.

classmethod open(path: str, *args: Any, **kwargs: Any) → Self

Open a path to a text file using open() and return relevant file object.

Arguments match those of the Python3 built-in open() function and are passed directly to it.

This method is provided as a convenience function for drop-in replacement of the built-in open() function.

Specific keyword arguments are provided for fold-file-specific settings:

Parameters

path (str) -- Path to file to open.
seq_type (str, optional) -- Type of FoldRecord to return: static, or dynamic (if not provided, uses FoldRecord.settings['seq_type']).
error_mode (str, optional) -- String representing the error mode. If None, defaults to the value set in settings['error_mode']. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.
*args -- Passed directly to open().
**kwargs -- Passed directly to open().

Returns

HybFile object.

read_records() → List[FoldRecord]: Return list of all FoldRecord objects for this file type.

settings = {}: Class-level settings. See hybkit.settings.FoldFile_settings_info for descriptions.

write_fh(*args: Any, **kwargs: Any) → None: Write directly to the underlying file handle.

write_record(write_record: FoldRecord) → None

Write a FoldRecord object for this file type.

Unlike the file.write() method, this method will add a newline to the end of each written record line.

Parameters: write_record (FoldRecord) -- FoldRecord objects to write.

write_records(write_records: Iterable[FoldRecord]) → None

Write a sequence of FoldRecord objects for this file type.

Unlike the file.writelines() method, this method will add a newline to the end of each written record line.

Parameters: write_records (list) -- List of FoldRecord objects to write.

CtFile Class

class hybkit.CtFile(*args: Any, seq_type: Optional[Literal['static', 'dynamic']] = None, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, from_file_like: bool = False, **kwargs: Any)

Ct file wrapper that returns ".ct" file lines as FoldRecord objects.

See CT File Format for more information.

Warning

This class is in beta stage, and is not well-tested.

Parameters

seq_type (str, optional) -- Type of FoldRecord to return: static, or dynamic (if not provided, uses FoldRecord.settings['seq_type']).
error_mode (str, optional) -- String representing the error mode. If None, defaults to the value set in settings['error_mode']. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.
from_file_like (bool, optional) -- If True, treat the first argument as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored (Default False).
*args -- Passed to open().
**kwargs -- Passed to open().

Variables

fh (file) -- File handle for the file being wrapped.
foldrecord_seq_type (str) -- Type of FoldRecord to return (see Args)
error_mode (str) -- Mode for error catching (see Args)

Warning

Occasionally fold files can be poorly-formatted. In that case, this iterator attempts error-catching but this is not always successful so verbose error modes are encouraged.

read_record() → Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]

Return the next CT record as a FoldRecord object.

Call next(self.fh) to return the first line of the next entry. Determine the expected number of following lines in the entry, and read that number of lines further. Return lines as a FoldRecord object.

write_record = None: CtFile Record Writing Not Implemented

write_records = None: CtFile Record Writing Not Implemented

close() → None: Close the file handle.

classmethod open(path: str, *args: Any, **kwargs: Any) → Self

Open a path to a text file using open() and return relevant file object.

Arguments match those of the Python3 built-in open() function and are passed directly to it.

This method is provided as a convenience function for drop-in replacement of the built-in open() function.

Specific keyword arguments are provided for fold-file-specific settings:

Parameters

path (str) -- Path to file to open.
seq_type (str, optional) -- Type of FoldRecord to return: static, or dynamic (if not provided, uses FoldRecord.settings['seq_type']).
error_mode (str, optional) -- String representing the error mode. If None, defaults to the value set in settings['error_mode']. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.
*args -- Passed directly to open().
**kwargs -- Passed directly to open().

Returns

HybFile object.

read_records() → List[FoldRecord]: Return list of all FoldRecord objects for this file type.

settings = {}: Class-level settings. See hybkit.settings.FoldFile_settings_info for descriptions.

write_fh(*args: Any, **kwargs: Any) → None: Write directly to the underlying file handle.

HybFoldIter Class

class hybkit.HybFoldIter(hybfile_handle: HybFile, foldfile_handle: FoldFile, combine: bool = False, iter_error_mode: Optional[Literal['raise', 'warn_return', 'warn_skip', 'skip', 'return']] = None)

Iterator for simultaneous iteration over a HybFile and FoldFile object.

This class provides an iterator to iterate through a HybFile and one of a ViennaFile, or CtFile simultaneously to return a HybRecord and FoldRecord.

Basic error checking / catching is performed based on the value of the ~settings['error_mode'] setting.

Parameters

hybfile_handle (HybFile) -- HybFile object for iteration
foldfile_handle (ViennaFile or CtFile) -- ViennaFile or CtFile object for iteration
combine (bool, optional) -- Use HybRecord.set_fold_record(FoldRecord) and return only the HybRecord.
iter_error_mode (str, optional) -- Error mode to use for reading FoldRecord objects. If not set, defaults to the value in settings['iter_error_mode'].

Returns

(HybRecord, FoldRecord)

settings = {'error_checks': ['hybrecord_indel', 'foldrecord_nofold', 'max_mismatch', 'energy_mismatch'], 'iter_error_mode': 'warn_skip', 'max_sequential_skips': 100}: Class-level settings. See settings.HybFoldIter_settings_info for descriptions.

report() → List[str]: Return a report of information from iteration.

print_report() → None: Print a report of information from iteration.