hybkit (module)
Module storing primary hybkit classes and hybkit API.
This module contains classes and methods for reading, writing, and manipulating data in the hyb genomic sequence format ([Travis2014]). For more information, see the hybkit Hyb File Specification.
An example string of a hyb-format line from [Gay2018] is:
2407_718\tATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC\t.\tMIMAT0000078_MirBase_miR-23a_microRNA\t1\t21\t1\t21\t0.0027\tENSG00000188229_ENST00000340384_TUBB2C_mRNA\t23\t49\t1181\t1207\t1.2e-06
Hybkit functionality is primarily based on classes for storage and evaluation of chimeric genomic sequences and associated fold-information:
Class to store a single hyb (hybrid) sequence record |
|
Class to store predicted RNA secondary structure information for hybrid reads |
Also included are classes for reading, writing, and iterating over files containing hybrid information:
Class for reading and writing hyb-format files
[Travis2014] containing chimeric RNA sequence information
as |
|
Class for reading and writing Vienna-format files
[ViennaFormat] containing RNA secondary structure information
in dot-bracket format as |
|
-BETA- Class for reading Connectivity Table (.ct)-format files
[CTFormat] containing predicted RNA secondary-structure
information as used by UNAFold as
|
|
Class for concurrent iteration over a |
HybRecord Class
- class hybkit.HybRecord(id: str, seq: str, energy: Optional[Union[float, int, str]] = None, seg1_props: Optional[Dict[str, Union[float, int, str]]] = None, seg2_props: Optional[Dict[str, Union[float, int, str]]] = None, flags: Optional[Dict[str, Any]] = None, read_count: Optional[int] = None, allow_undefined_flags: Optional[bool] = None)
Class for storing and analyzing chimeric (hybrid) RNA-seq reads in hyb format.
Hyb file (hyb) format entries are a GFF-related file format described by [Travis2014] that contain information about a genomic sequence read identified to be a hybrid by a chimeric read caller. Each line contains 15 or 16 columns separated by tabs ("\t") and provides annotations on each component. An example hyb-format line from [Gay2018]:
2407_718\tATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC\t.\tMIMAT0000078_MirBase_miR-23a_microRNA\t1\t21\t1\t21\t0.0027\tENSG00000188229_ENST00000340384_TUBB2C_mRNA\t23\t49\t1181\t1207\t1.2e-06
The columns are respectively described in hybkit as:
id
,seq
,energy
,seg1_ref_name
,seg1_read_start
,seg1_read_end
,seg1_ref_start
,seg1_ref_end
,seg1_score
,seg2_ref_name
,seg2_read_start
,seg2_read_end
,seg2_ref_start
,seg2_ref_end
,seg2_score
,flags
(For more information, see the hybkit Hyb File Specification)
The preferred method for reading hyb records from lines is with the
HybRecord.from_line()
constructor:# line = "2407_718\tATC..." hyb_record = hybkit.HybRecord.from_line(line)
This is the constructor used by the
HybFile
class to parse hyb files. For example, to print all hybrid identifiers in a hyb file:with hybkit.HybFile('path/to/file.hyb', 'r') as hyb_file: # performs "hyb_record = hybkit.HybRecord.from_line(line)" for each line in file for hyb_record in hyb_file: print(hyb_record.id)
HybRecord objects can also be constructed directly. A minimum amount of data necessary for a HybRecord object is the genomic sequence and its corresponding identifier.
Examples
hyb_record_1 = hybkit.HybRecord('1_100', 'ACTG') hyb_record_2 = hybkit.HybRecord('2_107', 'CTAG', '-7.3') hyb_record_3 = hybkit.HybRecord('3_295', 'CTTG', energy='-10.3')
Details about segments are provided via python dictionaries with
keys
specific to each segment. Data can be provided either as strings or as floats/integers (where appropriate). For example, to create a HybRecord object representing the example line given above:seg1_props = {'ref_name': 'MIMAT0000078_MirBase_miR-23a_microRNA', 'read_start': '1', 'read_end': '21', 'ref_start': '1', 'ref_end': '21', 'score': '0.0027'} seg2_props = {'ref_name': 'ENSG00000188229_ENST00000340384_TUBB2C_mRNA', 'read_start': 23, 'read_end': 49, 'ref_start': 1181, 'ref_end': 1207, 'score': 1.2e-06} seq_id = '2407_718' seq = 'ATCACATTGCCAGGGATTTCCAATCCCCAACAATGTGAAAACGGCTGTC' energy = None hyb_record = hybkit.HybRecord(seq_id, seq, energy, seg1_props, seg2_props) # OR hyb_record = hybkit.HybRecord(seq_id, seq, seg1_props=seg1_props, seg2_props=seg2_props)
- Parameters
id (str) -- Identifier for the hyb record
seq (str) -- Nucleotide sequence of the hyb record
energy (
str
orfloat
, optional) -- Predicted energy of sequence folding in kcal/molseg1_props (
dict
, optional) -- Properties of segment 1 of the record, containing possiblesegment column
keys: (ref_name
,read_start
,read_end
,ref_start
,ref_end
,score
)seg2_props (
dict
, optional) -- Properties of segment 2 of the record, containing possible:segment column
keys: (ref_name
,read_start
,read_end
,ref_start
,ref_end
,score
)flags (
dict
, optional) -- Dict with keys of flags for the record and their associated values. By default flags must be defined inALL_FLAGS
but custom flags can be supplied by changingHybRecord.settings['custom_flags']
. This setting can also be disabled by setting 'allow_undefined_flags' toTrue
inHybRecord.settings
.allow_undefined_flags (
bool
, optional) -- IfTrue
, allows flags not defined inALL_FLAGS
orHybRecord.settings['custom_flags']
to be added to the record. If not provided, defaults to the value inHybRecord.settings['allow_undefined_flags']
.
- Variables
id (str) -- Identifier for the hyb record (Hyb format:
<read-num>_<read-count>
)seq (str) -- Nucleotide sequence of the hyb record
energy (str) -- Predicted energy of folding
seg1_props (dict) -- Information on chimeric segment 1, contains
segment column
keys:ref_name
(str
),read_start
(int
),read_end
(int
),ref_start
(int
),ref_end
(int
), andscore
(float
).seg2_props (dict) -- Information on segment 2, contains
segment column
keys:ref_name
(str
),read_start
(int
),read_end
(int
),ref_start
(int
),ref_end
(int
), andscore
(float
).flags (dict) -- Dict of flags with possible
flag keys
and values as defined in the Flags section of the hybkit Hyb File Specification.fold_record (FoldRecord) -- Information on the predicted secondary structure of the sequence set by
set_fold_record()
.allow_undefined_flags (bool) -- Whether to allow undefined flags to be set.
- HYBRID_COLUMNS = ('id', 'seq', 'energy')
Record columns 1-3 defining parameters of the overall hybrid, defined by the Hyb format
- SEGMENT_COLUMNS = ('ref_name', 'read_start', 'read_end', 'ref_start', 'ref_end', 'score')
Record columns 4-9 and 10-15, respectively, defining annotated parameters of seg1 and seg2 respectively, defined by the Hyb format
- ALL_FLAGS = ('count_total', 'count_last_clustering', 'two_way_merged', 'seq_IDs_in_cluster', 'read_count', 'orient', 'det', 'seg1_type', 'seg2_type', 'seg1_det', 'seg2_det', 'miRNA_seg', 'target_reg', 'ext', 'dataset')
Flags defined by the hybkit package. Flags 1-4 are utilized by the Hyb software package. For information on flags, see the Flags portion of the hybkit Hyb File Specification.
- settings = {'allow_undefined_flags': False, 'allow_unknown_seg_types': False, 'custom_flags': [], 'hyb_placeholder': '.', 'mirna_types': ['miRNA', 'microRNA'], 'reorder_flags': True}
Class-level settings. See
settings.HybRecord_settings_info
for descriptions.
- TypeFinder
Link to
type_finder.TypeFinder
class for parsing sequence identifiers in assigning segment types byeval_types()
.
- SET_PROPS = ('energy', 'full_seg_props', 'fold_record', 'eval_types', 'eval_mirna', 'eval_target')
Properties for the
is_set()
method.energy
:energy
is not Nonefull_seg_props
: Each seg key is in segN_props dict and is not Nonefold_record
: fold_record has been seteval_mirna
: miRNA_seg flag has been set
- GEN_PROPS = ('has_indels',)
General record properties for the
prop()
method.has_indels
: either seg1 or seg2 alignments has insertions/deletions, shown by differing read/reference length for the same alignment
- STR_PROPS = ('id_is', 'id_prefix', 'id_suffix', 'id_contains', 'seq_is', 'seq_prefix', 'seq_suffix', 'seq_contains', 'seg1_is', 'seg1_prefix', 'seg1_suffix', 'seg1_contains', 'seg2_is', 'seg2_prefix', 'seg2_suffix', 'seg2_contains', 'any_seg_is', 'any_seg_prefix', 'any_seg_suffix', 'any_seg_contains', 'seg1_type_is', 'seg1_type_prefix', 'seg1_type_suffix', 'seg1_type_contains', 'seg2_type_is', 'seg2_type_prefix', 'seg2_type_suffix', 'seg2_type_contains', 'any_seg_type_is', 'any_seg_type_prefix', 'any_seg_type_suffix', 'any_seg_type_contains')
String-comparison properties for the
prop()
method.Field Types:
id
: record.idseq
: record.seqseg1
: seg1_props['ref_name']seg2
: seg2_props['ref_name']any_seg
: seg1_props['ref_name'] OR seg1_props['ref_name']seg1_type
: seg1_type flagseg2_type
: seg2_type flag
Comparisons:
is
: Comparison string matches field exactlyprefix
: Comparison string matches beginning of fieldsuffix
: Comparison string matches end of fieldcontains
: Comparison string is contained within field
- MIRNA_PROPS = ('has_mirna', 'no_mirna', 'mirna_dimer', 'mirna_not_dimer', '5p_mirna', '3p_mirna')
miRNA-evaluation-related properties for the
prop()
method. Requires miRNA_seg flag to be set byeval_mirna()
method.has_mirna
: Either or Both Seg1 or seg2 hve been identified as a miRNAno_mirna
: Both Seg1 and seg2 have been identified as Not a miRNAmirna_dimer
: Both seg1 and seg2 have been identified as a miRNAmirna_not_dimer
: One and Only One of seg1 or seg2 has been identified as a miRNA5p_mirna
: Seg1 (5p) has been identified as a miRNA3p_mirna
: Seg2 (3p) has been identified as a miRNA
- MIRNA_STR_PROPS = ('mirna_is', 'mirna_prefix', 'mirna_suffix', 'mirna_contains', 'target_is', 'target_prefix', 'target_suffix', 'target_contains', 'mirna_seg_type_is', 'mirna_seg_type_prefix', 'mirna_seg_type_suffix', 'mirna_seg_type_contains', 'target_seg_type_is', 'target_seg_type_prefix', 'target_seg_type_suffix', 'target_seg_type_contains')
Comparisons:
is
: Comparison string matches field exactlyprefix
: Comparison string matches beginning of fieldsuffix
: Comparison string matches end of fieldcontains
: Comparison string is contained within field
- HAS_PROPS = ('has_indels', 'id_is', 'id_prefix', 'id_suffix', 'id_contains', 'seq_is', 'seq_prefix', 'seq_suffix', 'seq_contains', 'seg1_is', 'seg1_prefix', 'seg1_suffix', 'seg1_contains', 'seg2_is', 'seg2_prefix', 'seg2_suffix', 'seg2_contains', 'any_seg_is', 'any_seg_prefix', 'any_seg_suffix', 'any_seg_contains', 'seg1_type_is', 'seg1_type_prefix', 'seg1_type_suffix', 'seg1_type_contains', 'seg2_type_is', 'seg2_type_prefix', 'seg2_type_suffix', 'seg2_type_contains', 'any_seg_type_is', 'any_seg_type_prefix', 'any_seg_type_suffix', 'any_seg_type_contains', 'has_mirna', 'no_mirna', 'mirna_dimer', 'mirna_not_dimer', '5p_mirna', '3p_mirna', 'mirna_is', 'mirna_prefix', 'mirna_suffix', 'mirna_contains', 'target_is', 'target_prefix', 'target_suffix', 'target_contains', 'mirna_seg_type_is', 'mirna_seg_type_prefix', 'mirna_seg_type_suffix', 'mirna_seg_type_contains', 'target_seg_type_is', 'target_seg_type_prefix', 'target_seg_type_suffix', 'target_seg_type_contains')
All allowed properties for the
prop()
method. SeeGEN_PROPS
,STR_PROPS
,MIRNA_PROPS
, andMIRNA_STR_PROPS
- set_flag(flag_key: str, flag_val: Optional[Union[float, int, str, bool]], allow_undefined_flags: Optional[bool] = None) None
Set the value of record
flag_key
toflag_val
.- Parameters
flag_key (str) -- Key for flag to set.
flag_val -- Value for flag to set.
allow_undefined_flags (
bool
, optional) -- Allow inclusion of flags not defined inALL_FLAGS
or insettings['custom_flags']
. If not provided, uses setting in'HybRecord.allow_undefined_flags'
(Defaults to value in:settings['allow_undefined_flags']
).
- get_seg1_type(require: bool = False) Optional[str]
Return the seg1_type flag if defined, or return None.
- Parameters
require -- If
True
, raise an error if seg1_type is not defined.
- get_seg2_type(require: bool = False) Optional[str]
Return the seg2_type flag if defined, or return None.
- Parameters
require (
bool
, optional) -- IfTrue
, raise an error if seg2_type is not defined.
- get_seg_types(require: bool = False) Tuple[Optional[str], Optional[str]]
Return "seg1_type" (or None), "seg2_type" (or None) flags.
Return a tuple of the seg1_type and seg2_type flags for each respective flag that is defined, or None for each flag that is not.
- Parameters
require (
bool
, optional) -- IfTrue
, raise an error if either flag is not defined.
- get_read_count(require: bool = False) Optional[int]
Return the read_count flag if defined, otherwise return None.
- Parameters
require (
bool
, optional) -- IfTrue
, raise an error if the "read_count" flag is not defined.
- get_record_count(require: bool = False) int
Return count_total flag if defined, or return 1 (this record).
- Parameters
require (
bool
, optional) -- IfTrue
, raise an error if the "count_total" flag is not defined.
- get_mirna_props(allow_mirna_dimers: bool = False, require: bool = True) Optional[Dict]
Return the seg_props dict corresponding to the miRNA segment, if set.
If
eval_mirna()
has been run, return the seg_props dict corresponding to the miRNA segment type as determined by checking the miRNA_seg flag, orNone
if the record does not contain a miRNA.
- get_target_props(allow_mirna_dimers: bool = False, require: bool = True) Optional[Dict]
Return the seg_props dict corresponding to the target segment, if set.
If
eval_mirna()
has been run, return the seg_props dict corresponding to the target segment type as determined by checking the miRNA_seg flag, (and returning the other segment), orNone
if the record does not contain a miRNA or contains two miRNAs.- Parameters
allow_mirna_dimers (
bool
, optional) -- IfTrue
, consider miRNA dimers as a miRNA/target pair and return the 3p miRNA segment properties as the arbitrarily-selected "target" of the dimer pair.require (
bool
, optional) -- IfTrue
, raise an error if the read does not contain a single target-annotated segment (Default:True
).
- eval_types(allow_unknown: Optional[bool] = None) None
Find the types of each segment using the the
TypeFinder
class.This method provides
HybRecord.seg1_props
andHybRecord.seg2_props
to theTypeFinder
class, linked as attributeHybRecord.TypeFinder
. This uses the method:TypeFinder.find
set byTypeFinder.set_method
orTypeFinder.set_custom_method
to set the seg1_type, seg2_type flags if not already set.To use a type-finding method other than the default, prepare the
TypeFinder
class by preparing and settingTypeFinder.params
and usingTypeFinder.set_method
.- Parameters
allow_unknown (
bool
, optional) -- IfTrue
, allow segment types that cannot be identified and set them as "unknown". Otherwise raise an error. If not provided uses setting insettings['allow_unknown_seg_types']
.
- set_fold_record(fold_record: Union[FoldRecord, Tuple[FoldRecord, Any]], allow_energy_mismatch: bool = False) None
Check and set provided fold_record (
FoldRecord
) as attribute fold_record.Ensures that fold_record argument is an instance of FoldRecord and has a matching sequence to this HybRecord, then set as HybRecord.fold_record.
- Parameters
fold_record (FoldRecord) --
FoldRecord
instance to set as HybRecord.fold_record.allow_energy_mismatch (
bool
, optional) -- IfTrue
, allow mismatched fold_record and HybRecord energy. Otherwise, raise an error.
- eval_mirna(override: bool = False, mirna_types: Optional[bool] = None) None
Analyze and set miRNA properties from type properties in the hyb record.
If not already done, determine whether a miRNA exists within this record and set the miRNA_seg flag. This evaluation requires the seg1_type and seg2_type flags to be populated, which can be performed by the
eval_types()
method.- Parameters
override (
bool
, optional) -- IfTrue
, override existing miRNA_seg flag if present.mirna_types (
list
,tuple
, orset
, optional) -- Iterable of string representing sequence types considered as miRNA. Otherwise, the types are used fromsettings['mirna_types']
(it is suggested that this be provided as aset
for fastest checking).
- mirna_details(detail: Literal['all', 'mirna_ref', 'target_ref', 'mirna_seg_type', 'target_seg_type', 'mirna_seq', 'target_seq', 'mirna_fold', 'target_fold'] = 'all', allow_mirna_dimers: bool = False) Optional[Union[Dict, str]]
Provide a detail about the miRNA or target following
eval_mirna()
.Analyze miRNA properties within the sequence record and provide a detail as output. Unless
allow_mirna_dimers
isTrue
, this method requires record to contain a non-dimer miRNA, otherwise an error will be raised.- Parameters
detail (str) --
Type of detail to return. Options include:all
: Dict of all properties (default)mirna_ref
: Identifier for Assigned miRNAtarget_ref
: Identifier for Assigned Targetmirna_seg_type
: Assigned seg_type of miRNAtarget_seg_type
: Assigned seg_type of targetmirna_seq
: Annotated subsequence of miRNAtarget_seq
: Annotated subsequence of targetmirna_fold
: Annotated fold substring of miRNA (requires fold_record set)target_fold
: Annotated fold substring of target (requires fold_record set)allow_mirna_dimers (
bool
, optional) -- Allow miRNA/miRNA dimers. The 5p-position will be assigned as the "miRNA", and the 3p-position will be assigned as the "target".
- mirna_detail(*args, **kwargs)
Deprecate, alias for
mirna_details()
.Deprecated since version v0.3.0.
- is_set(prop: str) bool
Return
True
if HybRecord property "prop" is set (if relevant) and is notNone
.Options described in
SET_PROPS
.- Parameters
prop (str) -- Property / Analysis to check
- not_set(prop: str) bool
Return
False
if HybRecord property "prop" is set (if relevant) and is notNone
.( returns
not is_set(prop)
)- Parameters
prop (str) -- Property / Analysis to check
- prop(prop: str, prop_compare: Optional[str] = None) bool
Return
True
if HybRecord has property:prop
.Check property against list of allowed properties in
HAS_PROPS
. If query property has a string comparator, provide this in prop_compare. Raises an error if a prerequisite field is not set (useis_set()
to check whether properties are set).Specific properties available to check are described in attributes:
General Record Properties
Field String Comparison Properties
miRNA-Associated Record Properties
miRNA-Associated String Comparison Properties
- has_prop(*args, **kwargs)
Return
True
if HybRecord has property:prop
.Deprecated since version v0.3.0: Use
prop()
instead.
- to_line(newline: bool = True, sep: str = '\t') str
Return a hyb-format string representation of the record.
- to_csv(newline: bool = False) str
Return a comma-separated hyb-format string representation of the record.
- Parameters
newline (
bool
, optional) -- IfTrue
, end the returned string with a newline.
- to_fields(missing_obj: Optional[Union[float, int, str, bool]] = None) dict
Return a python dictionary representation of the record.
Returns a dictionary with keys corresponding to the fields in the hyb-format file, and values corresponding to the values in the record. Output can be used with the pandas DataFrame constructor or csv.DictWriter.
- Parameters
missing_obj (optional) -- Object to use for missing values. Default =
None
.
- to_fasta_record(mode: Literal['hybrid', 'seg1', 'seg2', 'mirna', 'target'] = 'hybrid', annotate: bool = True, allow_mirna_dimers: bool = False) None
Return nucleotide sequence as BioPython SeqRecord object.
- Parameters
mode (
str
, optional) --Determines which sequence component to return. Options:hybrid
: Entire hybrid sequence (default)seg1
: Sequence 1 (if defined)seg2
: Sequence 2 (if defined)miRNA
: miRNA sequence of miRNA/target pair (if defined, else None)target
: Target sequence of miRNA/target pair (if defined, else None)annotate (
bool
, optional) -- Add name of components to fasta sequence identifier if present.allow_mirna_dimers (
bool
, optional) --IfTrue
, allow miRNA dimers to bereturned as miRNA sequence (the 5p segmentwill be selected as the "miRNA").
- to_fasta_str(mode: Literal['hybrid', 'seg1', 'seg2', 'mirna', 'target'] = 'hybrid', annotate: bool = True) str
Return nucleotide sequence as a fasta string.
- Parameters
mode (
str
, optional) --as withto_fasta_record()
method.annotate (
bool
, optional) -- Add name of components to fasta sequence identifier if present.
- classmethod from_line(line: str, hybformat_id: bool = False, hybformat_ref: bool = False) Self
Construct a HybRecord instance from a single-line hyb-format string.
The Hyb software package ([Travis2014]) records read-count information in the "id" field of the record, which can be read by setting
hybformat_id=True
. Additionally, the Hyb hOH7 database contains the segment type in the identifier of each reference in the 4th field, which can be read by settinghybformat_ref=True
.- Parameters
line (str) -- hyb-format string containing record information.
hybformat_id (
bool
, optional) -- IfTrue
, read count information from identifier in<read_number>_<read_count>
format.hybformat_ref (
bool
, optional) -- IfTrue
, read additional record information from identifier in<gene_id>_<transcript_id>_<gene_name>_<seg_type>
format.
- Returns
HybRecord
instance containing record information.
- classmethod from_fasta_records(seg1_record: None, seg2_record: None, hyb_id: Optional[str] = None, energy: Optional[Union[float, int, str]] = None, flags: Optional[Dict[str, Any]] = None) Self
Construct a HybRecord instance from two BioPython SeqRecord Objects.
Create artificial HybRecord from two SeqRecord Objects For the hybrid:
id:[seg1_record.id]--[seg2_record.id]
(overwritten by "id" parameter if provided)seq: seg1_record.seq + seg2_recordFor each segment:
FASTA_Sequence_ID -> segN_ref_nameFASTA_Description -> Flags: segN_det (Overwritten if segN_det flag is provided directly)Optional fields to add via function arguments:
hyb_idenergyflags- Parameters
seg1_record (SeqRecord) -- Biopython SeqRecord object containing information on the left/first/5p hybrid segment (seg1)
seg2_record (SeqRecord) -- Biopython SeqRecord object containing information on the right/second/3p hybrid segment (seg2)
hyb_id (
str
, optional) -- Identifier for the hyb record (overwrites generated id if provided)energy (
str
orfloat
, optional) -- Predicted energy of sequence folding in kcal/molflags (
dict
, optional) -- Dict with keys of flags for the record and their associated values. Any flags provided overwrite default-generated flags.
- Returns
HybRecord
instance containing record information.
- classmethod to_fields_header() Literal['id', 'seq', 'energy', 'seg1_ref_name', 'seg1_read_start', 'seg1_read_end', 'seg1_ref_start', 'seg1_ref_end', 'seg1_score', 'seg2_ref_name', 'seg2_read_start', 'seg2_read_end', 'seg2_ref_start', 'seg2_ref_end', 'seg2_score', 'flags']
Return a list of the fields in a
HybRecord
object.For use with the
to_fields()
method.
- classmethod to_csv_header(newline: bool = False) Literal['id,seq,energy,seg1_ref_name,seg1_read_start,seg1_read_end,seg1_ref_start,seg1_ref_end,seg1_score,seg2_ref_name,seg2_read_start,seg2_read_end,seg2_ref_start,seg2_ref_end,seg2_score,flags']
Return a comma-separated string representation of the fields in the record.
For use with the
to_csv()
method.- Parameters
newline (
bool
, optional) -- IfTrue
, end the returned string with a newline.
HybFile Class
- class hybkit.HybFile(path: str, *args: Any, hybformat_id: Optional[bool] = None, hybformat_ref: Optional[bool] = None, from_file_like: bool = False, **kwargs: Any)
Wrapper for a hyb-format text file which returns entries (lines) as HybRecord objects.
- Parameters
path (str) -- Path to text file to open as hyb-format file.
*args -- Arguments passed to
open()
function to open a text file for reading/writing.hybformat_id (
bool
, optional) -- IfTrue
, during parsing of lines read count information from identifier in<read_number>_<read_count>
format. Defaults to value insettings['hybformat_id']
.hybformat_ref (
bool
, optional) -- IfTrue
, during parsing of lines read additional record information from identifier in<gene_id>_<transcript_id>_<gene_name>_<seg_type>
format. Defaults to value insettings['hybformat_ref']
.from_file_like (
bool
, optional) -- IfTrue
, the first argument is treated as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored. (Default False``)**kwargs -- Keyword arguments passed to
open()
function to open a text file for reading/writing.
- Variables
- settings = {'hybformat_id': False, 'hybformat_ref': False}
Class-level settings. See
hybkit.settings.HybFile_settings_info
for descriptions.
- write_record(write_record: HybRecord) None
Write a HybRecord object to file as a Hyb-format string.
Unlike the file.write() method, this method will add a newline to the end of each written record line.
- Parameters
write_record (HybRecord) -- Record to write.
- write_records(write_records: Iterable[HybRecord]) None
Write a sequence of HybRecord objects as hyb-format lines to the Hyb file.
Unlike the file.writelines() method, this method will add a newline to the end of each written record line.
- write(*_args, **_kwargs) None
Implement no-op / error for "write" method to catch errors.
Use
write_record()
orwrite_fh()
instead.
- classmethod open(path: str, *args: Any, hybformat_id: Optional[bool] = None, hybformat_ref: Optional[bool] = None, **kwargs: Any) Self
Open a path to a text file using
open()
and return a HybFile object.Arguments match those of the Python3 built-in
open()
function and are passed directly to it.This method is provided as a convenience function for drop-in replacement of the built-in
open()
function.Specific keyword arguments are provided for HybFile-specific settings:
- Parameters
path (str) -- Path to file to open.
hybformat_id (
bool
, optional) -- IfTrue
, during parsing of lines read count information from identifier in<read_number>_<read_count>
format. Defaults to value insettings['hybformat_id']
.hybformat_ref (
bool
, optional) -- IfTrue
, during parsing of lines read additional record information from identifier in<gene_id>_<transcript_id>_<gene_name>_<seg_type>
format. Defaults to value insettings['hybformat_ref']
.
- Example usage:
with HybFile.open('path/to/file.hyb', 'r') as hyb_file: for record in hyb_file: print(record)
FoldRecord Class
- class hybkit.FoldRecord(id: str, seq: str, fold: str, energy: Optional[Union[float, int, str]] = None, seq_type: Optional[Literal['static', 'dynamic']] = None)
Class for storing secondary structure (folding) information for a nucleotide sequence.
This class supports the following file types: (Data courtesy of [Gay2018])
- Example:
34_151138_MIMAT0000076_MirBase_miR-21_microRNA_1_19-... TAGCTTATCAGACTGATGTTAGCTTATCAGACTGATG .....((((((.((((((......)))))).)))))) (-11.1)
- Example:
41 dG = -8 dH = -93.9 seq1_name-seq2_name 1 A 0 2 0 1 0 0 2 G 1 3 0 2 0 0 ... ... ... 40 G 39 41 11 17 39 41 41 T 40 0 10 18 40 0
A minimum amount of data necessary for a FoldRecord object is a sequence identifier, a genomic sequence, and its fold representation.
Two types of FoldRecord objects are supported, 'static' and 'dynamic'. Static FoldRecord objects are those where the 'seq' attribute matches exactly to the corresponding
HybRecord.seq
attribute (where applicable). Dynamic FoldRecord objects are those whereFoldRecord.seq
is reconstructed from aligned regions of aHybRecord.seq
chimeric read: Longer for chimeras with overlapping alignments, shorter for chimeras with gapped alignments.Overlapping Alignment Example:
Static: seg1: 1111111111111111111111 seg2: 222222222222222222222 seq: TAGCTTATCAGACTGATGTTTTAGCTTATCAGACTGATG Dynamic: seg1: 1111111111111111111111 seg2: 222222222222222222222 seq: TAGCTTATCAGACTGATGTTTTTTTTAGCTTATCAGACTGATG
Gapped Alignment Example:
Static: seg1: 1111111111111111 seg2: 222222222222222222 seq: TTAGCTTATCAGACTGATGTTAGCTTATCAGACTGATG Dynamic: seg1: 1111111111111111 seg2: 222222222222222222 seq: AGCTTATCAGACTGATTAGCTTATCAGACTGATG
Dynamic sequences are found in the Hyb program *_hybrids_ua.hyb file type. This is primarily relevant in error-checking when setting the
HybRecord.set_fold_record()
method.When the 'static' FoldRecord type is used, the following methods are used for
HybRecord.fold_record
error-checking:When the 'dynamic' FoldRecord type is used, the following methods are used for
HybRecord.fold_record
error-checking:- Parameters
id (str) -- Identifier for record
seq (str) -- Nucleotide sequence of record.
fold (str) -- Fold representation of record.
energy (
str
orfloat
, optional) -- Energy of folding for record.seq_type (
str
, optional) -- Expect sequence to be 'static' (match exactly to corresponding HybRecord.seq), or 'dynamic' (construct from pieces of HybRecord.seq). if not provided, defaults to~settings['seq_type']
setting. Seehybkit.settings.FoldRecord_settings_info
for descriptions.
- Variables
id (str) -- Sequence Identifier (often seg1name-seg2name)
seq (str) -- Genomic Sequence
fold (str) -- Dot-bracket Fold Representation, '(', '.', and ')' characters
energy (str) -- Predicted energy of folding
seq_type (str) -- Whether sequence is 'static' or 'dynamic' (Default: 'static'; see Args for details)
- settings = {'allowed_mismatches': 0, 'error_mode': 'raise', 'fold_placeholder': '.', 'seq_type': 'static'}
Class-level settings. See
hybkit.settings.FoldRecord_settings_info
for descriptions.
- to_vienna_lines(newline: bool = True) List[str]
Return a list of lines for the record in vienna format.
See (Vienna File Format).
- Parameters
newline (
bool
, optional) -- Add newline character to the end of each returned line. (Default: True)
- to_vienna_string(newline: bool = True) str
Return a 3-line string for the record in vienna format.
See (Vienna File Format).
- Parameters
newline (
bool
, optional) -- Terminate the returned string with a newline character. (Default: True)
- count_hyb_record_mismatches(hyb_record: HybRecord) int
Count mismatches between
hyb_record.seq
andfold_record.seq
.Uses
static_count_hyb_record_mismatches()
ifseq_type
isstatic
, ordynamic_count_hyb_record_mismatches()
ifseq_type
isdynamic
.- Parameters
hyb_record (HybRecord) -- hyb_record for comparison.
- static_count_hyb_record_mismatches(hyb_record: HybRecord) int
Count mismatches between
hyb_record.seq
andfold_record.seq
.- Parameters
hyb_record (HybRecord) -- hyb_record for comparison.
- dynamic_count_hyb_record_mismatches(hyb_record: HybRecord) int
Count mismatches between hyb_record.seq and dynamic fold_record.seq.
- Parameters
hyb_record (HybRecord) -- hyb_record for comparison
- matches_hyb_record(hyb_record: HybRecord, allowed_mismatches: Optional[int] = None) bool
Return
True
if self.seq and hyb_record.seq mismatches are <= allowed_mismatches.- Parameters
hyb_record (HybRecord) -- hyb_record to compare.
allowed_mismatches (
int
, optional) -- Number of mismatches allowed for a match. If not provided, defaults to the option insettings['allowed_mismatches']
.
- ensure_matches_hyb_record(hyb_record: HybRecord, allowed_mismatches: Optional[int] = None) None
Ensure self.seq matches hyb_record.seq, else raise an error.
- Parameters
hyb_record (HybRecord) -- hyb_record to compare.
allowed_mismatches (
int
, optional) -- Number of mismatches allowed for a match. If not provided, defaults to the option insettings['allowed_mismatches']
.
- classmethod from_vienna_lines(record_lines: List[str], error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]
Construct instance from a list of 3 strings of vienna-format ([ViennaFormat]) lines.
See Vienna File Format for more details.
- classmethod from_vienna_string(record_string: str, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]
Construct instance from a string representing 3 vienna-format ([ViennaFormat]) lines.
See Vienna File Format for more details.
- classmethod from_ct_lines(record_lines: List[str], error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]
Create a FoldRecord from a list of record lines in ".ct" format ([CTFormat]).
See CT File Format for more details.
Warning
This method is in beta stage, and is not well-tested.
- classmethod from_ct_string(record_string: str, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, seq_type: Optional[Literal['static', 'dynamic']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]
Create a FoldRecord entry from a multi-line string from ".ct" format ([CTFormat]).
See CT File Format for more details.
Warning
This method is in beta stage, and is not well-tested.
- Parameters
record_string (str) -- String containing lines of ct record
ViennaFile Class
- class hybkit.ViennaFile(*args: Any, seq_type: Optional[Literal['static', 'dynamic']] = None, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, from_file_like: bool = False, **kwargs: Any)
Vienna file wrapper that returns vienna-format file lines as FoldRecord objects.
See Vienna File Format for more information.
- Parameters
seq_type (
str
, optional) -- Type of FoldRecord to return:static
, ordynamic
(if not provided, usesFoldRecord.settings['seq_type']
).error_mode (
str
, optional) -- String representing the error mode. If None, defaults to the value set insettings['error_mode']
. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.from_file_like (
bool
, optional) -- If True, treat the first argument as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored (DefaultFalse
).*args -- Passed to
open()
.**kwargs -- Passed to
open()
.
- Variables
Warning
Occasionally fold files can be poorly-formatted. In that case, this iterator attempts error-catching but this is not always successful so verbose error modes are encouraged.
- read_record(override_error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None) Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]
Read next three lines and return output as FoldRecord object.
- Parameters
override_error_mode (str) -- Override the error_mode set in the
ViennaFile
object. See the ViennaFile Constructor for more information on allowed error modes.
- classmethod open(path: str, *args: Any, **kwargs: Any) Self
Open a path to a text file using
open()
and return relevant file object.Arguments match those of the Python3 built-in
open()
function and are passed directly to it.This method is provided as a convenience function for drop-in replacement of the built-in
open()
function.Specific keyword arguments are provided for fold-file-specific settings:
- Parameters
path (str) -- Path to file to open.
seq_type (
str
, optional) -- Type of FoldRecord to return:static
, ordynamic
(if not provided, usesFoldRecord.settings['seq_type']
).error_mode (
str
, optional) -- String representing the error mode. If None, defaults to the value set insettings['error_mode']
. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.*args -- Passed directly to
open()
.**kwargs -- Passed directly to
open()
.
- Returns
HybFile
object.
- read_records() List[FoldRecord]
Return list of all
FoldRecord
objects for this file type.
- settings = {}
Class-level settings. See
hybkit.settings.FoldFile_settings_info
for descriptions.
- write_record(write_record: FoldRecord) None
Write a FoldRecord object for this file type.
Unlike the file.write() method, this method will add a newline to the end of each written record line.
- Parameters
write_record (
FoldRecord
) --FoldRecord
objects to write.
- write_records(write_records: Iterable[FoldRecord]) None
Write a sequence of FoldRecord objects for this file type.
Unlike the file.writelines() method, this method will add a newline to the end of each written record line.
- Parameters
write_records (list) -- List of
FoldRecord
objects to write.
CtFile Class
- class hybkit.CtFile(*args: Any, seq_type: Optional[Literal['static', 'dynamic']] = None, error_mode: Optional[Literal['raise', 'warn_return', 'return']] = None, from_file_like: bool = False, **kwargs: Any)
Ct file wrapper that returns ".ct" file lines as FoldRecord objects.
See CT File Format for more information.
Warning
This class is in beta stage, and is not well-tested.
- Parameters
seq_type (
str
, optional) -- Type of FoldRecord to return:static
, ordynamic
(if not provided, usesFoldRecord.settings['seq_type']
).error_mode (
str
, optional) -- String representing the error mode. If None, defaults to the value set insettings['error_mode']
. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.from_file_like (
bool
, optional) -- If True, treat the first argument as a file-like object (such as io.StringIO or gzip.GzipFile) and the remaining positional arguments are ignored (DefaultFalse
).*args -- Passed to
open()
.**kwargs -- Passed to
open()
.
- Variables
Warning
Occasionally fold files can be poorly-formatted. In that case, this iterator attempts error-catching but this is not always successful so verbose error modes are encouraged.
- read_record() Union[Tuple[None, str], Tuple[Literal['NOFOLD'], str], Tuple[Literal['NOENERGY'], str], Self]
Return the next CT record as a
FoldRecord
object.Call next(self.fh) to return the first line of the next entry. Determine the expected number of following lines in the entry, and read that number of lines further. Return lines as a FoldRecord object.
- write_record = None
CtFile Record Writing Not Implemented
- write_records = None
CtFile Record Writing Not Implemented
- classmethod open(path: str, *args: Any, **kwargs: Any) Self
Open a path to a text file using
open()
and return relevant file object.Arguments match those of the Python3 built-in
open()
function and are passed directly to it.This method is provided as a convenience function for drop-in replacement of the built-in
open()
function.Specific keyword arguments are provided for fold-file-specific settings:
- Parameters
path (str) -- Path to file to open.
seq_type (
str
, optional) -- Type of FoldRecord to return:static
, ordynamic
(if not provided, usesFoldRecord.settings['seq_type']
).error_mode (
str
, optional) -- String representing the error mode. If None, defaults to the value set insettings['error_mode']
. Options: "raise": Raise an error when encountered and exit program; "warn_return": Print a warning and return the error_value; "return": Return the error value with no warnings.*args -- Passed directly to
open()
.**kwargs -- Passed directly to
open()
.
- Returns
HybFile
object.
- read_records() List[FoldRecord]
Return list of all
FoldRecord
objects for this file type.
- settings = {}
Class-level settings. See
hybkit.settings.FoldFile_settings_info
for descriptions.
HybFoldIter Class
- class hybkit.HybFoldIter(hybfile_handle: HybFile, foldfile_handle: FoldFile, combine: bool = False, iter_error_mode: Optional[Literal['raise', 'warn_return', 'warn_skip', 'skip', 'return']] = None)
Iterator for simultaneous iteration over a
HybFile
andFoldFile
object.This class provides an iterator to iterate through a
HybFile
and one of aViennaFile
, orCtFile
simultaneously to return aHybRecord
andFoldRecord
.Basic error checking / catching is performed based on the value of the
~settings['error_mode']
setting.- Parameters
hybfile_handle (HybFile) -- HybFile object for iteration
foldfile_handle (
ViennaFile
orCtFile
) --ViennaFile
orCtFile
object for iterationcombine (
bool
, optional) -- Use HybRecord.set_fold_record(FoldRecord) and return only the HybRecord.iter_error_mode (str, optional) -- Error mode to use for reading
FoldRecord
objects. If not set, defaults to the value insettings['iter_error_mode']
.
- Returns
- settings = {'error_checks': ['hybrecord_indel', 'foldrecord_nofold', 'max_mismatch', 'energy_mismatch'], 'iter_error_mode': 'warn_skip', 'max_sequential_skips': 100}
Class-level settings. See
settings.HybFoldIter_settings_info
for descriptions.