hybkit.type_finder

type_finder.py module.

This module contains the TypeFinder class to work with HybRecord to parse sequence identifiers to idenfity sequence type.

class hybkit.type_finder.TypeFinder

Class for parsing identifiers to identify sequence ‘type’.

Designed to be used by the hybkit.HybRecord

Variables

params (dict) – Stored parameters for string parsing, where applicable.

find = None

Placeholder for storing active method, set with set_method().

params = None

Placeholder for parameters for active method, set with set_method().

classmethod set_method(method, params={})

Select method to use when finding types.

Available methods are listed in methods.

Parameters
  • method (str) – Method option from methods to set for use as find().

  • params (dict, optional) – Dict object of parameters to use by set method.

classmethod set_custom_method(method, params={})

Set the method for use to find seg types.

This method is for providing a custom function. To use the included functions, use set_method(). Custom functions provided must have the signature:

seg_type = custom_method(self, seg_props, params, check_complete)

This function should return the string of the assigned segment type if found, or a None object if the type cannot be found. It can also take a dictionary in the “params” argument that specifies additional or dynamic search properties, as desired. The if check_complete is true, the function should search for all possibilities for a given sequence, instead of stopping after the first is found.

Parameters
  • method (method) – Method to set for use.

  • params (dict, optional) – dict of custom parameters to set for use.

static method_hybformat(seg_props, params={}, check_complete=False)

Return the type of the provided segment, or None if segment cannot be identified.

This method works with sequence / alignment mapping identifiers in the format of the reference database provided by the Hyb Software Package, specifically identifiers of the format:

<gene_id>_<transcript_id>_<gene_name>_<seg_type>

This method returns the last component of the identifier, split by “_”, as the identfied sequence type.

Example

"MIMAT0000076_MirBase_miR-21_microRNA"  --->  "microRNA".
Parameters
  • seg_props (dict) – seg_props from hybkit.HybRecord

  • params (dict, optional) – Unused in this method.

  • check_complete (bool, optional) – Unused in this method.

static method_string_match(seg_props, params={}, check_complete=False)

Return the type of the provided segment, or None if unidentified.

This method attempts to find a string matching a specific pattern within the identifier of the aligned segment. Search options include “startswith”, “contains”, “endswith”, and “matches”. The required params dict should contain a key for each desired search type, with a list of 2-tuples for each search-string with assigned-type.

Example

params = {'endswith': [('_miR', 'microRNA'),
                       ('_trans', 'mRNA')   ]}

This dict can be generated with the associated make_string_match_params() method and an associated csv legend file with format:

#commentline
#search_type,search_string,seg_type
endswith,_miR,microRNA
endswith,_trans,mRNA
Parameters
  • seg_props (dict) – HybRecord segment properties dict to evaluate.

  • params (dict, optional) – Dict with search paramaters as described above.

  • check_complete (bool, optional) – If true, the method will continue checking search options after an option has been found, to ensure that no options conflict (more sure method). If False, it will stop after the first match is found (faster method). (Default: False)

static make_string_match_params(legend_file)

Read csv and return a dict of search parameters for method_string_match().

The my_legend.csv file should have the format:

#commentline
#search_type,search_string,seg_type
endswith,_miR,microRNA
endswith,_trans,mRNA

Search_type options include “startswith”, “contains”, “endswith”, and “matches” The produced dict object contains a key for each search type, with a list of 2-tuples for each search-string and associated segment-type.

For example:

{'endswith': [('_miR', 'microRNA'),
              ('_trans', 'mRNA')   ]}
static method_id_map(seg_props, params={}, check_complete=False)

Return the type of the provided segment or None if it cannot be identified.

This method checks to see if the identifer of the segment is present in the params dict. params should be formatted as a dict with keys as sequence identifier names, and the corresponding type as the respective values.

Example

params = {'MIMAT0000076_MirBase_miR-21_microRNA': 'microRNA',
          'ENSG00000XXXXXX_NR003287-2_RN28S1_rRNA': 'rRNA'}

This dict can be generated with the associated make_id_map_params() method.

Parameters
  • params (dict) – Dict of mapping of sequence identifiers to sequence types.

  • check_complete (bool, optional) – Unused in this method.

Returns

Identified sequence type, or None if it cannot be found.

Return type

str

static make_id_map_params(mapped_id_files=None, type_file_pairs=None)

Read file(s) into a mapping of sequence identifiers.

This method reads one or more files into a dict for use with the method_id_map() method. The method requires passing either a list/tuple of one or more files to mapped_id_files, or a list/tuple of one or more pairs of file lists and file types passed to type_file_pairs. Files listed in the mapped_id_files argument should have the format:

#commentline
#seg_id,seg_type
seg1_unique_id,seg1_type
seg2_unique_id,seg2_type

Entries in the list/tuple passed to type_file_pairs should have the format: (seg1_type, file1_name)

Example

[(seg1_type, file1_name), (seg2_type, file2_name),]

The first entry in each (non-commented, non-blank) file line will be read and added to the mapping dictionary mapped to the provided seg_type.

Parameters
  • mapped_id_files (list or tuple, optional) – Iterable object containing strings of paths to files containing id/type mapping information.

  • type_file_pairs (list or tuple, optional) – Iterable object containing 2-tuple pairs containing id/type mapping information.

methods = {'hybformat': <staticmethod object>, 'id_map': <staticmethod object>, 'string_match': <staticmethod object>}

Dict of provided methods available to assign segment types

‘hybformat’

method_hybformat()

‘string_match’

method_string_match()

‘id_map’

method_id_map()

param_methods = {'hybformat': None, 'id_map': <function TypeFinder.make_id_map_params>, 'string_match': <function TypeFinder.make_string_match_params>}

Dict of param generation methods for type finding methods

‘hybformat’

None

‘string_match’

make_string_match_params()

‘id_map’

make_id_map_params()

param_methods_needs_file = {'hybformat': False, 'id_map': True, 'string_match': True}

Dict of whether parameter generation methods need an input file

‘hybformat’

False

‘string_match’

True

‘id_map’

True