data.h5py.H5SeqConverter¶

Class · Context · Source

converter = mdnc.data.h5py.H5SeqConverter(
    file_in_name=None, file_out_name=None
)

Convert any supervised .h5 data file into sequence version. This class allows users to choose some keywords and convert them into sequence version. Those keywords would be saved as in the format of continuous sequence. It could serve as a random splitter for preparing the training of LSTM.

The following figure shows how the data get converted. The converted dataset would be cut into several segments with random lengths.

The converted files should only get loaded by mdnc.data.h5py.H5CParser.

Warning

During the conversion, attributes would be lost, and the links and virtual datasets would be treated as h5py.Datasets.
Although this class supports context, it does not support dictionary-style APIs like h5py.Group.

Arguments¶

Requries

Argument	Type	Description
`file_in_name`	`str`	A path where we read the non-sequence formatted file. If not set, would not open the dataset.
`file_out_name`	`str`	The path of the output data file. If not set, it would be configured as `file_in_name + '_seq'`.

Methods¶

`config`¶

converter.config(logver=0, set_shuffle=False, seq_len=10, seq_len_max=20, random_seed=2048, **kwargs)

Make configuration for the converter. Only the explicitly given argument would be used for changing the configuration of this instance.

Requries

Argument	Type	Description
`logver`	`int`	The verbose level of the outputs. When setting 0, would run silently.
`set_shuffle`	`bool`	Whether to shuffle the order of segments during the conversion.
`seq_len`	`int`	The lower bound of the random segment length.
`seq_len_max`	`int`	The super bound of the random segment length.
`random_seed`	`int`	The random seed used in this instance.
`**kwargs`		Any argument that would be used for creating `h5py.Dataset`. The given argument would override the default value during the dataset creation.

`convert`¶

converter.convert(keyword, **kwargs)

Convert the h5py.Dataset given by keyword into the segmented dataset, and save it. The data would be converted into sequence. Note that before the conversion, the data should be arranged continuously of the batch axis.

If you have already converted or copied the keyword, please do not do it again.

Requries

Argument	Type	Description
`keyword`	`str`	The keyword that would be converted into segmented dataset.
`**kwargs`		Any argument that would be used for creating `h5py.Dataset`. The given argument would override the default value and configs set by `config()` during the dataset creation.

`copy`¶

converter.copy(keyword, **kwargs)

Copy the h5py.Dataset given by keyword into the output file.

If you have already converted or copied the keyword, please do not do it again.

Requries

Argument	Type	Description
`keyword`	`str`	The keyword that would be copied into the output file.
`**kwargs`		Any argument that would be used for creating `h5py.Dataset`. The given argument would override the default value and configs set by `config()` during the dataset creation.

`open`¶

converter.open(file_in_name, file_out_name=None)

Open a new file. If a file has been opened before, this file would be closed. This method and the __init__ method (need to specify file_in_name) support context management.

Requries

Argument	Type	Description
`file_in_name`	`str`	A path where we read the non-sequence formatted file.
`file_out_name`	`str`	The path of the output data file. If not set, it would be configured as `file_in_name + '_seq'`.

`close`¶

converter.close()

Close the converter.

Examples¶

Example 1

Codes

import os
import mdnc

root_folder = 'alpha-test'
os.makedirs(root_folder, exist_ok=True)

if __name__ == '__main__':
    # Prepare the datasets.
    set_list_file = os.path.join(root_folder, 'web-data')
    mdnc.data.webtools.DataChecker.init_set_list(set_list_file)
    dc = mdnc.data.webtools.DataChecker(root=root_folder, set_list_file=set_list_file, token='', verbose=False)
    dc.add_query_file('test_data_h5seqconverter1.h5')
    dc.query()

    # Perform test.
    with mdnc.data.h5py.H5SeqConverter(os.path.join(root_folder, 'test_data_h5seqconverter1')) as cvt:
        cvt.config(logver=1, shuffle=True, fletcher32=True, compression='gzip')
        cvt.convert('data_to_sequence')
        cvt.copy('data_only_copied')

Output

data.webtools: All required datasets are available.
data.h5py: Current configuration is: {'dtype': <class 'numpy.float32'>, 'shuffle': True, 'fletcher32': True, 'compression': 'gzip'}
data.h5py: Convert data_to_sequence into the output file. The original data shape is (1000,), splitted into 64 parts.
data.h5py: Copy data_only_copied into the output file. The data shape is (1000,).

Example 2

Codes

import os
import mdnc

root_folder = 'alpha-test'
os.makedirs(root_folder, exist_ok=True)

if __name__ == '__main__':
    # Prepare the datasets.
    set_list_file = os.path.join(root_folder, 'web-data')
    mdnc.data.webtools.DataChecker.init_set_list(set_list_file)
    dc = mdnc.data.webtools.DataChecker(root=root_folder, set_list_file=set_list_file, token='', verbose=False)
    dc.add_query_file(['test_data_h5seqconverter1.h5', 'test_data_h5seqconverter2.h5'])
    dc.query()

    # Perform test.
    converter = mdnc.data.h5py.H5SeqConverter()
    converter.config(logver=1, shuffle=True, fletcher32=True, compression='gzip')
    with converter.open(os.path.join(root_folder, 'test_data_h5seqconverter1')) as cvt:
        cvt.convert('data_to_sequence')
        cvt.copy('data_only_copied')
    with converter.open(os.path.join(root_folder, 'test_data_h5seqconverter2')) as cvt:
        cvt.convert('data_to_sequence')
        cvt.copy('data_only_copied')

Output

data.webtools: All required datasets are available.
data.h5py: Current configuration is: {'dtype': <class 'numpy.float32'>, 'shuffle': True, 'fletcher32': True, 'compression': 'gzip'}
data.h5py: Open a new read file: alpha-test\test_data_h5seqconverter1.h5
data.h5py: Open a new output file: alpha-test\test_data_h5seqconverter1_seq.h5
data.h5py: Convert data_to_sequence into the output file. The original data shape is (1000,), splitted into 64 parts.
data.h5py: Copy data_only_copied into the output file. The data shape is (1000,).
data.h5py: Open a new read file: alpha-test\test_data_h5seqconverter2.h5
data.h5py: Open a new output file: alpha-test\test_data_h5seqconverter2_seq.h5
data.h5py: Convert data_to_sequence into the output file. The original data shape is (1000,), splitted into 64 parts.
data.h5py: Copy data_only_copied into the output file. The data shape is (1000,).

Last update: March 14, 2021

data.h5py.H5SeqConverter¶

Arguments¶

Methods¶

config¶

convert¶

copy¶

open¶

close¶

Examples¶

Comments

`config`¶

`convert`¶

`copy`¶

`open`¶

`close`¶