Skip to content

data.h5py.H5SeqConverter

Class · Context · Source

converter = mdnc.data.h5py.H5SeqConverter(
    file_in_name=None, file_out_name=None
)

Convert any supervised .h5 data file into sequence version. This class allows users to choose some keywords and convert them into sequence version. Those keywords would be saved as in the format of continuous sequence. It could serve as a random splitter for preparing the training of LSTM.

The following figure shows how the data get converted. The converted dataset would be cut into several segments with random lengths.

The converted files should only get loaded by mdnc.data.h5py.H5CParser.

Warning
  • During the conversion, attributes would be lost, and the links and virtual datasets would be treated as h5py.Datasets.
  • Although this class supports context, it does not support dictionary-style APIs like h5py.Group.

Arguments

Requries

Argument Type Description
file_in_name str A path where we read the non-sequence formatted file. If not set, would not open the dataset.
file_out_name str The path of the output data file. If not set, it would be configured as file_in_name + '_seq'.

Methods

config

converter.config(logver=0, set_shuffle=False, seq_len=10, seq_len_max=20, random_seed=2048, **kwargs)

Make configuration for the converter. Only the explicitly given argument would be used for changing the configuration of this instance.

Requries

Argument Type Description
logver int The verbose level of the outputs. When setting 0, would run silently.
set_shuffle bool Whether to shuffle the order of segments during the conversion.
seq_len int The lower bound of the random segment length.
seq_len_max int The super bound of the random segment length.
random_seed int The random seed used in this instance.
**kwargs Any argument that would be used for creating h5py.Dataset. The given argument would override the default value during the dataset creation.

convert

converter.convert(keyword, **kwargs)

Convert the h5py.Dataset given by keyword into the segmented dataset, and save it. The data would be converted into sequence. Note that before the conversion, the data should be arranged continuously of the batch axis.

If you have already converted or copied the keyword, please do not do it again.

Requries

Argument Type Description
keyword str The keyword that would be converted into segmented dataset.
**kwargs Any argument that would be used for creating h5py.Dataset. The given argument would override the default value and configs set by config() during the dataset creation.

copy

converter.copy(keyword, **kwargs)

Copy the h5py.Dataset given by keyword into the output file.

If you have already converted or copied the keyword, please do not do it again.

Requries

Argument Type Description
keyword str The keyword that would be copied into the output file.
**kwargs Any argument that would be used for creating h5py.Dataset. The given argument would override the default value and configs set by config() during the dataset creation.

open

converter.open(file_in_name, file_out_name=None)

Open a new file. If a file has been opened before, this file would be closed. This method and the __init__ method (need to specify file_in_name) support context management.

Requries

Argument Type Description
file_in_name str A path where we read the non-sequence formatted file.
file_out_name str The path of the output data file. If not set, it would be configured as file_in_name + '_seq'.

close

converter.close()

Close the converter.

Examples

Example 1
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
import os
import mdnc

root_folder = 'alpha-test'
os.makedirs(root_folder, exist_ok=True)

if __name__ == '__main__':
    # Prepare the datasets.
    set_list_file = os.path.join(root_folder, 'web-data')
    mdnc.data.webtools.DataChecker.init_set_list(set_list_file)
    dc = mdnc.data.webtools.DataChecker(root=root_folder, set_list_file=set_list_file, token='', verbose=False)
    dc.add_query_file('test_data_h5seqconverter1.h5')
    dc.query()

    # Perform test.
    with mdnc.data.h5py.H5SeqConverter(os.path.join(root_folder, 'test_data_h5seqconverter1')) as cvt:
        cvt.config(logver=1, shuffle=True, fletcher32=True, compression='gzip')
        cvt.convert('data_to_sequence')
        cvt.copy('data_only_copied')
data.webtools: All required datasets are available.
data.h5py: Current configuration is: {'dtype': <class 'numpy.float32'>, 'shuffle': True, 'fletcher32': True, 'compression': 'gzip'}
data.h5py: Convert data_to_sequence into the output file. The original data shape is (1000,), splitted into 64 parts.
data.h5py: Copy data_only_copied into the output file. The data shape is (1000,).
Example 2
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
import os
import mdnc

root_folder = 'alpha-test'
os.makedirs(root_folder, exist_ok=True)

if __name__ == '__main__':
    # Prepare the datasets.
    set_list_file = os.path.join(root_folder, 'web-data')
    mdnc.data.webtools.DataChecker.init_set_list(set_list_file)
    dc = mdnc.data.webtools.DataChecker(root=root_folder, set_list_file=set_list_file, token='', verbose=False)
    dc.add_query_file(['test_data_h5seqconverter1.h5', 'test_data_h5seqconverter2.h5'])
    dc.query()

    # Perform test.
    converter = mdnc.data.h5py.H5SeqConverter()
    converter.config(logver=1, shuffle=True, fletcher32=True, compression='gzip')
    with converter.open(os.path.join(root_folder, 'test_data_h5seqconverter1')) as cvt:
        cvt.convert('data_to_sequence')
        cvt.copy('data_only_copied')
    with converter.open(os.path.join(root_folder, 'test_data_h5seqconverter2')) as cvt:
        cvt.convert('data_to_sequence')
        cvt.copy('data_only_copied')
data.webtools: All required datasets are available.
data.h5py: Current configuration is: {'dtype': <class 'numpy.float32'>, 'shuffle': True, 'fletcher32': True, 'compression': 'gzip'}
data.h5py: Open a new read file: alpha-test\test_data_h5seqconverter1.h5
data.h5py: Open a new output file: alpha-test\test_data_h5seqconverter1_seq.h5
data.h5py: Convert data_to_sequence into the output file. The original data shape is (1000,), splitted into 64 parts.
data.h5py: Copy data_only_copied into the output file. The data shape is (1000,).
data.h5py: Open a new read file: alpha-test\test_data_h5seqconverter2.h5
data.h5py: Open a new output file: alpha-test\test_data_h5seqconverter2_seq.h5
data.h5py: Convert data_to_sequence into the output file. The original data shape is (1000,), splitted into 64 parts.
data.h5py: Copy data_only_copied into the output file. The data shape is (1000,).

Last update: March 14, 2021

Comments