data.h5py.H5GParser¶

Class · Context · Source

dset = mdnc.data.h5py.H5GParser(
    file_name, keywords, batch_size=32, shuffle=True, shuffle_seed=1000,
    preprocfunc=None, num_workers=4, num_buffer=10
)

Grouply parsing dataset. This class allows users to feed one .h5 file, and convert it to mdnc.data.sequence.MPSequence. The realization could be described as:

Create .h5 file indexer, this indexer would be initialized by sequence.MPSequence. It would use the user defined keywords to get a group of h5py.Datasets.
Estimate the h5py.Dataset sizes, each dataset should share the same size (but could have different shapes).
Use the dataset size to create a sequence.MPSequence, and allows it to randomly shuffle the indices in each epoch.
Invoke the sequence.MPSequence APIs to serve the parallel dataset parsing.

Certainly, you could use this parser to load a single h5py.Dataset. To find details about the parallel parsing workflow, please check mdnc.data.sequence.MPSequence.

Arguments¶

Requries

Argument	Type	Description
`file_name`	`str`	The path of the `.h5` file (could be without postfix).
`keywords`	`(str, )`	Should be a list of keywords (or a single keyword).
`batch_size`	`int`	Number of samples in each mini-batch.
`shuffle`	`bool`	If enabled, shuffle the data set at the beginning of each epoch.
`shuffle_seed`	`int`	The seed for random shuffling.
`preprocfunc`	`object`	This function would be added to the produced data so that it could serve as a pre-processing tool. Note that this tool would process the batches produced by the parser. The details about this argument would be shown in the following tips.
`num_workers`	`int`	The number of parallel workers.
`num_buffer`	`int`	The buffer size of the data pool, it means the maximal number of mini-batches stored in the memory.

Tip

The minimal requirement for the argument preprocfunc is to be a function, or implemented with the __call__() method. This function accepts all input mini-batch variables formatted as np.ndarray, and returns the pre-processed results. The returned varaible number could be different from the input variable number. In some cases, you could use the provided pre-processors in the mdnc.data.preprocs module. The processors in these module support our Broadcasting Pre- and Post- Processor Protocol. For example:

Example

No args

import mdnc

def preprocfunc(x1, x2):
    return x1 + x2

mdnc.data.h5py.H5GParser(..., keywords=['x_1', 'x_2'],
                         preprocfunc=preprocfunc)

With args

import mdnc

class PreprocWithArgs:
    def __init__(self, a):
        self.a = a

    def __call__(self, x1, x2):
        return x1, self.a * x2

mdnc.data.h5py.H5GParser(..., keywords=['x_1', 'x_2'],
                         preprocfunc=PreprocWithArgs(a=0.1))

Use data.preprocs

import mdnc

mdnc.data.h5py.H5GParser(..., keywords=['x_1', 'x_2'],
                         preprocfunc=mdnc.data.preprocs.ProcScaler())

Warning

The argument preprocfunc requires to be a picklable object. Therefore, a lambda function or a function implemented inside if __name__ == '__main__' is not allowed in this case.

Methods¶

`check_dsets`¶

sze = dset.check_dsets(file_path, keywords)

Check the size of h5py.Dataset and validate all datasets. A valid group of datasets requires each h5py.Dataset shares the same length (sample number). If success, would return the size of the datasets. This method is invoked during the initialization, and do not requires users to call explicitly.

Requries

Argument	Type	Description
`file_path`	`str`	The path of the HDF5 dataset to be validated.
`keywords`	`(str, )`	The keywords to be validated. Each keyword should point to or redict to an `h5py.Dataset`.

Returns

Argument	Description
`sze`	A `int`, the size of all datasets.

`get_attrs`¶

attrs = dset.get_attrs(keyword, *args, attr_names=None)

Get the attributes by the keyword.

Requries

Argument	Type	Description
`keyword`	`str`	The keyword of the to a `h5py.Dataset` in the to-be-loaded file.
`attr_names`	`(str, )`	A sequence of required attribute names.
`*args`		other attribute names, would be attached to the argument `attr_names` by `list.extend()`.

Returns

Argument	Description
`attrs`	A list of the required attribute values.

`get_file`¶

f = dset.get_file(enable_write=False)

Get a file object of the to-be-loaded file.

Requries

Argument	Type	Description
`enable_write`	`bool`	If enabled, would use the `a` mode to open the file. Otherwise, use the `r` mode.

Returns

Argument	Description
`f`	The `h5py.File` object of the to-be-loaded file.

`start`¶

dset.start(compat=None)

Start the process pool. This method is implemented by mdnc.data.sequence.MPSequence. It supports context management.

Running start() or start_test() would interrupt the started sequence.

Requries

Argument	Type	Description
`compat`	`bool`	Whether to fall back to multi-threading for the sequence out-type converter. If set None, the decision would be made by checking `os.name`. The compatible mode requires to be enabled on Windows.

Tip

This method supports context management. Using the context is recommended. Here we show two examples:

Without context

dset.start()
for ... in dset:
    ...
dset.finish()

With context

1
2
3

with dset.start() as ds:
    for ... in ds:
        ...

Danger

The cuda.Tensor could not be put into the queue on Windows (but on Linux we could), see

https://pytorch.org/docs/stable/notes/windows.html#cuda-ipc-operations

To solve this problem, we need to fall back to multi-threading for the sequence out-type converter on Windows.

Warning

Even if you set shuffle=False, due to the mechanism of the parallelization, the sample order during the iteration may still get a little bit shuffled. To ensure your sample order not changed, please use shuffle=False during the initialization and use start_test() instead.

`start_test`¶

dset.start_test(test_mode='default')

Start the test mode. In the test mode, the process pool would not be open. All operations would be finished in the main thread. However, the random indices are still generated with the same seed of the parallel dset.start() mode.

Running start() or start_test() would interrupt the started sequence.

Requries

Argument	Type	Description
`test_mode`	`str`	Could be `'default'`, `'cpu'`, or `'numpy'`. `'default'`: the output would be converted as `start()` mode. `'cpu'`: even set 'cuda' as output type, the testing output would be still not converted to GPU. `'numpy'`: would ignore all out_type configurations and return the original output. This output is still pre-processed.

Tip

This method also supports context management. See start() to check how to use it.

`finish`¶

dset.finish()

Finish the process pool. The compatible mode would be auto detected by the previous start().

Properties¶

`len()`, `batch_num`¶

len(dset)
dset.batch_num

The length of the dataset. It is the number of mini-batches, also the number of iterations for each epoch.

`iter()`¶

for x1, x2, ... in dset:
    ...

The iterator. Recommend to use it inside the context. The unpacked variables x1, x2 ... are ordered according to the given argument keywords during the initialization.

`size`¶

dset.size

The size of the dataset. It contains the total number of samples for each epoch.

`batch_size`¶

dset.batch_size

The size of each batch. This value is given by the argument batch_size during the initialization. The last size of the batch may be smaller than this value.

`preproc`¶

dset.preproc

The argument preprocfunc during the initialziation. This property helps users to invoke the preprocessor manually.

Exampless¶

Example 1

Codes

import os
import mdnc

root_folder = 'alpha-test'
os.makedirs(root_folder, exist_ok=True)

if __name__ == '__main__':
    # Prepare the datasets.
    set_list_file = os.path.join(root_folder, 'web-data')
    mdnc.data.webtools.DataChecker.init_set_list(set_list_file)
    dc = mdnc.data.webtools.DataChecker(root=root_folder, set_list_file=set_list_file, token='', verbose=False)
    dc.add_query_file('test_data_h5gparser.h5')
    dc.query()

    # Perform test.
    dset = mdnc.data.h5py.H5GParser(os.path.join(root_folder, 'test_data_h5gparser'), ['one', 'zero'],
                                    batch_size=3, num_workers=4, shuffle=True, preprocfunc=None)
    with dset.start() as p:
        for i, data in enumerate(p):
            print('data.h5py: Epoch 1, Batch {0}'.format(i), data[0].shape, data[1].shape)

        for i, data in enumerate(p):
            print('data.h5py: Epoch 2, Batch {0}'.format(i), data[0].shape, data[1].shape)

Output

data.webtools: All required datasets are available.
data.h5py: Epoch 1, Batch 0 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 1 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 2 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 3 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 4 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 5 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 6 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 7 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 8 torch.Size([1, 20]) torch.Size([1, 10])
data.h5py: Epoch 2, Batch 0 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 1 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 2 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 3 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 4 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 5 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 6 torch.Size([1, 20]) torch.Size([1, 10])
data.h5py: Epoch 2, Batch 7 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 8 torch.Size([3, 20]) torch.Size([3, 10])

Example 2

Codes

import os
import numpy as np
import mdnc

root_folder = 'alpha-test'
os.makedirs(root_folder, exist_ok=True)

if __name__ == '__main__':
    # Prepare the datasets.
    set_list_file = os.path.join(root_folder, 'web-data')
    mdnc.data.webtools.DataChecker.init_set_list(set_list_file)
    dc = mdnc.data.webtools.DataChecker(root=root_folder, set_list_file=set_list_file, token='', verbose=False)
    dc.add_query_file('test_data_h5gparser.h5')
    dc.query()

    # Perform test.
    dset = mdnc.data.h5py.H5GParser(os.path.join(root_folder, 'test_data_h5gparser'), ['one', 'zero'],
                                    batch_size=3, num_workers=4, shuffle=True,
                                    preprocfunc=mdnc.data.preprocs.ProcScaler())
    with dset.start_test() as p:
        for i, (d_one, d_two) in enumerate(p):
            d_one, d_two = d_one.cpu().numpy(), d_two.cpu().numpy()
            std_one, std_two = np.std(d_one), np.std(d_two)
            d_one, d_two = p.preproc.postprocess(d_one, d_two)
            std_one_, std_two_ = np.std(d_one), np.std(d_two)
            print('Before: {0}, {1}; After: {2}, {3}.'.format(std_one, std_two, std_one_, std_two_))

Output

data.webtools: All required datasets are available.
Before: 0.4213927686214447, 0.5810447931289673; After: 3.4976863861083984, 4.269893169403076.
Before: 0.47204485535621643, 0.5270004868507385; After: 3.2560627460479736, 5.232884407043457.
Before: 0.380888432264328, 0.5548458099365234; After: 2.69606876373291, 4.5017008781433105.
Before: 0.555243968963623, 0.5082056522369385; After: 3.231991767883301, 5.085717678070068.
Before: 0.39406657218933105, 0.5630286931991577; After: 2.8078441619873047, 5.10365629196167.
Before: 0.49584802985191345, 0.5255910754203796; After: 2.706739664077759, 5.646749019622803.
Before: 0.4346843361854553, 0.5725106000900269; After: 2.7871317863464355, 4.466533660888672.
Before: 0.5043540000915527, 0.5292088389396667; After: 2.373351573944092, 4.446733474731445.
Before: 0.46324262022972107, 0.6497944593429565; After: 2.350776433944702, 5.593009948730469.

Last update: March 14, 2021

data.h5py.H5GParser¶

Arguments¶

Methods¶

check_dsets¶

get_attrs¶

get_file¶

start¶

start_test¶

finish¶

Properties¶

len(), batch_num¶

iter()¶

size¶

batch_size¶

preproc¶

Exampless¶

Comments

`check_dsets`¶

`get_attrs`¶

`get_file`¶

`start`¶

`start_test`¶

`finish`¶

`len()`, `batch_num`¶

`iter()`¶

`size`¶

`batch_size`¶

`preproc`¶