data.h5py.H5GParser¶
Class · Context · Source
dset = mdnc.data.h5py.H5GParser(
file_name, keywords, batch_size=32, shuffle=True, shuffle_seed=1000,
preprocfunc=None, num_workers=4, num_buffer=10
)
Grouply parsing dataset. This class allows users to feed one .h5 file, and convert it to mdnc.data.sequence.MPSequence
. The realization could be described as:
- Create
.h5
file indexer, this indexer would be initialized bysequence.MPSequence
. It would use the user defined keywords to get a group ofh5py.Dataset
s. - Estimate the
h5py.Dataset
sizes, each dataset should share the same size (but could have different shapes). - Use the dataset size to create a
sequence.MPSequence
, and allows it to randomly shuffle the indices in each epoch. - Invoke the
sequence.MPSequence
APIs to serve the parallel dataset parsing.
Certainly, you could use this parser to load a single h5py.Dataset
. To find details about the parallel parsing workflow, please check mdnc.data.sequence.MPSequence
.
Arguments¶
Requries
Argument | Type | Description |
---|---|---|
file_name | str | The path of the .h5 file (could be without postfix). |
keywords | (str, ) | Should be a list of keywords (or a single keyword). |
batch_size | int | Number of samples in each mini-batch. |
shuffle | bool | If enabled, shuffle the data set at the beginning of each epoch. |
shuffle_seed | int | The seed for random shuffling. |
preprocfunc | object | This function would be added to the produced data so that it could serve as a pre-processing tool. Note that this tool would process the batches produced by the parser. The details about this argument would be shown in the following tips. |
num_workers | int | The number of parallel workers. |
num_buffer | int | The buffer size of the data pool, it means the maximal number of mini-batches stored in the memory. |
Tip
The minimal requirement for the argument preprocfunc
is to be a function, or implemented with the __call__()
method. This function accepts all input mini-batch variables formatted as np.ndarray
, and returns the pre-processed results. The returned varaible number could be different from the input variable number. In some cases, you could use the provided pre-processors in the mdnc.data.preprocs
module. The processors in these module support our Broadcasting Pre- and Post- Processor Protocol. For example:
Example
1 2 3 4 5 6 7 |
|
1 2 3 4 5 6 7 8 9 10 11 |
|
1 2 3 4 |
|
Warning
The argument preprocfunc
requires to be a picklable object. Therefore, a lambda function or a function implemented inside if __name__ == '__main__'
is not allowed in this case.
Methods¶
check_dsets
¶
sze = dset.check_dsets(file_path, keywords)
Check the size of h5py.Dataset
and validate all datasets. A valid group of datasets requires each h5py.Dataset
shares the same length (sample number). If success, would return the size of the datasets. This method is invoked during the initialization, and do not requires users to call explicitly.
Requries
Argument | Type | Description |
---|---|---|
file_path | str | The path of the HDF5 dataset to be validated. |
keywords | (str, ) | The keywords to be validated. Each keyword should point to or redict to an h5py.Dataset . |
Returns
Argument | Description |
---|---|
sze | A int , the size of all datasets. |
get_attrs
¶
attrs = dset.get_attrs(keyword, *args, attr_names=None)
Get the attributes by the keyword.
Requries
Argument | Type | Description |
---|---|---|
keyword | str | The keyword of the to a h5py.Dataset in the to-be-loaded file. |
attr_names | (str, ) | A sequence of required attribute names. |
*args | other attribute names, would be attached to the argument attr_names by list.extend() . |
Returns
Argument | Description |
---|---|
attrs | A list of the required attribute values. |
get_file
¶
f = dset.get_file(enable_write=False)
Get a file object of the to-be-loaded file.
Requries
Argument | Type | Description |
---|---|---|
enable_write | bool | If enabled, would use the a mode to open the file. Otherwise, use the r mode. |
Returns
Argument | Description |
---|---|
f | The h5py.File object of the to-be-loaded file. |
start
¶
dset.start(compat=None)
Start the process pool. This method is implemented by mdnc.data.sequence.MPSequence
. It supports context management.
Running start()
or start_test()
would interrupt the started sequence.
Requries
Argument | Type | Description |
---|---|---|
compat | bool | Whether to fall back to multi-threading for the sequence out-type converter. If set None, the decision would be made by checking os.name . The compatible mode requires to be enabled on Windows. |
Tip
This method supports context management. Using the context is recommended. Here we show two examples:
1 2 3 4 |
|
1 2 3 |
|
Danger
The cuda.Tensor
could not be put into the queue on Windows (but on Linux we could), see
https://pytorch.org/docs/stable/notes/windows.html#cuda-ipc-operations
To solve this problem, we need to fall back to multi-threading for the sequence out-type converter on Windows.
Warning
Even if you set shuffle=False
, due to the mechanism of the parallelization, the sample order during the iteration may still get a little bit shuffled. To ensure your sample order not changed, please use shuffle=False
during the initialization and use start_test()
instead.
start_test
¶
dset.start_test(test_mode='default')
Start the test mode. In the test mode, the process pool would not be open. All operations would be finished in the main thread. However, the random indices are still generated with the same seed of the parallel dset.start()
mode.
Running start()
or start_test()
would interrupt the started sequence.
Requries
Argument | Type | Description |
---|---|---|
test_mode | str | Could be 'default' , 'cpu' , or 'numpy' .
|
Tip
This method also supports context management. See start()
to check how to use it.
finish
¶
dset.finish()
Finish the process pool. The compatible mode would be auto detected by the previous start()
.
Properties¶
len()
, batch_num
¶
len(dset)
dset.batch_num
The length of the dataset. It is the number of mini-batches, also the number of iterations for each epoch.
iter()
¶
for x1, x2, ... in dset:
...
The iterator. Recommend to use it inside the context. The unpacked variables x1, x2 ...
are ordered according to the given argument keywords
during the initialization.
size
¶
dset.size
The size of the dataset. It contains the total number of samples for each epoch.
batch_size
¶
dset.batch_size
The size of each batch. This value is given by the argument batch_size
during the initialization. The last size of the batch may be smaller than this value.
preproc
¶
dset.preproc
The argument preprocfunc
during the initialziation. This property helps users to invoke the preprocessor manually.
Exampless¶
Example 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
|
data.webtools: All required datasets are available.
data.h5py: Epoch 1, Batch 0 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 1 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 2 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 3 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 4 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 5 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 6 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 7 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 1, Batch 8 torch.Size([1, 20]) torch.Size([1, 10])
data.h5py: Epoch 2, Batch 0 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 1 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 2 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 3 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 4 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 5 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 6 torch.Size([1, 20]) torch.Size([1, 10])
data.h5py: Epoch 2, Batch 7 torch.Size([3, 20]) torch.Size([3, 10])
data.h5py: Epoch 2, Batch 8 torch.Size([3, 20]) torch.Size([3, 10])
Example 2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
|
data.webtools: All required datasets are available.
Before: 0.4213927686214447, 0.5810447931289673; After: 3.4976863861083984, 4.269893169403076.
Before: 0.47204485535621643, 0.5270004868507385; After: 3.2560627460479736, 5.232884407043457.
Before: 0.380888432264328, 0.5548458099365234; After: 2.69606876373291, 4.5017008781433105.
Before: 0.555243968963623, 0.5082056522369385; After: 3.231991767883301, 5.085717678070068.
Before: 0.39406657218933105, 0.5630286931991577; After: 2.8078441619873047, 5.10365629196167.
Before: 0.49584802985191345, 0.5255910754203796; After: 2.706739664077759, 5.646749019622803.
Before: 0.4346843361854553, 0.5725106000900269; After: 2.7871317863464355, 4.466533660888672.
Before: 0.5043540000915527, 0.5292088389396667; After: 2.373351573944092, 4.446733474731445.
Before: 0.46324262022972107, 0.6497944593429565; After: 2.350776433944702, 5.593009948730469.