TensorFlow耗尽GPU内存:分配器(GPU_0_BFC)在尝试分配时耗尽内存
问题描述
我是TensorFlow的新手,我在数据集方面遇到了问题。我在Windows 10上工作,TensorFlow版本是2.6.0,与CUDA一起使用。 我有两个NumPy数组,分别是X_TRAIN和X_TEST(已经拆分)。列车为5 GB,测试为1.5 GB。 这些形状是:
X_TRAIN:(259018,30,30,3),<;类‘numpy.ndarray’>;
Y_TRAIN:(259018,1),<;类‘numpy.ndarray’>;
我使用以下代码创建数据集:
dataset_train = tf.data.Dataset.from_tensor_slices((X_train , Y_train)).batch(BATCH_SIZE)
和BATCH_SIZE=32。
但我无法创建数据集,我收到以下错误:
2021-09-02 15:26:35.429930: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX AVX2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-09-02 15:26:35.772235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 3495 MB memory: -> device: 0, name: NVIDIA GeForce RTX 3060 Laptop GPU, pci bus id: 0000:01:00.0, compute capability: 8.6
2021-09-02 15:26:36.414627: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 2700000000 exceeds 10% of free system memory.
2021-09-02 15:26:47.146977: W tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 607.1KiB (rounded to 621824)requested by op _EagerConst
If the cause is memory fragmentation maybe the environment variable 'TF_GPU_ALLOCATOR=cuda_malloc_async' will improve the situation.
Current allocation summary follows.
Current allocation summary follows.
2021-09-02 15:26:47.147299: I tensorflow/core/common_runtime/bfc_allocator.cc:1004] BFCAllocator dump for GPU_0_bfc
2021-09-02 15:26:47.147383: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (256): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.147514: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (512): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.147636: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1024): Total Chunks: 1, Chunks in use: 1. 1.2KiB allocated for chunks. 1.2KiB in use in bin. 1.0KiB client-requested in use in bin.
2021-09-02 15:26:47.147761: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.147905: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148040: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8192): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148157: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148276: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148402: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (65536): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148518: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (131072): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148645: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148786: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.148918: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (1048576): Total Chunks: 1, Chunks in use: 1. 1.91MiB allocated for chunks. 1.91MiB in use in bin. 1.91MiB client-requested in use in bin.
2021-09-02 15:26:47.149079: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (2097152): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.149212: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (4194304): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.149342: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (8388608): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.149477: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (16777216): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.164471: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (33554432): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.164619: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.164765: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (134217728): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2021-09-02 15:26:47.164884: I tensorflow/core/common_runtime/bfc_allocator.cc:1011] Bin (268435456): Total Chunks: 2, Chunks in use: 2. 3.41GiB allocated for chunks. 3.41GiB in use in bin. 3.30GiB client-requested in use in bin.
2021-09-02 15:26:47.164982: I tensorflow/core/common_runtime/bfc_allocator.cc:1027] Bin for 607.2KiB was 512.0KiB, Chunk State:
2021-09-02 15:26:47.165040: I tensorflow/core/common_runtime/bfc_allocator.cc:1040] Next region of size 3665166336
2021-09-02 15:26:47.165106: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at b0e200000 of size 2700000000 next 1
2021-09-02 15:26:47.165159: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at baf0ebb00 of size 1280 next 2
2021-09-02 15:26:47.165208: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at baf0ec000 of size 2000128 next 3
2021-09-02 15:26:47.165250: I tensorflow/core/common_runtime/bfc_allocator.cc:1060] InUse at baf2d4500 of size 963164928 next 18446744073709551615
2021-09-02 15:26:47.165297: I tensorflow/core/common_runtime/bfc_allocator.cc:1065] Summary of in-use Chunks by size:
2021-09-02 15:26:47.165341: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 1280 totalling 1.2KiB
2021-09-02 15:26:47.165382: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 2000128 totalling 1.91MiB
2021-09-02 15:26:47.165426: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 963164928 totalling 918.54MiB
2021-09-02 15:26:47.165470: I tensorflow/core/common_runtime/bfc_allocator.cc:1068] 1 Chunks of size 2700000000 totalling 2.51GiB
2021-09-02 15:26:47.165514: I tensorflow/core/common_runtime/bfc_allocator.cc:1072] Sum Total of in-use chunks: 3.41GiB
2021-09-02 15:26:47.165558: I tensorflow/core/common_runtime/bfc_allocator.cc:1074] total_region_allocated_bytes_: 3665166336 memory_limit_: 3665166336 available bytes: 0 curr_region_allocation_bytes_: 7330332672
2021-09-02 15:26:47.165633: I tensorflow/core/common_runtime/bfc_allocator.cc:1080] Stats:
Limit: 3665166336
InUse: 3665166336
MaxInUse: 3665166336
NumAllocs: 4
MaxAllocSize: 2700000000
Reserved: 0
PeakReserved: 0
LargestFreeBlock: 0
2021-09-02 15:26:47.165771: W tensorflow/core/common_runtime/bfc_allocator.cc:468] *************************************************************************************************xxx
Traceback (most recent call last):
File "C:/Users/headl/Documents/github projects/datascience/DL_model_deep_insight.py", line 100, in <module>
dataset_train, dataset_test = prepare_tf_dataset(path_to_x_train, config.y_train_combined,
File "C:/Users/headl/Documents/github projects/datascience/DL_model_deep_insight.py", line 28, in prepare_tf_dataset
dataset_test = tf.data.Dataset.from_tensor_slices((X_test , Y_test)).batch(BATCH_SIZE)
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythondataopsdataset_ops.py", line 685, in from_tensor_slices
return TensorSliceDataset(tensors)
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythondataopsdataset_ops.py", line 3844, in __init__
element = structure.normalize_element(element)
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythondatautilstructure.py", line 129, in normalize_element
ops.convert_to_tensor(t, name="component_%d" % i, dtype=dtype))
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythonprofiler race.py", line 163, in wrapped
return func(*args, **kwargs)
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythonframeworkops.py", line 1566, in convert_to_tensor
ret = conversion_func(value, dtype=dtype, name=name, as_ref=as_ref)
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythonframework ensor_conversion_registry.py", line 52, in _default_conversion_function
return constant_op.constant(value, dtype, name=name)
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythonframeworkconstant_op.py", line 271, in constant
return _constant_impl(value, dtype, shape, name, verify_shape=False,
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythonframeworkconstant_op.py", line 283, in _constant_impl
return _constant_eager_impl(ctx, value, dtype, shape, verify_shape)
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythonframeworkconstant_op.py", line 308, in _constant_eager_impl
t = convert_to_eager_tensor(value, ctx, dtype)
File "C:UsersheadlDocumentsvirtual_envdatasciencelibsite-packages ensorflowpythonframeworkconstant_op.py", line 106, in convert_to_eager_tensor
return ops.EagerTensor(value, ctx.device_name, dtype)
tensorflow.python.framework.errors_impl.InternalError: Failed copying input tensor from /job:localhost/replica:0/task:0/device:CPU:0 to /job:localhost/replica:0/task:0/device:GPU:0 in order to run _EagerConst: Dst tensor is not initialized.
Process finished with exit code 1
似乎有一个耗尽GPU内存的问题,事实上,当我在Windows任务管理器中遵循这个过程时,我可以看到GPU使用率在脚本死之前达到峰值。
我试着只用了X列车的一部分。我可以创建高达X_TRAIN[:240000]的数据集。当我在那之后添加更多行时,出现错误。
我以为TensorFlow数据集是一个生成器,它应该与批处理一起处理内存问题?此外,减少批次大小也没有任何影响。
我还尝试执行建议的‘tf_gpu_allocator=cuda_Malloc_async’,但都不起作用。
如何加载整个数据?
提前谢谢您!
解决方案
工作正常。From_tensor_Slices实际上只对少量数据有用。DataSet专为需要从磁盘流式传输的大型数据集而设计。
要做到这一点,最难但也是最理想的方法是将您的无名数组数据写入TFRecords,然后通过TFRecordDataset将它们作为数据集读入。这是指南。https://www.tensorflow.org/tutorials/load_data/tfrecord
更简单但性能较差的方法是Dataset.From_Generator。下面是一个最小的例子:
>>> ds = tf.data.Dataset.from_generator(lambda: np.arange(100), output_signature=tf.TensorSpec(shape=(), dtype=tf.int32))
>>> for d in ds:
... print(d)
...
tf.Tensor(0, shape=(), dtype=int32)
tf.Tensor(1, shape=(), dtype=int32)
...
相关文章