$ python $NCI_AI_ML_ROOT/examples/mnist/ray_tensorflow_mnist.py -n ${PBS_NGPUS} --use-gpu 2022-08-10 17:07:42,327 INFO trainer.py:223 -- Trainer logs will be logged in: /home/900/rxy900/ray_results/train_2022-08-10_17-07-42 2022-08-10 17:07:47,625 INFO trainer.py:229 -- Run results will be logged in: /home/900/rxy900/ray_results/train_2022-08-10_17-07-42/run_001 (BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:47.956994: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (on eDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (BaseWorkerMixin pid=164396, ip=10.6.10.12) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:47.906882: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (on eDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (BaseWorkerMixin pid=164394, ip=10.6.10.12) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (BaseWorkerMixin pid=4170012) 2022-08-10 17:07:47.977785: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t he following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (BaseWorkerMixin pid=4170012) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (BaseWorkerMixin pid=4170013) 2022-08-10 17:07:47.934390: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t he following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (BaseWorkerMixin pid=4170013) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (BaseWorkerMixin pid=4170010) 2022-08-10 17:07:47.884217: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t he following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (BaseWorkerMixin pid=4170010) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.015297: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (on eDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (BaseWorkerMixin pid=164395, ip=10.6.10.12) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.151559: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (on eDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (BaseWorkerMixin pid=164397, ip=10.6.10.12) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.121281: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use t he following CPU instructions in performance-critical operations: AVX2 AVX512F FMA (BaseWorkerMixin pid=4170011) To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags. (BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.826591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB m emory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0 (BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.832490: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:5/device:GPU:0 with 30989 MB memo ry: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0 (BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.842818: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715} (BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:48.843293: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.12:60421 (BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:48.855935: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB m emory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0 (BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:48.861014: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:6/device:GPU:0 with 30989 MB memo ry: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0 (BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:48.873149: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715} (BaseWorkerMixin pid=164396, ip=10.6.10.12) 2022-08-10 17:07:48.873456: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.12:42355 (BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.842712: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB m emory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0 (BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.849770: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:7/device:GPU:0 with 30989 MB memo ry: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0 (BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.865833: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715} (BaseWorkerMixin pid=164397, ip=10.6.10.12) 2022-08-10 17:07:48.866318: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.12:57715 (BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:48.840350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB m emory: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0 (BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:48.844701: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:4/device:GPU:0 with 30989 MB memo ry: -> device: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0 (BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:48.854210: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6:56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715} (BaseWorkerMixin pid=164394, ip=10.6.10.12) 2022-08-10 17:07:48.854448: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.12:41421 (BaseWorkerMixin pid=4170012) 2022-08-10 17:07:48.813775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> dev ice: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0 (BaseWorkerMixin pid=4170012) 2022-08-10 17:07:48.818832: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:2/device:GPU:0 with 30989 MB memory: -> device : 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b1:00.0, compute capability: 7.0 (BaseWorkerMixin pid=4170012) 2022-08-10 17:07:48.828559: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6 :56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715} (BaseWorkerMixin pid=4170012) 2022-08-10 17:07:48.828828: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.6:53477 (BaseWorkerMixin pid=4170013) 2022-08-10 17:07:48.816269: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> dev ice: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0 (BaseWorkerMixin pid=4170013) 2022-08-10 17:07:48.820912: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:3/device:GPU:0 with 30989 MB memory: -> device : 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:b2:00.0, compute capability: 7.0 (BaseWorkerMixin pid=4170013) 2022-08-10 17:07:48.834523: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6 :56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715} (BaseWorkerMixin pid=4170013) 2022-08-10 17:07:48.834990: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.6:46449 (BaseWorkerMixin pid=4170010) 2022-08-10 17:07:48.837828: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> dev ice: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0 (BaseWorkerMixin pid=4170010) 2022-08-10 17:07:48.842887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> device : 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3d:00.0, compute capability: 7.0 (BaseWorkerMixin pid=4170010) 2022-08-10 17:07:48.856112: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6 :56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715} (BaseWorkerMixin pid=4170010) 2022-08-10 17:07:48.856568: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.6:51865 (BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.824444: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 30989 MB memory: -> dev ice: 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0 (BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.829377: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:worker/replica:0/task:1/device:GPU:0 with 30989 MB memory: -> device : 0, name: Tesla V100-SXM2-32GB, pci bus id: 0000:3e:00.0, compute capability: 7.0 (BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.839946: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:272] Initialize GrpcChannelCache for job worker -> {0 -> 10.6.10.6:51865, 1 -> 10.6.10.6 :56733, 2 -> 10.6.10.6:53477, 3 -> 10.6.10.6:46449, 4 -> 10.6.10.12:41421, 5 -> 10.6.10.12:60421, 6 -> 10.6.10.12:42355, 7 -> 10.6.10.12:57715} (BaseWorkerMixin pid=4170011) 2022-08-10 17:07:48.840232: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:437] Started server with target: grpc://10.6.10.6:56733 (BaseWorkerMixin pid=164395, ip=10.6.10.12) 2022-08-10 17:07:51.007341: W tensorflow/core/grappler/optimizers/data/auto_shard.cc:776] AUTO sharding policy will apply DATA sharding policy as it failed to apply FILE sharding policy because of the following reason: Found an unshardable source dataset: name: "Tens orSli ceDataset/_2" ... |