-
Notifications
You must be signed in to change notification settings - Fork 634
Extremely low throughput of running on IBM POWER9 processor #407
Comments
I worked with the POWER9 team before and our numbers were often very close. I have no idea how the CPU would perform but that might not be that far off. Your thread settings are likely not good. As of TF 1.14 (I think) and certainly nightly. We have built in some of the MKL-DNN open source features. No idea how that impacts POWER9, it worked fine on AMD as I suspect it just looks for the supported instruction sets again I realized POWER9 is way different. You do not need most of those flags. Let me give you some data. I am not sure of your end objective; and I hope this helps a little.
I realized some of this info a a bit sloppy. I do not know exactly what you want so I went with sharing a mix of stuff. Feel free to ping / mention me or whatever they call it on github. :-) I would like to have you testing with the official models and most of it should work on TF 1.0 (1.14) or nightly. TF 2.0 would be better but it is still in Alpha and I run with nightly versions so a bit bleeding edge. |
Hi @tfboyd Thank you very much! I'm still playing with TF 1.12 now. And I think the best number I see is around 3.6. I may need to double check the vector instruction extensions. BTW, can I build the MKL-DNN feature you mentioned on IBM machine? I understand it is optimized for INTEL architecture. Thank you very much! |
MKL is extremely specific to Intel processors. So you'll have to use the Eigen builds. In any case, training on a CPUs is very slow. Use those V100s. *and I agree - thread settings and SMT mode will likely make a difference. Tensorflow will create a lot of threads if you let it and it doesn't always help. |
Thank you @jayfurmanek . Yes, I learned that MKL is for Intel and AMD processors. Actually I think all the vector instruction set supported by tensorflow are all for intel architecture (MKL, SSE etc.) I'm still in trial-and-error stage for threading and SMT mode. So far I haven't observe any clear trend. I'm using my self-build TF 1.12. |
The latest version of Eigen have specific support for Power9, if TF is built with a compiler that supports Power9 (depends on the Linux distro). Also, you mention images/sec, so I presume that this is some image processing benchmark. TF uses NumPy and other libraries for image pre-processing. You should check if you have OpenBLAS installed (the most recent release of OpenBLAS has some Power9 enhancements and other enhancements are committed but available in an official release). Depending on the format of the images (jpg, etc.) one also needs to ensure that the best libraries are installed, e.g., libjpeg-turbo. You don't mention your configuration, your Linux distro, and the source of the components, but just as one installs MKL-DNN, etc. for Intel/AMD, one needs to install the appropriate, optimized libraries for Power9 to perform a meaningful comparison. |
Thank you @edelsohn for introducing OpenBLAS to me. My system is red hat 7.6 with OpenBLAS 0.3.5. But it seems that Power 9 support is in 0.3.7. Is it correct? Also, do I need to build my own Ops to use OpenBLAS? Based on my understanding, the Ops in tensorflow are mainly based on eigen and mkl. |
0.3.7 contains double precision optimizations. The single precision optimizations will be in the next release. They are in the github master repo, so you can download it and build it yourself. OpenBLAS doesn't affect the TF ops -- the ops use Eigen. But not everything in TF is the DL tensor ops. You mention images/sec, so something needs to handle the image ingestion and preprocessing. Even with GPU, the TF ingestion and pre-processing is handled by the CPU. Especially if you are testing inferencing, you shouldn't assume that the TF ops are dominating the time. The preliminary ingestion and pre-processing are provided by NumPy, Python and other libraries (OpenBLAS, libjpeg-turbo, libpng, FFMPEG, etc.) |
If building TensorFlow yourself, you'll want to ensure you have in your .tf_configure.bazelrc file
(or least power8) and |
@tfboyd What are the new "perf tools" that will be added for Tensorflow 2.0? tf_cnn_benchmarks is deprecated going into TF 2.0. It is still a great tool pre-TF 2.0. The new "perf tools" are different and focus on end user performance (plus flags for some "magic") where as tf_cnn_benchmarks was 100% focused on testing any kind of hardware we could find and avoided many of the high level APIs. Is this the tfprof tool? Is there a github location this activity is occurring in? Thanks for any additional information. |
@edelsohn Thank you for your help. However I'm using synthetic imagenet data for testing right now so I believe there's not any image injestion and preprocessing going on. I will build my own OpenBLAS 0.3.7 with real dataset to see if it helps. |
Hi,
I'm running TensorFlow benchmark on IBM machine(POWER9 processor + V100 GPUs). I know it is not the optimal way to go, but I'm just trying out the performance of POWER9 without using GPUs. Turns out the performance is VERY low (~0.5 images/sec to 4 images/sec) regardless my tuning of threading number(from 16 to 160). I'm not sure if anyone has been playing with similar setup but I cannot seem to find any reported performance. I'm doubting the performance number because Power9 seems to have very high CPU frequency despite no MKL.
So can anyone give me any suggestions? I'm attaching the script here:
python ~/benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --data_format=NHWC --batch_size=128 --num_batches=50 --model=resnet50 --optimizer=sgd --variable_update=replicated --use_fp16=False --nodistortions --gradient_repacking=2 --datasets_use_prefetch=True --loss_type_to_report=base_loss --compute_lr_on_cpu=True --single_l2_loss_op=True --local_parameter_device=cpu --device=cpu --local_parameter_device=cpu --display_every=10 --num_intra_threads=128 --num_inter_threads=1
The text was updated successfully, but these errors were encountered: