This will be a bit different in integration. I need to setup and manage long-term running instances of this for chat context preservation. I'm also using Python as the tensor and tokenizer code.
I only have 8Gb RAM on my GTX 1080 so will need to be careful!
NVIDIA makes it easy with their containers to get great inference and training speeds. Doing it on the host OS can be a pain with the various kernels, python releases, and dependency hell(s). It reminded me of the old Redhat RPM days.
I finally got good speed after setting up the following.
Latest NVIDIA GPU driver for up to CUDA 12.2 support.
I run Ubuntu 20.04 as my host OS which has Python 3.8 and GCC 9.4.0. Tensorflow has a matrix for whats required. I needed to work towards version 2.12.0.
CUDA installation (nvcc), but no not the latest... it had to be 11.8 for the above TensorFlow! But wait, the versions of torch and numpy are too new for Tensorflow! Gotta install slightly older versions.
pip install numpy==1.23.4
pip install torch==2.1.0
Now I could install Tensorflow. But wait again, there are no long two packages (CPU/GPU).
pip install tensorflow[and-cuda]
Once all the above works well then a nice to have is GPU accelerated NVIDIA Tensorrt for (I could be wrong) the matrix operations to speed up tokenizing of the inputs. I'm still working at this one. Again, what a pain.
I could have set all up with Conda but I like a challenge.
These were pretty east to integrate with. I gave C# a shot for the tensor work.