A very cool project that provides a BitTorrent-esque distributed inference and training platform.
Now that I have capable hardware I've been slowly working towards fine tuning models. These are good resources on fine tuning with frozen weights and adding in layers to tune. This cuts down on memory and processing needs.
Talk is cheap though so I have my work cut out for me!
8Gb ram with no tensor cores made for slow progress. Microcenter has a nice deal on refurbished RTX 3090Ti for $799. I'm impressed with how much faster inference is now!
Mr. Aurimas Griciūnas has put into graphics what my basic goals are architecturally. Granted mine will be simplified in areas!
Here is his website: (a very good resource)
This is an amazing article talking about how one could update or add user domain specific data to existing models.
This outfit provides a single interface for many AI providers. Very neat work!
Arg! I only have 8gig of RAM on my GTX 1080. I'm having fun with local LLMs but keep exceeding the available RAM limit. Perhaps this is in part of the ONNX runtime I'm using. Pytorch based models are large as well.
Shrinking of weights and whatnot is an option.
I really need to finish up my model accuracy measurement features. If weights are changed, how well does the model still work?
Nice tools below.
A good article about model shards
A good overview of model quanization
In relation to the vector db mentioned below, embeddings are an amazing way derive meanings and similarities from not just simple (subject-predicate-object) mappings used in ontologies, but from more complex things like entire paragraphs.
Whole sections of text can be projected into higher dimensional space where SME derived domain specific base embeddings can be used for comparison of similarities to deduce meaning and content.
That all goes back to the n-dimensional vector databases out there.
This paper has cool stuff on translating well established OWL ontologies to vectors for use with embeddings.
I need to get smarter on these things. As tensor inputs/outputs are encoded vectors, vector DBs would be (and are used as such) great for finding similar data. That whole matrix dot product thing and all.
This weekend I was interested in how I could possibly integrate existing AI/ML python code within my C# code. Python.net allows just this. Another arrow in the quiver but at this point not much use to me. It is still a neat project with other use cases.
Pros:
Run existing python code directly within C# code.
Cons:
In this case not multi-threaded. The python interpreter is limited by the GIL.
Slow. Every time this code executes a python interpreter must be created then destroyed.
Hard to proxy data between the languages.
I was able to use the default ONNX C# libraries to query input and output tensor data.
With the names and dimensions I can more easily infer input / output requirements and more easily chain together models.
I also learned that dimension of "-1" means "of any size".
While I'm sure Pytorch / Torchscript have analogous features, I found that ONNX models are fairly easy to poke and prod to get their expected input / output data dimensions.
A little C# hacking had me output whats needed.
I now appreciate how much like compilers model runtimes are. Onnx for example is an intermediate representation (IR) for models. Anyone can implement their own runtimes to consume these models. Each runtime can target their own hardware features and perform their own model optimizations.
ONNX Runtime is C/C++ based I believe with tons of language bindings and runtimes for large servers down to phones for lighter weight inferencing.
Pytorch / Torchscript have various libraries to run on many platforms.
OpenVino is Intel only and for leveraging their FPGAs and Xeon hardware.
How can I offload segments of unused models from the graphics card to system memory?
Can I wrap existing processes so as to not need to re-code?
CUDA 6+ has long had unified memory support, but how can Pytorch and others easily leverage this for large learning efforts and inference processing?
Neat notes on profiling CUDA memory usage.
How best to define programmatically the expected inputs and outputs of models?
ONNX from Azure exists but seems best for just the Microsoft ecosystem.
I was way wrong here. ONNX is used by tons of platforms!
This is the way forward for me.
https://learn.microsoft.com/en-us/windows/ai/windows-ml/tutorials/pytorch-convert-model
Can I infer (reflectively) a models needs within a consuming program?
Netron does this for ONNX https://github.com/lutzroeder/netron
Pytorch references "Flask" as a Rest API that does similar things. This would require a live internet connection.
Should I roll my own that best fits in my platform?
No.
As it stands now my platform can run in parallel many training and inference processes on a single GPU.
I've been thinking about Quartz as a way to schedule processes one-by-one, but this is uninterruptible.
Can I stop executing processes, save full state, layer in higher priority work, then continue later? A preemptive scheduler of sorts?