I've been looking into something like that: I envision a work queue where:

1. I spin up AWS Inf2 instances on demand that I pack with a large 100B+ instruct tuned reasoning LLM or maybe a LAM (but i'd have to learn more about that first). These do decomposition, review and maybe even prompt tuning?
2. Local m4 box(es) then run smaller models like `devstral` or `codellama` for actual operations.

optimism

econ

dawn of the token based economy

aljaz

have you looked into combining large action models with llms for orchestration? 

there was a research paper explaining the methodology of using large context frontier models as planning and smaller locally hosted models for execution, also used for working with sensitive data since no data was shared to the cloud based model, only processed by local ones. but i can't find it now...