Yep, I totally agree that context engineering is everything here, but the jump in model quality just in the last 4 months has just been insane. They are just way better at this now.In the case of my DAW, I went even fundamental and created a node-based visual UI and gave the agent the ability to program new modules using the Web Audio API, and to choose from selection of stock instruments and effects. Modules are editable after instantiation, and automatically create UI for each module based on the parameters, input and output. It could spawn and wire things up, do sound design, that sort of thing.
I also have recently tried out Gemini 3.1 Pro out on audio, and you should give it a spin if you haven't yet. It actually is the first model I've seen really able to talk about music in terms of frequency and time with great accuracy. It can break down songs by instrumentation, composition, sound design, arrangement, etc.
Its philosophical take on the music itself isn't always great, but it does have precision and at a high level you can see where things are headed. Some of its advice was definitely valid and actionable. I want to plug it into my DAW or Ableton MCP and see what happens. It might actually be able to do real sound design. What I want to do is not just ask for a melody, but be able to say things like, "let's throw a Reese base in there" or "sidechain everything under the kick" and for the model to know what I'm talking about. So not just music theory, etc. but sound design as well.
I'd love to chat about this more somewhere and cross-pollinate ideas if you're up for it, email's in my bio.