I thought the tuning procedures, such as RLHF, kind of messes up the probabilities, so you can’t really tell how confident the model is in the output (and I’m not sure how accurate these probabilities were in the first place)?
Also, it seems, at a certain point, the more context the models are given, the less accurate the output. A few times, I asked ChatGPT something, and it used its browsing functionality to look it up, and it was still wrong even though the sources were correct. But, when I disabled “browsing” so it would just use its internal model, it was correct.
It doesn’t seem there are too many expert services tied to ChatGPT (I’m just using this as an example, because that’s the one I use). There’s obviously some kind of guardrail system for “safety,” there’s a search/browsing system (it shows you when it uses this), and there’s a python interpreter. Of course, OpenAI is now very closed, so they may be hiding that it’s using expert services (beyond the “experts” in the MOE model their speculated to be using).
Oh for sure, it’s not perfect, and IMO this is where the current improvements and research are going. If you’re relying on a LLM to hit hundreds of endpoints with complex contracts it’s going to either hallucinate what it needs to do, or it’s going to call several and go down the wrong path. I would imagine that most systems do this in a very closed way anyway, and will only show you what they want to show you. Logically speaking, for questions like “should I wear a coat today” they’ll need a service to check the weather in your location, and a service to get information about the user and their location.