wrong answers?
It isn't truly intelligent. It isn't aware whether it's right or wrong. I've seen lots of AI experts point out that it isn't so much that LLMs hallucinate as that they're always hallucinating, and people just label the most obviously wrong answers as hallucinations.
A few months ago they thought 5.2 had solved Erdos 281 but later discovered it had been solved previously:
https://news.ycombinator.com/item?id=46664631
These things are scattershot. If you cherrypick the best results, you get an appearance of intelligence far above what you have factoring in all the other results.
See these comments on 5.2 from the OpenAI community:
https://community.openai.com/t/random-performance-drop-hallucinations-during-certain-periods-of-time/1370311
Ive noticed that during certain periods of time, the model (GPT-5.2) has an extremely high hallucination rate. I spend most of that time wasting tokens and money trying to correct the model, usually having to give up. During other periods of time, the model performs almost flawlessly. These periods seem to be quite random, and now I have to depend on pure luck to get the performance I need and my work done! And this happens despite no problems being announced on the status page.
Is anyone else experiencing this issue?
This behaviour is common across most LLMs, but it became more obvious after GPT-4.
OpenAI models become highly inconsistent, with noticeable output quality over time. Some days the model feels reliable, other days hallucinations spike for the same prompts.
I agree. I am working on an app in Android studio and gpt ignores its own instructions. Examples have included add a code block then it questions why I added the code block, so I remove it, only to be advised the code is missing causing the error I reported. I find i have scroll back thru the chat feed/history and copy/paste the advice it gave for it to acknowledge the cycle. It does seem to get worse when the chat has been going for some time.
Significant drop off. Random in nature. Tends to stick. But can suddenly recover too.
I am using it with codex, and itll just suddenly be dumb. I cant trust it.
It just randomly sucks some times.