Since openai removed access to 4.5, I am looking for something comparable from any other company.
Personally, I used it when 4o was not good enough. 4.5 was way better at research and doing more complex programming tasks.
What is comparably good in your experience?
Qwen coder for coding, specifically.
For more general stuff, GLM 4.5 is the new hotness these days. Look up Z AI chat. It’s not great at long context, though.
Jamba is a sleeper model outside of coding and benchmarks. It’s amazing at long context (like feeding it papers) and its world knowledge is great.
And there’s always Gemini Pro. It’s just so smart in general understanding of complex contexts, though it’s pretty “deep fried” and TBH I don’t use it for coding.
For local models, I love Nemotron 49B for stem stuff, and it nicely squeezes into a 3090’s vram as an exl3. Haven’t settled on a coding model beyond Qwen3 32B, though some finetunes are interesting.
do you mind running stuff locally? if not, then you can try running new qwen models (2507), whichever is largest and fits in your ram + vram (never go above q4km unless models are tiny (4B or lower) - then try q5km, higher quants are not that useful, but better performance by running faster is).
for api stuff, maybe try hugging chat? if you make a account, then you can use multiple models, with (arguably) better privacy. there are more generic inference providers, but I do not know much about to trust them.
I have a Radeon RX 6500 8GB and 32GB RAM
I can run llama.cpp/ollama locally, but all models I tried so far were super slow and very bad compared to what I got with the chatGPT subscription.
I’m trying local models from time to time, so I will check out all the suggestions from this thread.
Thanks for the hint about hugging chat.
Yeah, Qwen 30B is not bad but TBH you are best off using free chat apps, unless privacy is an issue.
you have plenty of vram to run some moes. community opinion is not good about it, but maybe try gpt-oss 20b. afaik it has 3-6 B active params, and rest can easily fit in your plenty of ram. if you use llama.cpp, there is a new -cpu-moe flag for mixture of experts. i think you can get in ballpark of 20 tps, and that is very fast imo.
You suggested a 20b model but OP said they have 8G VRAM (and 32G RAM). I thought a [very] rough approximation of VRAM size needed for a model was 1:1 with billion parameters (so 20G VRAM in this example). It seems I’m wrong. What’s a better approximation or heuristic to use?
heuristic is correct, but there is more to it. these models are MoE(mixture of experts) - essentially tinier models which are more specialised. at a given time, n out of m may be active (usually n is 2, i have heard there is not much increase with increase in n beyond 2). you can consider something like 1 expert is in ball park of 2B, and lets say there are 8-9 of them. then there are some extra “mandatory layers” which have to be there (you can assume that they orchestrate stuff)(at least that is how I have understood this). benefits of these “sparse” models is that at any given time, only n of them are being used, and hence compute is faster. most big models are moes (the deepseek 670+, or kimi 1000+), with size of experts never practically exceeding 30s (B). the largest open dense model afaik was the llama 405B, and below that some 100+ cohere and mistral stuff, but that was 1+ year ago (decades in ai terms). Industry found that moes were in general better bang for buck. moes have disadvantage that you have to hold all of the model in memory (not necessarily vram, ram is fine too, but ideally you should not be reading from disk) (the n active experts out of m keep on changing, so for each expert, if you have to read from storage, it is that much slower.), but actual compute happens with n experts, so it is fast. as a general rule of thumb, you can consider that a moe will always be stupider than its equal weight dense (single large model), but much faster. but training large model gets very hard. so you train many small models, and package together (you do not really train them separately, but that is a separate discussion).
now back to vram requirements - if using non quantised model, each param will take 1 or 2 Bytes of space in memory (fp8 or fp16). now 1 Billion bytes = 1 GB memory (not GiB), so 20B fp16 model would be ~40GB model. here fp16 is floating point number, with 16 bits. that is a representation of real number with finite precision. imagine like π extending to infinte digits, but usually you do not need all the digits (famously, nasa uses something like 15 or 16 digits after decimal) (in this case 16 bits include the mantisa and exponential and sign bit (1 for sign, maybe 11 bits for mantisa and 4 for exponential, but read about this online, it is not that important for current conversation)). but usually, you do not need all this precision while running the model. so we quantise these models (imagine rounding to lesser precision). a common quant is q4 (4 bits as opposed to 16). but there is some stuff we do not round that much, we do it to something like 8bits (these gives us the extra letters in q4"km"). Quantisation is not a free operation. what you gain in reduced memory usage, and slightly faster computation, you loose it in smartness, but trade off is usually worth it. You can measure the amount of smarts you loose by measuring something called “perplexity”, and yoy measure the amount of difference from the unquantised model. the lower the perplexity, the better the quant. usually, you do not want to go below q4. but for huge models (670/1000+), you have to use something like q1 or q2, because you got to do what you got to do to atleast run it. So you can imagine that q4km would roughly require 0.6 times number of params, so in this case, something like 12-13 GB. You have to hold all this in memory.
Ideally, both your context plus model weights should be held in vram, and size of context varies by model, but i usually do not think much of context (32k context requires close to 10GB of vram iirc, my stuff never usually goes beyond 4-6k).
many people can not afford much of vram, so run cpu and ram combos.
in this case, i have heard that with 8GB vram, and 64GB system memory, someone ran gpt oss 120b for roughly 10-11 tps, which is quiet fast imo, because active params ~6B.
but user here has a total of 40GB, so the best they can do is 32GB dense, so qwen3 32 B is good, but I think that would not be fast on that hardware, so i recommended something much smaller.
Wow that’s a lot to process! Thanks for the explanations, #sga@lemmings.world !
thankyou matey
https://nano-gpt.com/ is nice, but you have to trust them to stand by their saying
I generally use the big Qwen 3 models, Deepseek chat, GPT OSS 120b (inferior in many tasks, but pretty good for language)
Why don’t you use GPT 5?
Why don’t you use GPT 5?
I have a suite of tests that I run to compare AI quality for “problems” that I use it for that actually came up in real life. Not synthetic benchmarks that might mean nothing for my usecase. I also use them to evaluate open source models that I can run. The gist is: 5 performs worse than 4.5 even today after the update where they supposedly made it more intelligent.
A big one that I noticed is asking it for a challanging coding task, then asking it to extend it with additional functionality. It removed functions that I specifically requested in the first step. I never implied that those were not needed anymore. When asking it to put the old functions back it, it removed the newly added ones. Tried it multiple times in new chats and also logged out in private tab with VPN.
I don’t like the recent moves from the company in general, feels like enshittification and the opposite of what OPENai originally stood for so I stopped paying and would like to delete my account soon.
This seems really good! I like the pricing model. Will try it out, thanks a lot for sharing!
Damn, about GPT 5, I was afraid of this. My personal favorite was o4-mini-high which had nice thinking, could easily research stuff and was pretty versatile. It sucks.
They released a FOSS model that isn’t too bad but not great either, so that’s that. Didn’t expect them to make that move. I believe there are far worse companies that OpenAI, but they’re really not perfect, I agree. They lack transparency and it’s painfully obvious.
This seems really good! I like the pricing model. Will try it out, thanks a lot for sharing!
You’re welcome! Paying by prompt and having access to many models is really nice. I only used OpenAI’s stuff because I couldn’t be bothered to make accounts on other website with their weird interfaces… so having a place where I can anonymously pay, that claims not to store conversations, and has a lot of models where you pay by prompt, it’s great.
I do hope that they enforce their privacy stance though. They claim they ask the companies not to use the data you input, but you rely on the company who provides the model. Good thing they have TEE models, but there’s still them in the way, which you can’t be certain of (even though they say it’s provable, which isn’t?)
It seems globally well seen in the privacy ecosystem of cryptocurrency, so there might not be anything better currently. And they embrace the NANO cryptocurrency which I discovered. Bad for privacy, but a fabulous crypto (no tx fees, eco friendly)
I have a suite of tests that I run to compare AI quality for “problems” that I use it for that actually came up in real life. Not synthetic benchmarks that might mean nothing for my usecase.
Yea, the best way to know is to input things you would really try
Recently discovered https://lmarena.ai/ and sometimes I try a prompt that isn’t personal or sensitive, and check how the different models do. Allows me to find the best (or less bad) one
I feel like I’m spamming you, sorry haha
If you use nano-gpt, you can make two accounts, and use #1 to create a referral code and send it to #2, and #2 creates one for #1. This way you get a permanent 5% discount on usage (when using the website), plus you earn back 10% of what you spend on the other account. Not sure if that’s authorized, but it works
If you have an account feel free to send me a referral code by pm
I prefer to preserve my anonymity, but that’s very kind of you ❤️
They don’t do IP detection, you account is saved in a cookie or with an email (I recommend you set an email so you don’t lose your account in case you clear your cookies)
Making two accounts is as easy at opening a private tab or a second browser to create another account. Just put two different emails and there you go, you’ll get 15% savings total. When your main account is empty, you can go on the second to use the 10% you earned back