A note that this setup runs a 671B model in Q4 quantization at 3-4 TPS, running a Q8 would need something beefier. To run a 671B model in the original Q8 at 6-8 TPS you’d need a dual socket EPYC server motherboard with 768GB of RAM.
A note that this setup runs a 671B model in Q4 quantization at 3-4 TPS, running a Q8 would need something beefier. To run a 671B model in the original Q8 at 6-8 TPS you’d need a dual socket EPYC server motherboard with 768GB of RAM.
Oh yeah, I once tried a local small 8B LLM locally too, I can’t remember if it was DeepSeek’s but I think it was, but it was writing at like one token every 2/3 seconds, and after like 5 minutes, seeing the first message not being done yet, I realized it is too much to ask to my poor GT 1030. I also heard about the cheap API, many people were delighted, since ChatGPT’s costed much much more than that. Let’s hope DeepSeek’s API becomes free soon! Even now, I’m assuming you can get days worth of conversation with just 1€.
I estimated that to translate ProleWiki from English to 5 languages (the API charges per input tokens and output tokens, i.e. what you feed it -> english content and what it outputs -> translated content) it would cost us maximum 50$ with deepseek API. ChatGPT is so expensive I didn’t even try, it was going to be in the hundreds of dollars lol. The output per 1M with deepseek is 50 cents in the off-hours (easy, just run your code during the off-hours automatically) and gpt’s is 1.6$ for their “mini” model, which is still 3x as expensive.
There are other chinese models coming along, I think xiaomi is making one. They’re also innovating in image and video generation models but for text models. One of them that came out shortly after deepseek is the one that someone said was too cheap to meter (because it literally uses so little resources to run that it makes no sense to even keep track of usage!), but I haven’t heard more about it since.