I’ve recently been wondering if Lemmy should switch out NGINX for Caddy, while I hadn’t had experience with Caddy it looks like a great & fast alternative, What do you all think?
EDIT: I meant beehaw not Lemmy as a whole
Why? What’s wrong with nginx?
While I can’t speak for others, I’ve found NGINX to have weird issues where sometimes it just dies. And I have to manually restart the systemd service.
The configuration files are verbose, and maybe caddy would have better performance? I hadn’t investigated it much
I’m running a lot of services off my nginx reverse proxy. This is my general setup for each subdomain - each in its own config file. I wouldn’t consider this verbose in any way - and it’s never crashed on me
service.conf
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name [something].0x-ia.moe;
include /etc/nginx/acl_local.conf;
include /etc/nginx/default_settings.conf;
include /etc/nginx/ssl_0x-ia.conf;
location / {
proxy_pass http://[host]:[port]/;
}
}
- there are hidden configs
- this adds up quickly for more complex scenarios
- Yeah, fair enough it is really a preference thing and caddy supports it
nginx was built for performace, so I doubt caddy would have any significant different in regards to that. I’ve not found config verbosity to be a problem for me, but I guess to each their own. I’m aware I may come across as some gatekeeper - I assure you that is not my intention. It just feels like replacing a perfectly working, battle testing service with another one just because it’s newer is a bit of a waste of resources. Besides - you can do it yourself on your instance. It’s just a load balancer in front of a docker image.
Isn’t caddy battle tested too? And looking into alternatives is not really a waste of resources. It just feels like nginx is not as reliable and likes to drop requests. It’s not just a load balancer, mind you.
http3 is available in nginx 1.25 if you want to run their current release.
The problems I see with Lemmy performance all point to SQL being poorly optimized. In particular, federation is doing database inserts of new content from other servers - and many servers can be incoming at the same time with their new postings, comments, votes. Priority is not given to interactive webapp/API users.
Using a SQL database for a backend of a website with unique data all over the place is very tricky. You have to really program the app to avoid touching the database and create cached output and incoming queues and such when you can. Reddit (at lest 9 years ago when they open sourced it) is also based on PostgreSQL - and you will see they do not do live SQL inserts into comments like Lemmy does - they queue them using something other than the main database then insert them in batch.
email MTA apps I’ve seen do the same thing, they queue files to disk before putting into the main database.
I don’t think nginx is the problem, the bottleneck is the backend of the backend, PostgreSQL doing all that I/O and record locking.
nginx 100% isn’t the problem, and you’re right on all counts. I’ll also add that I’ve seen reports that Lemmy has some pretty poorly optimized SQL queries.
They need to add support for a message broker system like RabbitMQ. That way their poor postgres instance stops being the bottleneck.
PostgreSQL is tricky to get right and I can’t fault anyone for wanting different solution like RabbitMQ to workaround it. One of the thing I did back in the day was that when dealing with high-write traffic and the data itself is not mission critical, I would set up a tmpfs on Linux for specified amount of RAM to serves as a cache to create a duplicate of the same data table used for storing on SSD/HDD and then I create a view that combines them both where it would check the cache first before querying the HDD/SSD.
During an insert/update statement, it would trigger a condition that increment a variable (semaphore) and if reached a certain value, it would run a partitioned check on the cache table and scan for any old data that aren’t in active use based on timestamp and then have those written to HDD/SSD as well as writing to HDD/SSD if the data have been on cache long enough. Doing it this way, i was able to increase the throughput more than a 100 folds and still have data that can be retained on database.
Obviously, there are going to be some additional risks incurred by doing this like putting your data on a volatile memory although it’s less of a risk on ECC Memory on Servers. If the power goes out, whatever stored on the RAM would be gone, so I assumed in cloud they would have backup power and other solutions in place to ensure it doesn’t happen. They might have a network outage, but it’s rare for servers to do a hard fail.
Hm, that’s an interesting take. To be quite honest I saw issues with diesel-rs in production on another website I was contributing too, maybe it’s the issue?
I doubt it is anything that level. The problem is the data itself, in the datababase.
A reddit-like website is like email, every load from the database has unique content. You really have to be very careful when designing for scalability when almost all the data is unique. Especially in modern times where users block other users, and even 2 people loading the same posting do not get the same comments. It’s anti-cache, and you have to really work hard to design that to run efficiently on small servers.
As opposed to a website like Amazon where the listing for a toothbrush is not unique on every page load. There aren’t new comments and new votes altering the toothbrush listing every time a user refreshes the page. And people aren’t switching brands of toothbrush every 24 hours like the front page of Reddit abandons old data and starts with fresh data.
Would a good solution be to just deffer changes to data with something like Apache Kafka? Or changing to something that can be scaled, like cockroach db or neondb? I also heard ScyllaDB could be a great alternative, mostly from reading the discord technical blog.
You can use any reverse proxy you’d like, doesn’t have anything to do with lemmy
nginx is like, the gold standard. it’s performant as heck. the issues are likely a culmination of many small sub-optimal pieces.
One more thing I forgot to mention. The nginx 500 errors people are getting on multiple Lemmy sites could improve shortly with the release of 0.18 that stops using websockets. Right now Lemmy webapp is passing those through nginx for every web browser client.