Comment by russjr08@outpost.zeuslink.net on Communities on big servers not showing all (or any) posts

Hey, first off, I really appreciate what you are doing here with this Lemmy instance! This was the first one I joined during the Great Rexxit of 2023.

However, about a week later, I created another account on a larger server (.world) and using that as my main because I could not find a lot of communities on this instance at all (!mlemapp@lemmy.ml was the first one I tried where I had this issue, but the majority I tried to subscribe to at the time were not showing up at all, while lemmy.world could see everything except BeeHaw). With the update to 0.18, this got a lot better, where I can just keep retrying until it finds the community.

Of course, .world got super overloaded at the beginning of July with everybody finally leaving Reddit, so I subscribed to all the communities I had there over here on the Outpost, but over a week later, some communities still are not showing all posts. The most egregious example I have is !antiquememesroadshow@lemmy.world, which still isn’t showing any posts at all on this instance.

Now that things have been going better with .world, I’ve been directly comparing the two and I’m still missing a lot of posts here that I can see on .world (whether the community is on .world or elsewhere).

Is this a Lemmy glitch? Can anything be done about this? Thanks in advance!

Ah ha! I think I know (mostly) exactly what happened here. @danielton@outpost.zeuslink.net could you try to find a community that you’re subscribed to that is stuck on “Subscribe Pending” (this link will take you directly to your subscriptions), then click it to unsub, and finally click it again to re-try the subscription? If you refresh, it should go to “Joined” within 30 seconds or so. If there are zero posts on that community, it should try to ~~backfill about 10~20 posts - I don’t believe Lemmy backfills the comments though~~ (I might’ve been wrong about backfilling, or it just takes a bit of time to happen) and that’ll ensure that everything is working properly.

If anyone wants the technical breakdown/details, that continues from this point onward

When Lemmy 0.18.0’s release came out, one of the targeted fixes was for federation. I believe multiple fixes were made in Lemmy’s codebase in regards to message re-queue times but one of the other things that was recommended to update was the nginx (the web-server that actually receives connections from the internet and proxies the requests between the Lemmy docker container, and the general internet) config.

When I wrote my initial comment here, there was this line:

Additionally, there doesn’t seem to be a firewall/routing issue in terms of us connecting to both LW and lemmy.ml, as I can query the API for both instances directly from the VM that is running our instance.

Right before I submitted that comment I figured it might be a good idea to test querying our own API (and outside of the Lemmy VM, just to be extra cautious) to see what happens, and the results were very telling:

I took a look at our Nginx config and compared it to the recommended config which looked correct, but running the above API query gave me a hint at the where the problem was, which is line 12 in that paste:

                "~^(?:GET|HEAD):.*?application\/(?:activity|ld)\+json" "http://lemmy";

Specifically, this line tells Nginx where to send requests that match a regex pattern for an Accept: application/activity+json header that is for an HTTP GET or HEAD request. At the end of that line is the destination, in this case http://lemmy - I know that looks incorrect at first glance because http://lemmy isn’t a public address but it is not supposed to be a public address, instead its supposed to match an “upstream” configured in Nginx, which our config has the following defined:

        upstream lemmy-backend {
                server lemmy:8536;
        }
        upstream lemmy-frontend {
                server lemmy-ui:1234;
        }

In other words, this tells Nginx that anything in the config that mentions http://lemmy-backend should actually go to http://lemmy:8536 and conversely requests to http://lemmy-frontend should go to http://lemmy-ui:1234. lemmy and lemmy-ui refer to the docker containers for the backend component of Lemmy as well as the frontend.

When I originally was putting together our instance, I explicitly chose to make sure the upstream names didn’t match the container names because in the past I’ve had issues with exactly that. Unfortunately, that is not how the recommended config is structured, and they match their container names with the upstream names - which means when I updated our Nginx config (because everything between line #4 and line #22 was new to the 0.18.0 update) I forgot to swap out that upstream name. Specifically for that line as well, since line #6 and line #18 refer to the right upstream names (lemmy-frontend and lemmy-backend).

Fixing that line, and restarting the server now returns the expected response:

Now, my knowledge of Lemmy’s internals is something I’m still trying to correlate with how ActivityPub works, but I suspect what happened is that any requests we sent out to remote instances/communities was indeed received by that instance, and when they tried to query us back to make sure that our instance was also “speaking ActivityPub/the same language” so to speak, it tried to hit that endpoint that was broken, and thus never confirmed the subscription, and caused us to not receive activity updates. But with communities that were already subscribed to before this update, our subscriptions were already “confirmed” and so we were receiving updates.

Which of course lead to things appearing to work if you weren’t actively subscribing to new communities (and from my point of view when looking at the linked dashboard from my previous comment, I could see that federation was actively occurring) - thus not raising any alarm bells on my side, so I couldn’t investigate until you had notified me. I only had some suspicions of larger instances sometimes being problematic (both LW and lemmy.ml have had downtimes between LW getting DDoS’d yesterday and lemmy.ml just having general growing pains from being the flagship instance) and thought it was on their end.

I sincerely apologize about that, I try to be as cautious as I can when performing updates to make sure that nothing breaks, but when something breaks and isn’t super obvious to spot, it becomes difficult for me to fix what I don’t know about. Our userbase isn’t very large which isn’t necessarily an issue to me (as long as at least one person is getting some use out of The Outpost, then I’m happy) but it does mean that its harder to detect problems at the same rate the larger instances would because there is a larger audience that reports issues. Additionally, when I look at the “Known Communities” metric from our dashboard, it was going up despite the fact that the 0.18.0 update occurred about 20 days ago - I’m not sure how to explain that one…

Ideally I’d love to come up with a series of tests I can run after each update to make sure that nothing is broken, so far the lemmy stats dashboard has done a good job of letting me see when there are potential issues (for example, if the “Comments by Local Users” metric shot up insanely high over say a 30 minute period, then that would tell me that someone is possibly spamming somewhere, allowing me to find that user and kindly tell them to knock it off) but it definitely failed me on this occasion - at the very least, I’ll make sure that I try to perform more federation tests to ensure this doesn’t break. I’ll even look into making some sort of automated monitoring against that endpoint so that I get an instant notification the moment it stops working.

Communities on big servers not showing all (or any) posts

Outpost Home

!outpost@outpost.zeuslink.net

Community stats

Community moderators