Why are the forums so often offline?

Why are the forums so often offline?

This is now every day several times a day.

It is horrible when you are writing something and everything is lost because the forum goes offline.

Makes me really anxious as I follow some people and I like to read the updates from them.

  • Neither of those, regrettably. We are continuing to work on the issue from two directions. However, at present the interim steps that I mentioned in my previous message are still in place. It's good to know that things have improved from your perspective, but that won't prevent us from looking for a more complete solution.

    Thursday amounts to a maintenance release for the WWW site - a series of small changes affecting presentation and security that you probably won't notice, even though they matter. Some of them are tidying up after the recent switch to https: This Community site is affected because the WWW site provides its single sign-on - so this site will be here, but read-only for an hour or two.

  • Seems much better for the last week. Is it fixed?  Or is the coming planned maintenance on Thursday to fix it?

  • Who knows? Neither of us have access to logs or source (did you look at the backtrace in the report?), although some vague coincidences (my (a) above) suggested an ordinary user like me could trigger it when posting to a thread, abusive or not.

    I'm not sure why CanUserReviewAppeals() calls GetThreadById()... or vice versa. Maybe the fix is in CanManageAbuse() - just want to know if user has a general privilege, not for the page being viewed.

    Of course it's not in a spambot or crawler's interest to bring a site down.

  • Not unless there's some URL redirect mechanism that also involves the abuse system, and the crawler gets stuck on a redirect to a redirect to a redirect.

    How about a bot attempting to appeal against a moderation?

  • Not sure if these are for me or WebPM, but I'll take some guesses (maybe the people from Telligent will even see it):

    Q1: Telligent. Whether or not something is automatically seen as spam is determined by the software, but can be tuned by NAS. Similarly handling of abuse reports. See this link: https://community.telligent.com/community/9/w/user-documentation/51652/moderation-spam-and-abuse It's possible that some combination of settings NAS is using exposes the bug.

    Q2: Can only guess from function names, but no, I don't think it's a series of links. It's when posting. It may be something like whether a comment is moderated depends on whether the top-level post is moderated, which depends on whether a comment is posted. Or maybe when an abuse notification takes a thread over a threshold and then there is a post to it.

    Q3: Web crawlers can cause havoc, but they don't post, so I'd say not. Not unless there's some URL redirect mechanism that also involves the abuse system, and the crawler gets stuck on a redirect to a redirect to a redirect.

  • the software problem is the bug I link to as 'techie link' above, briefly referred to on the Telligent Community community [sic], unlimited recursion (speculating even further, from supplying wrong arguments to a hash function).

    Question1: who implemented the abuse / appeals process on this site? NAS or Telligent?

    Question2: Do the recursive stack frames map nicely to an erroneously recursive implementation when users click on a series of links?

    Question3: If a web crawler performs a depth-first traversal of question 2...?

  • Can I respond to the low-tech first?

    I was wondering if NAS (their servers/machines) are simply offline so much of late for similar reasons I am...? We are stuck in London,

    Nice idea, but the servers are probably in a big warehousey place, maybe in Liverpool, which should have some efficient air conditioning. I hope you find equally efficient cooling, or suitable downtime. :)

    High-tech:

    I set a crontab to log the results from name resolution once every 60 seconds

    Ha. I have a crontab to log a wget to see when it gives a 503 (the script on the forum checks more often, every 20 seconds, and gives an orange box if it doesn't give the expected result).  It looks like the problem with the DNS was resolved 5 months ago, and as WebPM says, this current problem is unrelated and software-related, involving the IIS application pool crashing.

    So far as I'm aware from that log it was OK from about 4pm on Tuesday (3 July 2018) until a brief downtime today Thursday 1444-1445. I suspect that (a) my posting to a Copybot thread may have precipitated that crash (sorry if so, but it didn't happen doing very similar things); (b) the hosts have set something to restart the server every 5 minutes to reduce the downtime and work around the problem; (c) the software problem is the bug I link to as 'techie link' above, briefly referred to on the Telligent Community community [sic], unlimited recursion (speculating even further, from supplying wrong arguments to a hash function).

    So it's better than it was, but probably still awaiting a software fix from the supplier. Thanks, WebPM.

  • An update.

    The problem is complex. We are working on it with our suppliers, but I have to ask your patience while we find a resolution. In the meantime, we are taking several steps to limit service interruptions and therefore reduce the effect on you.

    I think Cassandro was right, in the other thread, to suggest that further questions and discussions (of this problem) happen here in this thread. That other thread did originally refer to a DNS problem, which was something of a one-off five months ago, and has no bearing on the current issue. This problem appears to be software-related. Apologies to anyone, especially , who was confused by the use of one thread for two different things that have happened at different times.

    I'll provide updates as they are available.

    Regards

  • Greetings anyone, from myself, after many days away, and likely many more to come...!

    Perhaps someone can dismiss my own wondering, which is Low-Tech, amidst all of this High-Tech Talk...? (Manual/Hardware, rather than Digital/Software.)

    I was wondering if NAS (their servers/machines) are simply offline so much of late for similar reasons I am...? We are stuck in London, and it gets too hot, and at over 80-85°Fahrenheit, we can do very little at all: we want to, but we cannot...?

    (There is a slight cool spell just now, and so I can write (my usual waffle)... yet until September, I do not know when I can again --- and cannot even follow all of what is going on here, sorry.)

  • The alternate thread you reference refers to DNS issues. What sort of issues?

    I had wondered if you might have been hinting at a sinkhole, possibly affecting several of your supplier's clients, perhaps because of sub-optimal behaviour by some of them. 

    To that end I set a crontab to log the results from name resolution once every 60 seconds. I have seen no changes to the resolved address regardless of the evident state of the forum. 

    Given your assertion of DNS issues, and the consistent 185.68.1.27, does this imply that it's reverse lookups at your end that are failing?

  • It was down today until about 10am BST, then down again between 12:00 and 12:36pm, and between 1324 and 1411, and then briefly from 1439 to 1440. Maybe you have access to logs showing those times? It gives 503 when trying to load or reload any page on community.autism.org.uk, and if you still have a forum page open in the browser get the orange pop-up box 'You've gone offline', which disappears about a minute after the server starts up again.

    When it's down, it's down for everyone. I had a weird idea though that it went down whenever I tried to comment on this thread. Probably that's just me being paranoid because I've seen the orange box there a few times for other reasons, but I wonder if posting on particular threads may cause the IIS application pool to crash (edit: I can neither reproduce the crash reliably or rule out the idea that 503 errors for a few minutes are related to posting to such a thread; also down 1604-1605). Also it showed a reply I posted to that thread, but it then disappeared.

    (techie link - to a forum that hopefully doesn't crash :). Could be database corruption or even disk failure, but if that 'CanManageAbuse' shows up in the stack trace repeatedly as in that bug report, it's presumably a bug in the software, maybe triggered on particular moderation settings.  Maybe as a workaround, the host could also write some script to restart automatically on a 503 error.)

    Anyway, hope they get it sorted soonish.

  • We're really aware of this, and have been working with our suppliers on the technical issues affecting the forum. We don't have a full solution yet, but we have put a measure in place that should improve things. There is also discussion in this thread.

    I am sorry for the inconvenience and concern that this is causing.