I was burned afoul by a former admin who, instead of diagnosing why a mail service was failing, labeled a script as a /etc/cron.d file entry as “…” (three dots) which, unless you were careful, you’d never notice in an "ls " listing casually. The cron job ran a script with a similar name which he ran once every 5 minutes. It would launch the mail service, but simultaneous services were not allowed to run on the same box, so if it was running, nothing would happen, although this later explained hundreds of “[program] service is already running” errors in our logs. It was every 5 minutes because our solarwinds check would only notice if the service had been down for 5 minutes. The reason why the service was crashing was later fixed in a patch, but nobody knew about this little “helper” script for years.
Until one day, we had a service failover from primary to backup. Normally, we had two mail servers servers behind a load balancer. It would serve only the IP that was reporting as up. Before, we manually disabled the other network port, but this time, that step was forgotten, so BOTH IPs were listening. We shut down the primary mail service, but after 5 minutes, it came back up. The mail software would sync all the mail from one server to the other (like primary to backup, or reversed, but one way only). With both up, the load balancer just sent traffic to a random one.
So now, both IPs received and sent mail, along with web interface users could use. But now, with mail going to both, it created mass confusion, and the mailbox sync was copying from backup to primary. Mail would appear and disappear randomly, and if it disappeared, it was because backup was syncing to primary. It was slow, and the first people to notice were the scant IMAP customers over the next several days. Those customers were always complaining because they had old and cranky systems, and our weekend customer service just told them to wait until Monday. But then more and more POP3 customers started to notice, and after 5 days had passed, we figured out what had happened. And we only did Netbackups every week, so now thousands of legitimate emails were lost for good over 3000 customers. A lot of them were lawyers.
I was burned afoul by a former admin who, instead of diagnosing why a mail service was failing, labeled a script as a /etc/cron.d file entry as “…” (three dots) which, unless you were careful, you’d never notice in an "ls " listing casually. The cron job ran a script with a similar name which he ran once every 5 minutes. It would launch the mail service, but simultaneous services were not allowed to run on the same box, so if it was running, nothing would happen, although this later explained hundreds of “[program] service is already running” errors in our logs. It was every 5 minutes because our solarwinds check would only notice if the service had been down for 5 minutes. The reason why the service was crashing was later fixed in a patch, but nobody knew about this little “helper” script for years.
Until one day, we had a service failover from primary to backup. Normally, we had two mail servers servers behind a load balancer. It would serve only the IP that was reporting as up. Before, we manually disabled the other network port, but this time, that step was forgotten, so BOTH IPs were listening. We shut down the primary mail service, but after 5 minutes, it came back up. The mail software would sync all the mail from one server to the other (like primary to backup, or reversed, but one way only). With both up, the load balancer just sent traffic to a random one.
So now, both IPs received and sent mail, along with web interface users could use. But now, with mail going to both, it created mass confusion, and the mailbox sync was copying from backup to primary. Mail would appear and disappear randomly, and if it disappeared, it was because backup was syncing to primary. It was slow, and the first people to notice were the scant IMAP customers over the next several days. Those customers were always complaining because they had old and cranky systems, and our weekend customer service just told them to wait until Monday. But then more and more POP3 customers started to notice, and after 5 days had passed, we figured out what had happened. And we only did Netbackups every week, so now thousands of legitimate emails were lost for good over 3000 customers. A lot of them were lawyers.
Oof.
Did those lawyers not also have a separate archiving service like they should have?