Fix receiving email into discourse server

TL;DR: emails from the mailing lists have not been mirrored to discourse from Oct. '21 (:scream:) but it is fixed now :cake: . I suspect that creating and responding to topics via email has also been broken.

To see my debugging in real time see matplotlib/matplotlib - Gitter .

Thank you to @noatamir for doing the due-dilligence to verify post to the mailing list got mirrored to identify this as a problem!


details

To mirror the mailing lists into discourse subscribe an @discourse.matplotlib.org email address to the mailing list. These emails are then routed to mail-receiver which in turn forwards the emails to the discourse server which in turn ingest and posts them. Details of how this is set up: Direct-delivery incoming email for self-hosted sites - sysadmin - Discourse Meta

The first problem we encountered is that in late 2021 Let’s Encrypt (whom we use for https SSL certificates for our discourse deployment). There was a known issue where the mail-receiver image missed having the new certificate
(Self-hosted mail-receiver update following Let's Encrypt root certificate change - announcements - Discourse Meta). Although the rest of the containers are updated regularly (where are prompts through the website UI), it seems that mail-receiver is not regularly updated so in Sept 2021 we stopped forwarding emails. The errors from this look like:

<22>Apr 21 14:12:07 postfix/pipe[9431]: F3ECE180D06: to=<matplotlib-devel@discourse.matplotlib.org>, relay=discourse, delay=75063, delays=75063/0.01/0/0.14, dsn=4.3.0, status=deferred (temporary failure)
<22>Apr 21 14:17:06 postfix/qmgr[80]: 47F5C180C94: from=<tcaswell@gmail.com>, size=3101, nrcpt=1 (queue active)
<23>Apr 21 14:17:06 receive-mail[9438]: Recipient: matplotlib-devel@discourse.matplotlib.org
<19>Apr 21 14:17:06 receive-mail[9438]: Failed to POST the e-mail to https://discourse.matplotlib.org/admin/email/handle_mail: SSL_connect returned=1 errno=0 state=SSLv3 read server certificate B: certificate verify failed (OpenSSL::SSL::SSLError)
<19>Apr 21 14:17:06 receive-mail[9438]:   /usr/local/lib/ruby/2.3.0/net/protocol.rb:44:in `connect_nonblock'
  /usr/local/lib/ruby/2.3.0/net/protocol.rb:44:in `ssl_socket_connect'
  /usr/local/lib/ruby/2.3.0/net/http.rb:928:in `connect'
  /usr/local/lib/ruby/2.3.0/net/http.rb:863:in `do_start'
  /usr/local/lib/ruby/2.3.0/net/http.rb:852:in `start'
  /usr/local/lib/ruby/2.3.0/net/http.rb:1384:in `request'
  /usr/local/lib/ruby/site_ruby/mail_receiver/discourse_mail_receiver.rb:42:in `process'
  /usr/local/bin/receive-mail:12:in `<main>'
<22>Apr 21 14:17:06 postfix/pipe[9437]: 47F5C180C94: to=<matplotlib-devel@discourse.matplotlib.org>, relay=discourse, delay=1100, delays=1100/0.01/0/0.13, dsn=4.3.0, status=deferred (temporary failure)

This first problem was fixed by rebuilding the container (which pulled the new images with the new certs) via:

./launcher rebuild mail-receiver

Once this was fixed we started getting errors that looked like

Apr 21 14:44:19 discourse-mail-receiver postfix/master[1]: daemon started -- version 3.5.6, configuration /etc/postfix
Apr 21 14:45:49 discourse-mail-receiver postfix/smtpd[99]: connect from mail-qk1-f171.google.com[209.85.222.171]
Apr 21 14:45:49 discourse-mail-receiver postfix/smtpd[99]: ACAEF180D75: client=mail-qk1-f171.google.com[209.85.222.171]
Apr 21 14:45:49 discourse-mail-receiver postfix/cleanup[105]: ACAEF180D75: message-id=<CAA48SF96d7EA4RCeBNmzk+MAQEFcqMe8oZwONZxwcHnoX172eQ@mail.gmail.com>
Apr 21 14:45:49 discourse-mail-receiver postfix/qmgr[97]: ACAEF180D75: from=<tcaswell@gmail.com>, size=3166, nrcpt=1 (queue active)
Apr 21 14:45:49 discourse-mail-receiver postfix/smtpd[99]: disconnect from mail-qk1-f171.google.com[209.85.222.171] ehlo=1 mail=1 rcpt=1 bdat=1 quit=1 commands=5
<23>Apr 21 14:45:49 receive-mail[107]: Recipient: nobody@discourse.matplotlib.org<19>Apr 21 14:45:49 receive-mail[107]: Failed to POST the e-mail to https://discourse.matplotlib.org/admin/email/handle_mail: 404Apr 21 14:45:49 discourse-mail-receiver postfix/pipe[106]: ACAEF180D75: to=<nobody@discourse.matplotlib.org>, relay=discourse, delay=0.32, delays=0.15/0.01/0/0.17, dsn=4.3.0, status=deferred (temporary failure)
Apr 21 14:49:09 discourse-mail-receiver postfix/anvil[101]: statistics: max connection rate 1/60s for (smtp:209.85.222.171) at Apr 21 14:45:49
Apr 21 14:49:09 discourse-mail-receiver postfix/anvil[101]: statistics: max connection count 1 for (smtp:209.85.222.171) at Apr 21 14:45:49
Apr 21 14:49:09 discourse-mail-receiver postfix/anvil[101]: statistics: max cache size 1 at Apr 21 14:45:49
Apr 21 14:49:19 discourse-mail-receiver postfix/qmgr[97]: 04243180D0E: from=<tcaswell@gmail.com>, size=3175, nrcpt=1 (queue active)
<23>Apr 21 14:49:20 receive-mail[115]: Recipient: nobody@discourse.matplotlib.org<19>Apr 21 14:49:20 receive-mail[115]: Failed to POST the e-mail to https://discourse.matplotlib.org/admin/email/handle_mail: 404Apr 21 14:49:20 discourse-mail-receiver postfix/pipe[114]: 04243180D0E: to=<nobody@discourse.matplotlib.org>, relay=discourse, delay=558, delays=557/0.02/0/0.33, dsn=4.3.0, status=deferred (temporary failure)

which is a 404 when pushing the email to the discourse server. Despite not working, this is still a step forward as we are at least getting to the server!

The end point that ingests the email requires an API key (which is a secret in it’s config files) so that only the mail-receiver can inject emails that should turn into issues. However, the API key we were using expired Mar 30 (it is not clear to me why it expired). Inbound mail is not handled anymore due to invalid API key - #7 by zogstrip - support - Discourse Meta was key to identifying that this was a problem.

To fix this I generated a new API key that is restricted to only have power to hit the handle_email end point and only for the system user, updated the config, and restarted the mail-receiver. Once that was done, the emails made it to the discourse server and began to be processed again :tada: . The emails from today (my test and @noatamir 's actually post to the mailing list) got reprocessed.

There is still work to be done back-fill the missing 7 months of mailing list. I am hopeful that this can be done using the same method as we used to initially ingest the historic mailing lists. It may also be worth looking through the logs to get an accounting of how much email was dropped on the floor.


Log of what I did at the shell:

  319  cd /var/discourse
  320  ls containers/
  321  /launcher logs mail-receiver
  322  ./launcher logs mail-receiver
  323  df -h
  324  ./launcher stop mail-receiver
  325  ./launcher bootstrap mail-receiver
  326  ./launcher start mail-receiver
  327  ./launcher logs mail-receiver | tail -n 50
  328  ./launcher update mail-receiver
  329  ./launcher rebuild mail-receiver
  330  ./launcher logs mail-receiver | tail -n 50
  331  cat samples/mail-receiver.yml 
  332  ./launcher logs mail-receiver | tail -n 50
  333  vi containers/mail-receiver.yml 
  334  ./launcher rebuild mail-receiver
  335  ./launcher logs mail-receiver | tail -n 50
  336  ./launcher rebuild app
  337  git pull
  338  history