IMAP中的换行符-= -如何解码?
问题描述
我正在尝试制作一个电子邮件刮取器,它可以抓取某些电子邮件,以查找值以将其存储在CSV文件中。我已经尝试了很多方法来解决这个问题,但到目前为止都没有成功。
# Function to get email content part i.e its body part
def get_body(msg):
if msg.is_multipart():
return get_body(msg.get_payload(decode=True)).decode()
else:
return msg.get_payload(decode=True).decode()
# Function to search for a key value pair
def search(key, value, con):
result, data = con.search(None, key, '"{}"'.format(value))
return data
# Function to get the list of emails under this label
def get_emails(result_bytes):
print("get email")
msgs = [] # all the email data are pushed inside an array
for num in result_bytes[0].split():
typ, data = con.fetch(num, '(RFC822)')
msgs.append(data)
return msgs
# this is done to make SSL connection with GMAIL
con = imaplib.IMAP4_SSL(imap_url)
con.login(user, password)
con.select('Inbox')
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
print(msg)
# encoding set as utf-8
content = sent[1], 'utf-8'
data = str(content)
# Handling errors related to unicodenecode
try:
indexstart = data.find("span")
data2 = data[indexstart + 5: len(data)]
indexend = data2.find("</div>")
# printtng the required content which we need
# to extract from our email i.e our body
waarde = data2[0: indexend]
test_naam_1 = waarde.split("Naam: ",1)[1]
echte_naam = test_naam_1.split("Email: ",-1)[0]
email_test = waarde.split("Email: ",1)[1]
echte_email = email_test.split("Tel nr.: ",-1)[0]
tel_test = waarde.split("Tel nr.: ",1)[1]
echte_tel = tel_test.split("Onderwerp: ",-1)[0]
subj_test = waarde.split("Onderwerp: ",1)[1]
echte_subj = subj_test.split("Bericht: ",-1)[0]
print("---ADRESGEGEVENS---")
print("---Naam: " + echte_naam + "---")
print("---Naam: " + echte_email + "---")
print("---Naam: " + echte_tel + "---")
print("---Naam: " + echte_subj + "---")
现在在我的结果中,我仍然收到这些难看的换行符,它们在我的标记中如下所示:
[(b'12638 (RFC822 {1973}', b'MIME-Version: 1.0
Date: Mon, 25 Oct 2021 16:41:46 +0200
Message-ID: <CAJDn=xsVynQqp7BwYoGZB=v21-AAR5=xcMkQ8D2kXE7ZpYFNNQ@mail.example.com>
Subject: TESTTITELPYTHON
From: Patrick Merkx <patrick@example.nl>
To: Patrick Merkx <patrick@example.nl>
Content-Type: multipart/alternative; boundary="00000000000042e6ae05cf2e5c7e"
--00000000000042e6ae05cf2e5c7e
Content-Type: text/plain; charset="UTF-8"
Contactformulier ingevuld door:
Naam: Patrick Merkx
Email: merkx.patrick@example.com
Tel nr.: 0611381219
Onderwerp: Nog een test
Bericht:
Bericht
--00000000000042e6ae05cf2e5c7e
Content-Type: text/html; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable
<div dir=3D"ltr"><div><div dir=3D"ltr" class=3D"gmail_signature" data-smart=
mail=3D"gmail_signature"><div dir=3D"ltr"><div><div dir=3D"ltr"><div><div d=
ir=3D"ltr"><div style=3D"font-stretch:normal;font-size:13.33px;line-height:=
19.99px;background:none;border:0px rgb(34,34,34);width:600px;overflow:visib=
le;min-height:0px;outline-width:0px"><span class=3D"gmail-il" style=3D"font=
-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=
ngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=
small">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=
=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=
il.com" target=3D"_blank" style=3D"font-size:small">merkx.patrick@example.com=
</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =
0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=
"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=
=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=
:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=
ze:small">Bericht</span><br></div></div></div></div></div></div></div></div=
></div>
--00000000000042e6ae05cf2e5c7e--'), b')']
class=3D"gmail-il" style=3D"font=
-size:small">Contactformulier</span><span style=3D"font-size:small">=C2=A0i=
ngevuld door:</span><br style=3D"font-size:small"><span style=3D"font-size:=
small">Naam: Patrick Merkx</span><br style=3D"font-size:small"><span style=
=3D"font-size:small">Email:=C2=A0</span><a href=3D"mailto:merkx.patrick@gma=
il.com" target=3D"_blank" style=3D"font-size:small">merkx.patrick@gmail.com=
</a><br style=3D"font-size:small"><span style=3D"font-size:small">Tel nr.: =
0611381219</span><br style=3D"font-size:small"><br style=3D"font-size:small=
"><span style=3D"font-size:small">Onderwerp: Nog een test</span><br style=
=3D"font-size:small"><br style=3D"font-size:small"><span style=3D"font-size=
:small">Bericht:</span><br style=3D"font-size:small"><span style=3D"font-si=
ze:small">Bericht</span><br>
我也试过剥离Body标签,解码,也尝试了多种解决方案,但到目前为止都不走运。到目前为止,我似乎无法用任何已知的方法删除这些换行符。
我做错了什么?
解决方案
您正在查看的MIME部分包含Content-Transfer-Encoding: quoted-printable
。正确的解码方法是遍历MIME结构并在执行过程中解释各个部分。但是没有必要显式地这样做;Python的email
库已经为您完成了这项工作。
from email import message_from_bytes
from email.policy import default
...
msg_ids = get_emails(search('SUBJECT', 'TESTTITELPYTHON', con))
for msg in msg_ids[::-1]:
for sent in msg:
if type(sent) is tuple:
msg = message_from_bytes(sent[1], policy=default)
不幸的是,如果没有这些消息中的MIME结构的示例,我无法确切地告诉您如何处理产生的消息。您可能有类似于msg.get_body(preferencelist=('html', 'plain'))
MIME Body Part;的内容,msg.get_body(preferencelist=('html', 'plain'))
会将其提取出来,而get_content()
结果会提取实际的Body部分。
policy=default
关键字参数选择在Python3.6中引入的email.message.EmailMessage
对象类,而不是旧版本中的旧email.message.Message
对象。
更详细地说,尝试将原始电子邮件正文解码为UTF-8是非常错误的。典型的MIME消息有几个部分,每个部分可能具有不同的编码,其中许多部分肯定不使用UTF-8作为其编码(尽管它正变得越来越流行;但通常情况下,实际的UTF-8将位于内容传输编码之后,该编码将保护它在通过可能不是8位干净的路线传输期间免受损害)。
相关文章