从 FTP python 读取缓冲区中的文件
问题描述
我正在尝试从 FTP 服务器读取文件.该文件是一个 .gz
文件.我想知道我是否可以在套接字打开时对此文件执行操作.我试图遵循 读取文件而不写入磁盘和从 FTP 读取文件而不下载但不成功.
I am trying to read a file from an FTP server. The file is a .gz
file. I would like to know if I can perform actions on this file while the socket is open. I tried to follow what was mentioned in two StackOverflow questions on reading files without writing to disk and reading files from FTP without downloading but was not successful.
我知道如何在下载的文件上提取数据/工作,但我不确定我是否可以即时完成.有没有办法连接到站点,在缓冲区中获取数据,可能进行一些数据提取并退出?
I know how to extract data/work on the downloaded file but I'm not sure if I can do it on the fly. Is there a way to connect to the site, get data in a buffer, possibly do some data extraction and exit?
尝试 StringIO 时出现错误:
When trying StringIO I got the error:
>>> from ftplib import FTP
>>> from StringIO import StringIO
>>> ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
Traceback (most recent call last):
File "<pyshell#2>", line 1, in <module>
ftp = FTP('ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/PMC-ids.csv.gz')
File "C:Python27libftplib.py", line 117, in __init__
self.connect(host)
File "C:Python27libftplib.py", line 132, in connect
self.sock = socket.create_connection((self.host, self.port), self.timeout)
File "C:Python27libsocket.py", line 553, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
gaierror: [Errno 11004] getaddrinfo failed
我只需要知道如何将数据放入某个变量并在其上循环,直到读取来自 FTP 的文件.
I just need to know how can I get data into some variable and loop on it until the file from FTP is read.
感谢您的宝贵时间和帮助.谢谢!
I appreciate your time and help. Thanks!
解决方案
请务必先登录ftp服务器.之后,使用 retrbinary
以二进制模式拉取文件.它对文件的每个块使用回调.您可以使用它来将其加载到字符串中.
Make sure to login to the ftp server first. After this, use retrbinary
which pulls the file in binary mode. It uses a callback on each chunk of the file. You can use this to load it into a string.
from ftplib import FTP
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@
# Setup a cheap way to catch the data (could use StringIO too)
data = []
def handle_binary(more_data):
data.append(more_data)
resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
data = "".join(data)
加分项:我们在解压字符串时如何?
Bonus points: how about we decompress the string while we're at it?
简单模式,使用上面的数据字符串
Easy mode, using data string above
import gzip
import StringIO
zippy = gzip.GzipFile(fileobj=StringIO.StringIO(data))
uncompressed_data = zippy.read()
稍微好一点,完整的解决方案:
from ftplib import FTP
import gzip
import StringIO
ftp = FTP('ftp.ncbi.nlm.nih.gov')
ftp.login() # Username: anonymous password: anonymous@
sio = StringIO.StringIO()
def handle_binary(more_data):
sio.write(more_data)
resp = ftp.retrbinary("RETR pub/pmc/PMC-ids.csv.gz", callback=handle_binary)
sio.seek(0) # Go back to the start
zippy = gzip.GzipFile(fileobj=sio)
uncompressed = zippy.read()
实际上,动态解压缩会好得多,但我看不到使用内置库的方法(至少不容易).
In reality, it would be much better to decompress on the fly but I don't see a way to do that with the built in libraries (at least not easily).
相关文章