从 Twitter 抓取用户位置
问题描述
我正在尝试从 Twitter 中获取用户名的纬度和经度.用户名列表是一个 csv 文件,在一个输入文件中包含 50 多个名称.以下是我迄今为止所做的两个试验.他们似乎都没有工作.欢迎对任何程序或全新方法进行更正.
I am trying to scrape latitude and longitude of user from Twitter with respect to user names. The user name list is a csv file with more than 50 names in one input file. The below are two trials that I have made by far. Neither of them seems to be working. Corrections in any one of the program or an entirely new approach is welcome.
我有 User_names
列表,我正在尝试查找用户个人资料并从个人资料或时间线中提取 geolocation
.我在互联网上的任何地方都找不到很多样本.
I have list of User_names
and I am trying to lookup user profile and pull the geolocation
from the profile or timeline. I could not find much of samples anywhere over Internet.
我正在寻找一种更好的方法来从 Twitter 获取用户的地理位置.我什至找不到显示参考 User_name 或 user_id 获取用户位置的单个示例.它甚至可能放在首位吗?
I am looking for a better approach to get geolocations of users from Twitter. I could not even find a single example that shows harvesting User location with reference to User_name or user_id. Is It even possible in first place?
输入:输入文件有超过 50k 行
Input: The input files have more than 50k rows
AfsarTamannaah,6.80E+17,12/24/2015,#chennaifloods
DEEPU_S_GIRI,6.80E+17,12/24/2015,#chennaifloods
DEEPU_S_GIRI,6.80E+17,12/24/2015,#weneverletyoudownstr
ndtv,6.80E+17,12/24/2015,#chennaifloods
1andonlyharsha,6.79E+17,12/21/2015,#chennaifloods
Shashkya,6.79E+17,12/21/2015,#moneyonmobile
Shashkya,6.79E+17,12/21/2015,#chennaifloods
timesofindia,6.79E+17,12/20/2015,#chennaifloods
ANI_news,6.78E+17,12/20/2015,#chennaifloods
DrAnbumaniPMK,6.78E+17,12/19/2015,#chennaifloods
timesofindia,6.78E+17,12/18/2015,#chennaifloods
SRKCHENNAIFC,6.78E+17,12/18/2015,#dilwalefdfs
SRKCHENNAIFC,6.78E+17,12/18/2015,#chennaifloods
AmeriCares,6.77E+17,12/16/2015,#india
AmeriCares,6.77E+17,12/16/2015,#chennaifloods
ChennaiRainsH,6.77E+17,12/15/2015,#chennairainshelp
ChennaiRainsH,6.77E+17,12/15/2015,#chennaifloods
AkkiPritam,6.77E+17,12/15/2015,#chennaifloods
代码:
import tweepy
from tweepy import Stream
from tweepy.streaming import StreamListener
from tweepy import OAuthHandler
import pandas as pd
import json
import csv
import sys
import time
CONSUMER_KEY = 'XYZ'
CONSUMER_SECRET = 'XYZ'
ACCESS_KEY = 'XYZ'
ACCESS_SECRET = 'XYZ'
auth = OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET)
api = tweepy.API(auth)
auth.set_access_token(ACCESS_KEY, ACCESS_SECRET)
data = pd.read_csv('user_keyword.csv')
df = ['user_name', 'user_id', 'date', 'keyword']
test = api.lookup_users(user_ids=['user_name'])
for user in test:
print user.user_name
print user.user_id
print user.date
print user.keyword
print user.geolocation
错误:
Traceback (most recent call last):
File "user_profile_location.py", line 24, in <module>
test = api.lookup_users(user_ids=['user_name'])
File "/usr/lib/python2.7/dist-packages/tweepy/api.py", line 150, in lookup_users
return self._lookup_users(list_to_csv(user_ids), list_to_csv(screen_names))
File "/usr/lib/python2.7/dist-packages/tweepy/binder.py", line 197, in _call
return method.execute()
File "/usr/lib/python2.7/dist-packages/tweepy/binder.py", line 173, in execute
raise TweepError(error_msg, resp)
tweepy.error.TweepError: [{'message': 'No user matches for specified terms.', 'code': 17}]
我知道每个用户都不会分享地理位置,但如果我能获得地理位置,那些保持个人资料公开开放的人会很棒.
I understand every user does not share the geolocation, but those who keep the profile publicly open from the if I can get geolocation shall be great.
我正在寻找作为名称和/或纬度的用户位置.
User locations as name and/or lat lon is what I am looking for.
如果这种方法不正确,那么我也愿意接受替代方案.
If this approach isn't correct then I am open to alternatives also.
更新一:经过一番深入搜索后,我发现了这个 网站,它提供了一个非常关闭解决方案,但是在尝试从输入文件中读取 userName
时出现错误.
Update One: After some deep search I found this website that provides a very close solution, But I am getting error while trying to read the userName
from the input file.
这表示只能抓取 100 个用户的信息,有什么更好的方法可以解除这个限制?
This says only 100 user's information can be grabbed what is the better way to lift that limitation ?
代码:
import sys
import string
import simplejson
from twython import Twython
import csv
import pandas as pd
#WE WILL USE THE VARIABLES DAY, MONTH, AND YEAR FOR OUR OUTPUT FILE NAME
import datetime
now = datetime.datetime.now()
day=int(now.day)
month=int(now.month)
year=int(now.year)
#FOR OAUTH AUTHENTICATION -- NEEDED TO ACCESS THE TWITTER API
t = Twython(app_key='ABC',
app_secret='ABC',
oauth_token='ABC',
oauth_token_secret='ABC')
#INPUT HAS NO HEADER NO INDEX
ids = pd.read_csv('user_keyword.csv', header=['userName', 'userID', 'Date', 'Keyword'], usecols=['userName'])
#ACCESS THE LOOKUP_USER METHOD OF THE TWITTER API -- GRAB INFO ON UP TO 100 IDS WITH EACH API CALL
users = t.lookup_user(user_id = ids)
#NAME OUR OUTPUT FILE - %i WILL BE REPLACED BY CURRENT MONTH, DAY, AND YEAR
outfn = "twitter_user_data_%i.%i.%i.csv" % (now.month, now.day, now.year)
#NAMES FOR HEADER ROW IN OUTPUT FILE
fields = "id, screen_name, name, created_at, url, followers_count, friends_count, statuses_count,
favourites_count, listed_count,
contributors_enabled, description, protected, location, lang, expanded_url".split()
#INITIALIZE OUTPUT FILE AND WRITE HEADER ROW
outfp = open(outfn, "w")
outfp.write(string.join(fields, " ") + "
") # header
#THE VARIABLE 'USERS' CONTAINS INFORMATION OF THE 32 TWITTER USER IDS LISTED ABOVE
#THIS BLOCK WILL LOOP OVER EACH OF THESE IDS, CREATE VARIABLES, AND OUTPUT TO FILE
for entry in users:
#CREATE EMPTY DICTIONARY
r = {}
for f in fields:
r[f] = ""
#ASSIGN VALUE OF 'ID' FIELD IN JSON TO 'ID' FIELD IN OUR DICTIONARY
r['id'] = entry['id']
#SAME WITH 'SCREEN_NAME' HERE, AND FOR REST OF THE VARIABLES
r['screen_name'] = entry['screen_name']
r['name'] = entry['name']
r['created_at'] = entry['created_at']
r['url'] = entry['url']
r['followers_count'] = entry['followers_count']
r['friends_count'] = entry['friends_count']
r['statuses_count'] = entry['statuses_count']
r['favourites_count'] = entry['favourites_count']
r['listed_count'] = entry['listed_count']
r['contributors_enabled'] = entry['contributors_enabled']
r['description'] = entry['description']
r['protected'] = entry['protected']
r['location'] = entry['location']
r['lang'] = entry['lang']
#NOT EVERY ID WILL HAVE A 'URL' KEY, SO CHECK FOR ITS EXISTENCE WITH IF CLAUSE
if 'url' in entry['entities']:
r['expanded_url'] = entry['entities']['url']['urls'][0]['expanded_url']
else:
r['expanded_url'] = ''
print r
#CREATE EMPTY LIST
lst = []
#ADD DATA FOR EACH VARIABLE
for f in fields:
lst.append(unicode(r[f]).replace("/", "/"))
#WRITE ROW WITH DATA IN LIST
outfp.write(string.join(lst, " ").encode("utf-8") + "
")
outfp.close()
错误:
File "user_profile_location.py", line 35, in <module>
ids = pd.read_csv('user_keyword.csv', header=['userName', 'userID', 'Date', 'Keyword'], usecols=['userName'])
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 315, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 645, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 799, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1202, in __init__
ParserBase.__init__(self, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 918, in __init__
raise ValueError("cannot specify usecols when "
ValueError: cannot specify usecols when specifying a multi-index header
解决方案
假设您只想获取用户在他/她的个人资料页面中的位置,您可以使用 API.get_user 来自 Tweepy.下面是工作代码.
Assuming that you just want to get the location of the user that is put up in his/her profile page, you can just use the API.get_user from Tweepy. Below is the working code.
#!/usr/bin/env python
from __future__ import print_function
#Import the necessary methods from tweepy library
import tweepy
from tweepy import OAuthHandler
#user credentials to access Twitter API
access_token = "your access token here"
access_token_secret = "your access token secret key here"
consumer_key = "your consumer key here"
consumer_secret = "your consumer secret key here"
def get_user_details(username):
userobj = api.get_user(username)
return userobj
if __name__ == '__main__':
#authenticating the app (https://apps.twitter.com/)
auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
api = tweepy.API(auth)
#for list of usernames, put them in iterable and call the function
username = 'thinkgeek'
userOBJ = get_user_details(username)
print(userOBJ.location)
注意:这是一个粗略的实现.编写适当的 sleeper 函数以遵守 Twitter API 访问限制.
Note: This is a crude implementation. Write a proper sleeper function to obey Twitter API access limits.
相关文章