从 Twitter 抓取用户位置

2022-01-14 00:00:00 pandas python-2.7 tweepy geolocation twython

问题描述

我正在尝试从 Twitter 中获取用户名的纬度和经度.用户名列表是一个 csv 文件，在一个输入文件中包含 50 多个名称.以下是我迄今为止所做的两个试验.他们似乎都没有工作.欢迎对任何程序或全新方法进行更正.

I am trying to scrape latitude and longitude of user from Twitter with respect to user names. The user name list is a csv file with more than 50 names in one input file. The below are two trials that I have made by far. Neither of them seems to be working. Corrections in any one of the program or an entirely new approach is welcome.

我有 User_names 列表，我正在尝试查找用户个人资料并从个人资料或时间线中提取 geolocation.我在互联网上的任何地方都找不到很多样本.

I have list of User_names and I am trying to lookup user profile and pull the geolocation from the profile or timeline. I could not find much of samples anywhere over Internet.

我正在寻找一种更好的方法来从 Twitter 获取用户的地理位置.我什至找不到显示参考 User_name 或 user_id 获取用户位置的单个示例.它甚至可能放在首位吗?

I am looking for a better approach to get geolocations of users from Twitter. I could not even find a single example that shows harvesting User location with reference to User_name or user_id. Is It even possible in first place?

输入:输入文件有超过 50k 行

Input: The input files have more than 50k rows

AfsarTamannaah,6.80E+17,12/24/2015,#chennaifloods DEEPU_S_GIRI,6.80E+17,12/24/2015,#chennaifloods DEEPU_S_GIRI,6.80E+17,12/24/2015,#weneverletyoudownstr ndtv,6.80E+17,12/24/2015,#chennaifloods 1andonlyharsha,6.79E+17,12/21/2015,#chennaifloods Shashkya,6.79E+17,12/21/2015,#moneyonmobile Shashkya,6.79E+17,12/21/2015,#chennaifloods timesofindia,6.79E+17,12/20/2015,#chennaifloods ANI_news,6.78E+17,12/20/2015,#chennaifloods DrAnbumaniPMK,6.78E+17,12/19/2015,#chennaifloods timesofindia,6.78E+17,12/18/2015,#chennaifloods SRKCHENNAIFC,6.78E+17,12/18/2015,#dilwalefdfs SRKCHENNAIFC,6.78E+17,12/18/2015,#chennaifloods AmeriCares,6.77E+17,12/16/2015,#india AmeriCares,6.77E+17,12/16/2015,#chennaifloods ChennaiRainsH,6.77E+17,12/15/2015,#chennairainshelp ChennaiRainsH,6.77E+17,12/15/2015,#chennaifloods AkkiPritam,6.77E+17,12/15/2015,#chennaifloods

代码:

import tweepy from tweepy import Stream from tweepy.streaming import StreamListener from tweepy import OAuthHandler import pandas as pd import json import csv import sys import time CONSUMER_KEY = 'XYZ' CONSUMER_SECRET = 'XYZ' ACCESS_KEY = 'XYZ' ACCESS_SECRET = 'XYZ' auth = OAuthHandler(CONSUMER_KEY,CONSUMER_SECRET) api = tweepy.API(auth) auth.set_access_token(ACCESS_KEY, ACCESS_SECRET) data = pd.read_csv('user_keyword.csv') df = ['user_name', 'user_id', 'date', 'keyword'] test = api.lookup_users(user_ids=['user_name']) for user in test: print user.user_name print user.user_id print user.date print user.keyword print user.geolocation

错误:

Traceback (most recent call last): File "user_profile_location.py", line 24, in <module> test = api.lookup_users(user_ids=['user_name']) File "/usr/lib/python2.7/dist-packages/tweepy/api.py", line 150, in lookup_users return self._lookup_users(list_to_csv(user_ids), list_to_csv(screen_names)) File "/usr/lib/python2.7/dist-packages/tweepy/binder.py", line 197, in _call return method.execute() File "/usr/lib/python2.7/dist-packages/tweepy/binder.py", line 173, in execute raise TweepError(error_msg, resp) tweepy.error.TweepError: [{'message': 'No user matches for specified terms.', 'code': 17}]

我知道每个用户都不会分享地理位置，但如果我能获得地理位置，那些保持个人资料公开开放的人会很棒.

I understand every user does not share the geolocation, but those who keep the profile publicly open from the if I can get geolocation shall be great.

我正在寻找作为名称和/或纬度的用户位置.

User locations as name and/or lat lon is what I am looking for.

如果这种方法不正确，那么我也愿意接受替代方案.

If this approach isn't correct then I am open to alternatives also.

更新一:经过一番深入搜索后，我发现了这个网站，它提供了一个非常关闭解决方案，但是在尝试从输入文件中读取 userName 时出现错误.

Update One: After some deep search I found this website that provides a very close solution, But I am getting error while trying to read the userName from the input file.

这表示只能抓取 100 个用户的信息，有什么更好的方法可以解除这个限制?

This says only 100 user's information can be grabbed what is the better way to lift that limitation ?

代码:

import sys import string import simplejson from twython import Twython import csv import pandas as pd #WE WILL USE THE VARIABLES DAY, MONTH, AND YEAR FOR OUR OUTPUT FILE NAME import datetime now = datetime.datetime.now() day=int(now.day) month=int(now.month) year=int(now.year) #FOR OAUTH AUTHENTICATION -- NEEDED TO ACCESS THE TWITTER API t = Twython(app_key='ABC', app_secret='ABC', oauth_token='ABC', oauth_token_secret='ABC') #INPUT HAS NO HEADER NO INDEX ids = pd.read_csv('user_keyword.csv', header=['userName', 'userID', 'Date', 'Keyword'], usecols=['userName']) #ACCESS THE LOOKUP_USER METHOD OF THE TWITTER API -- GRAB INFO ON UP TO 100 IDS WITH EACH API CALL users = t.lookup_user(user_id = ids) #NAME OUR OUTPUT FILE - %i WILL BE REPLACED BY CURRENT MONTH, DAY, AND YEAR outfn = "twitter_user_data_%i.%i.%i.csv" % (now.month, now.day, now.year) #NAMES FOR HEADER ROW IN OUTPUT FILE fields = "id, screen_name, name, created_at, url, followers_count, friends_count, statuses_count, favourites_count, listed_count, contributors_enabled, description, protected, location, lang, expanded_url".split() #INITIALIZE OUTPUT FILE AND WRITE HEADER ROW outfp = open(outfn, "w") outfp.write(string.join(fields, " ") + " ") # header #THE VARIABLE 'USERS' CONTAINS INFORMATION OF THE 32 TWITTER USER IDS LISTED ABOVE #THIS BLOCK WILL LOOP OVER EACH OF THESE IDS, CREATE VARIABLES, AND OUTPUT TO FILE for entry in users: #CREATE EMPTY DICTIONARY r = {} for f in fields: r[f] = "" #ASSIGN VALUE OF 'ID' FIELD IN JSON TO 'ID' FIELD IN OUR DICTIONARY r['id'] = entry['id'] #SAME WITH 'SCREEN_NAME' HERE, AND FOR REST OF THE VARIABLES r['screen_name'] = entry['screen_name'] r['name'] = entry['name'] r['created_at'] = entry['created_at'] r['url'] = entry['url'] r['followers_count'] = entry['followers_count'] r['friends_count'] = entry['friends_count'] r['statuses_count'] = entry['statuses_count'] r['favourites_count'] = entry['favourites_count'] r['listed_count'] = entry['listed_count'] r['contributors_enabled'] = entry['contributors_enabled'] r['description'] = entry['description'] r['protected'] = entry['protected'] r['location'] = entry['location'] r['lang'] = entry['lang'] #NOT EVERY ID WILL HAVE A 'URL' KEY, SO CHECK FOR ITS EXISTENCE WITH IF CLAUSE if 'url' in entry['entities']: r['expanded_url'] = entry['entities']['url']['urls'][0]['expanded_url'] else: r['expanded_url'] = '' print r #CREATE EMPTY LIST lst = [] #ADD DATA FOR EACH VARIABLE for f in fields: lst.append(unicode(r[f]).replace("/", "/")) #WRITE ROW WITH DATA IN LIST outfp.write(string.join(lst, " ").encode("utf-8") + " ") outfp.close()

错误:

File "user_profile_location.py", line 35, in <module> ids = pd.read_csv('user_keyword.csv', header=['userName', 'userID', 'Date', 'Keyword'], usecols=['userName']) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 562, in parser_f return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 315, in _read parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 645, in __init__ self._make_engine(self.engine) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 799, in _make_engine self._engine = CParserWrapper(self.f, **self.options) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 1202, in __init__ ParserBase.__init__(self, kwds) File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 918, in __init__ raise ValueError("cannot specify usecols when " ValueError: cannot specify usecols when specifying a multi-index header

解决方案

假设您只想获取用户在他/她的个人资料页面中的位置，您可以使用 API.get_user 来自 Tweepy.下面是工作代码.

Assuming that you just want to get the location of the user that is put up in his/her profile page, you can just use the API.get_user from Tweepy. Below is the working code.

#!/usr/bin/env python from __future__ import print_function #Import the necessary methods from tweepy library import tweepy from tweepy import OAuthHandler #user credentials to access Twitter API access_token = "your access token here" access_token_secret = "your access token secret key here" consumer_key = "your consumer key here" consumer_secret = "your consumer secret key here" def get_user_details(username): userobj = api.get_user(username) return userobj if __name__ == '__main__': #authenticating the app (https://apps.twitter.com/) auth = tweepy.auth.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) api = tweepy.API(auth) #for list of usernames, put them in iterable and call the function username = 'thinkgeek' userOBJ = get_user_details(username) print(userOBJ.location)

注意:这是一个粗略的实现.编写适当的 sleeper 函数以遵守 Twitter API 访问限制.

Note: This is a crude implementation. Write a proper sleeper function to obey Twitter API access limits.

相关文章