无需 cookie 或本地存储的用户识别
我正在构建一个分析工具,我目前可以从他们的用户代理获取用户的 IP 地址、浏览器和操作系统.
我想知道是否有可能在不使用 cookie 或本地存储的情况下检测到同一用户?我不希望这里有代码示例;只是一个简单的提示,可以进一步了解.
忘了提到如果它是同一台计算机/设备,它需要跨浏览器兼容.基本上我追求的是设备识别而不是真正的用户.
解决方案简介
如果我对您的理解正确,您需要识别您没有唯一标识符的用户,因此您想通过匹配随机数据来确定他们是谁.您无法可靠地存储用户的身份,因为:
- Cookies 可以删除
- IP 地址可以更改
- 浏览器可以更改
- 浏览器缓存可能被删除
Java Applet 或 Com 对象本来是使用硬件信息散列的简单解决方案,但现在人们非常注重安全性,很难让人们在他们的系统上安装这类程序.这让您无法使用 Cookie 和其他类似工具.
Cookie 和其他类似工具
您可以考虑构建数据配置文件,然后使用概率测试来识别可能的用户.可以通过以下某种组合生成对此有用的配置文件:
- IP 地址
- 真实 IP 地址
- 代理 IP 地址(用户经常重复使用同一个代理)
- Cookie
- HTTP Cookies
- 会话 Cookies
- 第三方 Cookies
- Flash Cookies ( 库允许您生成人工神经网络.要实施贝叶斯推理,请查看以下链接:
- 使用 PHP 实现贝叶斯推理,第 1 部分
- 使用 PHP 实现贝叶斯推理,第 2 部分
- 使用 PHP 实现贝叶斯推理,第 3 部分
此时,你可能会想:
为什么一个看似简单的任务需要这么多数学和逻辑?
基本上,因为它不是一个简单的任务.实际上,您要实现的是纯概率.例如,给定以下已知用户:
User1 = A + B + C + D + G + K用户 2 = C + D + I + J + K + F
当您收到以下数据时:
B + C + E + G + F + K
您本质上要问的问题是:
接收到的数据(B + C + E + G + F + K)实际上是User1还是User2的概率是多少?这两个匹配中的哪一个是最可能的?
为了有效地回答这个问题,您需要了解频率vs 概率格式 以及为什么 联合概率 可能成为更好的方法.细节太多了(这就是我给你链接的原因),但一个很好的例子是 医学诊断向导应用程序,它使用症状的组合来识别可能的疾病.
考虑一下构成您的数据配置文件(上例中的 B + C + E + G + F + K)的一系列数据点为症状,未知用户为疾病.通过识别疾病,您可以进一步确定合适的治疗方法(将此用户视为 User1).
显然,我们已识别出超过 1 个症状的疾病更容易识别.事实上,我们可以识别的症状越多,我们的诊断几乎肯定会越容易和准确.
还有其他选择吗?
当然.作为替代措施,您可以创建自己的简单评分算法,并将其基于精确匹配.这不如概率有效,但对您来说实施起来可能更简单.
例如,考虑这个简单的分数图表:
<上一页>+-------------+--------+------------+|物业 |重量 |重要性 |+-------------+--------+------------+|真实IP地址 |60 |5 ||使用的代理 IP 地址 |40 |4 ||HTTP Cookie |80 |8 ||会话 Cookie |80 |6 ||第三方饼干 |60 |4 ||快闪饼干 |90 |7 ||PDF 错误 |20 |1 ||闪存错误 |20 |1 ||Java 错误 |20 |1 ||频繁页面 |40 |1 ||浏览器指纹 |35 |2 ||已安装的插件 |25 |1 ||缓存图像 |40 |3 ||网址 |60 |4 ||系统字体检测 |70 |4 ||本地存储 |90 |8 ||地理位置 |70 |6 ||奥尔特 |70 |4 ||网络信息API |40 |3 ||电池状态 API |20 |1 |+-------------+--------+------------+对于您可以根据给定请求收集的每条信息,授予相关分数,然后在分数相同时使用重要性解决冲突.
概念证明
如需简单的概念证明,请查看 Perceptron.感知器是一种RNA 模型,通常用于模式识别应用.甚至还有一个旧的 PHP 类它完美地实现了它,但您可能需要根据自己的目的对其进行修改.
尽管 Perceptron 是一个很棒的工具,但仍然可以返回多个结果(可能的匹配项),因此使用分数和差异比较仍然有助于确定这些匹配项中的最佳.
假设
- 存储有关每个用户的所有可能信息(IP、cookie 等)
- 如果结果完全匹配,则将分数提高 1
- 如果结果不完全匹配,则将分数降低 1
期待
- 生成 RNA 标签
- 生成模拟数据库的随机用户
- 生成单个未知用户
- 生成未知用户 RNA 和值
- 系统将合并 RNA 信息并教导感知器
- 训练感知器后,系统会有一组权重
- 您现在可以测试未知用户的模式,感知器将生成结果集.
- 存储所有正匹配
- 首先按分数对匹配项进行排序,然后按差异对匹配项进行排序(如上所述)
- 输出两个最接近的匹配,或者,如果没有找到匹配,则输出空结果
概念验证代码
$features = array('真实IP地址' =>.5,'使用的代理 IP 地址' =>.4,'HTTP Cookie' =>.9,'会话 Cookie' =>.6,'第 3 方饼干' =>.6,'Flash Cookies' =>.7,'PDF 错误' =>.2,'闪虫' =>.2,'Java 错误' =>.2,'常用页面' =>.3,'浏览器指纹' =>.3,'已安装的插件' =>.2,'网址' =>.5,'缓存的 PNG' =>.4,'系统字体检测' =>.6,'本地存储' =>.8,'地理位置' =>.6,'AOLTR' =>.4,'网络信息 API' =>.3,'电池状态 API' =>.2);//获取 RNA 标签$标签=数组();$n = 1;foreach ($features as $k => $v) {$labels[$k] = "x" .$n;$n++;}//创建用户$users = 数组();for($i = 0, $name = "A"; $i <5; $i ++, $name ++) {$users[] = new Profile($name, $features);}//生成未知用户$unknown = new Profile("未知", $features);//生成未知 RNA$未知RNA =数组(0 =>数组(o" => 1),1 =>数组(o" => - 1));//创建 RNA 值foreach ($unknown->data as $item => $point) {$unknownRNA[0][$labels[$item]] = $point;$unknownRNA[1][$labels[$item]] = (- 1 * $point);}//开始感知类$perceptron = new Perceptron();//训练结果$trainResult = $perceptron->train($unknownRNA, 1, 1);//查找匹配foreach ($users as $name => &$profile) {//使用较短的标签$data = array_combine($labels, $profile->data);if ($perceptron->testCase($data, $trainResult) == true) {$score = $diff = 0;//确定分数和差异foreach ($unknown->data as $item => $found) {if ($unknown->data[$item] === $profile->data[$item]) {if ($profile->data[$item] > 0) {$score += $features[$item];} 别的 {$diff += $features[$item];}}}//Ser score 和 diff$profile->setScore($score, $diff);$matchs[] = $profile;}}//根据分数和输出排序if (count($matchs) > 1) {usort($matchs, function ($a, $b) {//如果分数相同,使用差异if ($a->score == $b->score) {//差异越小越好返回 $a->diff == $b->diff?0 : ($a->diff > $b->diff ? 1 : - 1);}//分数越高越好return $a->score >$b->分数?- 1:1;});echo "<br/>可能匹配", implode(",", array_slice(array_map(function ($v) {return sprintf(" %s (%0.4f|%0.4f) ", $v->name, $v->score,$v->diff);}, $matchs), 0, 2));} 别的 {echo "<br/>未找到匹配";}
输出:
可能匹配 D (0.7416|0.16853),C (0.5393|0.2809)
D"的打印_r:
echo "<pre>";print_r($matchs[0]);配置文件对象([名称] =>D[数据] =>大批 ([真实IP地址] =>-1[使用的代理 IP 地址] =>-1[HTTP Cookie] =>1[会话 Cookie] =>1[第 3 方饼干] =>1[Flash Cookie] =>1[PDF 错误] =>1[闪光错误] =>1[Java 错误] =>-1[常用页面] =>1[浏览器指纹] =>-1[已安装插件] =>1[网址] =>-1[缓存的 PNG] =>1[系统字体检测] =>1[本地存储] =>-1[地理位置] =>-1[AOLTR] =>1[网络信息 API] =>-1[电池状态 API] =>-1)[得分] =>0.74157303370787[差异] =>0.1685393258427[基础] =>8.9)
如果 Debug = true 您将能够看到 输入(传感器和所需)、初始权重、输出(传感器、总和、网络)、误差、校正和最终权重.
+----+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+------+-----+-----+-----+------+-----+----+------+---------+---------+---------+---------+---------+---------+---------+---------+----------+----------+------------+----------+----------+---------+-----------+---------+----------+----------+----------+----+----+----+----+----+----+----+----+----+-----+-----+---+-----+------+-----+---+---+-----+-----+-----+------------+|○ |x1 |x2 |x3 |x4 |x5 |x6 |x7 |x8 |x9 |x10 |x11 |x12 |x13 |x14 |x15 |x16 |x17 |x18 |x19 |x20 |偏见 |尹 |是 |deltaW1 |deltaW2 |deltaW3 |deltaW4 |deltaW5 |deltaW6 |deltaW7 |deltaW8 |deltaW9 |deltaW10 |deltaW11 |deltaW12 |deltaW13 |deltaW14 |deltaW15 |deltaW16 |deltaW17 |deltaW18 |deltaW19 |deltaW20 |W1 |W2 |W3 |W4 |W5 |W6 |W7 |W8 |W9 |W10 |W11 |W12 |W13 |W14 |W15 |W16 |W17 |W18 |W19 |W20 |增量偏差 |+----+----+----+----+----+----+----+----+----+----+-----+-----+-----+------+-----+-----+-----+-----+------+-----+-----+------+-----+----+---------+----------+---------+---------+---------+---------+---------+---------+----------+----------+---------+-----------+---------+----------+----------+----------+----------+----------+----------+---------+----+----+----+----+----+----+----+----+----+-----+-----+-----+---+-----+------+-----+---+---+-----+-----+------------+|1 |1 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |1 |1 |1 |0 |-1 |0 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |1 |1 |0 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |1 |1 |1 ||-1 |-1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |-1 |-1 |1 |-19 |-1 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |1 |1 |1 ||-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- ||1 |1 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |1 |1 |1 |19 |1 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |1 |1 |1 ||-1 |-1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |-1 |-1 |1 |-19 |-1 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |0 |-1 |-1 |-1 |-1 |-1 |-1 |1 |1 |1 |1 |1 |1 |1 |-1 |-1 |-1 |-1 |1 |1 |1 ||-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |-- |+----+----+----+----+----+----+----+----+----+----+-----+-----+-----+------+-----+-----+-----+-----+------+-----+-----+------+-----+----+---------+----------+---------+---------+---------+---------+---------+---------+----------+----------+---------+-----------+---------+----------+----------+----------+----------+----------+----------+---------+----+----+----+----+----+----+----+----+----+-----+-----+-----+---+-----+------+-----+---+---+-----+-----+------------+
x1 到 x20 表示代码转换的特征.
//获取 RNA 标签$标签=数组();$n = 1;foreach ( $features as $k => $v ) {$labels[$k] = "x" .$n;$n++;}
这是一个在线演示
使用的类:
类配置文件{公共 $name, $data = array(), $score, $diff, $base;函数 __construct($name, 数组 $importance) {$values = 数组(-1, 1);//感知值$this->name = $name;foreach ($importance as $item => $point) {//为真实项目生成随机真/假$this->data[$item] = $values[mt_rand(0, 1)];}$this->base = array_sum($importance);}公共函数 setScore($score, $diff) {$this->score = $score/$this->base;$this->diff = $diff/$this->base;}}
修改后的感知器类
类感知器{私人 $w = 数组();私人 $dw = 数组();公共$调试=假;私有函数初始化($colums){//初始化感知器变量for($i = 1; $i <= $colums; $i ++) {//加权变量$this->w[$i] = 0;$this->dw[$i] = 0;}}函数训练($input,$alpha,$teta){$colums = count($input[0]) - 1;$weightCache = array_fill(1, $colums, 0);$检查点 = 数组();$keepTrainning = 真;//初始化 RNA 变量$this->initialize(count($input[0]) - 1);$just_started = 真;$totalRun = 0;$yin = 0;//训练 RNA 直到它变得稳定而($keepTrainning == true){//扫描输入主题的每一行foreach ($input as $row_counter => $row_data) {//找出输入的列数$n_columns = count($row_data) - 1;//计算阴$yin = 0;for($i = 1; $i <= $n_columns; $i ++) {$yin += $row_data["x" .$i] * $weightCache[$i];}//计算实际输出$Y = ($yin <= 1) ?- 1:1;//扫描列 ...$checkpoints[$row_counter] = 0;for($i = 1; $i <= $n_columns; $i ++) {/** 三角洲 **///是第一行吗?如果($just_started == true){$this->dw[$i] = $weightCache[$i];$just_started = 假;//找到想要的输出?} elseif ($Y == $row_data["o"]) {$this->dw[$i] = 0;//计算 Delta Ws} 别的 {$this->dw[$i] = $row_data["x" .$i] * $row_data["o"];}/** 权重 **///计算权重$this->w[$i] = $this->dw[$i] + $weightCache[$i];$weightCache[$i] = $this->w[$i];/** 检查点 **/$checkpoints[$row_counter] += $this->w[$i];}//结束 - 对于foreach ($this->w as $index => $w_item) {$debug_w["W" .$index] = $w_item;$debug_dw["deltaW" .$index] = $this->dw[$index];}//脚本调试专用$debug_vars[] = array_merge($row_data, array(偏见"=>1、阴"=>$贤,Y"=>$Y), $debug_dw, $debug_w, 数组(deltaBias"=>1));}//结束 - foreach//脚本调试专用$empty_data_row = 数组();for($i = 1; $i <= $n_columns; $i ++) {$empty_data_row["x" .$i] = "--";$empty_data_row["W" .$i] = "--";$empty_data_row["deltaW" .$i] = "--";}$debug_vars[] = array_merge($empty_data_row, array(o"=>"--",偏见"=>"--",阴"=>"--",Y"=>"--",deltaBias"=>——"));//计算训练次数$totalRun ++;//现在检查 RNA 是否已经稳定$referer_value = end($checkpoints);//如果所有行都匹配所需的输出 ...$sum = array_sum($checkpoints);$n_rows = count($checkpoints);if ($totalRun > 1 && ($sum/$n_rows) == $referer_value) {$keepTrainning = 假;}}//结束 - 而//准备最终结果$结果 = 数组();for($i = 1; $i <= $n_columns; $i ++) {$结果[w".$i] = $this->w[$i];}$this->debug($this->print_html_table($debug_vars));返回$结果;}//结束 - 训练函数 testCase($input, $results) {//扫描输入列$结果 = 0;$i = 1;foreach ($input as $column_value) {//计算睾丸Y$result += $results["w" .$i] * $column_value;$i++;}//检查测试适合的每个类返回 ($result > 0) ?真假;}//结束 - test_class//返回基于哈希数组的 html 表的 html 代码函数 print_html_table($array) {$html = "";$inner_html = "";$table_header_composed = 假;$table_header = 数组();//构建表格内容foreach ($array as $array_item) {$inner_html .= "<tr> ";foreach ( $array_item as $array_col_label => $array_col ) {$inner_html .= "<td> ";$inner_html .= $array_col;$inner_html .= "</td> ";如果($table_header_composed == false){$table_header[] = $array_col_label;}}$table_header_composed = 真;$inner_html .= "</tr> ";}//构建完整的表$html = "<表格边框=1> ";$html .= "<tr> ";foreach ($table_header as $table_header_item) {$html .= "<td> ";$html .= "<b>".$table_header_item .</b>";$html .= "</td> ";}$html .= "</tr> ";$html .= $inner_html ."</table>";返回 $html;}//结束 - print_html_table//调试函数功能调试($消息){if ($this->debug == true) {echo "<b>DEBUG:</b>$message";}}//结束 - 调试}//结束 - 类
结论
在没有唯一标识符的情况下识别用户并不是一项简单的任务.它依赖于收集足够数量的随机数据,您可以通过各种方法从用户那里收集这些数据.
即使您选择不使用人工神经网络,我建议至少使用带有优先级和可能性的简单概率矩阵 - 我希望上面提供的代码和示例足以让您继续.
I'm building an analytic tool and I can currently get the user's IP address, browser and operating system from their user agent.
I'm wondering if there is a possibility to detect the same user without using cookies or local storage? I'm not expecting code examples here; just a simple hint of where to look further.
Forgot to mention that it would need to be cross-browser compatible if it's the same computer/device. Basically I'm after device recognition not really the user.
解决方案Introduction
If I understand you correctly, you need to identify a user for whom you don't have a Unique Identifier, so you want to figure out who they are by matching Random Data. You can't store the user's identity reliably because:
- Cookies Can be deleted
- IP address Can change
- Browser Can Change
- Browser Cache may be deleted
A Java Applet or Com Object would have been an easy solution using a hash of hardware information, but these days people are so security-aware that it would be difficult to get people to install these kinds of programs on their system. This leaves you stuck with using Cookies and other, similar tools.
Cookies and other, similar tools
You might consider building a Data Profile, then using Probability tests to identify a Probable User. A profile useful for this can be generated by some combination of the following:
- IP Address
- Real IP Address
- Proxy IP Address (users often use the same proxy repeatedly)
- Cookies
- HTTP Cookies
- Session Cookies
- 3rd Party Cookies
- Flash Cookies (most people don't know how to delete these)
- Web Bugs (less reliable because bugs get fixed, but still useful)
- PDF Bug
- Flash Bug
- Java Bug
- Browsers
- Click Tracking (many users visit the same series of pages on each visit)
- Browsers Finger Print - Installed Plugins (people often have varied, somewhat unique sets of plugins)
- Cached Images (people sometimes delete their cookies but leave cached images)
- Using Blobs
- URL(s) (browser history or cookies may contain unique user id's in URLs, such as https://stackoverflow.com/users/1226894 or http://www.facebook.com/barackobama?fref=ts)
- System Fonts Detection (this is a little-known but often unique key signature)
- HTML5 & Javascript
- HTML5 LocalStorage
- HTML5 Geolocation API and Reverse Geocoding
- Architecture, OS Language, System Time, Screen Resolution, etc.
- Network Information API
- Battery Status API
The items I listed are, of course, just a few possible ways a user can be identified uniquely. There are many more.
With this set of Random Data elements to build a Data Profile from, what's next?
The next step is to develop some Fuzzy Logic, or, better yet, an Artificial Neural Network (which uses fuzzy logic). In either case, the idea is to train your system, and then combine its training with Bayesian Inference to increase the accuracy of your results.
The NeuralMesh library for PHP allows you to generate Artificial Neural Networks. To implement Bayesian Inference, check out the following links:
- Implement Bayesian inference using PHP, Part 1
- Implement Bayesian inference using PHP, Part 2
- Implement Bayesian inference using PHP, Part 3
At this point, you may be thinking:
Why so much Math and Logic for a seemingly simple task?
Basically, because it is not a simple task. What you are trying to achieve is, in fact, Pure Probability. For example, given the following known users:
User1 = A + B + C + D + G + K User2 = C + D + I + J + K + F
When you receive the following data:
B + C + E + G + F + K
The question which you are essentially asking is:
What is the probability that the received data (B + C + E + G + F + K) is actually User1 or User2? And which of those two matches is most probable?
In order to effectively answer this question, you need to understand Frequency vs Probability Format and why Joint Probability might be a better approach. The details are too much to get into here (which is why I'm giving you links), but a good example would be a Medical Diagnosis Wizard Application, which uses a combination of symptoms to identify possible diseases.
Think for a moment of the series of data points which comprise your Data Profile (B + C + E + G + F + K in the example above) as Symptoms, and Unknown Users as Diseases. By identifying the disease, you can further identify an appropriate treatment (treat this user as User1).
Obviously, a Disease for which we have identified more than 1 Symptom is easier to identify. In fact, the more Symptoms we can identify, the easier and more accurate our diagnosis is almost certain to be.
Are there any other alternatives?
Of course. As an alternative measure, you might create your own simple scoring algorithm, and base it on exact matches. This is not as efficient as probability, but may be simpler for you to implement.
As an example, consider this simple score chart:
+-------------------------+--------+------------+ | Property | Weight | Importance | +-------------------------+--------+------------+ | Real IP address | 60 | 5 | | Used proxy IP address | 40 | 4 | | HTTP Cookies | 80 | 8 | | Session Cookies | 80 | 6 | | 3rd Party Cookies | 60 | 4 | | Flash Cookies | 90 | 7 | | PDF Bug | 20 | 1 | | Flash Bug | 20 | 1 | | Java Bug | 20 | 1 | | Frequent Pages | 40 | 1 | | Browsers Finger Print | 35 | 2 | | Installed Plugins | 25 | 1 | | Cached Images | 40 | 3 | | URL | 60 | 4 | | System Fonts Detection | 70 | 4 | | Localstorage | 90 | 8 | | Geolocation | 70 | 6 | | AOLTR | 70 | 4 | | Network Information API | 40 | 3 | | Battery Status API | 20 | 1 | +-------------------------+--------+------------+
For each piece of information which you can gather on a given request, award the associated score, then use Importance to resolve conflicts when scores are the same.
Proof of Concept
For a simple proof of concept, please take a look at Perceptron. Perceptron is a RNA Model that is generally used in pattern recognition applications. There is even an old PHP Class which implements it perfectly, but you would likely need to modify it for your purposes.
Despite being a great tool, Perceptron can still return multiple results (possible matches), so using a Score and Difference comparison is still useful to identify the best of those matches.
Assumptions
- Store all possible information about each user (IP, cookies, etc.)
- Where result is an exact match, increase score by 1
- Where result is not an exact match, decrease score by 1
Expectation
- Generate RNA labels
- Generate random users emulating a database
- Generate a single Unknown user
- Generate Unknown user RNA and Values
- The system will merge RNA information and teach the Perceptron
- After training the Perceptron, the system will have a set of weightings
- You can now test the Unknown user's pattern and the Perceptron will produce a result set.
- Store all Positive matches
- Sort the matches first by Score, then by Difference (as described above)
- Output the two closest matches, or, if no matches are found, output empty results
Code for Proof of Concept
$features = array( 'Real IP address' => .5, 'Used proxy IP address' => .4, 'HTTP Cookies' => .9, 'Session Cookies' => .6, '3rd Party Cookies' => .6, 'Flash Cookies' => .7, 'PDF Bug' => .2, 'Flash Bug' => .2, 'Java Bug' => .2, 'Frequent Pages' => .3, 'Browsers Finger Print' => .3, 'Installed Plugins' => .2, 'URL' => .5, 'Cached PNG' => .4, 'System Fonts Detection' => .6, 'Localstorage' => .8, 'Geolocation' => .6, 'AOLTR' => .4, 'Network Information API' => .3, 'Battery Status API' => .2 ); // Get RNA Lables $labels = array(); $n = 1; foreach ($features as $k => $v) { $labels[$k] = "x" . $n; $n ++; } // Create Users $users = array(); for($i = 0, $name = "A"; $i < 5; $i ++, $name ++) { $users[] = new Profile($name, $features); } // Generate Unknown User $unknown = new Profile("Unknown", $features); // Generate Unknown RNA $unknownRNA = array( 0 => array("o" => 1), 1 => array("o" => - 1) ); // Create RNA Values foreach ($unknown->data as $item => $point) { $unknownRNA[0][$labels[$item]] = $point; $unknownRNA[1][$labels[$item]] = (- 1 * $point); } // Start Perception Class $perceptron = new Perceptron(); // Train Results $trainResult = $perceptron->train($unknownRNA, 1, 1); // Find matches foreach ($users as $name => &$profile) { // Use shorter labels $data = array_combine($labels, $profile->data); if ($perceptron->testCase($data, $trainResult) == true) { $score = $diff = 0; // Determing the score and diffrennce foreach ($unknown->data as $item => $found) { if ($unknown->data[$item] === $profile->data[$item]) { if ($profile->data[$item] > 0) { $score += $features[$item]; } else { $diff += $features[$item]; } } } // Ser score and diff $profile->setScore($score, $diff); $matchs[] = $profile; } } // Sort bases on score and Output if (count($matchs) > 1) { usort($matchs, function ($a, $b) { // If score is the same use diffrence if ($a->score == $b->score) { // Lower the diffrence the better return $a->diff == $b->diff ? 0 : ($a->diff > $b->diff ? 1 : - 1); } // The higher the score the better return $a->score > $b->score ? - 1 : 1; }); echo "<br />Possible Match ", implode(",", array_slice(array_map(function ($v) { return sprintf(" %s (%0.4f|%0.4f) ", $v->name, $v->score,$v->diff); }, $matchs), 0, 2)); } else { echo "<br />No match Found "; }
Output:
Possible Match D (0.7416|0.16853),C (0.5393|0.2809)
Print_r of "D":
echo "<pre>"; print_r($matchs[0]); Profile Object( [name] => D [data] => Array ( [Real IP address] => -1 [Used proxy IP address] => -1 [HTTP Cookies] => 1 [Session Cookies] => 1 [3rd Party Cookies] => 1 [Flash Cookies] => 1 [PDF Bug] => 1 [Flash Bug] => 1 [Java Bug] => -1 [Frequent Pages] => 1 [Browsers Finger Print] => -1 [Installed Plugins] => 1 [URL] => -1 [Cached PNG] => 1 [System Fonts Detection] => 1 [Localstorage] => -1 [Geolocation] => -1 [AOLTR] => 1 [Network Information API] => -1 [Battery Status API] => -1 ) [score] => 0.74157303370787 [diff] => 0.1685393258427 [base] => 8.9 )
If Debug = true you would be able to see Input (Sensor & Desired), Initial Weights, Output (Sensor, Sum, Network), Error, Correction and Final Weights.
+----+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+----+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------+ | o | x1 | x2 | x3 | x4 | x5 | x6 | x7 | x8 | x9 | x10 | x11 | x12 | x13 | x14 | x15 | x16 | x17 | x18 | x19 | x20 | Bias | Yin | Y | deltaW1 | deltaW2 | deltaW3 | deltaW4 | deltaW5 | deltaW6 | deltaW7 | deltaW8 | deltaW9 | deltaW10 | deltaW11 | deltaW12 | deltaW13 | deltaW14 | deltaW15 | deltaW16 | deltaW17 | deltaW18 | deltaW19 | deltaW20 | W1 | W2 | W3 | W4 | W5 | W6 | W7 | W8 | W9 | W10 | W11 | W12 | W13 | W14 | W15 | W16 | W17 | W18 | W19 | W20 | deltaBias | +----+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+----+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------+ | 1 | 1 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 0 | -1 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | -1 | -1 | 1 | -19 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | | 1 | 1 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 19 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | -1 | -1 | 1 | -19 | -1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 | -1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | -1 | -1 | -1 | -1 | 1 | 1 | 1 | | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | +----+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+------+-----+----+---------+---------+---------+---------+---------+---------+---------+---------+---------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----------+----+----+----+----+----+----+----+----+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----------+
x1 to x20 represent the features converted by the code.
// Get RNA Labels $labels = array(); $n = 1; foreach ( $features as $k => $v ) { $labels[$k] = "x" . $n; $n ++; }
Here is an online demo
Class Used:
class Profile { public $name, $data = array(), $score, $diff, $base; function __construct($name, array $importance) { $values = array(-1, 1); // Perception values $this->name = $name; foreach ($importance as $item => $point) { // Generate Random true/false for real Items $this->data[$item] = $values[mt_rand(0, 1)]; } $this->base = array_sum($importance); } public function setScore($score, $diff) { $this->score = $score / $this->base; $this->diff = $diff / $this->base; } }
Modified Perceptron Class
class Perceptron { private $w = array(); private $dw = array(); public $debug = false; private function initialize($colums) { // Initialize perceptron vars for($i = 1; $i <= $colums; $i ++) { // weighting vars $this->w[$i] = 0; $this->dw[$i] = 0; } } function train($input, $alpha, $teta) { $colums = count($input[0]) - 1; $weightCache = array_fill(1, $colums, 0); $checkpoints = array(); $keepTrainning = true; // Initialize RNA vars $this->initialize(count($input[0]) - 1); $just_started = true; $totalRun = 0; $yin = 0; // Trains RNA until it gets stable while ($keepTrainning == true) { // Sweeps each row of the input subject foreach ($input as $row_counter => $row_data) { // Finds out the number of columns the input has $n_columns = count($row_data) - 1; // Calculates Yin $yin = 0; for($i = 1; $i <= $n_columns; $i ++) { $yin += $row_data["x" . $i] * $weightCache[$i]; } // Calculates Real Output $Y = ($yin <= 1) ? - 1 : 1; // Sweeps columns ... $checkpoints[$row_counter] = 0; for($i = 1; $i <= $n_columns; $i ++) { /** DELTAS **/ // Is it the first row? if ($just_started == true) { $this->dw[$i] = $weightCache[$i]; $just_started = false; // Found desired output? } elseif ($Y == $row_data["o"]) { $this->dw[$i] = 0; // Calculates Delta Ws } else { $this->dw[$i] = $row_data["x" . $i] * $row_data["o"]; } /** WEIGHTS **/ // Calculate Weights $this->w[$i] = $this->dw[$i] + $weightCache[$i]; $weightCache[$i] = $this->w[$i]; /** CHECK-POINT **/ $checkpoints[$row_counter] += $this->w[$i]; } // END - for foreach ($this->w as $index => $w_item) { $debug_w["W" . $index] = $w_item; $debug_dw["deltaW" . $index] = $this->dw[$index]; } // Special for script debugging $debug_vars[] = array_merge($row_data, array( "Bias" => 1, "Yin" => $yin, "Y" => $Y ), $debug_dw, $debug_w, array( "deltaBias" => 1 )); } // END - foreach // Special for script debugging $empty_data_row = array(); for($i = 1; $i <= $n_columns; $i ++) { $empty_data_row["x" . $i] = "--"; $empty_data_row["W" . $i] = "--"; $empty_data_row["deltaW" . $i] = "--"; } $debug_vars[] = array_merge($empty_data_row, array( "o" => "--", "Bias" => "--", "Yin" => "--", "Y" => "--", "deltaBias" => "--" )); // Counts training times $totalRun ++; // Now checks if the RNA is stable already $referer_value = end($checkpoints); // if all rows match the desired output ... $sum = array_sum($checkpoints); $n_rows = count($checkpoints); if ($totalRun > 1 && ($sum / $n_rows) == $referer_value) { $keepTrainning = false; } } // END - while // Prepares the final result $result = array(); for($i = 1; $i <= $n_columns; $i ++) { $result["w" . $i] = $this->w[$i]; } $this->debug($this->print_html_table($debug_vars)); return $result; } // END - train function testCase($input, $results) { // Sweeps input columns $result = 0; $i = 1; foreach ($input as $column_value) { // Calculates teste Y $result += $results["w" . $i] * $column_value; $i ++; } // Checks in each class the test fits return ($result > 0) ? true : false; } // END - test_class // Returns the html code of a html table base on a hash array function print_html_table($array) { $html = ""; $inner_html = ""; $table_header_composed = false; $table_header = array(); // Builds table contents foreach ($array as $array_item) { $inner_html .= "<tr> "; foreach ( $array_item as $array_col_label => $array_col ) { $inner_html .= "<td> "; $inner_html .= $array_col; $inner_html .= "</td> "; if ($table_header_composed == false) { $table_header[] = $array_col_label; } } $table_header_composed = true; $inner_html .= "</tr> "; } // Builds full table $html = "<table border=1> "; $html .= "<tr> "; foreach ($table_header as $table_header_item) { $html .= "<td> "; $html .= "<b>" . $table_header_item . "</b>"; $html .= "</td> "; } $html .= "</tr> "; $html .= $inner_html . "</table>"; return $html; } // END - print_html_table // Debug function function debug($message) { if ($this->debug == true) { echo "<b>DEBUG:</b> $message"; } } // END - debug } // END - class
Conclusion
Identifying a user without a Unique Identifier is not a straight-forward or simple task. it is dependent upon gathering a sufficient amount of Random Data which you are able to gather from the user by a variety of methods.
Even if you choose not to use an Artificial Neural Network, I suggest at least using a Simple Probability Matrix with priorities and likelihoods - and I hope the code and examples provided above give you enough to go on.
相关文章