用于在列中显示词频的sql server函数

2021-12-21 00:00:00 indexing string frequency text sql-server

我有一个表格,其中列出了一项调查中的自由文本输入,允许参与者输入他们的回答(关于他们希望在婚礼中使用的颜色)

I have a table that lists a freet text input from a survey where enterents were allowed to enter their responses (regarding colours they would like to have in their wedding)

我想写一个sql函数,把这个列的所有信息都收集起来,orders统计每个词出现的频率,按照这个计数对结果集进行排序.

I would like to write a sql function that gathers all the information from this column, and orders counts the frequency of each word, ordering the result set by this count.

Response
--------
Red and White
green
White and blue
Blue
Dark blue

我希望上表的顺序如下

Response  Frequency
--------  ---------
Blue      3
White     2
And       2
Red       1
Green     1

我可以在函数运行后去掉所有像and"这样的垃圾词.有谁知道产生这种行为的任何好的功能?

I can strip all the rubbish words like "and" after the function has run. Does anyone know any good functions that produce this behaviour?

推荐答案

好的,这是一种享受.首先是一个分离值的函数......

Okay this works a treat. Firstly a function to separate the values...

Alter Function dbo.SeparateValues    

(    
 @data VARCHAR(MAX),    
 @delimiter VARCHAR(10)     
)     
RETURNS     
@tbldata TABLE(col VARCHAR(MAX))    
As    
--Declare @data VARCHAR(MAX) ,@delimiter VARCHAR(10)     
--Declare @tbldata TABLE(col VARCHAR(10))    
--Set @data = 'hello,how,are,you?,234234'    
--Set @delimiter = ','    
--DECLARE @tbl TABLE(col VARCHAR(10))    
Begin    
DECLARE @pos INT    
DECLARE @prevpos INT    
SET @pos = 1     
SET @prevpos = 0    

WHILE @pos > 0     
BEGIN    
SET @pos = CHARINDEX(@delimiter, @data, @prevpos+1)    
if @pos > 0     
INSERT INTO @tbldata(col) VALUES(LTRIM(RTRIM(SUBSTRING(@data, @prevpos+1, @pos-@prevpos-1))))    
else    
INSERT INTO @tbldata(col) VALUES(LTRIM(RTRIM(SUBSTRING(@data, @prevpos+1, len(@data)-@prevpos))))    
SET @prevpos = @pos     
End    

RETURN       
END    

然后我就将它应用到我的桌子上...

then I just apply it to my table...

Select Count(*), sep.Col FROM (
        Select * FROM (
            Select value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(response, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), '  ', ' '), '  ', ' ')))) FROM Responses
        ) easyValues
        Where value <> '' 
    ) actualValues 
    Cross Apply dbo.SeparateValues(value, ' ') sep
    Group By sep.Col
    Order By Count(*) Desc

好的,所以我使用嵌套表进行了 OTT,但我已经去除了所有废话字符,分离了值并保留了最常用单词的运行总数.

Okay, so I went OTT with my nested tables, but I've stripped out all the crap characters, separated the values and kept a running total of the most frequently used words.

相关文章