Query
I wish to have a SQL question to return the highest 5 frequent itemsets containing a given tag (string). In my instance under I used the Apriori algorithm however I might even be blissful to see any easier different approaches utilizing Postgres if that is onerous to implement.
Information
I’ve there following tables:
desk tag (
id SERIAL PRIMARY KEY,
label TEXT UNIQUE
);
desk article_tag (
article_id INT REFERENCES article(id) ON DELETE CASCADE ON UPDATE CASCADE,
tag_id INT REFERENCES tag(id) ON DELETE CASCADE ON UPDATE CASCADE ,
PRIMARY KEY (article_id, tag_id)
);
desk creator (
id SERIAL PRIMARY KEY,
title TEXT UNIQUE,
);
- tags are programming languages or applied sciences
- creator is the creator of an article
- article_tag is a map of articles to tags
I’ve simply began to make use of Postgres SQL and wish to replicate what I’ve accomplished with this information utilizing Python.
Python code:
import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
tags = """1 python
2 azure-devops
3 lua"""
tagrows = [i.split() for i in tags.splitlines()]
df_tags = pd.DataFrame(tagrows,columns = ["id","label"])
df_tags["id"] = df_tags["id"].astype(int)
df_tags.set_index(df_tags.id,inplace=True)
article_map = """1 1
2 1
1 2
3 2"""
article_map = [i.split() for i in article_map.splitlines()]
df_article_map = pd.DataFrame(article_map,columns=["article_id","tag_id"])
df_article_map["tag_id"] = df_article_map["tag_id"].astype(int)
df_article_map["article_id"] = df_article_map["article_id"].astype(int)
df_article_map.set_index(df_article_map.tag_id,inplace=True)
merged = df_tags.merge(df_article_map,left_index=True, right_index=True)
merged_group = merged.groupby('article_id').label.apply(record).reset_index()
merged_group.set_index("article_id").sort_index()
tag_list_of_lists = merged_group.label.values.tolist()
te = TransactionEncoder()
te_ary = te.match(tag_list_of_lists).rework(tag_list_of_lists)
df = pd.DataFrame(te_ary, columns=te.columns_)
frequent_itemsets = apriori(df, min_support=0.02, use_colnames=True)
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets
This permits me to search out frequent itemsets within the dataframe with size > 1. (I’ve solely added a minimal, reproducible instance right here so we do not get many frequent itemsets).
string_hit = 'python'
frequent_itemsets[(frequent_itemsets.itemsets.apply(lambda x : string_hit in x) & (frequent_itemsets.length > 1) & (frequent_itemsets.support > 0.03))]
Output: