BettaFish: China's Open-Source Public Opinion AI Analysis Tool

The Tool That Shouldn't Be Free

Surveillance tech usually hides behind paywalls and procurement contracts. BettaFish, a public opinion analysis system built by a single Chinese student, lives on GitHub as a free download, source code and all. It promises sentiment insights on 1.4 billion people using the same techniques that governments and marketing giants pay serious money for.

The repository has exploded past 30,000 stars, a signal that the global dev community is not just curious but actively fascinated. Stars on GitHub are a crude metric, but crossing that line puts BettaFish in the same popularity tier as mainstream frameworks and tools, not niche research projects. This is a surveillance-grade experiment with the engagement of a front-page JavaScript library.

BettaFish scrapes Chinese social platforms at scale—TikTok’s China twin Douyin, Weibo, Zhihu, and others—then tries to answer questions like “What do Chinese people really think of Donald Trump, Marvel films, or Apple?” Reports floating around the web show it surfacing soybean price panics among elderly WeChat users, lukewarm sentiment toward Marvel, and distrust of Apple over defective batteries. It reads less like a toy and more like a turnkey population sentiment dashboard.

That power triggers immediate legal and ethical alarms. The system leans on aggressive web scraping, a custom “mind spider” crawler, and analysis of content that users never consented to feed into a mass opinion engine. In jurisdictions with personal information protection laws—from China’s PIPL to the EU’s GDPR—running BettaFish at full tilt could move from gray area to outright violation fast.

Under the hood, this is not a single Python script wired to an API. BettaFish runs as a multi-agent architecture orchestrated by a Python Flask backend, with separate agents for insights, media, and web queries. A crawler fills MySQL or Postgres databases with posts tagged by hotness scores and sentiment, turning chaotic social chatter into structured fuel.

Those agents do not just dump data; they argue. A forum-style coordination layer has LLMs moderate a debate between agents, force them to reconcile conflicting evidence, and then pass everything to a report generator. The result: polished, narrative-style opinion reports that feel uncomfortably close to mind reading at national scale.

Decoding the 'Mind-Reading' Engine

Mind-reading sounds dramatic, but BettaFish (Weiyu) is, at its core, a highly automated public opinion analysis engine. It does not peer into brains; it peers into feeds, comments, and repost chains, then turns that chaos into structured reports about what people appear to think.

Built by a single Chinese student and released on GitHub, BettaFish behaves more like a full-blown in-house analytics platform than a side project. Its design assumes access to data at Chinese social scale, targeting a population of roughly 1.4 billion people whose digital traces run through a handful of dominant apps.

Name choice is a mission statement. “Weiyu” comes from a Chinese phrase meaning “small but powerful,” a nod to both the tiny dev team (one person) and the outsized leverage of pointing industrial-strength AI at public chatter.

BettaFish’s primary job: scrape, process, and synthesize sentiment from Chinese social media on any topic a user can type. Ask what Chinese users think of Donald Trump, Marvel films, or Apple, and the system assembles a dossier from platforms like Douyin, Weibo, and Zhihu.

Under the hood, a Python Flask orchestrator receives a natural-language question and fans it out to multiple AI agents. A crawler runs continuously in the background, dumping posts, comments, and engagement metrics into MySQL or Postgres, tagging each entry with a hotness score and sentiment label.

Where typical “social listening” tools stop at dashboards and keyword clouds, BettaFish keeps going. It spins up three main agents in parallel: - An insight agent that analyzes local or private databases via generated SQL - A media agent that inspects images and video using Playwright and multimodal models - A query agent that scans news and broader web content

Those agents do not just aggregate; they argue. A dedicated forum engine forces them into an AI-moderated debate, with a large language model pushing for evidence, resolving contradictions, and reconciling outlier takes before anything reaches the user.

Finally, a report agent distills the debate into narrative form: charts of sentiment, breakdowns by demographic proxy, recurring themes like soybean prices or battery defects. That automated argument-to-report pipeline is what elevates BettaFish far beyond standard analytics dashboards.

Inside the AI Agent Hivemind

Queries into BettaFish do not hit a model first; they hit infrastructure. A user’s question lands on a Python Flask Orchestrator, a slim web app that behaves like an air traffic controller for everything that follows. It parses intent, fans the request out to multiple agents, and keeps track of which subsystem is still thinking.

From there, three primary AI agents spin up in parallel, each pointed at a different slice of reality. The Insight Agent talks directly to structured data, generating SQL to interrogate MySQL or Postgres tables filled with scraped posts, hotness scores, and sentiment labels. It behaves like an automated data analyst, turning a natural‑language prompt into JOINs, filters, and aggregations.

Running beside it, the Media Agent focuses on the visual firehose. Using Playwright to drive headless browsers, it loads pages from platforms like Douyin or Weibo, captures frames, and hands images or video snippets to multimodal models for classification, OCR, and sentiment. In theory, it can tell you not just what users wrote about Trump, but how protest signs looked, how often Apple logos show up, or which Marvel scenes go viral.

The third pillar, the Query Agent, acts as a networked researcher. It hits web and news search APIs, pulls in coverage from state media, independent outlets, and forums, then summarizes and normalizes those sources into something the other agents can cross‑reference. Together, the trio can answer a single question by triangulating databases, social feeds, and the broader web at once.

Crucially, none of these agents hard‑code a favorite model. BettaFish uses a model‑agnostic design where each agent’s backend LLM is just a config entry: Gemini, GPT‑4, DeepSeek, Kimi, or open‑source models piped in through OpenRouter or direct APIs. The repo on GitHub explicitly treats models as interchangeable parts, not sacred dependencies.

That modularity turns one student’s project into a kind of plug‑and‑play AI observability stack for public opinion. Swap in a cheaper open‑source model for bulk scraping, reserve GPT‑4 or Gemini for final synthesis, or specialize the Media Agent with a vision model tuned for memes. BettaFish – Multi-Agent Public Opinion Analysis System (Official GitHub) documents how each component talks over HTTP and queues, so developers can bolt on new data sources, add more agents, or point the whole thing at a different country’s social networks without rewriting the core.

The Forum Where AI Agents Argue

Forget sentiment dashboards that just spit out charts. BettaFish’s ForumEngine turns its AI agents into a panel of quarrelsome analysts, forcing them to argue until they reach something like consensus.

Each agent walks into this virtual room with its own evidence stack. The Query Agent brings scraped news coverage and web articles, the Media Agent drags in screenshots, video transcripts, and comment threads, and the Insight Agent shows up with SQL‑pulled stats from local databases.

Instead of quietly merging outputs, ForumEngine runs a structured debate. Agents present claims, cite sources, and get interrogated when their conclusions clash with everyone else’s.

At the center sits a moderator LLM acting like a relentless editor. It checks whether an agent’s claim actually follows from its evidence, demands more samples when data looks thin, and pushes for clarification when two agents describe the same trend in opposite ways.

Imagine a query like: “What do Chinese users really think of Apple?” The Query Agent might surface neutral corporate news and a few positive profiles of Apple’s supply chain and iPhone launches from major outlets.

Meanwhile, the Media Agent is knee‑deep in Douyin and Weibo comments under iPhone teardown videos, where users complain about defective batteries, repair hassles, and nationalistic calls to buy domestic brands. Sentiment there skews sharply negative, especially among younger, tech‑savvy users.

ForumEngine notices the mismatch. The moderator LLM challenges the Query Agent: are its news sources over‑indexed on official media? It then asks the Media Agent whether the angry comments represent a broad trend or a niche subculture.

Agents respond by pulling more data. The Query Agent widens its search to include independent tech blogs and user forums; the Media Agent samples additional videos and different regions. Each round, the moderator summarizes points of agreement and flags unresolved conflicts.

Only after several of these cycles does ForumEngine allow a synthesis: for example, “state‑aligned news coverage remains cautiously positive on Apple’s economic role, while grassroots video comments show concentrated anger around batteries and pricing.”

Fueling the Machine: The Data Harvester

Fuel for this so‑called mind‑reading engine comes from a swarm of crawlers quietly combing through more than 30 social platforms. BettaFish points its custom “mind spider” at Chinese giants like Weibo, Douyin, and Xiaohongshu, plus forums, news sites, and smaller apps that collectively represent a user base well north of 1 billion people. The crawlers run continuously, not on‑demand, so the system always chews on fresh discourse.

Each crawler streams raw posts, comments, and metadata into a staging layer before anything touches an AI model. From there, standardized pipelines clean the text, normalize timestamps, and de‑duplicate viral reposts that would otherwise skew results. Only after this pass does content land in a structured MySQL or Postgres database, ready for instant querying.

BettaFish treats that database as its private firehose. Every row represents a post with author ID (usually pseudonymous), platform, engagement metrics, and language tags. By pre‑indexing this material, the system can answer a new query about “Donald Trump” or “Apple batteries” by hitting SQL, not by scraping the web in real time.

Before storage, each item passes through a hotness classifier that estimates how much oxygen a post is getting online. That score blends factors like: - Raw views and likes - Reposts, quote‑tweets, and comment velocity - Platform‑specific boosts, such as trending lists or front‑page placement

Alongside hotness, a multilingual sentiment analysis layer assigns polarity and emotion labels. Chinese, English, and other languages route through configurable LLM or smaller sentiment models, producing tags like “strongly negative,” “sarcastic,” or “nationalistic pride.” These labels become first‑class columns in the database, not bolted‑on annotations.

Scale turns this from a fancy scraper into infrastructure. With millions of posts ingested and scored each day, BettaFish approximates a near real‑time, queryable mirror of online public opinion for more than 1.4 billion people. When an agent later asks what Chinese users think about Marvel or soybean prices, it is not starting a search; it is interrogating a living, constantly updated dataset.

A Real-World Test Drive: Power and Pitfalls

Booting BettaFish up in the real world starts with a rented Hetzner CX31 server and a Docker compose file. The Better Stack team pulls the GitHub repo, wires it to OpenRouter for LLM access, and exposes the Python Flask orchestrator. Within minutes, a surveillance-grade multi-agent stack runs on a cheap European VPS.

For the first query, they go straight for geopolitics: “What do the Chinese media really think of Donald Trump?” That single sentence fans out across the Insight Agent, Query Agent, and Media Agent, each spinning up tasks, logging progress, and feeding into the ForumEngine. Terminal windows fill with timestamps, SQL calls, and crawling logs in real time.

Then the critical failure hits. The Media Agent crashes with a blunt error: missing “Bcker web search API key.” That key requires a linked WeChat account, a hurdle many non-Chinese users cannot clear, so the whole media pipeline stalls. Because the report generator waits for all three agents, the polished final report never arrives.

Workaround mode kicks in. The team pivots to the ForumEngine output, copying raw debate logs and stuffing them into Gemini 1.5 for report generation. Under the hood, the system has still scraped data from over 30 platforms, run sentiment analysis, and ranked content by hotness scores, even if one agent failed.

Those raw logs expose what makes BettaFish dangerous and fascinating. Among the Trump chatter, the system surfaces a viral WeChat thread: “Dear aunties and grandmas, soybean oil is already 105 yuan a barrel,” forwarded 987,000 times by middle-aged and elderly users. Soybean prices, not trade wars or NATO, dominate a huge slice of Trump-related sentiment.

That soybean fixation reveals BettaFish’s real power: surfacing non-obvious, hyper-local obsessions at national scale. Documentation in the BettaFish English README – Technical Overview and Features makes clear this is not a toy sentiment scraper but an industrial-grade public-opinion radar.

Navigating the 'Forbidden' Legal Minefield

Forbidden here does not mean classified; it means legally radioactive. BettaFish sits at the intersection of surveillance tech, mass data extraction, and cross‑border privacy law, and almost every part of that stack steps on someone’s rules.

Start with scraping. BettaFish’s crawler cluster hits more than 30 platforms—including Weibo, Douyin, and Xiaohongshu—at industrial scale, then stores posts in MySQL or Postgres with hotness scores and sentiment tags. That goes far beyond casual browsing and slams into platform Terms of Service, which typically ban automated scraping, bulk collection, and reuse of content for commercial analytics.

History here is ugly. In the US, Meta has sued scraping outfits like BrandTotal and Bright Data; LinkedIn spent years fighting HiQ over automated scraping of “public” profiles. Courts sent mixed signals, but the message from platforms is clear: large‑scale scraping, especially for profiling, invites cease‑and‑desist letters, IP blocking, and potentially Computer Fraud and Abuse Act arguments if you ignore technical barriers.

Privacy law raises the stakes even higher. BettaFish aggregates nominally public posts into rich behavioral dossiers, then runs sentiment analysis and topic clustering to infer attitudes, fears, and loyalties. Under China’s Personal Information Protection Law (PIPL) and Europe’s GDPR, that starts to look like large‑scale profiling and “special category” inference, often without explicit consent or clear legal basis.

Regulators increasingly treat “public” as not a free‑for‑all. GDPR cases against Clearview AI showed that scraping open web content to build face‑recognition databases can be unlawful. A BettaFish deployment aimed at EU users could trigger obligations for: - Lawful basis for processing - Data protection impact assessments - Data subject access and deletion rights

Misuse risk is where the “forbidden mind‑reading” label stops feeling like hype. A system that maps emotional triggers across millions of users can optimize misinformation campaigns, A/B test propaganda narratives in real time, or micro‑target outrage to specific demographics. Governments and political consultancies already pay for far cruder dashboards.

Corporate players could quietly plug BettaFish into internal datasets for commercial espionage, tracking employee sentiment, union organizing, or whistleblower chatter. Combined with “private databases” and real‑time monitoring, the same pipeline that explains what Chinese aunties think about soybean oil can also flag dissidents, identify boycott organizers, or pressure activists before they trend.

Beyond China: Global Potential and Peril

Dropped into Western social feeds, BettaFish would stop being a curiosity about 1.4 billion people and start looking like a turnkey opinion dragnet. Swap Weibo and Douyin for X, Reddit, Facebook, YouTube, Instagram, and TikTok, and the same crawler stack could vacuum up millions of posts per hour, tag them by geography, ideology, or community, and feed them into the same multi‑agent debate loop. With OpenAI, Anthropic, or local LLMs slotted in, you get near‑real‑time synthesis of what any slice of the internet “really thinks” about Gaza, Taylor Swift, or the S&P 500.

For legitimate players, that is catnip. A hedge fund could wire BettaFish into Reddit’s r/wallstreetbets, X finance, and YouTube finance influencers to quantify meme‑stock momentum before it hits Bloomberg terminals. Public‑health agencies could monitor spikes in “chest pain after running,” “Ozempic side effects,” or anti‑vaccine narratives across Facebook groups and Telegram channels, then target interventions days earlier. Brands already pay six figures for social listening; a hardened BettaFish fork could give them granular reputation tracking across languages, subcultures, and fringe platforms for the cost of cloud GPUs and a DevOps hire.

The same mechanics become ugly fast in Western politics. Once a tool like this is open‑sourced, any campaign, PAC, or foreign influence shop can run 24/7 narrative reconnaissance: which talking points resonate in Michigan suburbs, which conspiracy hashtags are about to break out in Brazil, which influencer clusters swing on immigration or trans rights. Couple that with cheap content farms and ad APIs, and you get automated feedback loops that A/B test propaganda in public, then amplify only what polarizes hardest.

BettaFish shows how hard dual‑use AI is to contain. The code is on GitHub, already starred tens of thousands of times, and nothing stops forks tuned for US, EU, or Indian politics from spreading via private repos and Discord servers. You cannot meaningfully “recall” a multi‑agent surveillance‑grade analysis system once it exists; you can only race to build norms, regulation, and counter‑tooling before the next student ships an even sharper version.

The Creator's Paradoxical Vision

BettaFish’s creator does not pitch it as a weapon. He talks about a system that can “break free from echo chambers” by mapping a “real sentiment landscape” across platforms, scraping millions of posts to show what 1.4 billion people actually argue about, not just what state media or viral outrage threads amplify. In his framing, more data and more nuance equal more truth.

That idealism extends into the official roadmap. Future versions promise graph neural networks that model relationships between users, topics, and narratives, and time-series pipelines that track those graphs over days or months. The goal: not just describing what Chinese social media thinks about Donald Trump or Apple today, but forecasting where sentiment will move next.

Roadmap notes talk about combining: - Cross-platform social graphs - Historical “hotness” scores and sentiment curves - External signals like news cycles or policy events

Together, those inputs would let BettaFish run simulations of opinion cascades—who influences whom, how fast outrage decays, which demographics flip first.

That same architecture also looks indistinguishable from a mass surveillance and psychological profiling engine. A system that clusters users into influence graphs, tags them by sentiment, and predicts their future reactions does not just describe a population; it creates a targeting matrix for advertisers, political operatives, or security agencies. Documentation and explainers like BettaFish (WeiYu) – In-Depth Introduction to the Open-Source Public Opinion Platform frame this as analytical power, but the line between “analysis” and “control” shrinks as prediction improves.

So the project sits on a paradox. To truly “break echo chambers,” BettaFish must see everything, remember everything, and model everyone, which almost guarantees collateral damage to privacy and digital rights. The open question is whether any public opinion engine this granular can remain a tool for transparency once states, platforms, or bad actors plug into it.

The Double-Edged Sword on Your Server

Power sits uncomfortably close to anyone who can run `docker compose up`. BettaFish turns a mid-range Hetzner box into a surveillance-grade sentiment radar, quietly scraping Weibo, Douyin, Xiaohongshu, and dozens of other platforms, then fusing millions of posts into polished reports on what 1.4 billion people supposedly “really think.”

That reach comes with a catch baked directly into the README. Buried under the hype are blunt disclaimers: the author distances themselves from any misuse, and every legal and ethical consequence lands on whoever actually deploys this code. In other words, BettaFish is free, but liability is fully privatized.

Those warnings are not academic. Continuous scraping, cross-platform correlation, and real-time trend tracking collide with China’s Personal Information Protection Law and similar privacy regimes elsewhere. Run this stack against Twitter (X), Reddit, Facebook, or YouTube, and you are suddenly operating a homebrew social listening platform at a scale that usually belongs to ad-tech giants and intelligence agencies.

What makes BettaFish unsettling is not that it is uniquely evil, but that it is unusually honest about what modern AI can do. Multi-agent debate, automated SQL generation, and a crawler cluster feeding a single sentiment database are exactly how commercial reputation-monitoring and political consulting tools already work—just behind paywalls and NDAs instead of on GitHub stars and Docker Hub pulls.

So the question stops being “Is this tool good or bad?” and becomes “Who gets to do this, and under what rules?” A government ministry, a hedge fund, a troll farm, and a lone grad student now share access to roughly the same capabilities: scrape, cluster, analyze, and predict mass opinion in near real time, at almost zero marginal cost.

BettaFish crystallizes the current AI era into a single command-line decision. You can fork it, wire in OpenRouter, point it at your favorite platforms, and watch the reports roll in. Before you do, ask yourself: in an age where open-source code can read the crowd at planetary scale, where do you draw the line between insight and intrusion?

Frequently Asked Questions

What is BettaFish AI?

BettaFish (Weiyu) is an open-source, multi-agent AI system designed to analyze public opinion by scraping data from social media platforms, using different AI agents to process the information, debate findings, and generate comprehensive reports.

How does BettaFish work?

It uses a crawler to scrape social media, then deploys multiple AI agents in parallel: a Query Agent for web news, a Media Agent for images/videos, and an Insight Agent for private data. A unique 'ForumEngine' has these agents debate their findings before a Report Agent synthesizes the final output.

Is it legal to use BettaFish?

Using BettaFish exists in a legal gray area. Its web scraping functionality may violate the terms of service of many social media platforms and could contravene data protection laws (like GDPR or China's PIPL) depending on how and where it is used. The project's GitHub page includes disclaimers advising users to comply with local laws.

What social media platforms can BettaFish analyze?

BettaFish is primarily designed to analyze major Chinese social media platforms like Weibo, Douyin (TikTok China), Xiaohongshu, and Zhihu. However, its architecture is extensible and could potentially be adapted to scrape other global platforms like Twitter (X), Reddit, or YouTube.

China's Forbidden Mind-Reading AI Is Free