AIFORUS — AI News Intelligence Platform

Production full-stack system I designed and built solo at Aiforus (Seoul). A Python pipeline (FastAPI + PostgreSQL + ML tagging) continuously discovers AI-related news from global media, scrapes and normalizes content, and applies zero-shot classification. A React + Vite dashboard with Leaflet maps and Recharts visualizes the output for non-technical users.

Solo Engineer (backend + frontend)
Pipeline Architecture
ML Integration
Data Visualization
Docker / nginx Deploy

Overview of the AIFORUS system: pipeline feeding the dashboard

The problem

Aiforus needed a foundation for AI-news intelligence that could grow over years without being rewritten. The first step was a collection layer that kept up with thousands of global sources, recovered cleanly from failures, and stayed neutral about downstream interpretation — so summarization, scoring, and analytics could be iterated independently above it. The second step was a dashboard that a non-engineer could open and immediately answer two questions: what is happening in AI news right now, and where is it coming from.

I built both ends — the Python pipeline and the React dashboard — as the solo engineer on the project.

Architecture: pipeline + dashboard

The backend is split into four independently runnable services orchestrated by a thin operator layer. Each service owns one responsibility and never reaches across boundaries.

CLT — discovers and collects news URLs from global feeds and sitemaps.

SCR — fetches each URL, parses the article body, normalizes it.

TAG — applies AI tagging: language detection, AI relevance scoring, zero-shot topic classification.

API — a FastAPI surface that exposes trends and articles to the dashboard.

A separate policy/ directory holds the rules in bilingual KO/EN docs. Policy is authoritative: the code implements the spec, not the other way around. This keeps the system audit-friendly and lets non-engineers evolve the rules without touching Python.

End-to-end architecture: CLT, SCR, TAG, API, policy layer, and React dashboard

Backend — URL discovery (CLT)

The collector reads from per-country source sets that I maintain separately from the code. Each source has its own discovery strategy — RSS, sitemap, listing pages — and its own polling cadence. Failures are first-class: a source that throws on Tuesday is expected to recover on Wednesday, and the scheduler treats the failure log as signal rather than an exception.

The CLT service writes only discovered URLs and metadata. It never fetches article bodies and never makes editorial decisions — that lives downstream so the collection layer stays cheap and unopinionated.

CLT scheduler running per-country source sets with failure logging

Backend — content normalization (SCR)

The scraper picks up new URLs from the queue, fetches the page, and runs it through a two-stage parser: trafilatura for the primary extraction and newspaper3k as a fallback for pages where trafilatura under-performs. Language is detected with fasttextso downstream tagging knows which model to apply.

Each fetch goes through a filter chain — paywalls, JS-only pages, duplicate canonicals — before normalization. Rejected URLs are kept with a reason code, which was essential for debugging source-set quality.

SCR fetch and parse pipeline with trafilatura primary and newspaper3k fallback

Backend — AI tagging (TAG)

The tagger applies three independent passes per article: an AI relevance score, a zero-shot topic classification, and a separate Korean-language topic detector built on anchor phrases. Each pass writes to its own column so we can run shadow versions side-by-side without breaking consumers.

Models live behind a thin abstraction — sentence-transformerstoday, but the interface is small enough that swapping to an LLM API tomorrow is a contained change. This matters because the team explicitly does not want to be locked to one vendor's model.

TAG service applying AI scoring, zero-shot classification, and Korean topic detection

Dashboard — map view

The dashboard is a React 19 + Vite SPA that consumes the FastAPI surface and visualizes the pipeline's output. The map view uses react-leaflet to plot tracked sources, sized by article volume in the selected window. Clicking a marker filters the news panel to that source. The marker layer is memoized so re-renders during filtering only touch the news panel, not Leaflet.

Leaflet world map with markers sized by article volume per source

Dashboard — trends view

A Recharts area chart shows article volume by AI topic over time. Korean-language topics surface as a separate stacked breakdown because the pipeline classifies them with a dedicated model — the UI follows the data, not the other way around. Hover gives the exact count per topic per day.

Recharts time-series area chart of article volume by AI topic

Dashboard — news panel

The news panel is a virtualized list of articles filtered by whatever is selected on the map and trends views. Each row shows the source, the detected language, and the topic tag. Clicking opens the original URL in a new tab — the dashboard never mirrors article content, by policy. A top-level ErrorBoundary plus a centralized API client keeps backend hiccups from breaking the UI.

News panel listing articles with source, language, and topic tags

Stack & operations

Backend: Python 3.11, FastAPI, PostgreSQL with additive-only migrations, torch / transformers / sentence-transformers for the AI passes, fasttext for language detection, trafilatura + newspaper3k for parsing. Two Docker images (API + scheduler) plus a docker-compose dev stack.

Frontend: React 19, Vite 7, react-leaflet, Recharts, served from a multi-stage nginx Docker image. No state library — React built-ins plus a small fetch helper were enough for a read-heavy, stateless- between-views dashboard.

What I learned

The biggest win wasn't a clever model — it was treating failures and policy changes as first-class workflows. Once "this source broke" and "we now classify Korean articles differently" stopped being incidents and became routine operations, the system started compounding instead of degrading. On the frontend side, the dashboard launched fast because I skipped TypeScript and a state library — both correct calls at the time, but TS is the next thing I'd add given how much the API surface has grown.