# Technology Documentation FY2 S3 This documentation is current for Centum version 2.6.12. # Introduction This document exists to explain the intended outcome and the current form of Centum technology. Centum technology exists to generate a catalog of cards containing interesting facts about interesting topics, then present to people visiting our website and mobile apps. To do so, Centum AI writes contents cards for topics, then saves them in the Centum Catalog DB, and Centum Shell gets contents cards from DB, then presents them to the visitor. ### Document Overview 1. **Introduction** section, which you are currently reading, explains the overall purpose of Centum technology and the purpose and the structure of this document. 2. **Tenets** section explains how we are currently making technical decisions in order to provide context for following sections. 3. **Application** section explains how Centum works from the perspective of our customers. 4. **Database** section explains Catalog DB. 5. **Codebase** section explains the program code for Centum AI, Centum Catalog DB, and Centum Shell Interface. 6. **Hardware** section explains how we are using different computers inside and outside of our company space to create intended events in production and to test code changes before publishing to production. 7. **Current Compromises** section explains decisions we made that traded optimality for practicality, and suggests considerations for the future. # **Tenets** ### **First create. Then optimize.** For new features, rush to reach a point where we can test on devices, then refactor code to optimize for performance and readability (while maintaining functionality). ### **Partner with maintainers.** For any important technology that deserves constant dedicated attention, prefer taking dependency on external components that are maintained by dedicate teams outside of our company. # Application Visitor can enter Centum as a web site (from Windows, macOS, Android, iOS), or as a mobile app (from Android, iOS). Centum Shell Interface (”Shell” in short) has 2 pages (excluding auxiliary pages such as Privacy Policy and Terms of Service): - Front Page lists cards in an infinite scroll contents feed with a search bar at the top. - Back Page provides controls and information such as Chat Support and compliance documents. Visitors can flip the page (from front to back and vice versa) by tapping on the logo icon at the top of each page. On front page: - Contents Feed fetches 5 discovery cards at a time from the DB. - A discovery card presents short interesting facts on a topic inviting visitor to trigger elaboration (”Elaborate on {topic}”). - Tapping on a linked topic word, in blue color, opens all 9 contents cards on that topic immediately below the current card (which the visitor just touched). Opening a topic from the search bar inserts contents cards on the searched topic at the top of the contents feed (index: 0). # Database ### Topics ```sql CREATE TABLE topics ( id UUID PRIMARY KEY NOT NULL DEFAULT gen_random_uuid(), creation_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, last_edit_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, variation TEXT, variation_id UUID REFERENCES variations(id) ON DELETE CASCADE, text_search_vector tsvector, allowed_in_site BOOLEAN NOT NULL DEFAULT FALSE, disallowed_in_site BOOLEAN NOT NULL DEFAULT FALSE, generation_priority INT NOT NULL DEFAULT 0, origin TEXT, topic TEXT NOT NULL UNIQUE, intro TEXT, has_intro BOOLEAN NOT NULL DEFAULT FALSE, has_subtopics BOOLEAN NOT NULL DEFAULT FALSE, has_discovery_cards BOOLEAN NOT NULL DEFAULT FALSE, has_contents BOOLEAN NOT NULL DEFAULT FALSE, has_edited_contents BOOLEAN NOT NULL DEFAULT FALSE, is_presentable BOOLEAN NOT NULL DEFAULT FALSE ); ``` ### Cards ```sql CREATE TABLE cards ( id UUID PRIMARY KEY NOT NULL DEFAULT gen_random_uuid(), creation_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, last_edit_time TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP, variation TEXT, variation_id UUID REFERENCES variations(id) ON DELETE CASCADE, topic_id UUID REFERENCES topics(id) ON DELETE CASCADE, topic TEXT NOT NULL, title TEXT NOT NULL, contents JSONB, edited_contents JSONB, is_discovery_card BOOLEAN NOT NULL DEFAULT FALSE, is_presentable BOOLEAN NOT NULL DEFAULT FALSE, UNIQUE(topic, title) ); ``` # Codebase Centum codebase is a monorepo. ## AI ### Add Topics to Catalog DB Currently, topics are added to the catalog DB in three ways: 1. **Curated**: I manually added the topic to the Catalog. 2. **Linked**: Centum AI decided to add a link to a term in an existing contents card for an existing topic, then added the term as a standalone topic in the Catalog, because the topic was not yet in the Catalog. 3. **Searched**: Visitor searched for the topic using the search bar from the Shell, and Shell added the topic to the Catalog, because the topic was not yet in the Catalog. Before migrating to seeding the Catalog with manually curated topics, I first tried seeding with topics that are ideated automatically by Generative AI, and discovered that Generative AI at this moment fails to come up with sufficiently interesting topics. ### Centum Writer For Generative AI, we are currently using Mistral:v0.3:7B. - `write_intro_for_topics` - From DB, gets topics that do not have intros. - Using Generative AI, writes intro. - Saves to DB. - `curate_subtopics_for_topics(self, num_topics: int =` - From DB, gets topics that do not have subtopics. - Using Generative AI, writes subtopics. - Saves to DB. - `write_discovery_card_for_topics(self, num_topics: int = 10)` - From DB, gets topics that do not have discovery cards. - Using Generative AI, writes discovery cards. - Saves to DB. - `write_contents_for_topics(self, num_topics: int = 10)` - From DB, gets topics that do not have contents cards. - Using Generative AI, writes contents cards. - Saves to DB. ### Centum Editor - `add_emoticons` - From DB, get some topics. For each topic, get the discovery card for that topic. - For each fact in discovery card, using Generative AI, add a contextually appropriate emoticon as a prefix for the bullet point. Currently, we apply this operation to a small proportion of the topics such that visitors can be delighted by occasional emoticons as opposed to getting frustrated by seeing emoticons all the time. - `add_links_to_cards` - From DB, get topics that have not been edited. For each topic, get contents cards. - For each contents card, perform named entity recognition to get terms that are ["PRODUCT" , "ORG" , "WORK_OF_ART" , "EVENT" , "FAC" , "LOC" , "GPE" , "PERSON" , "LAW"]. For named entity recognition, we are currently using an AI model for NLP made by Spacy:`en_core_web_trf`. - For discovered entity terms, add link to the entity term using grammar [[display_text|link_destination]]. - `edit_using_policy` Perform brute-force string replacement operations as a temporary fallback to overcome mistakes by AI models. For example, when asked to write surprising facts about topics, despite getting explicitly asked not to do so, AI models prefix facts with the string “Did you know?”. For now, we just remove “Did you know?” string if it exists. ### Scripts `fill_catalog` script monitors execution time for each operation performed by Centum Writer and Centum Editor. The script executes commands in 3 modes: - Waterfall Mode: For each batch of topics, script performs all operations in order from write_intro_for_topics to edit_using_policy. - Flare Mode: For each operation write_intro_for_topics to edit_using_policy, script performs operation on a batch of topics. - Infinity Mode: Script runs in flare mode infinitely (in order to fill catalog for production). In all modes, execution time is reported as minutes per topic. ### Memo Generation of contents for 1 topic, currently in Centum version 2.6.12, costs 2-3 minutes, down from 3-5 minutes in version 2.6.1. We target to get this down to 1-2 minutes in 2.9, then to under 1 minute in 3.0. This is assuming constant characteristics for contents, technology, and business, which is likely false since we are planning to enhance contents quality (harder task → slower), migrate to servers hosted by 3P cloud service providers (stronger computers → faster), and sell subscription memberships (finance → faster). ## DB **Seed.SQL**: See Database section of this document for DB schema. **Functions**:`add_topics_or_increment_priority`: We call this function from AI or Shell when we meet a topic which is not yet ready for presentation in our Catalog. - If a topic is not in catalog, then add the topic to catalog such that AI will eventually generate contents for it. - If a topic is in catalog, then bump up generation priority for the topic such that AI will generate contents for it sooner. This logic ensures topic that are interesting to people are prepared in our catalog surely and sooner. ## Shell `/packages/app` contains monorepo code that will be delivered to web site visitors via `apps/next` and to mobile app visitors via `apps/expo`. # Hardware For production: - Website in production is hosted on Vercel. - Mobile Apps are distributed on app stores operated by Apple and Google. - Database in production is hosted on Supabase. For testing: - Servers are hosted on my laptops in my home office (macOS and Windows). - App is tested on test devices running Windows, macOS, Android, and iOS. # **Current** Compromises - Currently, Centum AI writes directly to production Catalog DB. Until now, filling catalog rapidly was imperative to us, because our catalog was too small to deliver enough utility to visitors, and we were introducing many breaking changes between version updates which made us restart catalog filling process from scratch. Now, production catalog has 2K topics that are presentable, and it, despite still falling short of our own standards, is delivering meaningful utility to daily visitors. We plan to start buffering updates to Production Catalog DB by adding a Buffer Catalog DB. Centum AI will write to the Buffer Catalog DB, then buffered changes in Buffer Catalog DB will be pushed to Production Catalog DB in batch updates. This change will increase stability and decrease egress cost for production DB. - Catalog Backlog (keeping track of topics to generate contents for) is in Postgres DB instead of Redis. Consider testing Redis in the future. - Consider documenting tools.