PROJECT N° 04
2023
DATA COLLECTION ENGINEER

Global Image Dataset, Web Scraping

A remote engagement with Limitless Capital (Singapore) building the upstream pipeline that feeds computer-vision models, emphasising geographic diversity, licensing hygiene, and labelling consistency.

PythonETLData Quality
§01 / CONTEXT

The brief.

AI inclusivity starts at the dataset. Limitless Capital needed image corpora that genuinely represented users across continents, not the usual North-American skew. The role combined engineering, curation, and quality assurance under remote, performance-tracked conditions.

§02 / APPROACH

How it was built.

  1. STEP 01Wrote modular Python scrapers covering 50+ sources with respectful rate limiting and retry logic.
  2. STEP 02Curated images across 30+ countries, balancing demographic and contextual diversity.
  3. STEP 03Authored data-quality standards adopted by the cross-functional team.
  4. STEP 04Operated independently against weekly performance benchmarks in a remote environment.
§03 / OUTCOMES

What it moved.

  • Collected 15,000+ images while cutting manual collection time by 65%.
  • Improved AI inclusivity metrics on downstream models by 28%.
  • Established the QA checklist that became the team's onboarding document.
§04 / STACK

Tools used.

TOOLPython
TOOLRequests
TOOLBeautifulSoup
TOOLPandas