PROJECT N° 04
2023
DATA COLLECTION ENGINEER
Global Image Dataset, Web Scraping
A remote engagement with Limitless Capital (Singapore) building the upstream pipeline that feeds computer-vision models, emphasising geographic diversity, licensing hygiene, and labelling consistency.
PythonETLData Quality
§01 / CONTEXT
The brief.
AI inclusivity starts at the dataset. Limitless Capital needed image corpora that genuinely represented users across continents, not the usual North-American skew. The role combined engineering, curation, and quality assurance under remote, performance-tracked conditions.
§02 / APPROACH
How it was built.
- STEP 01Wrote modular Python scrapers covering 50+ sources with respectful rate limiting and retry logic.
- STEP 02Curated images across 30+ countries, balancing demographic and contextual diversity.
- STEP 03Authored data-quality standards adopted by the cross-functional team.
- STEP 04Operated independently against weekly performance benchmarks in a remote environment.
§03 / OUTCOMES
What it moved.
- →Collected 15,000+ images while cutting manual collection time by 65%.
- →Improved AI inclusivity metrics on downstream models by 28%.
- →Established the QA checklist that became the team's onboarding document.
§04 / STACK
Tools used.
TOOLPython
TOOLRequests
TOOLBeautifulSoup
TOOLPandas