An automated lead generation system that finds and extracts company contact information for exporters and manufacturers exporting to the European Union (EU). The system searches for companies by commodity, country, and industry, then scrapes their websites to extract structured contact data.
- EU-Focused Search: Targeted queries to find companies exporting to EU markets
- Intelligent Query Generation: Uses LLM to generate 40+ diverse search queries with site: operators
- Concurrent Crawling: Processes up to 5 websites in parallel for faster results
- Anti-Detection Crawling: User agent rotation, custom headers, browser fingerprinting
- Smart Page Discovery: Automatically finds Contact and About pages
- Obfuscation Handling: Decodes emails like "info [at] company . com"
- No Hallucination: Returns null for missing data instead of guessing
- Social Media Filtering: Focuses on business directories and corporate sites
- Deduplication: Removes duplicate companies by normalized name and merges contact details
- Retry Logic: Exponential backoff for API failures and rate limits
- Comprehensive Extraction: 16 data fields including EU destinations, LinkedIn, social links
- Progress Tracking: Rich console with progress bars and structured logging
The system follows a 3-stage pipeline:
- Discovery (Search Agent) - Generates 40+ EU-focused search queries and gathers company URLs
- Intelligence (Scraper Agent + Crawler Toolkit) - Concurrently crawls websites with anti-detection features
- Action (Analyst Agent + Output Writer) - Validates trade status, scores leads, deduplicates, and saves results
- Python 3.10 or higher
- API keys for OpenRouter and a search provider (Tavily or Serper)
- Clone the repository:
git clone https://github.com/Paraschamoli/Contact-Detail-Agent.git
cd Contact-Detail-Agent- Install dependencies:
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .Important: Always use uv run to execute the script to ensure the correct Python environment is used:
uv run python index.py --commodity textiles --country India- Set up environment variables:
cp .env.example .env
# Edit .env with your API keys- OpenRouter: Get your key at https://openrouter.ai/keys
- Tavily (recommended): Get your key at https://app.tavily.com/
- Serper (alternative): Get your key at https://serper.dev/
uv run python index.py --commodity textiles --country India --industry textilesuv run python index.py \
--commodity electronics \
--country Germany \
--industry manufacturing \
--queries-per-pattern 10 \
--model "anthropic/claude-3.5-sonnet" \
--output-dir resultsuv run python index.py \
--commodity textiles \
--country India \
--outreachThis generates personalized B2B email drafts for leads with score >= 80.
uv run contact-agent --commodity textiles --country India--commodity, -c: Product/commodity to search for (required)--country, -C: Country to search in (required)--industry, -i: Optional industry category--queries-per-pattern, -q: Number of results per search query (default: 10)--model, -m: LLM model to use (default: openai/gpt-oss-120b:nitro)--output-dir, -o: Output directory for results (default: output)--outreach: Generate personalized email drafts for leads with score >= 80
The system generates a timestamped CSV file with the following columns:
Core Information:
tier: Lead quality tier (Tier 1 = best, Tier 3 = lowest)lead_score: AI-generated lead score (0-100)legitimacy_level: Trade legitimacy assessment (Green/Yellow/Red)company_name: Official company namewebsite: Company website URLlocation: Company address or locationcountry: Country where company is locatedcontact_person: Specific contact person if availableproduct_category: Product category (e.g., textiles, electronics)business_description: Brief company business description
Contact Information:
direct_emails: All department emails (sales@, exports@, info@, etc.)email_confidence_avg: Average email confidence scorehas_verified_email: Whether email was verifiedphone_numbers: All phone numbers in international formatlinkedin_profile: LinkedIn company page URLsocial_links: Other social media profile URLs
Export Information:
export_details: Specific products the company exportsexport_region: Export markets or regions servedeu_destinations: Specific EU countries they export tocertifications: Certifications (ISO, CE, REX, etc.)key_executives: Key executives with names and titles
Analysis & Metadata:
product_match: Product match assessmenteu_compliance: EU compliance statuscompany_type: Manufacturer vs. Trader/Middlemanreasoning: AI reasoning for the scoreemail_draft_subject/recipient/body: Outreach email drafts (if --outreach used)backup_url: Backup URL if primary crawl failedcrawl_failure: Failure reason if crawl failed
Example output file: output/textiles_India_detailed_20240108_143022.csv
Edit config/settings.yaml to customize search query patterns. The default patterns are EU-focused to target exporters:
global_search_queries:
- "{commodity} exporters in {country} to EU directory"
- "list of {commodity} manufacturers in {country} exporting to Europe"
- "site:europages.com {commodity} {country}"
- "site:kompass.com {commodity} exporters {country} EU"
- "REX registered {commodity} exporters {country}"The system automatically generates 40+ diverse queries using:
- Site-specific searches (europages.com, kompass.com, alibaba.com, indiamart.com)
- EU compliance searches (REX registered, CE certified)
- Specific EU country targets (Germany, France, Netherlands)
- Government/export promotion sites
- Industry association directories
# Install development dependencies
uv sync --dev
# Or with pip
pip install -e ".[dev]"# Format code
black .
# Lint code
ruff check .
# Type checking
mypy .pytestcontact-detail-agent/
├── agents/
│ ├── search_agent.py # LLM-powered EU-focused search query generation
│ ├── scraper_agent.py # Concurrent crawling with backup URL support
│ └── analyst_agent.py # Lead scoring and EU compliance assessment
├── tools/
│ ├── search_toolkit.py # Search API integration (Tavily/Serper) with retry logic
│ ├── crawler_toolkit.py # Playwright-based web crawling with anti-detection
│ ├── trade_validator.py # EU trade legitimacy validation (VIES, REX)
│ ├── mailer_toolkit.py # Personalized B2B email drafting
│ └── verification_toolkit.py # Email verification (syntax, domain, mailbox)
├── utils/
│ ├── llm_extractor.py # Structured contact information extraction (16 fields)
│ └── output_writer.py # CSV/Excel output with deduplication
├── config/
│ ├── settings.yaml # EU-focused search patterns
│ └── industry_params.json # Industry-specific parameters
├── output/ # Generated results (CSV files)
├── storage/ # Crawlee runtime storage (auto-generated, can be deleted)
├── index.py # Main CLI entry point
├── pyproject.toml # Project configuration
├── .env.example # Environment variables template
└── .gitignore # Git ignore rules
- Tavily: Better search results, supports advanced filtering
- Serper: Google Search API integration
Supports models available via OpenRouter:
- Claude 3.5 Sonnet (recommended)
- GPT-4o
- Other OpenRouter models
- API keys stored in environment variables (never committed)
- Social media links filtered out for privacy
- No personal data stored or cached
- Anti-detection features for ethical crawling
-
ModuleNotFoundError: Always use
uv runto execute scripts - it ensures the correct Python environmentuv run python index.py --commodity textiles --country India
-
API Key Errors: Ensure all required API keys are set in
.envOPENROUTER_API_KEYis requiredTAVILY_API_KEYorSERPER_API_KEYis required
-
Rate Limits: Reduce
--queries-per-patternif hitting API limitsuv run python index.py --commodity textiles --country India --queries-per-pattern 5
-
Crawling Failures: Some sites may block crawlers; results vary. The system automatically tries backup URLs.
-
Empty Results: Try different search terms or increase query count. The system uses 40+ queries by default.
-
Low Quality Results: Search APIs may return directory pages instead of actual company sites. This is a limitation of the search provider.
Detailed logs are written to agent.log in the project directory. Check this file for debugging issues with search, crawling, or extraction.
The storage/ directory contains temporary Crawlee runtime data. It can be safely deleted:
rm -rf storage/MIT License - see LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
For issues and questions:
- Open an issue on GitHub
- Check the troubleshooting section
- Review the configuration documentation