Web Reconnaissance Cheatsheet
CheatSheet dành cho các phương pháp và công cụ trong Web Reconnaissance
Cheatsheet này được tóm tắt bởi AI, từ module Information Gathering - Web Edition thuộc về HackTheBox.
Reconnaissance Methodology
Two fundamental methodologies:
Active: port scanning, vulnerability scanning, network mapping, banner grabbing, OS fingerprinting, service enumeration, web spidering.
Passive: search engines, WHOIS, DNS, web archives, social media, code repositories.
WHOIS
Provides: registrar, registration/expiration dates, nameservers, contact information of domain owner.
Value in pentest:
- Identify key personnel (names, emails, phone numbers) for social engineering and phishing.
- Discover network infrastructure (name servers, IP addresses) to find entry points and misconfiguration.
- Analyze historical data via WhoisFreaks to track changes in ownership and technical details.
Note: data may be inaccurate or obscured by privacy services.
DNS
DNS Digging
Tools: dig, nslookup, host, dnsenum, fierce, dnsrecon, theHarvester.
- Discover assets: subdomains, mail servers, name server records (CNAME pointing to outdated server may be vulnerable).
- Map network infrastructure: for example, identify hosting provider via NS records, discover load balancer via A records.
- Monitor changes: new subdomains (VPN endpoints), TXT records revealing tools in use (1Password).
DNS record types:
| Record Type | Description |
|---|---|
| A | Map hostname → IPv4 |
| AAAA | Map hostname → IPv6 |
| CNAME | Alias hostname → another hostname |
| MX | Mail servers handling email |
| NS | Delegate DNS zone to name server |
| TXT | Store arbitrary text information |
| SOA | Administrative info of DNS zone |
DNS Zone Transfer
Zone transfer reveals:
- Subdomains: complete list, including development servers, staging environments, admin panels.
- IP addresses of each subdomain.
- Name server records: hosting provider and misconfiguration.
Subdomain & Virtual Host Enumeration
Subdomains: extensions of main domain (blog.example.com), have their own DNS records.
Virtual Hosts: web server configuration allowing multiple websites on single server, distinguished via HTTP Host header. Can be top-level domains or subdomains, each VHost has separate config.
Subdomains often contain:
- Development/staging environments with relaxed security.
- Hidden login portals (admin panels).
- Legacy applications with outdated software.
- Sensitive information: confidential documents, internal data, config files.
Active methods: zone transfer, ffuf, gobuster, dirbuster, dnsenum, amass, assetfinder, puredns, fierce, dnsrecon, sublist3r.
Passive methods: Certificate Transparency logs, search engines (site:).
Certificate Transparency Logs
Tools: crt.sh, search.censys.io.
Advantages:
- Not limited by wordlist or brute-force algorithm.
- Access historical and comprehensive view of subdomains.
- Discover subdomains with old/expired certificates (potentially vulnerable).
- No need for brute-forcing or complete wordlist.
Fingerprinting
Techniques: banner grabbing, HTTP headers, unique responses (error messages), page content (source code, scripts, copyright).
Tools: Wappalyzer, BuiltWith, WhatWeb, Nmap, Netcraft, wafw00f, Nikto.
Crawling
Crawling (spidering): automated process of browsing the web, following links to collect information.
Breadth-first: prioritize width, crawl all links on seed page before going deep. Good for website structure overview.
Depth-first: prioritize depth, follow one path of links as far as possible before backtracking. Good for finding specific content or reaching deep structure.
Information collected:
- Links: internal and external links to map structure, discover hidden pages.
- Comments: may reveal sensitive details, internal processes, hints about vulnerabilities.
- Metadata: titles, descriptions, keywords, author names, dates.
- Sensitive files: backup files (.bak, .old), config files (web.config, settings.php), log files (error_log, access_log), files containing passwords, API keys.
Crawlers
- Burp Suite Spider: powerful active crawler, map web apps, discover hidden content and vulnerabilities.
- OWASP ZAP: free, open-source scanner, includes spider component.
- Scrapy: flexible, scalable Python framework, build custom crawlers.
- Apache Nutch: Java, highly extensible and scalable, handle massive crawls.
ReconSpider (custom Scrapy spider):
wget -O ReconSpider.zip https://academy.hackthebox.com/storage/modules/144/ReconSpider.v1.2.zip
unzip ReconSpider.zip
python3 ReconSpider.py http://example.comResults saved in results.json.
robots.txt
Text file in root directory, structure:
User-agent: specifies which crawler/bot the rules apply to. Wildcard (*) = all bots.
Directives:
| Directive | Description | Example |
|---|---|---|
| Disallow | Paths bot should not crawl | Disallow: /admin/ |
| Allow | Explicitly permit crawling specific paths | Allow: /public/ |
| Crawl-delay | Delay between requests (seconds) | Crawl-delay: 10 |
| Sitemap | URL to XML sitemap | Sitemap: https://example.com/sitemap.xml |
Value:
- Disallowed paths often contain sensitive info, backup files, admin panels.
- Analyze allowed/disallowed paths to create website map.
- Honeypot directories to lure malicious bots.
Well-Known URIs
Per RFC 8615, .well-known at root domain (/.well-known/) contains metadata: config files, services, protocols, security mechanisms.
openid-configuration: part of OpenID Connect Discovery protocol (identity layer on OAuth 2.0). Endpoint https://example.com/.well-known/openid-configuration returns JSON containing metadata about provider endpoints, authentication methods, token issuance.
Google Dorking
Using search operators to discover sensitive information, vulnerabilities, hidden content:
| Operator | Description | Example |
|---|---|---|
| site: | Limit results to domain | site:example.com |
| inurl: | Find term in URL | inurl:login |
| filetype: | Find specific file type | filetype:pdf |
| intitle: | Find term in title | intitle:“confidential report” |
| intext:/inbody: | Find term in body text | intext:“password reset” |
| cache: | Display cached version | cache:example.com |
| link: | Find pages linking to webpage | link:example.com |
| related: | Find related websites | related:example.com |
| info: | Summary information | info:example.com |
| define: | Define word/phrase | define:phishing |
| numrange: | Find numbers in range | numrange:1000-2000 |
| allintext: | All words in body | allintext:admin password reset |
| allinurl: | All words in URL | allinurl:admin panel |
| allintitle: | All words in title | allintitle:confidential report 2023 |
| AND | Require all terms | site:example.com AND inurl:admin |
| OR | Any term | “linux” OR “ubuntu” |
| NOT | Exclude term | site:bank.com NOT inurl:login |
| * | Wildcard | filetype:pdf user* manual |
| .. | Range search | “price” 100..500 |
| " " | Exact phrase | “information security policy” |
| - | Exclude term | -inurl:sports |
Reference: https://www.exploit-db.com/google-hacking-database
Web Archives
Wayback Machine provides:
- Uncovering hidden assets: old pages, directories, files, subdomains not currently accessible, may contain sensitive info or security flaws.
- Tracking changes: compare historical snapshots to observe evolution (structure, content, technologies, vulnerabilities).
- Gathering intelligence: OSINT about past activities, marketing strategies, employees, technology choices.
- Stealthy reconnaissance: passive activity doesn’t interact directly with target infrastructure, less detectable.
Automated Reconnaissance Tools
- FinalRecon: Python, modules for SSL certificate checking, WHOIS, header analysis, crawling. Modular structure enables easy customization.
- Recon-ng: Python framework, modular with DNS enumeration, subdomain discovery, port scanning, web crawling, exploit vulnerabilities.
- theHarvester: Command-line Python, collect emails, subdomains, hosts, employee names, open ports, banners from search engines, PGP key servers, SHODAN.
- SpiderFoot: Open-source intelligence automation, integrates multiple data sources to collect IP addresses, domains, emails, social media profiles. Performs DNS lookups, web crawling, port scanning.
- OSINT Framework: Collection of tools and resources for OSINT, includes social media, search engines, public records.
Public Buckets
https://buckets.grayhatwarfare.com/
Public search portal for “open buckets” (public cloud storage containers) and contents. Search file names, filter by extension, keywords, date ranges to find exposed files.
Aggregates data from Amazon S3, Azure Blob Storage, Google Cloud Storage. Discover misconfigured buckets leaking data.