ScrapeViz: Hierarchical Representations for Web Scraping Macros

Programming-by-demonstration (PBD) makes it possible to create web scraping macros without writing code. However, it can still be challenging for users to understand the exact scraping behavior that is inferred and to verify that the scraped data is correct, especially when scraping occurs across multiple website pages. We present ScrapeViz, a new PBD tool for authoring and visualizing distributed hierarchical web scraping macros. ScrapeViz's key novelty is in providing users a visual representation of their web scraping macro -- the sequences of pages visited, generalized scraping behavior across similar pages and page elements through grouping layouts and color coding, and the source of scraped data through an interactive output table linked to page context. We conducted a lab study with 12 participants comparing ScrapeViz to the existing web scraping tool Rousillon and saw that participants found ScrapeViz helpful for understanding high-level scraping behavior, the source of scraped data, and anomalies, and for validating macros in real-time while authoring.