{"flag":true,"single":true,"pageTitle":"Simple Grabing tutorial | Create Spider, Extract data, Save data in json, with pagination scraping  | scrapy cheatsheet Python","post":{"id":38,"user_id":"1","slug":"simple-grabing-tutorial-create-spider-extract-data-save-data-in-json-with-pagination-scraping-scrapy-cheatsheet-python-skcm","title":"Simple Grabing tutorial | Create Spider, Extract data, Save data in json, with pagination scraping  | scrapy cheatsheet Python","body":"<p>We will grabe <a href=\"https:\/\/quotes.toscrape.com\/\">https:\/\/quotes.toscrape.com\/<\/a><\/p>\r\n<p>1.Run command below to start a scrapy project&nbsp;<\/p>\r\n<pre class=\"language-markup\"><code>scrapy startproject tutorial<\/code><\/pre>\r\n<p>It will create a new folder tutorial and scrapy.cfg file<\/p>\r\n<p><strong>scrapy.cfg &nbsp;<\/strong> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# deploy configuration file<\/p>\r\n<p>&nbsp; &nbsp; <strong>tutorial\/ <\/strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; # project's Python module, you'll import your code from here<\/p>\r\n<p>&nbsp; &nbsp; &nbsp; &nbsp;<strong> items.py <\/strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# project items definition file<\/p>\r\n<p>&nbsp; &nbsp; &nbsp; &nbsp; <strong>middlewares.py <\/strong>&nbsp; &nbsp;# project middlewares file<\/p>\r\n<p>&nbsp; &nbsp; &nbsp; &nbsp; <strong>pipelines.py <\/strong>&nbsp; &nbsp; &nbsp;# project pipelines file<\/p>\r\n<p>&nbsp; &nbsp; &nbsp; &nbsp; <strong>settings.py <\/strong>&nbsp; &nbsp; &nbsp; # project settings file<\/p>\r\n<p>&nbsp; &nbsp; &nbsp; &nbsp; <strong>spiders\/ <\/strong>&nbsp; &nbsp; &nbsp; &nbsp; &nbsp;# a directory where you'll later put your spiders<\/p>\r\n<p>Spiders are classes that you define and that Scrapy uses to<\/p>\r\n<ul>\r\n<li>scrape information from a website .<\/li>\r\n<li>how to follow links in the pages,<\/li>\r\n<li>how to parse the downloaded page content to extract data.<\/li>\r\n<\/ul>\r\n<p>2. change directory to tutorial folder<\/p>\r\n<pre class=\"language-markup\"><code>cd tutorial<\/code><\/pre>\r\n<p>3. Create a new file&nbsp;<strong>quotes_spider.py<\/strong> inside spiders directory.<\/p>\r\n<pre class=\"language-markup\"><code>from pathlib import Path\r\n\r\nimport scrapy\r\n\r\n\r\nclass QuotesSpider(scrapy.Spider):\r\n    name = \"quotes\"\r\n\r\n    def start_requests(self):\r\n        urls = [\r\n            'https:\/\/quotes.toscrape.com\/page\/1\/',\r\n            'https:\/\/quotes.toscrape.com\/page\/2\/',\r\n        ]\r\n        for url in urls:\r\n            yield scrapy.Request(url=url, callback=self.parse)\r\n\r\n    def parse(self, response):\r\n        page = response.url.split(\"\/\")[-2]\r\n        filename = f'quotes-{page}.html'\r\n        Path(filename).write_bytes(response.body)\r\n        self.log(f'Saved file {filename}')<\/code><\/pre>\r\n<p><strong>name:<\/strong> identifies the Spider. It must be unique within a project, that is, you can&rsquo;t set the same name for different Spiders.<\/p>\r\n<p><strong>start_requests():<\/strong> must return an iterable of Requests which the Spider will begin to crawl from.<\/p>\r\n<p><strong>parse():<\/strong> handle the response downloaded for each of the requests made. The <strong>response parameter<\/strong> is an instance of TextResponse that holds the page content<\/p>\r\n<p><strong>4.Run the spider<\/strong><\/p>\r\n<pre class=\"language-markup\"><code>scrapy crawl quotes<\/code><\/pre>\r\n<p>It will save two new files have been created: <strong>quotes-1.html and quotes-2.html&nbsp;<\/strong>in root directory.<\/p>\r\n<p><strong>background Working:<\/strong><\/p>\r\n<ul>\r\n<li>Scrapy schedules the <strong>scrapy.Request objects <\/strong>returned by the start_requests method of the Spider.<\/li>\r\n<li>instantiates the Response objects and calls the callback method associated with the request (in this case, the parse method)<\/li>\r\n<\/ul>\r\n<p>parse() is Scrapy&rsquo;s default callback method so we can use <strong>short code<\/strong> also<\/p>\r\n<pre class=\"language-markup\"><code>from pathlib import Path\r\nimport scrapy\r\nclass QuotesSpider(scrapy.Spider):\r\n    name = \"quotes\"\r\n    start_urls = [\r\n        'https:\/\/quotes.toscrape.com\/page\/1\/',\r\n        'https:\/\/quotes.toscrape.com\/page\/2\/',\r\n    ]\r\n    def parse(self, response):\r\n        page = response.url.split(\"\/\")[-2]\r\n        filename = f'quotes-{page}.html'\r\n        Path(filename).write_bytes(response.body)<\/code><\/pre>\r\n<p><em><strong>Extract Data<\/strong><\/em><\/p>\r\n<p>Run the command&nbsp;<\/p>\r\n<pre class=\"language-markup\"><code>scrapy shell \"https:\/\/quotes.toscrape.com\/page\/1\/\"<\/code><\/pre>\r\n<p>Now play with below commands<\/p>\r\n<p><strong>Using CSS Selector<\/strong><\/p>\r\n<pre class=\"language-markup\"><code>response.css('title') #return selectorList object\r\nresponse.css('title').getall() # return list of elements with content\r\nresponse.css('title::text').getall() #return list of all text from element\r\nresponse.css('title::text').get() # first result only\r\nresponse.css('title::text')[0].get() # first result only\r\nresponse.css('title::text').re(r'Quotes.*') # return all elements that start with Qoutes\r\nresponse.css('title::text').re(r'(\\w+) to (\\w+)')\r\nresponse.css(\"div.quote\")\r\nSUB Queries using css selector\r\nquote = response.css(\"div.quote\")[0]\r\ntext = quote.css(\"span.text::text\").get() #return first found element\r\ntags = quote.css(\"a.tag::text\").getall() #return list<\/code><\/pre>\r\n<p><strong>Using Xpath<\/strong><\/p>\r\n<pre class=\"language-markup\"><code>Xpaths : more powerfull then css selectors\r\nresponse.xpath('\/\/title') #return selectorList object\r\nresponse.xpath('\/\/title\/text()').get() # return text <\/code><\/pre>\r\n<p><strong>Details: <\/strong>https:\/\/docs.scrapy.org\/en\/latest\/topics\/selectors.html#topics-selectors<\/p>\r\n<p>________________________<\/p>\r\n<p>Extract Data from our project&nbsp;<\/p>\r\n<p>A Scrapy spider typically generates <strong>many dictionaries<\/strong> containing the data extracted from the page.<\/p>\r\n<p>\"yield\" is a keyword used in generator functions to return a sequence of values one at a time.<\/p>\r\n<pre class=\"language-markup\"><code>from pathlib import Path\r\nimport scrapy\r\nclass QuotesSpider(scrapy.Spider):\r\n    name = \"quotes\"\r\n    start_urls = [\r\n        'https:\/\/quotes.toscrape.com\/page\/1\/',\r\n        'https:\/\/quotes.toscrape.com\/page\/2\/',\r\n    ]\r\n    def parse(self, response):\r\n        for quote in response.css('div.quote'):\r\n            yield {\r\n                'text': quote.css('span.text::text').get(),\r\n                'author': quote.css('small.author::text').get(),\r\n                'tags': quote.css('div.tags a.tag::text').getall(),\r\n            }<\/code><\/pre>\r\n<p>Below command will generate a quotes.json file containing all scraped items, serialized in JSON.<\/p>\r\n<pre class=\"language-markup\"><code>scrapy crawl quotes -O quotes.json<\/code><\/pre>\r\n<p>-O: overwrite the data with same file<\/p>\r\n<p><strong>save data using json lines, its append the data in previous file<\/strong><\/p>\r\n<pre class=\"language-markup\"><code>scrapy crawl quotes -o quotes.jsonl<\/code><\/pre>\r\n<p><em><strong>With Pagination:<\/strong><\/em><\/p>\r\n<p>Get Attribute of elements<\/p>\r\n<pre class=\"language-markup\"><code>response.css('li.next a::attr(href)').get()\r\nOR\r\nresponse.css('li.next a').attrib['href']<\/code><\/pre>\r\n<p>Scrape all <strong>https:\/\/quotes.toscrape.com\/<\/strong><\/p>\r\n<pre class=\"language-markup\"><code>import scrapy\r\nclass QuotesSpider(scrapy.Spider):\r\n    name = \"quotes\"\r\n    start_urls = [\r\n        'https:\/\/quotes.toscrape.com\/page\/1\/',\r\n    ]\r\n    def parse(self, response):\r\n        for quote in response.css('div.quote'):\r\n            yield {\r\n                'text': quote.css('span.text::text').get(),\r\n                'author': quote.css('small.author::text').get(),\r\n                'tags': quote.css('div.tags a.tag::text').getall(),\r\n            }\r\n        next_page = response.css('li.next a::attr(href)').get()\r\n        if next_page is not None:\r\n            next_page = response.urljoin(next_page)\r\n            yield scrapy.Request(next_page, callback=self.parse)<\/code><\/pre>","category_id":"11","is_private":"0","created_at":"2023-03-21T11:43:06.000000Z","updated_at":"2023-03-21T12:58:26.000000Z","category":{"id":11,"user_id":"1","name":"Scrapy Python","slug":"python-vk7t","parent_id":"5","created_at":"2023-03-21T09:29:28.000000Z","updated_at":"2023-03-21T09:29:43.000000Z"},"user":{"id":1,"name":"R GONDAL","email":"rizikmw@gmail.com","email_verified_at":null,"two_factor_confirmed_at":null,"current_team_id":"1","profile_photo_path":null,"created_at":"2023-03-12T10:49:33.000000Z","updated_at":"2025-01-10T12:59:00.000000Z","profile_photo_url":"https:\/\/ui-avatars.com\/api\/?name=R+G&color=7F9CF5&background=EBF4FF"}},"pageDesc":"We will grabe https:\/\/quotes.toscrape.com\/ 1.Run command below to start a scrapy project&nbsp; scrapy startproject tutorial It will create a - Simple Grabing tutorial | Create Spider, Extract data, Save data in json, with pagination scraping  | scrapy cheatsheet Python (Updated: March 21, 2023) - Read more about Simple Grabing tutorial | Create Spider, Extract data, Save data in json, with pagination scraping  | scrapy cheatsheet Python at my programming site [SITE]","categories":[]}