The “RGPV Result Scraper” is a Python script designed to automate the process of extracting student result data from the Rajiv Gandhi Proudyogiki Vishwavidyalaya (RGPV) website. This script enables users to retrieve results for specific branches of study and semesters, subsequently saving the scraped data into CSV files for further analysis or record-keeping.
Web Scraping: The script utilizes web scraping techniques to access and extract information from the RGPV result portal.
Captcha Handling: It automates the entry of captchas using Optical Character Recognition (OCR) with Tesseract, enabling the bypassing of security measures.
Customization: Users can select their desired branch of study and semester before initiating the scraping process, allowing flexibility in result retrieval.
Data Export: The extracted result data, including student roll numbers, names, SGPA, CGPA, and results, is organized and saved in CSV files.
This Python script automates the process of scraping student result data from the Rajiv Gandhi Proudyogiki Vishwavidyalaya (RGPV) website. The script allows you to retrieve and store student results for specific branches and semesters in a CSV file.
You can watch the video tutorial on how to use this project by clicking on the following image:
Before using this script, make sure you have the following dependencies installed:
Clone the repository to your local machine:
git clone https://github.com/devnamdev2003/result_automation_system.git
Navigate to the project directory:
cd result_automation_system
Install the required Python libraries:
pip install -r requirements.txt
Edit the main()
function in the project.py
script to customize the scraping parameters such as the branch, semester, and starting roll number.
Ensure that you have the msedgedriver.exe
WebDriver executable placed in the project directory.
Run the script:
python project.py
The script will start scraping student results and save them to CSV files in the project directory.
To adapt this code for your college, follow these steps:
project.py
:
project.py
script in a text editor or code editor of your choice.main()
function to match your college’s website structure. Specifically, adjust the following:
You can customize the following parameters in the main()
function of project.py
:
branch
: Enter the branch number according to your college’s branch codes.sem
: Set the semester for which you want to scrape results.i
) and the roll number format (en_num
) should match your college’s enrollment number format.Tesseract OCR is an open-source optical character recognition engine that is widely used for text recognition in images. To use Tesseract OCR in your project, follow these steps to install it on your system:
tesseract-ocr-w64-setup-vX.X.X.exe
).Open a terminal and run the following commands to install Tesseract:
sudo apt update
sudo apt install tesseract-ocr
sudo apt install libtesseract-dev
Open a terminal and run the following command to install Tesseract:
brew install tesseract
To verify that Tesseract OCR is installed correctly, open a terminal or command prompt and run the following command:
tesseract --version
You should see the Tesseract OCR version information, indicating that the installation was successful.
You can now use Tesseract OCR in your Python scripts or applications. Make sure to install the pytesseract
Python library, which provides a Python wrapper for Tesseract OCR. You can install it using pip
:
pip install pytesseract
Refer to the pytesseract documentation for information on how to use Tesseract OCR with Python.
Example: If Tesseract is installed at “C://Program Files//Tesseract-OCR//tesseract.exe”, set it as follows:
tesseract_path = r"C://Program Files//Tesseract-OCR//tesseract.exe"
def read_text(captcha_image):
tesseract_path = r"E://Program Files//Tesseract-OCR//tesseract.exe"
# Rest of your code...
Import Libraries:
The script starts by importing necessary Python libraries, including those for web scraping, web automation, image processing, and file handling.
Helper Functions:
find_src(source_code)
: This function parses the HTML source code of a web page to locate and extract the URL of the captcha image.download_image(url, image_name)
: Downloads the captcha image from a given URL and saves it locally with the specified name.read_text(captcha_image)
: Reads text from a captcha image using Tesseract OCR, cleans the extracted text, and deletes the image file.Main Function (main()
):
Main Execution:
if __name__ == "__main__":
) and calls the main()
function to start the scraping process.Ensure you have Microsoft Edge WebDriver (or the appropriate WebDriver for your browser) installed and set the edge_driver_path
variable in the script to its location.
Make sure you have the correct branch and semester selected before running the script.
Happy scraping!
This project is licensed under the MIT License - see the LICENSE file for details.
Feel free to contribute to the project or report issues on the GitHub repository.