Assignment Task
Task
Module Learning Outcomes 1-5:
A-Plus Writing Help For University Students
Get expert assistance in any academic field. All courses and programs covered.
Get Help Now!- Demonstrate a sound knowledge of information retrieval principles
- Apply main data structures used in index construction in Python or a similar high- level language
- Implement a typical web crawler and query processor, in Python or a similar high-level language
- Acquire knowledge and skills to apply common machine learning methods for text classification and document clustering
- Build the outline of a minimum viable vertical search engine for text retrieval
Tasks:
There are two tasks in this coursework. You can use any general purpose programming language of your choice to perform these tasks. However, Python is recommended.
Task 1. Search Engine
Your system crawls the relevant web pages and retrieves information about all available publications. For each publication, it extracts available data (such as authors, publication year, and title) and the links to both the publication page and the author’s profile (also called “pureportal” profile) page. It also goes to the publication page and extracts the abstract (if the abstract is available).Make sure that your crawler is polite, i.e. it preserves the robots.txt rules and does not hit the servers unnecessarily or too fast.Because of low rate of changes to this information, your crawler may be scheduled to look for new information, say, once per week, but it should ideally be able to do so automatically, as a scheduled task. Every time it runs, itshould update the index with the new data instead of building it from scratch.Make sure you apply the required pre-processing tasks to both the crawled data and the users’ queries.From the user’s point of view, your system has an interface that is similar to the Google Scholar main page, where the user can type in their queries/keywords about the publications they want to find. Then, your system will display the results, sorted by relevance, in a similar way Google Scholar does. However, the search results are restricted to the publications by the members of the specified school. You must show in your report and viva that your system is accurate by trying various queries. For example, you must use both short and long queries, both with and without stop words, queries with various keywords and more challenging queries to prove the robustness of your system.
Task 2. Document Clustering
Develop a document clustering system.
First, collect a number of documents that belong to different categories, namely sport, science and business. Each document should be at least one sentence (the longer is usually the better). The total number of documents is up to you but should be at least 100 (the more is usually the better). You may collect these documents from publicly available websites such as BBC news websites, but make sure you preserve their copyrights and terms of use and clearly cite them in your work. You may simply copy-paste such texts manually, and writing an RSS feed reader/crawler to do it automatically is NOT mandatory.Once you have collected sufficient documents, cluster them using a standard clustering method (e.g. K-means).Finally, use the created model to assign a new document to one of the existing clusters. That is, the user enters a document (e.g. a sentence) and your system outputs the right cluster.
NOTE: You must show in your report and viva that your system suggests the right cluster for variety of inputs, e.g. short and long inputs, those with and without stop worlds, inputs of different topics, as well as more challenging inputs to show the system is robust enough.
Fully working crawler component (Task 1)
Must completely crawl the required web pages and find all the publications. It should be scheduled to re-crawl to extract new data automatically. Must be polite by preserving robots.txt rules and not hitting the servers too fast. It must extract the required data for each publication (at least author(s), title, publication year, abstract, the link to the publication page and the link to the CU author page). Should apply appropriate pre-processing tasks before passing the data to the indexer.
Construction of Inverted Index (Task 1)
Construction of the index based on appropriate data structures studied in the module as opposed to naive database tables. The index should be updated incrementally once new data are received from the crawler component (as opposed to be constructed from scratch every time it is run). Obviously, no mark is considered if Elastic Search is used. 15
Fully Working query processor component (Task 1)
Showing results relevant to queries given by the user. The accuracy and robustness of the system must be proved in the report (via screenshots) and the viva (by running the system and showing the results) for various input queries. For example, appropriate inputs should be used to prove the system properly performs the pre-processing tasks (such as stop-world removal and stemming) and ranked retrieval. 25
Fully working document clustering component (Task 2)
Enough data should be used. A standard clustering algorithm such as K-means must be used. In addition, the learned model must be used to identify the right cluster for a given input. Various inputs must be used both in the report (via screenshots) and the viva to show that the system is accurate and robust. The system should be evaluated using one of the methods introduced in the module. 25
Overall usability (Tasks 1-2)
Acceptable response time, easy-to-use interface, and anything else that might affect the usability of the systems. 10Based on the above marking scheme, an example of a mark of 40 is a working search engine which accepts users’ queries/keywords and displays some partially correct results, without a proper index and with no document clustering (Task 2) component. Alternatively, Task 2 may be properly accomplished but the search engine might not be fully working because of inappropriate query processor and indexer but with a reasonable working crawler.An example of a 70+ mark is a fully-working search engine with reasonable accuracy and speed. This ensures that the system contains fully working crawler and query processor components. In addition, it must have at least one, and preferably both, of the other two components, i.e. the inverted index (without using Elastic Search) and the document clustering component, in fully working status. If Task 2 is missing, then the rest must be perfect. For example, the inverted index must be fully implemented (without using ES) and be updated incrementally.Alternatively, if Task 2 is perfect, then the inverted index may be implemented using ES, in which case the output of ES must be reformatted for more readable results for the user.Other marks are possible based on the above marking scheme table.To show that your system meets each of the above-mentioned requirements, your report must provide sufficient evidence including clear description, complete source code, and complete screenshots where applicable. You viva must also demonstrate the fully working systems by trying numerous and various inputs. See Appendix 1 for items to cover in your report and viva.
Part 1 – Search engine
1. Crawler:
1.1 Whether or not all the publications are crawled. If not, how many?
1.2. Information collected about each publication (e.g. links, title, year, author, abstract, or any additional part)
1.3. Which pre-processing tasks are performed before passing data to Indexer/Elastic Search
1.4. When the crawler operates, e.g. scheduled or run manually
1.5. Brief explanation of how it works
2. Indexer
2.1. Whether you implemented the index or used Elastic Search (note that if Elastic Search is used you will lose the whole mark for index construction, but the project becomes easier).
2.2. If you implemented it, which data structure is used (for example, incidence matrix or inverted index)
2.3. If you implemented it, whether it is incremental, i.e. it grows and gets updated over the time, or it is constructed from scratch every time your crawler is run
2.4. If you implemented it, show some part of its content (e.g. the constructed dictionary) as screenshot and in your viva
2.5. Brief explanation of how it works
3. Query processor
3.1. Which pre-processing tasks are applied to a given query
3.2. Do you only support Boolean queries (using AND, OR, NOT, etc.) or accept keywords like Google does (without any need for AND, OR, NOT etc.). The latter is preferred like common queries we give Google.
3.3. If Elastic Search is used, how you convert a user query to an appropriate query for Elastic Search
3.4. If Elastic Search is NOT used, whether or not you perform ranked retrieval; if yes, specify whether or not you used vector space and the method used to calculate the ranks
3.5. Demonstration of the running system (use screenshots in you report and run your software in your viva). You must run your system on numerous and various input queries to prove the accuracy and robustness of your system. For example, you must use appropriate queries to prove your system performs stop-word removal and stemming and ranked retrieval.
3.6. Brief explanation of how it works
4. (Optional)
Any other important point you may want to mention, including any restriction, extras, issues
Part 2 – Document clustering
1. How and how many input documents are collected
2. Which document clustering method (e.g. K-means with appropriate K value) has been used and how its performance is measured
3. Which type of clustering is used (hierarchical/flat and hard/soft)
4. Screenshot and demonstration of its accuracy and robustness for numerous and various inputs
5. Which of the metrics explained in the lecture is used to evaluate the system
6. Brief explanation of how it works
7. (Optional) any other important point you may want to mention
Welcome to our Online Essay Writing Agency. Securing higher grades costing your pocket? Order your assignment online at the lowest price now! Our online essay writers are able to provide high-quality assignment help within your deadline. With our homework writing company, you can order essays, term papers, research papers, capstone projects, movie review, presentation, annotated bibliography, reaction paper, research proposal, discussion, or another assignment without having to worry about its originality – we offer 100% original content written completely from scratch
We write papers within your selected deadline. Just share the instructions



