
Preparing data for LLM Applications using Data Prep Kit
When building data intensive applications, a significant portion of your time will be dedicated to data wrangling (cleaning, de-duping, removing markups, etc.). Data Prep Kit (DPK) is an open source python library that can scale from your laptop to a highly scalable cluster in the cloud. It has been used at scale to prepare terabytes of data to train the IBM Granite Large Language Models (LLMS). Features include:
- de-duping documents (exact dedupe and fuzzy dedupe)
- handling documents and code
- language detection (spoken languages and programming languages)
- removing PII
- malware detection
- creating embeddings for a vector database
Check out the IBM TechXchange Dev Day: Virtual Agents spotlight on IBM Developer: https://ibm.biz/virtual-agents-dev-day-highlights
Connect with developers and other practitioners in the IBM TechXchange Community: https://ibm.biz/techxchange-community
Choose a specific community topic group:
IBM Granite AI foundation models: https://ibm.biz/granite-models-group
Global AI and Data Science: https://ibm.biz/global-ai-data-science-group
watsonx.ai: https://ibm.biz/watsonx-ai-group
watsonx Assistant: https://ibm.biz/watsonx-assistant-group
watsonx Orchestrate: https://ibm.biz/watsonx-orchestrate-group
____________________________________________
IBM Developer — write better code, boost your skills, and build something new: https://ibm.biz/ibm-developer-yt
Subscribe to see more developer content: https://ibm.biz/ibm-developer-yt-subscribe
Follow IBM Developer on LinkedIn: https://ibm.biz/ibm-developer-linkedin-yt
More from IBM Developer:
Community: https://community.ibm.com/community/user/community
Blog: https://developer.ibm.com/blogs/
Call for Code: https://developer.ibm.com/callforcode/
#virtualagents
#IBMTechXchange
#IBMDeveloper
#Developer
#Coding
- de-duping documents (exact dedupe and fuzzy dedupe)
- handling documents and code
- language detection (spoken languages and programming languages)
- removing PII
- malware detection
- creating embeddings for a vector database
Check out the IBM TechXchange Dev Day: Virtual Agents spotlight on IBM Developer: https://ibm.biz/virtual-agents-dev-day-highlights
Connect with developers and other practitioners in the IBM TechXchange Community: https://ibm.biz/techxchange-community
Choose a specific community topic group:
IBM Granite AI foundation models: https://ibm.biz/granite-models-group
Global AI and Data Science: https://ibm.biz/global-ai-data-science-group
watsonx.ai: https://ibm.biz/watsonx-ai-group
watsonx Assistant: https://ibm.biz/watsonx-assistant-group
watsonx Orchestrate: https://ibm.biz/watsonx-orchestrate-group
____________________________________________
IBM Developer — write better code, boost your skills, and build something new: https://ibm.biz/ibm-developer-yt
Subscribe to see more developer content: https://ibm.biz/ibm-developer-yt-subscribe
Follow IBM Developer on LinkedIn: https://ibm.biz/ibm-developer-linkedin-yt
More from IBM Developer:
Community: https://community.ibm.com/community/user/community
Blog: https://developer.ibm.com/blogs/
Call for Code: https://developer.ibm.com/callforcode/
#virtualagents
#IBMTechXchange
#IBMDeveloper
#Developer
#Coding
IBM Developer
Whatever your experience level, IBM Developer provides the best in open source tech, learning resources, and opportunities to connect with our expert Developer Advocates. Subscribe to this channel to be notified of our upcoming live streams and new on-dem...