(110h) Data Centric AI—How to Build the Software 2.0 Stack to Maximize ROI on Unstructured Data | AIChE

(110h) Data Centric AI—How to Build the Software 2.0 Stack to Maximize ROI on Unstructured Data

Data Centric AI - the way to handle unstructured data and develop great AI models

Data-centric AI is the practice of systematically engineering the data used to build AI systems. AI is made of both code and data. Historically, AI has focused primarily on code, with researchers building ever more sophisticated models on fixed datasets. Think GPT-3 with and its 200 billions parameters. $12 million for a single training run. But the real-world experience of those who put them into production shows that if you're trying to improve your model, it's often the quality of data and iterating on it that makes your AI project succeed or fail. And quality means two things. For a model to perform well, you need both clean data and diverse data.

Set up a Software 2.0 stack

Starting a company today without machine learning is like starting a company ten years ago without software. AI is software 2.0. The human instructs the machine, line by line. That’s Software 1.0. Software 2.0 is a neural network that learns which rules are needed for the desired outcome. Software 2.0 is king when the algorithm itself is difficult to design explicitly. Think object detection in images. If you recognize Software 2.0 as a new and emerging programming paradigm and DCAI as agile for Software 2.0, you need to set up a Software 2.0 stack.

Find the right IDE

To deliver good software, all the developers in the world write code in a dedicated software named an IDE - Integrated Development Environment. An IDE is the Microsoft Word for code. A good IDE is designed to write good code. Not just a lot of code. A good IDE is designed to develop in an interactive and iterative way, with a short time between development and testing. A good IDE provides syntax highlighting, debuggers, profilers, go to def, git integration, etc. In the Software 2.0 stack, the programming is done by accumulating, massaging and cleaning datasets. To switch to Software 2.0, you need a Software 2.0 IDE. It helps with all of the workflows in accumulating, visualizing, cleaning, labeling, and sourcing datasets. It bubbles up images that the network suspects are mislabeled. It assists in labeling by seeding labels with predictions. It suggests useful examples to label based on the uncertainty of the network’s predictions. It shares knowledge and enforces the reuse of data across the organization.

Because it reduces development time

One of the key success factor of the software industry over the past years has been agile. It all started in the spring of 2000, when a group of 17 software developers, including Martin Fowler & co met in Oregon to discuss how they could speed up development times in order bring new software to market faster. They made development, and testing activities concurrent, allowing more communication between developers, managers, testers, and customers. It increased cost-effectiveness, productivity, quality, cycle-time reduction, and customer satisfaction from 30% to 100% Data-centric AI is the agile of AI. Labeling, model training and model diagnostic can work in parallel and directly influence the data used for the AI system. It removes the unnecessary trial-and-error time spent on improving the model without changing inconsistent data and reduces the development time up to 10x faster.

Set up a human-in-the-loop culture

Revolutionary change is not linear or constant. It is the chaos that disturbs the organization and leads to the reshaping of its culture. DCAI means bringing human intelligence to machine learning. DCAI means leveraging human expertise to train good AI. To do so, you need to put humans right in the center, in a human-in-the-loop machine learning process. This is not trivial. It will change your development processes. It will change the structure of your organization. It may change your business model. It will involve forming a new vision and a new mission. If you do it, you will: improve the consistency, accuracy, transparency, and safety of your models. To succeed, start small. Execute DCAI pilot projects to gain momentum Build a multi-disciplinary in-house DCAI team made of subject matter experts, ML engineers, and data quality managers. Provide broad DCAI training.