AI

MIT’s ChartNet Lets a 2B AI Model Beat GPT-4o at Charts

MIT-IBM’s ChartNet dataset trains a 2-billion-parameter open model to read charts better than GPT-4o, scoring 70.3% on data extraction versus 46.7%.

Published

2 hours ago

June 3, 2026

Logan Pierce

ChartNet, a new public dataset from MIT and the MIT-IBM Computing Research Lab, taught a 2-billion-parameter open-source model to read charts more accurately than GPT-4o, OpenAI’s flagship multimodal system. On the task of pulling the underlying numbers out of a chart, the small model scored 70.3 percent against GPT-4o’s 46.7 percent. The dataset holds more than a million chart images, each paired with the code, table, and text that explain it.

That result lands on a sore spot. The finance industry runs on charts, and the promise of feeding a quarterly report to an AI and getting clean figures back has been undercut by models that misread axes and invent values. ChartNet narrows that gap for anyone willing to fine-tune a small model, though the biggest commercial systems still hold an edge on the hardest job: rebuilding a chart’s code from the picture alone.

Why Charts Still Trip Up Frontier AI

A chart is a strange object for a vision-language model (VLM, an AI system that reads images and text together). It is part picture, part table, part sentence. To answer “what was revenue in Q3,” the model has to find a bar, map it to an axis, read a number that may not be printed anywhere, and phrase the answer. Models that ace natural photos and plain text often fall apart on that mix.

The block has been training data. Most chart datasets scraped a few thousand images off the web and stopped there, with no code, no clean data table, and thin labels. Jovana Kondic, an MIT electrical engineering and computer science graduate student who led the work, puts the scale problem bluntly: a model may need to see thousands of examples before it reliably recognizes something as a line chart. ChartNet was built to remove that ceiling. Each sample carries five aligned pieces that let a model connect the picture to the meaning:

Plotting code – the Python that generated the chart, from a few hundred to tens of thousands of characters
Chart image – the rendered figure the model actually sees
Data table – the underlying numbers in CSV form
Natural language summary – a written description of what the chart shows
Question-and-answer pairs – with step-by-step reasoning, so the model learns how to get to an answer, not just the answer

The public release on Hugging Face spans 24 chart types across six plotting libraries, from bar and line charts to violin plots and heatmaps. It has been pulled down 16,441 times in the past month, a sign the developer community is already building on it. The work is described in the ChartNet research paper accepted to CVPR 2026, the IEEE Computer Vision and Pattern Recognition Conference.

MIT ChartNet dataset helps small open AI model read charts better than GPT-4o.

A 2B Model Out-Read GPT-4o on the Numbers

The headline finding is not that ChartNet helps a little. It is that small open-source models fine-tuned on it beat a commercial system many times their size on the tasks businesses care about most. The researchers trained IBM’s Granite Vision series and other open models, then ran them against GPT-4o on the dataset’s own evaluation set. The 2-billion-parameter Granite Vision model led on three of four tasks.

Task	Granite Vision 2B (ChartNet)	GPT-4o
Chart data extraction	70.3%	46.7%
Chart summarization	83.9%	77.1%
Chart QA with reasoning	65.0%	61.1%
Chart reconstruction (code)	90.4%	95.9%

Data extraction is the gap that matters for a finance team. A 24-point lead on reading values out of a chart is the difference between a tool you can trust with a balance sheet and one you have to double-check by hand. Summarization and reasoning went the small model’s way too, by smaller margins. The one task GPT-4o kept was reconstruction, turning a chart image back into runnable code, where its 95.9 percent edged the open model’s 90.4. The pattern is clean: for reading what a chart says, size stopped being the deciding factor.

Inside ChartNet’s Synthetic Pipeline

None of this works without solving the data problem first, and the team did it with synthetic data, charts generated by algorithms rather than scraped from the web. That choice is what let them reach seven figures of examples without a labeling army. The pipeline runs in two steps and has a gate built in to throw out anything that does not hold up.

From One Chart to Hundreds of Variants

The system starts by translating an existing chart image into code. Then it rewrites that code over and over, changing the chart type, the values, the topic, the colors. One seed chart spawns hundreds of variations. “We can start from a single chart that we use as a seed and come up with hundreds of augmentations of it,” Kondic explains. That multiplier is how a manageable set of starting figures became a dataset of more than a million.

The Quality Filter That Throws Work Away

Diversity alone is not useful if half the charts are broken. An automated check runs each generated sample, confirming the code actually executes and the rendered image comes out clean and accurate. Anything that fails gets discarded before it reaches the dataset. The team’s view was that volume only counts when the information is presented in a way a model can learn from.

Where Human Annotators Come In

On top of the synthetic core, ChartNet folds in chart samples annotated by human experts, plus real-world charts contributed by outside collaborators. Those carry validity guarantees the synthetic data cannot, and a developer can use them to fine-tune a model for one specific use, squeezing out extra accuracy on a narrow task.

We developed ChartNet to be a one-stop shop for chart understanding, covering basically anything that an AI model and a practitioner who is training that model might need. We hope our work motivates researchers to achieve state-of-the-art performance with smaller models that don’t require infinite amounts of computation.

That is Kondic, summarizing the wager behind the project.

Who Picks Up the Cheaper Chart Reader

The clearest winners are small firms that could never justify a frontier-model bill. A 2-billion-parameter model runs on modest hardware and can be self-hosted, which keeps sensitive financial documents in-house. “The finance industry thrives on charts,” says Dhiraj Joshi, a senior scientist at IBM Research and a co-author. “If vision-language models can extract information out of charts, like descriptions of trends, that facilitates a lot of workflows that happen downstream.”

The compute angle is the quiet part of the story. Kondic’s pitch is explicitly about results that don’t need infinite computation, and that lands at a moment when the cost of running large models is colliding with hard limits. Data-centre electricity demand is on track to nearly double by 2030, a squeeze we covered in how power, not chips, now caps Europe’s AI plans. A capable model that fits on a single GPU is not a downgrade in that world; it is the practical option.

It also chips at vendor lock-in. The same shift toward routing work across many smaller models, rather than paying one provider for everything, shows up in consumer tools too, as in a flat monthly layer over GPT-4o, Claude, and Gemini. For chart reading specifically, the open route now comes with receipts. The Granite Vision models trained here are published openly through IBM’s open Granite model family on Hugging Face, alongside the dataset itself.

Where the Commercial Models Still Win

This is not a clean rout, and pretending otherwise would oversell it. GPT-4o held the reconstruction crown, and rebuilding a chart’s exact code from an image is the task closest to true visual reasoning. On the broadest, messiest charts, the largest models still bring more general knowledge to bear. The benchmark wins are real, but they are scoped to the tasks ChartNet was built to teach.

Licensing is the other asterisk. The full ChartNet dataset hosted on Hugging Face ships a large core under a permissive Community Data License Agreement that clears commercial use, but parts of the original release carried a non-commercial restriction tied to how the data was generated. A team planning to ship a product needs to check which subset it is training on. Safety and grounding subsets are still marked as coming soon.

The team plans to keep expanding the dataset with harder charts and to lean on feedback from the research community. For now, the takeaway for any company staring at a stack of financial PDFs is concrete: the tool that reads them no longer has to be the most expensive one on the market.

Frequently Asked Questions

What is ChartNet?

ChartNet is a public dataset of more than a million chart images built by researchers at MIT, the MIT-IBM Computing Research Lab, and IBM Research to train AI models to read and reason about charts. Each chart comes with its plotting code, a data table, a written summary, and question-answer pairs.

Can I use ChartNet for free, including commercially?

Much of it, yes. A large core subset is released under the Community Data License Agreement Permissive 2.0, which allows commercial use. Some of the original synthetic data carries a non-commercial restriction, so check the specific subset on the Hugging Face page before building a product on it.

Which AI models were trained on ChartNet?

The researchers fine-tuned IBM’s Granite Vision models, including a 2-billion-parameter version, along with other open models such as Qwen2.5-VL and a LLaVA variant. Several of these smaller open models then outscored GPT-4o on chart tasks in the team’s evaluation.

Did the small models beat GPT-4o on everything?

No. The fine-tuned 2-billion-parameter model led on data extraction, summarization, and reasoning, but GPT-4o stayed ahead on chart reconstruction, scoring 95.9 percent against 90.4 percent on turning a chart image back into code.

How was ChartNet built without manual labeling?

It uses synthetic data. The pipeline converts existing charts into code, then rewrites that code to change types, values, topics, and colors, spinning one chart into hundreds of variants. An automated check discards any sample whose code fails to run or renders incorrectly, and human-annotated charts are layered on top.