Thursday, October 03, 2024

Antivirus software considered harmful

The IT profession needs to confront policies which demand antivirus be installed as a general measure

IT professionals still have in fresh memory the failure of CrowStrike's Falcon endpoint protection product which caused an US$10 billion global IT breakdown. The incident has been extensively covered in news outlets, on YouTube channels, and in blog posts – and rightly so.

However, no-one seems to question the idea of having such software installed on servers in the first place. That has long been a mystery to me.

Lack of independent justification for AV

After seven years, the following well-written ITPro article is still relevant: Does antivirus software do more harm than good? In the article, several sources describe how antivirus (AV) software itself has security holes, and that AV may interfere with other security measures. The problem with AV software is that it often runs in close proximity to the guts of the underlying system, so vulnerabilities in AV software (which are not uncommon) can have extraordinarily nasty security implications, as well as stability implications, see CrowdStrike.

In other sectors like research and medicine, it's considered natural to ask "Is there any independent evidence supporting the net gain from spending time and money on X?". Even though the IT world spends huge amounts of resources on AV, I've yet to see studies which provide such evidence. Sure, there are reports demonstrating that a given product detects malware XYZ (though not necessarily malware ZYX). But where's a study which documents how that detection makes a system more secure? Bear in mind that a file with malware might not get a chance to infect, because it relies on specific user interaction which is unlikely to happen, or it may rely on operating system vulnerabilities which have been patched long ago. On the other hand, when the AV product sometimes introduces gaping security holes, then the overall gain may well be negative.

Down-to-earth evidence of AV's questionable role

The above may appear overly academic, so here are some down-to-earth observations:

  • Billions of Internet-connected devices run without AV software, and it generally works very well. Phone operating systems get software through well-understood, curated channels, and they have built-in robustness such as sandboxing and ASLR. So even though the smartphone ecosystem mainly consists of happy-go-lucky non-technical end-users, we have seen very little (if any) akin to an ILOVEYOU pandemic, and we see very little ransomware on phones.

  • I did not have to clean up after the CrowStrike incident. But in my professional life, I've lately had to deal with detrimental effects from another AV product: That AV product had suddenly decided to "eat" a DLL file from a perfectly valid piece of software which had been installed weeks before. As a consequence, a rather important service stopped working. Paradoxically, the impacted service was helping us conduct safe IT practices. It took time and money to get things back running, and it has happened several times. I'm sure many of you have also seen AV software constitute a source of arbitrary and unpredictable outage.


AV at odds with pillars of secure IT practice

Let's step back and review decade-old basic system administration principles, and how they are often broken by AV:

  • Install as little software as possible in order to limit attack surfaces – especially minimize software which runs with high privileges. AV breaks this principle, because it's perfectly possible to run a computer without an AV add-on (which millions of Macs and Linux servers, and probably also some Windows servers, do).

  • Keep your software up-to-date in order to close the holes which malware will try to exploit. I've seen several cases where AV would conflict with software updates from Microsoft, making this principle harder to follow.

  • Run software with the lowest possible privileges: All non-trivial software has bugs, so processes should be constrained as much as possible. I claim that many AV products run with excessively high privileges and/or in kernel space, breaking the principle. See this article about DLL search order hijacking for extra suspense.

  • Make software hard to infect, primarily by disallowing software to write to its own binaries, but also by using techniques like address space layout randomization (ASLR). The article from 2017 quotes a software developer who spent lots of time fighting tricks performed by AV which could conflict with the product's built-in robustness.

  • Prefer to have software be delivered via well known channels, such as yum repositories, app stores, etc. Many AV products introduce proprietary update services which see little scrutiny and may even decide to replace itself with another product, see Kaspersky's recent surprise act. On top of that, CrowdStrike's update system is hardly the only one which made it impossible to roll out in non-production, before applying updates to production.

  • Stay away from software which shows signs of sloppy software engineering. Kudos to AV vendors for publishing CVE reports. But less so for what seems to be sloppiness: CrowdStrike has huge financial ballast, yet it failed to conduct basic testing steps, and it was not the first time in recent history. Other AV products also have questionable track records, but I've yet to hear such aspects being part of IT departments' decision making.

  • Absolutely remove abandoned or end-of-life software. Yet, AV software is often rather hard to remove, causing friction when trying to uphold the principle. I've seen many servers running AV software which was no longer being updated, after a vendor change long ago (in one case, because removal required down-time where the server had to be booted into safe mode, before the abandoned product could be removed).

  • Spend time wisely, because there is never time for it all. Time spent administering/maintaining AV software and battling AV-derived problems may result in less time spent on security enhancing activities like log analysis, systems/code cleanup, restore tests, AD hardening, etc.
    I've been in a meeting where a server break-in was discussed. The server had known vulnerabilities which would take time to address. Someone proposed installing AV on the server, buying time. If the decision for AV had been chosen, I'm rather sure it would have stolen much needed attention on getting the root cause handled, because it would have added a false sense of security.

On top of that, AV software is often seen incurring significant system overhead, adding latency and affecting end-user productivity. System overhead also requires extra CPU and RAM, which goes against green IT ambitions.

Culture and profit

Given all the technical arguments against AV, we may instead be dealing with a cultural issue: What is considered a given in one community is foreign to another. You will probably agree that in some parts of the IT landscape, it would be surprising to run across AV. For example:

  • Internet routers don't run AV, even though they are directly exposed to all sorts of traffic.
  • I claim most VMware administrators would strongly object, if someone requested AV be run directly on hypervisors.
  • Printers don't run AV (and are often not patched, but that's a separate, sad story).
  • Millions of compute nodes in high-performance compute environments deliver gazillions of compute hours, typically without having AV interfere.

The list goes on. And fortunately, many Linux and Mac hosts are still allowed to run without AV, although many overly-general IT policies are starting to force AV into those environments.

AV got started in the old days when Microsoft had not yet introduced proper kernel/userland and administrator/user separation into Windows, and where organizations had no good way to curb users from installing software from arbitrary sources. Fast-forward to 2024 where Windows systems can be very robust, but AV thinking prevails -- why? Inertia is not foreign to the IT business, of course: "Nobody Gets Fired For Buying IBMAV". Let me propose an additional reason: Some software vendors and certain security consultants have found AV to be a very profitable business, and they would naturally hate to see profits go down.

Time to start questioning

AV uselessness may appear to be a controversial proposition, and many will disagree with me. On the other hand, I also know of many system administrators and programmers who agree.

IT professionals, can we at least agree that it's our responsibility to start questioning AV as a general requirement? The compliance crowd will not have the guts to do it. So-called IT security consultants often profit from antivirus products, so they don't want to rock the boat.

It's on IT professionals' plate to address this.



PS 1

There may be situations where AV makes good sense to have. I argue that AV can make sense to have

  • on mail gateways, as part of general anti-junkmail efforts
  • on file shares which are shared by many users, depending on the environment
  • on hosts which – for some reason – can neither be regularly patched, nor brought onto a segregated network (but don't expect it to be very effective)


PS 2

While I generally object to the notion of buying security as an add-on product, I acknowledge that some products can provide real security. A product like Nessus' vulnerability scanner, for example, is a good addition to an IT organization's arsenal, especially because it can run in an unintrusive manner which neither installs anything on a server, nor tries to intercept all network traffic.

Wednesday, September 04, 2024

Find recently updated files

The following task comes up once in a while, but I keep forgetting how to do it. So I decided to finally write it down.

Find the ten youngest directory entries in the current directory and all subdirectories:

find . -print0 | xargs -0 stat --format '%Y :%y %n' | sort -nr | cut -d: -f2- | head

Same task, but solely with regards to files (not directories):

find . -type f -print0 | xargs -0 stat --format '%Y :%y %n' | sort -nr | cut -d: -f2- | head

Note: In case no files/directories are found, the following error message may be returned: "stat: missing operand". 

Note also, that this may not work on non-Linux operating systems: For example, I've seen it fail on an old FreeBSD server where the stat command behaves very different from what I'm used to.


Saturday, October 07, 2023

(Un)safe English locales

I recently spent a frustrating amount of time troubleshooting a Java application which had stopped working properly after an upgrade: Certain features of the product were broken due to the application running on a Linux server which had been configured with the "en_DK" locale.

"en_DK" had been chosen expecting to have system with messages in English, but with currency symbols etc. suitable for Denmark. This makes sense, because it's not uncommon for IT systems to provide more precise (error) messages in English, compared to a small language like Danish.

Unfortunately, there is not universal agreement about locales. Even within Linux distributions, there is not 100% consistency.

The GNU C Library (Glibc) seems to have the longst list of recognized locales, including 19 English ones. Java's list of supported locales is somewhat shorter and includes only 11 English locales. Interestingly, Java has an English locale for Malta (en_MT) which glibc does not have.

I haven't been able to find MacOS's list of locales, but forum posts suggest it has only 6 English locales: en_AU, en_CA, en_GB, en_IE, en_NZ, en_US.

Ignoring Mac for a moment, these are unsafe English locales, i.e. not supported by both glibc and Java:

LocaleCountry
en_AGAntigua and Barbuda
en_BWBotswana
en_DKDenmark
en_HKHong Kong
en_ILIsrael
en_MTMalta
en_NGNigeria
en_SCSeychelles
en_ZMZambia
en_ZWZimbabwe

On the other hand, the following locales should be safe:

LocaleCountrySafe even om Mac
en_AUAustralia🍎
en_CACanada🍎
en_GBGreat Britain🍎
en_IEIreland🍎
en_INIndia
en_NZNew Zealand🍎
en_PHPhilippines
en_SGSingapore
en_USUSA🍎
en_ZASouth Africa

Looking beyond unix/POSIX-like systems: Windows recognizes more than 100 English locale identifiers. Windows' list does invalidate the above safe-list.

For Danes, I suggest using one of the following locales:

  • da_DK (and sometimes accept poor error messages)
  • en_IE, as the Irish are sane enough to use a 24-hour date format, etc
  • C, which is the fall-back "POSIX system default" locale

Tuesday, July 11, 2023

Unbootable RHEL 9 when using BP-028

When installing Red Hat Enterprise Linux 9, you may choose to apply a security profile, such as ANSSI-BP-028 High.

I've recently seen two VMware-virtualized RHEL 9.2 servers not being able to boot properly when installed with the ANSSI-BP-028 High profile. Instead, they booted into emergency mode.

The way to fix it:

Add file /etc/modules-load.d/for_uefi.conf containing a single line:

vfat 

Then run the following two commands:

dracut -f /boot/initramfs-$(uname -r).img $(uname -r)
reboot

Friday, March 21, 2014

parallel_ntp_scan

NTP-based DDoS attacks are fashionable, currently.

I've coded a little application which quickly scans a network for NTP servers. For those found, it rates them according to their susceptibility to being implicated in an amplification DDoS attack.

Saturday, October 01, 2011

What a dying SFP looks like

Fibre channel (FC) storage is handy, and generally very reliable, in my experience. I certainly do not miss the days of messing around with disks in a server-room. And I like the fact that RAIDs may be cut up into slices (LUNs) which may be shared by many servers, resulting in very efficient use of the disks (if so wanted).

One part about FC that I dislike (in addition to the price tags): SFPs. Why on earth are transceivers not an integral part of a Fibre Channel switch? Having the transceivers be separate units means more electrical contact points, and a potential support mess (it's not hard to imagine a situation where the support contract of an SFP has run out, while the switch itself is still covered).

Anyway: Today, I experienced an defunct SFP, for the first time. The following observations may give a hint of how to discover that an SFP is starting to malfunction. The setup is an IBM DS4800 storage system where port 2 on controller B is connected to port 0 on an IBM TotalStorage SAN32B FC switch (which is an IBM-branded Brocade 5100 switch).

Friday morning at 07:49, in syslog: A few messages like this from the FC switch:
raslogd: 2011/09/30-07:49:07, [SNMP-1008], 2113, WWN 10:... | FID 128, INFO, IBM_2005_B32_B,  The last device change happened at : Fri Sep 30 07:49:01 2011

At the same time the storage system started complaining about "Drive not on preferred path due to ADT/RDAC failover", meaning that at least one server had started using a non-optimal path, most likely due to I/O timeouts on the preferred path. And a first spike in the bad_os count occurred for the FC switch port:


bad_os is a counter which exists in Brocade switches, and possibly others. Brocade describes it as the number of invalid ordered sets received.

At 10:55, in syslog:
raslogd: 2011/09/30-10:55:02, [FW-1424], 2118, WWN 10:... | FID 128, WARNING, IBM_2005_B32_B, Switch status changed from HEALTHY to MARGINAL
At the same time, there was a slightly larger spike in the bad_os graph.
Coinciding: The storage system sent a mail warning about "Data rate negotiation failed" for the port.

At 17:00: The count for bit-traffic flat-lined (graph not shown). I.e.: All traffic had ceased.

At no point did the graphs for C3 discards, encoding errors or CRC errors show any spikes.

The next morning, the involved optical cable was switched; that didn't help. Inserting another SFP helped, leading to the conclusion that the old SFP had started to malfunction.

Morale: Make sure to not just keep spare cables around. A spare SFP should also be kept in stock.
And monitor your systems: A centralized and regularly inspected syslog is invaluable. Generating graphs for key counters is also mandatory for mature systems operation; one way to collect and display key counts for Brocade switches is to use Munin and a Munin plugin which I wrote.

PS: Brocade documentation states that SFP problems might result in the combination of rises in CRC errors and encoding/disparity errors. This did not happen in this situation.

Wednesday, June 23, 2010

Separation or co-location of database and application

In a classical three-tier architecture (database, application-server, client), a choice will always have to be made: Should the database and the application-server reside on separate servers, or co-located on a shared server?

Often, I see recommendations for separation, with vague claims about performance improvements. But I assert that the main argument for separation is bureaucratic (license-related), and that separation may well hurt performance. Sure, if you separate the application and the database on separate servers, you gain an easy scale-out effect, but you also end up with a less efficient communication path.

If a database and an application is running on the same operating system instance, you may use very efficient communication channels. With DB2, for example, you may use shared-memory based inter-process communication (IPC) when the application and the database are co-located. If they are on separate servers, TCP must be used. TCP is a nice protocol, offering reliable delivery, congestion control, etc, but it also entails several protocol layers, each contributing overhead.

But let's try to quantify the difference, focusing as closely on the query round-trips as possible.

I wrote a little Java program, LatencyTester, which I used to measure differences between a number of database<>application setups. Java was chosen because it's a compiled, statically typed language; that way, the test-program should have as little internal execution overhead as possible. This can be important: I have sometimes written database benchmark programs in Python (which is a much nicer programming experience), but as Python can be rather inefficient, I ended up benchmarking Python, instead of the database.

The program connects to a DB2 database in one of two ways:
  • If you specify a servername and a username, it will communicate using "DRDA" over TCP
  • If you leave out servername and username it will use a local, shared memory based channel.
After connecting to the database, the program issues 10000 queries which shouldn't result in I/Os, because no data in the database system is referenced. The timer starts after connection setup, just before the first query; it stops immediately after the last query.

The application issues statements like VALUES(...) where ... is a value set by the application. Note the lack of a SELECT and a FROM in the statement. When invoking the program, you must choose between short or long statements. If you choose short, statements like this will be issued:
VALUES(? + 1)
where ? is a randomly chosen host variable.
If you choose long,  statements like
VALUES('lksjflkvjw...pvoiwepwvk')
will be issued, where lksjflkvjw...pvoiwepwvk is a 2000-character pseudo-randomly composed string.
In other words: The program may run in a mode where very short or very long units are sent back and forth between the application and the database. The short queries are effectively measuring latency, while the long queries may be viewed as measuring throughput.

I used the program to benchmark four different setups:
  • Application and database on the same server, using a local connection
  • Application and database on the same server, using a TCP connection
  • Application on a virtual server hosted by the same server as the database, using TCP
  • Application and database on different physical servers, but on the same local area network (LAN), using TCP
The results are displayed in the following graph:

Results from LatencyTester
Click on figure to enlarge; opens in new window

Clearly, the the highest number of queries per second is seen when the database and the application are co-located. This is especially true for the short queries: Here, more than four times as many queries may be executed per second when co-locating on the same server, compared to separating over a LAN. When using TCP on the same server, short-query round-trips run at around half the speed.

The results need to be put in perspective, though: While there are clear differences, the absolute numbers may not matter much in the Real World. The average query-time for short local queries were 0.1ms, compared to 0.8ms for short queries over the LAN. Let's assume that we are dealing with a web application where each page-view results in ten short queries. In this case, the round-trip overhead for a location connection is 10x0.1ms=1ms, whereas round-trip overhead for the LAN-connected setup is 10x0.8ms=8ms. Other factors (like query disk-I/O, and browser-to-server roundtrips) will most likely dominate, and the user will hardly notice a difference.

Even though queries over a LAN will normally not be noticeably slower, LANs may sometimes exhibit congestion. And all else being equal, the more servers and the more equipment being involved, the more things can go wrong.

Having established that application/database server-separation will not improve query performance (neither latency, nor throughput), what other factors are involved in the decision between co-location and separation?
  • Software licensing terms may make it cheaper to put the database on its own, dedicated hardware: The less CPUs beneath on the database system, the less licensing costs. The same goes for the application server: If it is priced per CPU, it may be very expensive to pay for CPUs which are primarily used for other parts of the solution.
  • Organizational aspects may dictate that the the DBA and the application server administrator each have their "own boxes". Or conversely, the organization may put an effort into operating as few servers as possible to keep administration work down.
  • The optimal operating system for the database may not be the optimal operating system for the application server.
  • The database server may need to run on hardware with special, high-performance storage system attachments. - While the application server (which probably doesn't perform much disk I/O) may be better off running in a virtual server, taking advantage of the flexible administration advantages of virtualization.
  • Buying one very powerful piece of server hardware is sometimes more expensive than buying two servers which add up to the same horsepower. But it may also be the other way around, especially if cooling, electricity, service agreements, and rack space is taken into account.
  • Handling authentication and group memberships may be easier when there is only one server. E.g., DB2 and PostgreSQL allows the operating system to handle authentication if an application connects locally, meaning that no authentication configuration needs to be set up in the application. (Don't you just hate it when passwords sneak into the version control system?)
  • A mis-behaving (e.g. memory leaking) application may disturb the database if the two are running on the same system. Operating systems generally provide mechanisms for resource-constraining processes, but that can often be tricky to setup.
Summarized:
Pro separationCon separation
Misbehaving application will not be able to disturb the database as much.Slightly higher latency.
May provide for more tailored installations (the database gets its perfect environment, and so does the application).Less predictable connections on shared networks.
If the application server is split into two servers in combination with a load balancing system, each application server may be patched individually without affecting the database server, and with little of no visible down-time for the users.More hardware/software/systems to maintain, monitor, backup, and document.
May save large amounts of money if the database and/or the application is CPU-licensed.Potentially more base software licensing fees (operating systems, supporting software).
Potentially cheaper to buy two modest servers than to buy one top-of-the-market server.Potentially more expensive to buy two servers if cooling and electricity is taken into account.
Allows for advanced scale-out/application-clustering scenarios.Prevents easy packaging of a complete database+application product, such as a turn-key solution in a single VMWare or Amazon EC2 image.
Am I overlooking something?
Is the situation markedly different with other DBMSes?