Huge performance difference of the command find with and without using %M option to show permissions The 2019 Stack Overflow Developer Survey Results Are In Unicorn Meta Zoo #1: Why another podcast? Announcing the arrival of Valued Associate #679: Cesar Manara 2019 Community Moderator Election ResultsPermissions for making some some (but not all) files visible directly under a directoryThe relationship between execute permission on a directory and its inode structureFile inheriting permission of directory it is copied in?python vs bc in evaluating 6^6^6Why does find -inum iterate through the whole filesystem tree?Why does chmod succeed on a file when the user does not have write permission on parent directory?Find files with group permissions more restrictive than owner permissionsIs it possible to run ls or find and pipe it through stat?KVM guest I/O hangs randomly“permission denied” when appending with echo, but working with vi

How many cones with angle theta can I pack into the unit sphere?

Match Roman Numerals

Can the Right Ascension and Argument of Perigee of a spacecraft's orbit keep varying by themselves with time?

The following signatures were invalid: EXPKEYSIG 1397BC53640DB551

Why don't hard Brexiteers insist on a hard border to prevent illegal immigration after Brexit?

Simulating Exploding Dice

Nested ellipses in tikzpicture: Chomsky hierarchy

What aspect of planet Earth must be changed to prevent the industrial revolution?

Do working physicists consider Newtonian mechanics to be "falsified"?

Solving overdetermined system by QR decomposition

Single author papers against my advisor's will?

Mortgage adviser recommends a longer term than necessary combined with overpayments

Word for: a synonym with a positive connotation?

Drawing vertical/oblique lines in Metrical tree (tikz-qtree, tipa)

Would an alien lifeform be able to achieve space travel if lacking in vision?

Can withdrawing asylum be illegal?

Huge performance difference of the command find with and without using %M option to show permissions

How do I design a circuit to convert a 100 mV and 50 Hz sine wave to a square wave?

Why can't devices on different VLANs, but on the same subnet, communicate?

60's-70's movie: home appliances revolting against the owners

Is this wall load bearing? Blueprints and photos attached

How to determine omitted units in a publication

Do I have Disadvantage attacking with an off-hand weapon?

Why are PDP-7-style microprogrammed instructions out of vogue?



Huge performance difference of the command find with and without using %M option to show permissions



The 2019 Stack Overflow Developer Survey Results Are In
Unicorn Meta Zoo #1: Why another podcast?
Announcing the arrival of Valued Associate #679: Cesar Manara
2019 Community Moderator Election ResultsPermissions for making some some (but not all) files visible directly under a directoryThe relationship between execute permission on a directory and its inode structureFile inheriting permission of directory it is copied in?python vs bc in evaluating 6^6^6Why does find -inum iterate through the whole filesystem tree?Why does chmod succeed on a file when the user does not have write permission on parent directory?Find files with group permissions more restrictive than owner permissionsIs it possible to run ls or find and pipe it through stat?KVM guest I/O hangs randomly“permission denied” when appending with echo, but working with vi



.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








3















On my CentOS 7.6, I have created a folder (called many_files) with 3,000,000 files, by running:



for i in 1..3000000; do echo $i>$i; done;


I am using the command find to write the information about files in this directory into a file. This works surprisingly fast:



$ time find many_files -printf '%i %y %pn'>info_file

real 0m6.970s
user 0m3.812s
sys 0m0.904s


Now if I add %M to get the permissions:



$ time find many_files -printf '%i %y %M %pn'>info_file

real 2m30.677s
user 0m5.148s
sys 0m37.338s


The command takes much longer. This is very surprising to me, since in a C program we can use struct stat to get inode and permission information of a file and in the kernel the struct inode saves both these information.



My Questions:



  1. What causes this behavior?

  2. Is there a faster way to get file permissions for so many files?









share|improve this question
























  • The second question is the wrong question to ask. The real question is what you are doing with the output. If you are piping it somewhere for later processing of files based on the permissions, then you are probably doing it in a roundabout way. Instead you may want to use -perm with find to pick out the files with the permissions you're looking for.

    – Kusalananda
    3 hours ago


















3















On my CentOS 7.6, I have created a folder (called many_files) with 3,000,000 files, by running:



for i in 1..3000000; do echo $i>$i; done;


I am using the command find to write the information about files in this directory into a file. This works surprisingly fast:



$ time find many_files -printf '%i %y %pn'>info_file

real 0m6.970s
user 0m3.812s
sys 0m0.904s


Now if I add %M to get the permissions:



$ time find many_files -printf '%i %y %M %pn'>info_file

real 2m30.677s
user 0m5.148s
sys 0m37.338s


The command takes much longer. This is very surprising to me, since in a C program we can use struct stat to get inode and permission information of a file and in the kernel the struct inode saves both these information.



My Questions:



  1. What causes this behavior?

  2. Is there a faster way to get file permissions for so many files?









share|improve this question
























  • The second question is the wrong question to ask. The real question is what you are doing with the output. If you are piping it somewhere for later processing of files based on the permissions, then you are probably doing it in a roundabout way. Instead you may want to use -perm with find to pick out the files with the permissions you're looking for.

    – Kusalananda
    3 hours ago














3












3








3








On my CentOS 7.6, I have created a folder (called many_files) with 3,000,000 files, by running:



for i in 1..3000000; do echo $i>$i; done;


I am using the command find to write the information about files in this directory into a file. This works surprisingly fast:



$ time find many_files -printf '%i %y %pn'>info_file

real 0m6.970s
user 0m3.812s
sys 0m0.904s


Now if I add %M to get the permissions:



$ time find many_files -printf '%i %y %M %pn'>info_file

real 2m30.677s
user 0m5.148s
sys 0m37.338s


The command takes much longer. This is very surprising to me, since in a C program we can use struct stat to get inode and permission information of a file and in the kernel the struct inode saves both these information.



My Questions:



  1. What causes this behavior?

  2. Is there a faster way to get file permissions for so many files?









share|improve this question
















On my CentOS 7.6, I have created a folder (called many_files) with 3,000,000 files, by running:



for i in 1..3000000; do echo $i>$i; done;


I am using the command find to write the information about files in this directory into a file. This works surprisingly fast:



$ time find many_files -printf '%i %y %pn'>info_file

real 0m6.970s
user 0m3.812s
sys 0m0.904s


Now if I add %M to get the permissions:



$ time find many_files -printf '%i %y %M %pn'>info_file

real 2m30.677s
user 0m5.148s
sys 0m37.338s


The command takes much longer. This is very surprising to me, since in a C program we can use struct stat to get inode and permission information of a file and in the kernel the struct inode saves both these information.



My Questions:



  1. What causes this behavior?

  2. Is there a faster way to get file permissions for so many files?






linux files permissions find performance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 1 hour ago









Jeff Schaller

45k1164147




45k1164147










asked 3 hours ago









BahramBahram

212




212












  • The second question is the wrong question to ask. The real question is what you are doing with the output. If you are piping it somewhere for later processing of files based on the permissions, then you are probably doing it in a roundabout way. Instead you may want to use -perm with find to pick out the files with the permissions you're looking for.

    – Kusalananda
    3 hours ago


















  • The second question is the wrong question to ask. The real question is what you are doing with the output. If you are piping it somewhere for later processing of files based on the permissions, then you are probably doing it in a roundabout way. Instead you may want to use -perm with find to pick out the files with the permissions you're looking for.

    – Kusalananda
    3 hours ago

















The second question is the wrong question to ask. The real question is what you are doing with the output. If you are piping it somewhere for later processing of files based on the permissions, then you are probably doing it in a roundabout way. Instead you may want to use -perm with find to pick out the files with the permissions you're looking for.

– Kusalananda
3 hours ago






The second question is the wrong question to ask. The real question is what you are doing with the output. If you are piping it somewhere for later processing of files based on the permissions, then you are probably doing it in a roundabout way. Instead you may want to use -perm with find to pick out the files with the permissions you're looking for.

– Kusalananda
3 hours ago











2 Answers
2






active

oldest

votes


















4














The first version requires only to readdir(3)/getdents(2) the directory, when run on a filesystem supporting this feature (ext4: filetype feature displayed with tune2fs -l /dev/xxx, xfs: ftype=1 displayed with xfs_info /mount/point ...).



The second version in addition also requires to stat(2) each file, requiring an additional inode lookup, and thus more seeks on the filesystem and device, possibly quite slower if it's a rotating disk and cache wasn't kept. This stat is not required when looking only for name, inode and filetype because the directory entry is enough:




 The linux_dirent structure is declared as follows:

struct linux_dirent
unsigned long d_ino; /* Inode number */
unsigned long d_off; /* Offset to next linux_dirent */
unsigned short d_reclen; /* Length of this linux_dirent */
char d_name[]; /* Filename (null-terminated) */
/* length is actually (d_reclen - 2 -
offsetof(struct linux_dirent, d_name)) */
/*
char pad; // Zero padding byte
char d_type; // File type (only since Linux
// 2.6.4); offset is (d_reclen - 1)
*/




the same informations are available to readdir(3):




struct dirent 
ino_t d_ino; /* Inode number */
off_t d_off; /* Not an offset; see below */
unsigned short d_reclen; /* Length of this record */
unsigned char d_type; /* Type of file; not supported
by all filesystem types */
char d_name[256]; /* Null-terminated filename */
;



Suspected but confirmed by comparing (on a smaller sample...) the two outputs of:



strace -o v1 find many_files -printf '%i %y %pn'>info_file
strace -o v2 find many_files -printf '%i %y %M %pn'>info_file


Which on my Linux amd64 kernel 5.0.x just shows as main difference:



[...]



 getdents(4, /* 0 entries */, 32768) = 0
close(4) = 0
fcntl(5, F_DUPFD_CLOEXEC, 0) = 4
-write(1, "25499894 d many_filesn25502410 f"..., 4096) = 4096
-write(1, "iles/844n25502253 f many_files/8"..., 4096) = 4096
-write(1, "096 f many_files/686n25502095 f "..., 4096) = 4096
-write(1, "es/529n25501938 f many_files/528"..., 4096) = 4096
-write(1, "1 f many_files/371n25501780 f ma"..., 4096) = 4096
-write(1, "/214n25497527 f many_files/213n2"..., 4096) = 4096
-brk(0x55b29a933000) = 0x55b29a933000
+newfstatat(5, "1000", 0644, st_size=5, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "999", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "998", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "997", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "996", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "995", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "994", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "993", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "992", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "991", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "990", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]



+newfstatat(5, "891", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+write(1, "25499894 d drwxr-xr-x many_files"..., 4096) = 4096
+newfstatat(5, "890", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]






share|improve this answer

























  • Unfortunately, the d_type field of a dir entry is a non-standard feature, only present on Linux and BSD, as mentioned in the readdir(3) manpage. (Though on Linux it is implemented on most filesystems that matter).

    – mosvy
    2 hours ago












  • @mosvy That's ok, the question is tagged CentOS. But yes I understand that on other *nix, results may differ

    – A.B
    2 hours ago












  • Hum actually xfs (CentOS' default) support isn't quite clear...

    – A.B
    1 hour ago











  • added how to check if the filetype feature is present on xfs, in case xfs is in use.

    – A.B
    1 hour ago











  • I think it's supported on xfs -- when I was making a testcase for a glibc glob(3) that only triggered when the d_type field was absent, I had to use either minixfs or use the GLOB_ALTDIRFUNC.

    – mosvy
    1 hour ago


















1














For your 1st question:



I think your problem is not with how quickly the information is accessed, but the output bottleneck.



You are writing the output to info_file.



When you add %M to the find command, you are now outputting more text due to the permissions. 10 additional characters per line of output. That is 30,000,000 more characters.



This is more data that has to go through the STDOUT redirect to info_file and get written to disk. More data to push == longer time to write and complete.



In a situation with a single file or a small number of files, it would not be noticeable to a human; time may give you some variation to measure but it might be too slight to notice.



In your question you are working with 3,000,000 files, so obviously it takes longer to write out the permissions output.



2nd question



I have no idea. Do you have a practical use case for needing to collect permissions for 3,000,000 files, or is this an academic exercise?






share|improve this answer

























  • info_file has size 94M after the first command and 125M after the second one. An extra 31M shouldn't cause the command to run 20 times slower!

    – Bahram
    3 hours ago






  • 1





    ... it would be easy to test whether this is the case, by replacing %M with a fixed string like -rw-rw-r--

    – steeldriver
    3 hours ago











  • I don't think it's a matter of raw "disk space", I think it has to do with processing each line of output with an extra 10 characters. Depending on exactly what is coming out, that could mean an increase of 30% or more, per line.

    – 0xSheepdog
    3 hours ago











Your Answer








StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "106"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f512167%2fhuge-performance-difference-of-the-command-find-with-and-without-using-m-option%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









4














The first version requires only to readdir(3)/getdents(2) the directory, when run on a filesystem supporting this feature (ext4: filetype feature displayed with tune2fs -l /dev/xxx, xfs: ftype=1 displayed with xfs_info /mount/point ...).



The second version in addition also requires to stat(2) each file, requiring an additional inode lookup, and thus more seeks on the filesystem and device, possibly quite slower if it's a rotating disk and cache wasn't kept. This stat is not required when looking only for name, inode and filetype because the directory entry is enough:




 The linux_dirent structure is declared as follows:

struct linux_dirent
unsigned long d_ino; /* Inode number */
unsigned long d_off; /* Offset to next linux_dirent */
unsigned short d_reclen; /* Length of this linux_dirent */
char d_name[]; /* Filename (null-terminated) */
/* length is actually (d_reclen - 2 -
offsetof(struct linux_dirent, d_name)) */
/*
char pad; // Zero padding byte
char d_type; // File type (only since Linux
// 2.6.4); offset is (d_reclen - 1)
*/




the same informations are available to readdir(3):




struct dirent 
ino_t d_ino; /* Inode number */
off_t d_off; /* Not an offset; see below */
unsigned short d_reclen; /* Length of this record */
unsigned char d_type; /* Type of file; not supported
by all filesystem types */
char d_name[256]; /* Null-terminated filename */
;



Suspected but confirmed by comparing (on a smaller sample...) the two outputs of:



strace -o v1 find many_files -printf '%i %y %pn'>info_file
strace -o v2 find many_files -printf '%i %y %M %pn'>info_file


Which on my Linux amd64 kernel 5.0.x just shows as main difference:



[...]



 getdents(4, /* 0 entries */, 32768) = 0
close(4) = 0
fcntl(5, F_DUPFD_CLOEXEC, 0) = 4
-write(1, "25499894 d many_filesn25502410 f"..., 4096) = 4096
-write(1, "iles/844n25502253 f many_files/8"..., 4096) = 4096
-write(1, "096 f many_files/686n25502095 f "..., 4096) = 4096
-write(1, "es/529n25501938 f many_files/528"..., 4096) = 4096
-write(1, "1 f many_files/371n25501780 f ma"..., 4096) = 4096
-write(1, "/214n25497527 f many_files/213n2"..., 4096) = 4096
-brk(0x55b29a933000) = 0x55b29a933000
+newfstatat(5, "1000", 0644, st_size=5, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "999", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "998", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "997", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "996", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "995", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "994", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "993", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "992", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "991", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "990", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]



+newfstatat(5, "891", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+write(1, "25499894 d drwxr-xr-x many_files"..., 4096) = 4096
+newfstatat(5, "890", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]






share|improve this answer

























  • Unfortunately, the d_type field of a dir entry is a non-standard feature, only present on Linux and BSD, as mentioned in the readdir(3) manpage. (Though on Linux it is implemented on most filesystems that matter).

    – mosvy
    2 hours ago












  • @mosvy That's ok, the question is tagged CentOS. But yes I understand that on other *nix, results may differ

    – A.B
    2 hours ago












  • Hum actually xfs (CentOS' default) support isn't quite clear...

    – A.B
    1 hour ago











  • added how to check if the filetype feature is present on xfs, in case xfs is in use.

    – A.B
    1 hour ago











  • I think it's supported on xfs -- when I was making a testcase for a glibc glob(3) that only triggered when the d_type field was absent, I had to use either minixfs or use the GLOB_ALTDIRFUNC.

    – mosvy
    1 hour ago















4














The first version requires only to readdir(3)/getdents(2) the directory, when run on a filesystem supporting this feature (ext4: filetype feature displayed with tune2fs -l /dev/xxx, xfs: ftype=1 displayed with xfs_info /mount/point ...).



The second version in addition also requires to stat(2) each file, requiring an additional inode lookup, and thus more seeks on the filesystem and device, possibly quite slower if it's a rotating disk and cache wasn't kept. This stat is not required when looking only for name, inode and filetype because the directory entry is enough:




 The linux_dirent structure is declared as follows:

struct linux_dirent
unsigned long d_ino; /* Inode number */
unsigned long d_off; /* Offset to next linux_dirent */
unsigned short d_reclen; /* Length of this linux_dirent */
char d_name[]; /* Filename (null-terminated) */
/* length is actually (d_reclen - 2 -
offsetof(struct linux_dirent, d_name)) */
/*
char pad; // Zero padding byte
char d_type; // File type (only since Linux
// 2.6.4); offset is (d_reclen - 1)
*/




the same informations are available to readdir(3):




struct dirent 
ino_t d_ino; /* Inode number */
off_t d_off; /* Not an offset; see below */
unsigned short d_reclen; /* Length of this record */
unsigned char d_type; /* Type of file; not supported
by all filesystem types */
char d_name[256]; /* Null-terminated filename */
;



Suspected but confirmed by comparing (on a smaller sample...) the two outputs of:



strace -o v1 find many_files -printf '%i %y %pn'>info_file
strace -o v2 find many_files -printf '%i %y %M %pn'>info_file


Which on my Linux amd64 kernel 5.0.x just shows as main difference:



[...]



 getdents(4, /* 0 entries */, 32768) = 0
close(4) = 0
fcntl(5, F_DUPFD_CLOEXEC, 0) = 4
-write(1, "25499894 d many_filesn25502410 f"..., 4096) = 4096
-write(1, "iles/844n25502253 f many_files/8"..., 4096) = 4096
-write(1, "096 f many_files/686n25502095 f "..., 4096) = 4096
-write(1, "es/529n25501938 f many_files/528"..., 4096) = 4096
-write(1, "1 f many_files/371n25501780 f ma"..., 4096) = 4096
-write(1, "/214n25497527 f many_files/213n2"..., 4096) = 4096
-brk(0x55b29a933000) = 0x55b29a933000
+newfstatat(5, "1000", 0644, st_size=5, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "999", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "998", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "997", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "996", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "995", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "994", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "993", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "992", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "991", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "990", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]



+newfstatat(5, "891", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+write(1, "25499894 d drwxr-xr-x many_files"..., 4096) = 4096
+newfstatat(5, "890", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]






share|improve this answer

























  • Unfortunately, the d_type field of a dir entry is a non-standard feature, only present on Linux and BSD, as mentioned in the readdir(3) manpage. (Though on Linux it is implemented on most filesystems that matter).

    – mosvy
    2 hours ago












  • @mosvy That's ok, the question is tagged CentOS. But yes I understand that on other *nix, results may differ

    – A.B
    2 hours ago












  • Hum actually xfs (CentOS' default) support isn't quite clear...

    – A.B
    1 hour ago











  • added how to check if the filetype feature is present on xfs, in case xfs is in use.

    – A.B
    1 hour ago











  • I think it's supported on xfs -- when I was making a testcase for a glibc glob(3) that only triggered when the d_type field was absent, I had to use either minixfs or use the GLOB_ALTDIRFUNC.

    – mosvy
    1 hour ago













4












4








4







The first version requires only to readdir(3)/getdents(2) the directory, when run on a filesystem supporting this feature (ext4: filetype feature displayed with tune2fs -l /dev/xxx, xfs: ftype=1 displayed with xfs_info /mount/point ...).



The second version in addition also requires to stat(2) each file, requiring an additional inode lookup, and thus more seeks on the filesystem and device, possibly quite slower if it's a rotating disk and cache wasn't kept. This stat is not required when looking only for name, inode and filetype because the directory entry is enough:




 The linux_dirent structure is declared as follows:

struct linux_dirent
unsigned long d_ino; /* Inode number */
unsigned long d_off; /* Offset to next linux_dirent */
unsigned short d_reclen; /* Length of this linux_dirent */
char d_name[]; /* Filename (null-terminated) */
/* length is actually (d_reclen - 2 -
offsetof(struct linux_dirent, d_name)) */
/*
char pad; // Zero padding byte
char d_type; // File type (only since Linux
// 2.6.4); offset is (d_reclen - 1)
*/




the same informations are available to readdir(3):




struct dirent 
ino_t d_ino; /* Inode number */
off_t d_off; /* Not an offset; see below */
unsigned short d_reclen; /* Length of this record */
unsigned char d_type; /* Type of file; not supported
by all filesystem types */
char d_name[256]; /* Null-terminated filename */
;



Suspected but confirmed by comparing (on a smaller sample...) the two outputs of:



strace -o v1 find many_files -printf '%i %y %pn'>info_file
strace -o v2 find many_files -printf '%i %y %M %pn'>info_file


Which on my Linux amd64 kernel 5.0.x just shows as main difference:



[...]



 getdents(4, /* 0 entries */, 32768) = 0
close(4) = 0
fcntl(5, F_DUPFD_CLOEXEC, 0) = 4
-write(1, "25499894 d many_filesn25502410 f"..., 4096) = 4096
-write(1, "iles/844n25502253 f many_files/8"..., 4096) = 4096
-write(1, "096 f many_files/686n25502095 f "..., 4096) = 4096
-write(1, "es/529n25501938 f many_files/528"..., 4096) = 4096
-write(1, "1 f many_files/371n25501780 f ma"..., 4096) = 4096
-write(1, "/214n25497527 f many_files/213n2"..., 4096) = 4096
-brk(0x55b29a933000) = 0x55b29a933000
+newfstatat(5, "1000", 0644, st_size=5, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "999", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "998", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "997", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "996", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "995", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "994", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "993", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "992", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "991", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "990", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]



+newfstatat(5, "891", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+write(1, "25499894 d drwxr-xr-x many_files"..., 4096) = 4096
+newfstatat(5, "890", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]






share|improve this answer















The first version requires only to readdir(3)/getdents(2) the directory, when run on a filesystem supporting this feature (ext4: filetype feature displayed with tune2fs -l /dev/xxx, xfs: ftype=1 displayed with xfs_info /mount/point ...).



The second version in addition also requires to stat(2) each file, requiring an additional inode lookup, and thus more seeks on the filesystem and device, possibly quite slower if it's a rotating disk and cache wasn't kept. This stat is not required when looking only for name, inode and filetype because the directory entry is enough:




 The linux_dirent structure is declared as follows:

struct linux_dirent
unsigned long d_ino; /* Inode number */
unsigned long d_off; /* Offset to next linux_dirent */
unsigned short d_reclen; /* Length of this linux_dirent */
char d_name[]; /* Filename (null-terminated) */
/* length is actually (d_reclen - 2 -
offsetof(struct linux_dirent, d_name)) */
/*
char pad; // Zero padding byte
char d_type; // File type (only since Linux
// 2.6.4); offset is (d_reclen - 1)
*/




the same informations are available to readdir(3):




struct dirent 
ino_t d_ino; /* Inode number */
off_t d_off; /* Not an offset; see below */
unsigned short d_reclen; /* Length of this record */
unsigned char d_type; /* Type of file; not supported
by all filesystem types */
char d_name[256]; /* Null-terminated filename */
;



Suspected but confirmed by comparing (on a smaller sample...) the two outputs of:



strace -o v1 find many_files -printf '%i %y %pn'>info_file
strace -o v2 find many_files -printf '%i %y %M %pn'>info_file


Which on my Linux amd64 kernel 5.0.x just shows as main difference:



[...]



 getdents(4, /* 0 entries */, 32768) = 0
close(4) = 0
fcntl(5, F_DUPFD_CLOEXEC, 0) = 4
-write(1, "25499894 d many_filesn25502410 f"..., 4096) = 4096
-write(1, "iles/844n25502253 f many_files/8"..., 4096) = 4096
-write(1, "096 f many_files/686n25502095 f "..., 4096) = 4096
-write(1, "es/529n25501938 f many_files/528"..., 4096) = 4096
-write(1, "1 f many_files/371n25501780 f ma"..., 4096) = 4096
-write(1, "/214n25497527 f many_files/213n2"..., 4096) = 4096
-brk(0x55b29a933000) = 0x55b29a933000
+newfstatat(5, "1000", 0644, st_size=5, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "999", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "998", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "997", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "996", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "995", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "994", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "993", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "992", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "991", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+newfstatat(5, "990", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]



+newfstatat(5, "891", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0
+write(1, "25499894 d drwxr-xr-x many_files"..., 4096) = 4096
+newfstatat(5, "890", 0644, st_size=4, ..., AT_SYMLINK_NOFOLLOW) = 0


[...]







share|improve this answer














share|improve this answer



share|improve this answer








edited 1 hour ago

























answered 2 hours ago









A.BA.B

6,00711030




6,00711030












  • Unfortunately, the d_type field of a dir entry is a non-standard feature, only present on Linux and BSD, as mentioned in the readdir(3) manpage. (Though on Linux it is implemented on most filesystems that matter).

    – mosvy
    2 hours ago












  • @mosvy That's ok, the question is tagged CentOS. But yes I understand that on other *nix, results may differ

    – A.B
    2 hours ago












  • Hum actually xfs (CentOS' default) support isn't quite clear...

    – A.B
    1 hour ago











  • added how to check if the filetype feature is present on xfs, in case xfs is in use.

    – A.B
    1 hour ago











  • I think it's supported on xfs -- when I was making a testcase for a glibc glob(3) that only triggered when the d_type field was absent, I had to use either minixfs or use the GLOB_ALTDIRFUNC.

    – mosvy
    1 hour ago

















  • Unfortunately, the d_type field of a dir entry is a non-standard feature, only present on Linux and BSD, as mentioned in the readdir(3) manpage. (Though on Linux it is implemented on most filesystems that matter).

    – mosvy
    2 hours ago












  • @mosvy That's ok, the question is tagged CentOS. But yes I understand that on other *nix, results may differ

    – A.B
    2 hours ago












  • Hum actually xfs (CentOS' default) support isn't quite clear...

    – A.B
    1 hour ago











  • added how to check if the filetype feature is present on xfs, in case xfs is in use.

    – A.B
    1 hour ago











  • I think it's supported on xfs -- when I was making a testcase for a glibc glob(3) that only triggered when the d_type field was absent, I had to use either minixfs or use the GLOB_ALTDIRFUNC.

    – mosvy
    1 hour ago
















Unfortunately, the d_type field of a dir entry is a non-standard feature, only present on Linux and BSD, as mentioned in the readdir(3) manpage. (Though on Linux it is implemented on most filesystems that matter).

– mosvy
2 hours ago






Unfortunately, the d_type field of a dir entry is a non-standard feature, only present on Linux and BSD, as mentioned in the readdir(3) manpage. (Though on Linux it is implemented on most filesystems that matter).

– mosvy
2 hours ago














@mosvy That's ok, the question is tagged CentOS. But yes I understand that on other *nix, results may differ

– A.B
2 hours ago






@mosvy That's ok, the question is tagged CentOS. But yes I understand that on other *nix, results may differ

– A.B
2 hours ago














Hum actually xfs (CentOS' default) support isn't quite clear...

– A.B
1 hour ago





Hum actually xfs (CentOS' default) support isn't quite clear...

– A.B
1 hour ago













added how to check if the filetype feature is present on xfs, in case xfs is in use.

– A.B
1 hour ago





added how to check if the filetype feature is present on xfs, in case xfs is in use.

– A.B
1 hour ago













I think it's supported on xfs -- when I was making a testcase for a glibc glob(3) that only triggered when the d_type field was absent, I had to use either minixfs or use the GLOB_ALTDIRFUNC.

– mosvy
1 hour ago





I think it's supported on xfs -- when I was making a testcase for a glibc glob(3) that only triggered when the d_type field was absent, I had to use either minixfs or use the GLOB_ALTDIRFUNC.

– mosvy
1 hour ago













1














For your 1st question:



I think your problem is not with how quickly the information is accessed, but the output bottleneck.



You are writing the output to info_file.



When you add %M to the find command, you are now outputting more text due to the permissions. 10 additional characters per line of output. That is 30,000,000 more characters.



This is more data that has to go through the STDOUT redirect to info_file and get written to disk. More data to push == longer time to write and complete.



In a situation with a single file or a small number of files, it would not be noticeable to a human; time may give you some variation to measure but it might be too slight to notice.



In your question you are working with 3,000,000 files, so obviously it takes longer to write out the permissions output.



2nd question



I have no idea. Do you have a practical use case for needing to collect permissions for 3,000,000 files, or is this an academic exercise?






share|improve this answer

























  • info_file has size 94M after the first command and 125M after the second one. An extra 31M shouldn't cause the command to run 20 times slower!

    – Bahram
    3 hours ago






  • 1





    ... it would be easy to test whether this is the case, by replacing %M with a fixed string like -rw-rw-r--

    – steeldriver
    3 hours ago











  • I don't think it's a matter of raw "disk space", I think it has to do with processing each line of output with an extra 10 characters. Depending on exactly what is coming out, that could mean an increase of 30% or more, per line.

    – 0xSheepdog
    3 hours ago















1














For your 1st question:



I think your problem is not with how quickly the information is accessed, but the output bottleneck.



You are writing the output to info_file.



When you add %M to the find command, you are now outputting more text due to the permissions. 10 additional characters per line of output. That is 30,000,000 more characters.



This is more data that has to go through the STDOUT redirect to info_file and get written to disk. More data to push == longer time to write and complete.



In a situation with a single file or a small number of files, it would not be noticeable to a human; time may give you some variation to measure but it might be too slight to notice.



In your question you are working with 3,000,000 files, so obviously it takes longer to write out the permissions output.



2nd question



I have no idea. Do you have a practical use case for needing to collect permissions for 3,000,000 files, or is this an academic exercise?






share|improve this answer

























  • info_file has size 94M after the first command and 125M after the second one. An extra 31M shouldn't cause the command to run 20 times slower!

    – Bahram
    3 hours ago






  • 1





    ... it would be easy to test whether this is the case, by replacing %M with a fixed string like -rw-rw-r--

    – steeldriver
    3 hours ago











  • I don't think it's a matter of raw "disk space", I think it has to do with processing each line of output with an extra 10 characters. Depending on exactly what is coming out, that could mean an increase of 30% or more, per line.

    – 0xSheepdog
    3 hours ago













1












1








1







For your 1st question:



I think your problem is not with how quickly the information is accessed, but the output bottleneck.



You are writing the output to info_file.



When you add %M to the find command, you are now outputting more text due to the permissions. 10 additional characters per line of output. That is 30,000,000 more characters.



This is more data that has to go through the STDOUT redirect to info_file and get written to disk. More data to push == longer time to write and complete.



In a situation with a single file or a small number of files, it would not be noticeable to a human; time may give you some variation to measure but it might be too slight to notice.



In your question you are working with 3,000,000 files, so obviously it takes longer to write out the permissions output.



2nd question



I have no idea. Do you have a practical use case for needing to collect permissions for 3,000,000 files, or is this an academic exercise?






share|improve this answer















For your 1st question:



I think your problem is not with how quickly the information is accessed, but the output bottleneck.



You are writing the output to info_file.



When you add %M to the find command, you are now outputting more text due to the permissions. 10 additional characters per line of output. That is 30,000,000 more characters.



This is more data that has to go through the STDOUT redirect to info_file and get written to disk. More data to push == longer time to write and complete.



In a situation with a single file or a small number of files, it would not be noticeable to a human; time may give you some variation to measure but it might be too slight to notice.



In your question you are working with 3,000,000 files, so obviously it takes longer to write out the permissions output.



2nd question



I have no idea. Do you have a practical use case for needing to collect permissions for 3,000,000 files, or is this an academic exercise?







share|improve this answer














share|improve this answer



share|improve this answer








edited 3 hours ago

























answered 3 hours ago









0xSheepdog0xSheepdog

1,72711024




1,72711024












  • info_file has size 94M after the first command and 125M after the second one. An extra 31M shouldn't cause the command to run 20 times slower!

    – Bahram
    3 hours ago






  • 1





    ... it would be easy to test whether this is the case, by replacing %M with a fixed string like -rw-rw-r--

    – steeldriver
    3 hours ago











  • I don't think it's a matter of raw "disk space", I think it has to do with processing each line of output with an extra 10 characters. Depending on exactly what is coming out, that could mean an increase of 30% or more, per line.

    – 0xSheepdog
    3 hours ago

















  • info_file has size 94M after the first command and 125M after the second one. An extra 31M shouldn't cause the command to run 20 times slower!

    – Bahram
    3 hours ago






  • 1





    ... it would be easy to test whether this is the case, by replacing %M with a fixed string like -rw-rw-r--

    – steeldriver
    3 hours ago











  • I don't think it's a matter of raw "disk space", I think it has to do with processing each line of output with an extra 10 characters. Depending on exactly what is coming out, that could mean an increase of 30% or more, per line.

    – 0xSheepdog
    3 hours ago
















info_file has size 94M after the first command and 125M after the second one. An extra 31M shouldn't cause the command to run 20 times slower!

– Bahram
3 hours ago





info_file has size 94M after the first command and 125M after the second one. An extra 31M shouldn't cause the command to run 20 times slower!

– Bahram
3 hours ago




1




1





... it would be easy to test whether this is the case, by replacing %M with a fixed string like -rw-rw-r--

– steeldriver
3 hours ago





... it would be easy to test whether this is the case, by replacing %M with a fixed string like -rw-rw-r--

– steeldriver
3 hours ago













I don't think it's a matter of raw "disk space", I think it has to do with processing each line of output with an extra 10 characters. Depending on exactly what is coming out, that could mean an increase of 30% or more, per line.

– 0xSheepdog
3 hours ago





I don't think it's a matter of raw "disk space", I think it has to do with processing each line of output with an extra 10 characters. Depending on exactly what is coming out, that could mean an increase of 30% or more, per line.

– 0xSheepdog
3 hours ago

















draft saved

draft discarded
















































Thanks for contributing an answer to Unix & Linux Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2funix.stackexchange.com%2fquestions%2f512167%2fhuge-performance-difference-of-the-command-find-with-and-without-using-m-option%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Log på Navigationsmenu

Creating second map without labels using QGIS?How to lock map labels for inset map in Print Composer?How to Force the Showing of Labels of a Vector File in QGISQGIS Valmiera, Labels only show for part of polygonsRemoving duplicate point labels in QGISLabeling every feature using QGIS?Show labels for point features outside map canvasAbbreviate Road Labels in QGIS only when requiredExporting map from composer in QGIS - text labels have moved in output?How to make sure labels in qgis turn up in layout map?Writing label expression with ArcMap and If then Statement?

Nuuk Indholdsfortegnelse Etyomologi | Historie | Geografi | Transport og infrastruktur | Politik og administration | Uddannelsesinstitutioner | Kultur | Venskabsbyer | Noter | Eksterne henvisninger | Se også | Navigationsmenuwww.sermersooq.gl64°10′N 51°45′V / 64.167°N 51.750°V / 64.167; -51.75064°10′N 51°45′V / 64.167°N 51.750°V / 64.167; -51.750DMI - KlimanormalerSalmonsen, s. 850Grønlands Naturinstitut undersøger rensdyr i Akia og Maniitsoq foråret 2008Grønlands NaturinstitutNy vej til Qinngorput indviet i dagAntallet af biler i Nuuk må begrænsesNy taxacentral mødt med demonstrationKøreplan. Rute 1, 2 og 3SnescootersporNuukNord er for storSkoler i Kommuneqarfik SermersooqAtuarfik Samuel KleinschmidtKangillinguit AtuarfiatNuussuup AtuarfiaNuuk Internationale FriskoleIlinniarfissuaq, Grønlands SeminariumLedelseÅrsberetning for 2008Kunst og arkitekturÅrsberetning for 2008Julie om naturenNuuk KunstmuseumSilamiutGrønlands Nationalmuseum og ArkivStatistisk ÅrbogGrønlands LandsbibliotekStore koncerter på stribeVandhund nummer 1.000.000Kommuneqarfik Sermersooq – MalikForsidenVenskabsbyerLyngby-Taarbæk i GrønlandArctic Business NetworkWinter Cities 2008 i NuukDagligt opdaterede satellitbilleder fra NuukområdetKommuneqarfik Sermersooqs hjemmesideTurist i NuukGrønlands Statistiks databankGrønlands Hjemmestyres valgresultaterrrWorldCat124325457671310-5