Nov 23, 2013 Where is the ZFS? Code from recorded macro doesn't appear in excel for mac. Discussion in 'OS X Mavericks (10.9)' started. Controllers that I can configure to be Raid0, 1, 5, 10 or JBOD. I am comparing the 300mb/s at Raid5 vs JBOD with raidz. In total I have 16 SATA3 drives. 8 are 3TB Red WD drives and 8 are 2TB Red WD drives. In 10.10 but 10.10 is off-topic share Share on Twitter Share on. Various recommendations exist on when to use RAIDZ-1/2/3 and when not to. Some people say that a RAIDZ-1 and RAIDZ-3 should use an odd number of disks. RAIDZ-1 should start with 3 and not exceed 7 disks in the array, while RAIDZ-3 should start at 7 and not exceed 15.
If you are a regular reader of this blog, you likely know I like the ZFS filesystem a lot. ZFS has many very interesting features, but I am a bit tired of hearing negative statements on ZFS performance. It feels a bit like people are telling me “Why do you use InnoDB? I have read that MyISAM is faster.” I found the comparison of InnoDB vs.
MyISAM quite interesting, and I’ll use it in this post. To have some data to support my post, I started an AWS i3.large instance with a 1000GB gp2 EBS volume. A gp2 volume of this size is interesting because it is above the burst IOPS level, so it offers a constant 3000 IOPS performance level. I used sysbench to create a table of 10M rows and then, using export/import tablespace, I copied it 329 times.
I ended up with 330 tables for a total size of about 850GB. The dataset generated by sysbench is not very compressible, so I used lz4 compression in ZFS. For the other ZFS settings, I used what can be found in my earlier ZFS posts but with the ARC size limited to 1GB. I then used that plain configuration for the first benchmarks. Here are the results with the sysbench point-select benchmark, a uniform distribution and eight threads.
The InnoDB buffer pool was set to 2.5GB. In both cases, the load is IO bound. The disk is doing exactly the allowed 3000 IOPS. The above graph appears to be a clear demonstration that XFS is much faster than ZFS, right?
But is that really the case? The way the dataset has been created is extremely favorable to XFS since there is absolutely no file fragmentation. Once you have all the files opened, a read IOP is just a single fseek call to an offset and ZFS doesn’t need to access any intermediate inode. The above result is about as fair as saying MyISAM is faster than InnoDB based only on table scan performance results of unfragmented tables and default configuration. ZFS is much less affected by the file level fragmentation, especially for point access type. More on ZFS metadata ZFS stores the files in in a very similar fashion as InnoDB stores data.
To access a piece of data in a B-tree, you need to access the top level page (often called root node) and then one block per level down to a leaf-node containing the data. With no cache, to read something from a three levels B-tree thus requires 3 IOPS. Simple three levels B-tree The extra IOPS performed by ZFS are needed to access those internal blocks in the B-trees of the files. These internal blocks are labeled as metadata. Essentially, in the above benchmark, the ARC is too small to contain all the internal blocks of the table files’ B-trees. If we continue the comparison with InnoDB, it would be like running with a buffer pool too small to contain the non-leaf pages.
The test dataset I used has about 600MB of non-leaf pages, about 0.1% of the total size, which was well cached by the 3GB buffer pool. So only one InnoDB page, a leaf page, needed to be read per point-select statement. To correctly set the ARC size to cache the metadata, you have two choices. First, you can guess values for the ARC size and experiment. Second, you can try to evaluate it by looking at the ZFS internal data. Let’s review these two approaches. You’ll read/hear often the ratio 1GB of ARC for 1TB of data, which is about the same 0.1% ratio as for InnoDB.
I wrote about that ratio a few times, having nothing better to propose. Actually, I found it depends a lot on the recordsize used. The 0.1% ratio implies a ZFS recordsize of 128KB. A ZFS filesystem with a recordsize of 128KB will use much less metadata than another one using a recordsize of 16KB because it has 8x fewer leaf pages. Fewer leaf pages require less B-tree internal nodes, hence less metadata.
A filesystem with a recordsize of 128KB is excellent for sequential access as it maximizes compression and reduces the IOPS but it is poor for small random access operations like the ones MySQL/InnoDB does. To determine the correct ARC size, you can slowly increase the ARC size and monitor the number of metadata cache-misses with the arcstat tool. Here’s an example: # echo /sys/module/zfs/parameters/zfsarcmax # arcstat -f time,arcsz,mm%,mhit,mread,dread,pread 10 time arcsz mm% mhit mread dread pread 10:22:49 105M 0 0 0 0 0 10:22:59 113M 100 0 22 73 0 10:23:09 120M 100 0 20 68 0 10:23:19 127M 100 0 20 65 0 10:23:29 135M 100 0 22 74 0 You’ll want the ‘mm%’, the metadata missed percent, to reach 0. So when the ‘arcsz’ column is no longer growing and you still have high values for ‘mm%’, that means the ARC is too small.
Increase the value of ‘zfsarcmax’ and continue to monitor. If the 1GB of ARC for 1TB of data ratio is good for large ZFS recordsize, it is likely too small for a recordsize of 16KB. Does 8x more leaf pages automatically require 8x more ARC space for the non-leaf pages?
Although likely, let’s verify. The second option we have is the zdb utility that comes with ZFS, which allows us to view many internal structures including the B-tree list of pages for a given file.
The tool needs the inode of a file and the ZFS filesystem as inputs.
New to hadoop, only setup a 3 debian server cluster for practice. I was researching best practices on hadoop and came across: JBOD no RAID Filesystem: ext3, ext4, xfs - none of that fancy COW stuff you see with zfs and btrfs So I raise these questions. Everywhere I read JBOD is better then RAID in hadoop, and that the best filesystems are xfs and ext3 and ext4.
Aside from the filesystem stuff which totally makes sense why those are the best. How do you implement this JBOD? You will see my confusion if you do the google search your self, JBOD alludes to a linear appendage or combination of just a bunch of disks kind of like a logical volume well at least thats how some people explain it, but hadoop seems to want a JBOD that doesnt combine. No body expands on that. Question 1) What does everyone in the hadoop world mean by JBOD and how do you implement that?. Question 2) Is it as simple as mounting each disk to a different directory is all?. Question 3) Does that mean that hadoop runs best on a JBOD where each disk is simply mounted to a different directory?.
Question 4) And then you just point hadoop at those data.dirs?. Question5) I see JBODS going 2 ways, either each disk going to a seperate mount, or a linear concat of disks, which can be done mdadm -linear mode, or lvm i bet can do it too, so I dont see the big deal with that. And if thats the case, where mdadm -linear or lvm can be used because the JBOD people are refering to is this concat of disks, then which is the best way to 'JBOD' or linearly concat disks for hadoop?
This is off topic, but can someone verify if this is correct as well? Filesystems that use cow, copy on write, like zfs and btrfs just slow down hadoop but not only that the cow implementation is a waste with hadoop. Question 6) Why is COW and things like RAID a waste on hadoop? I see it as if your system crashes and you use the cowness of if to restore it, by the time you restored your system, there have been so many changes to hdfs it will probably just consider that machine as faulty and it would be better to rejoin it from scratch (bring it up as a fresh new datanode). Or how will the hadoop system see the older datanode? My guess is it wont think its old or new or even a datanode, it will just see it as garbage. Question 7) What happens if hadoop sees a datanode that fell off the cluster and then the datanode comes back online with data slightly older?
Is there an extent to how old the data has to be??? How does this topic? REASKING QUESTION 1 THRU 4. I just realized my question is so simple yet it's so hard for me to explain it that I had to split it up into 4 questions, and i still didn't get the answer I'm looking for, from what sounds like very smart individuals, so i must re-ask differently. On paper I could easily or with a drawing.
I'll attempt with words again. If confused on what I am asking in the JBOD question.
I can try to answer few of the questions - tell me wherever you disagree. 1.JBOD: Just a bunch of disks; an array of drives, each of which is accessed directly as an independent drive. From, topic Why not use RAID?, says that RAID Read and Write performance is limited by the slowest disk in the Array. In addition, in case of HDFS, replication of data occurs across different machines residing in different racks. This handle potential loss of data even if a rack fails.
So, RAID isn't that necessary. Namenode can though use RAID as mentioned in the link. 2.Yes That means independent disks (JBODs) mounted in each of the machines (e.g. /disk1, /disk2, /disk3 etc.) but not partitioned. 3, 4 & 5 Read Addendum 6 & 7. To see how replication of blocks occurs Addendum after the comment: Q1.
Which method is everyone refering to is BEST PRACTICE for hadoop this combination jbod or a seperation of disks - which is still also a jbod according to online documentation? Possible answer: From Hadoop Definitive Guide - You should also set the dfs.data.dir property, which specifies a list of directories for a datanode to store its blocks. Unlike the namenode, which uses multiple directories for redundancy, a datanode round-robins writes between its storage directories, so for performance you should specify a storage directory for each local disk.
Read performance also benefits from having multiple disks for storage, because blocks will be spread across them, and concurrent reads for distinct blocks will be correspondingly spread across disks. For maximum performance, you should mount storage disks with the noatime option. This setting means that last accessed time information is not written on file reads, which gives significant performance gains. Why LVM is not a good idea? Avoid RAID and LVM on TaskTracker and DataNode machines – it generally reduces performance. This is because LVM creates logical layer over the individual mounted disks in a machine.
For TIP 1 more details. There are use cases where using LVM performed slow when running Hadoop jobs. I'm late to the party but maybe I can chime in: JBOD Question 1) What does everyone in the hadoop world mean by JBOD and how do you implement that? Just a bunch of disks. You just format the whole disk and include it in the ´hdfs-site.xml andmapred-site.xml oryarn-site-xml` on the datanodes. Hadoop takes care about distributing blocks across disks.
Question 2) Is it as simple as mounting each disk to a different directory is all? Question 3) Does that mean that hadoop runs best on a JBOD where each disk is simply mounted to a different directory? Hadoop does checksumming of the data and periodically verifies these checksums. Question 4) And then you just point hadoop at those data.dirs?
But there are directories for data storage (HDFS) and computation (MapReduce, YARN,.) you can configure different directories and disks for certain tasks. Question 5) I see JBODS going 2 ways, either each disk going to a seperate mount, or a linear concat of disks, which can be done mdadm -linear mode, or lvm i bet can do it too, so I dont see the big deal with that. And if thats the case, where mdadm -linear or lvm can be used because the JBOD people are refering to is this concat of disks, then which is the best way to 'JBOD' or linearly concat disks for hadoop?
The problem is faulty disks. If you keep it simple and just mount each disks at a time you just have to replace this disk. If you use mdadm or LVM in ja JBOD configuration you have are prone to loosing more data in case a disk dies as the striped or concat configuration may not survive a disk failure.
As data for more blocks is spread across multiple disks. Question 6) Why is COW and things like RAID a waste on hadoop? I see it as if your system crashes and you use the cowness of if to restore it, by the time you restored your system, there have been so many changes to hdfs it will probably just consider that machine as faulty and it would be better to rejoin it from scratch (bring it up as a fresh new datanode). Or how will the hadoop system see the older datanode? My guess is it wont think its old or new or even a datanode, it will just see it as garbage. HDFS is a competently seperate layer atop of your native filesystem. Disk failures are expected and that's why all data blocks are replicated at least 3 times across several machines.
HDFS also does it's own checksumming so if the checksum of a block mismatches a replica of this block is used and the broken block will be deleted by HDFS. So in theory it makes no sense to use RAID or COW for Hadoop drives. It can make sense through if you have to deal with faulty disks that can't be replaced instantly. Question 7) What happens if hadoop sees a datanode that fell off the cluster and then the datanode comes back online with data slightly older? Is there an extent to how old the data has to be??? How does this topic?
The NameNode has a list of blocks and their locations on the datanodes. Each block has a checksum and locations. If a datanode goes down in a cluster the namenode replicates the blocks of this datanode to other datanodes.
If an older datanode comes online it is sending it's block list to the NameNode and depending on how many of the blocks are already replicated or not it will delete unneeded blocks on this datanode. The age of data is not important it's only about blocks. If the NameNode still maintains the blocks and the datanode has them, they will be used again.
ZFS/btrfs/COW In theory they additional features these filesystems provide are not required for Hadoop. However as you usally use cheap and huge 4TB+ desktop drives for datanodes you can run into problems if these disks start to fail. Ext4 remounts itself read-only on failure and at this point you'll see the disk dropping out of the HDFS on the datanode it it's configured to loose drives or you'll see the datanode die if disk failures that are not allowed. This can be a problem because modern drives often exhibit some bad sectors while still functioning fine for the most part and it's work intensive to fsck this disks and restart the datanode. Another problem are computations through YARN/MapReduce. These write also intermediate data on the disks and if this data gets corrupted or can't be written you'll run into errors.
I'm not sure if YARN/MapReduce also checksum their temporary files - I think it's implemented through. ZFS and btrfs provide some resilience against this errors on modern drives as they are able to deal better with corrupted metadata and avoid lengthy fsck checks due to internal checksumming. I'm running a Hadoop cluster on ZFS (just JBOD with LZ4) with lots of disks that exhibitit some bad sectors and that are out of warranty but still perform well and it works fine despite these errors.
If you can replace the faulty disks instantly it does not matter much. If you need to live with partly broken disks ZFS/btrfs will buy you some time before having to replace the disks. COW is not needed because Hadoop takes care of replication and security. Compression can be usefull if you store your data uncompressed on the cluster. LZ4 in ZFS should not provide a performance penalty and can speed up sequential reads (like HDFS and MapReduce do them).
Performance The case against RAID is that at least MapReduce is implementing something similiar. HDFS can read and write concurrently to all of the disks and usally multiple map and reduce jobs are running that can use a whole disk for writing and reading their data. If you put RAID or striping below Hadoop these jobs have all to enqueue their reads and write to the single RAID controller and overall it's likely slower.
Depending on your jobs it can make sense to use something like RAID-0 for pairs of disks but be sure to first verify that sequential read or write is really the bottleneck for your job (and not the network, HDFS replication, CPU,.) but make sure first that what you are doing is worth the work and the hassle.